[00:00:05] twentyafterfour: May I have your attention please! Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T0000) [00:03:41] (03Merged) 10jenkins-bot: Revert "Add a new type of database to the installer from extension" [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615439 (https://phabricator.wikimedia.org/T258664) (owner: 10Legoktm) [00:03:45] (03Merged) 10jenkins-bot: Revert "Add a new type of database to the installer from extension" [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615440 (https://phabricator.wikimedia.org/T258664) (owner: 10Legoktm) [00:04:17] twentyafterfour: are you deploying now? [00:07:27] syncing to mwdebug1001 [00:09:48] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:15] syncing... [00:11:49] !log legoktm@deploy1001 scap failed: average error rate on 3/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details) [00:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:54] uhoh [00:14:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:15:08] ok, it didn't sync properly [00:15:37] >[{exception_id}] {exception_url} Error from line 459 of /srv/mediawiki/php-1.36.0-wmf.1/includes/libs/rdbms/database/Database.php: Class 'MediaWiki\Installer\Services\InstallerDBSupport' not found [00:15:42] of course not, because I'm trying to remove that [00:15:51] trying once more... [00:15:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:16:56] !log legoktm@deploy1001 Synchronized php-1.36.0-wmf.1/includes/: T258664: Revert "Add a new type of database to the installer from extension" (duration: 01m 09s) [00:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:03] T258664: 25% latency regression July 2nd due to InstallerExtensionSelector service running in production - https://phabricator.wikimedia.org/T258664 [00:17:26] uh [00:17:58] I think exceptions are spiking again [00:19:25] ok, looks fine [00:19:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:20:16] !log legoktm@deploy1001 Scap failed!: 9/9 canaries failed their endpoint checks(https://en.wikipedia.org) [00:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:35] ok, I'll sync more atomically... [00:21:43] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10aaron) Given the libketama-style consistent hashing in twemproxy and that, AFAIK, CentralAuth sessions can regenerate (notwithstanding one-off CSRF token failu... [00:21:45] sorry [00:22:47] !log legoktm@deploy1001 Synchronized php-1.35.0-wmf.41/includes/libs/rdbms/database/Database.php: T258664: Revert "Add a new type of database to the installer from extension" (duration: 01m 05s) [00:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:52] T258664: 25% latency regression July 2nd due to InstallerExtensionSelector service running in production - https://phabricator.wikimedia.org/T258664 [00:24:13] !log legoktm@deploy1001 Synchronized php-1.35.0-wmf.41/includes/: T258664: Revert "Add a new type of database to the installer from extension" (2/2) (duration: 01m 08s) [00:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:30] there we go [00:25:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:27:42] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19001592 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:31:26] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2024680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:20] (03CR) 10Andrew Bogott: [C: 03+2] docker::registry: Allow param config to override defaults [puppet] - 10https://gerrit.wikimedia.org/r/615581 (owner: 10BryanDavis) [02:16:44] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:18:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:00:57] 10Operations, 10Graphoid, 10serviceops, 10Chinese-Sites, and 3 others: Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 (10Shizhao) [04:31:38] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10aaron) >>! In T244340#6211682, @elukey wrote: > Side note: if not... [04:36:06] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5331 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:37:58] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 13 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:45:16] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [04:47:02] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:27:51] (03PS1) 10Marostegui: Revert "dbproxy1019: Reduce labsdb1009 weight" [puppet] - 10https://gerrit.wikimedia.org/r/615442 [05:28:57] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Reduce labsdb1009 weight" [puppet] - 10https://gerrit.wikimedia.org/r/615442 (owner: 10Marostegui) [05:29:28] !log Restore labsdb1009's original weight [05:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Joe) @Jrbranaa ping again :) [06:15:02] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:20:50] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:29:24] PROBLEM - ores on ores2009 is CRITICAL: connect to address 10.192.48.90 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:34:58] RECOVERY - ores on ores2009 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.878 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:55:21] good morning [06:55:32] anyone with logstash access that could get the stack trace for https://phabricator.wikimedia.org/T258666? [07:08:48] <_joe_> Majavah: done, turns out it's a duplicate of T212428 [07:08:49] T212428: includes/Revision/RevisionStore.php: Main slot of revision (number) not found in database! - https://phabricator.wikimedia.org/T212428 [07:09:25] _joe_: thank you! [07:10:07] <_joe_> also I'm looking at if it got worse over the last few days [07:10:18] <_joe_> but no [07:10:37] well that ticket is claiming that it broke one extension (FileImporter) completely on this train [07:10:55] <_joe_> Majavah: I don't see that from the error rate [07:12:11] <_joe_> but ok, I'll let someone else change that priority [07:22:05] (03PS1) 10Muehlenhoff: Also return uid from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/615664 [07:22:07] (03PS1) 10Muehlenhoff: Also return uid from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/615665 [07:24:01] hey, are there known Job Queue issues and/or who should I tag on Phab to flag them? Thanks. [07:35:42] <_joe_> Elitre: no known issues besides some occasional overload that might lose some jobs [07:35:52] <_joe_> please tag operations [07:36:12] _joe_: thanks. I had the same mass message delivered twice. I almost wanna cry. [07:36:20] (03PS4) 10Kormat: mariadb::monitor::prometheus: Remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) [07:36:24] <_joe_> uh, interesting [07:36:40] (03PS6) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [07:36:41] 10Operations, 10MassMessage: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) [07:36:48] <_joe_> so the problem is not that it was delivered, but that it was delivered twice [07:37:04] <_joe_> then sorry, probably we need to add more tags, I'll take care of it [07:37:08] yeah. [07:37:12] <_joe_> also that bug is a bit old :D [07:37:24] I can add them no problem, I just don't know which those are [07:37:28] _joe_: i think the preferred term is "mature" ;) [07:38:10] <_joe_> kormat: in this case, ripe [07:38:14] yes, they found evidence of the mass message system along with some dynos which were excavated recently. [07:38:34] 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Joe) [07:38:54] This isn't the first time this has happened. There's at least https://phabricator.wikimedia.org/T232379 as well. [07:39:47] _joe_: πŸ‘Œ [07:40:08] (03PS2) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) [07:40:10] 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Joe) @Pchelolo @Ottomata do we have any way to verify how this happened from eventgate and changeprop logs? [07:40:55] I should probably specify, whatever you do to investigate, I hope that doesn't involve sending that message again :p [07:41:19] (03PS3) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) [07:41:19] <_joe_> Elitre: it looks like https://phabricator.wikimedia.org/T232379#5556920 is still the issue. [07:41:21] (03CR) 10jerkins-bot: [V: 04-1] Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [07:42:31] (03CR) 10jerkins-bot: [V: 04-1] Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [07:42:49] (03CR) 10Elukey: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/615664 (owner: 10Muehlenhoff) [07:46:39] (03PS4) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) [07:48:42] _joe_: yes, and IIUC what pchelolo wrote, then we probably should do something about it now! [07:49:14] (03PS5) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) [07:54:34] (03CR) 10JMeybohm: [C: 04-1] ratelimit: add new docker image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [07:59:01] (03PS1) 10Volans: mgmt: netbox-generated data for mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183) [08:08:44] (03CR) 10Muehlenhoff: [C: 03+2] Also return uid from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/615664 (owner: 10Muehlenhoff) [08:09:20] RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [08:15:35] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) Thanks for your work on this. One clarification, for those of us that are not that familiar with LV... [08:16:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 to move it to m2 T257540', diff saved to https://phabricator.wikimedia.org/P12024 and previous config saved to /var/cache/conftool/dbconfig/20200723-081650-marostegui.json [08:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:57] T257540: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 [08:19:59] (03PS1) 10Marostegui: mariadb: Move db1107 from s1 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/615669 (https://phabricator.wikimedia.org/T257540) [08:20:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1107 from s1 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/615669 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [08:21:41]  [08:21:51] (03CR) 10Volans: "I've manually verified all of them both programmatically with a diff script and visually (screen-by-screen) comparison. Instructions on ho" [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [08:21:57] 10Operations, 10observability: Change smokeping to have pinging active/active, with alerts active/standby - https://phabricator.wikimedia.org/T258675 (10fgiunchedi) [08:22:13] 10Operations, 10observability: Change smokeping to have pinging active/active, with alerts active/standby - https://phabricator.wikimedia.org/T258675 (10fgiunchedi) [08:22:16] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [08:24:05] (03PS5) 10ArielGlenn: rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) [08:24:18] (03CR) 10JMeybohm: [C: 03+2] proton: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615254 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [08:24:24] (03PS3) 10JMeybohm: proton: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615254 (https://phabricator.wikimedia.org/T256843) [08:24:30] (03CR) 10jerkins-bot: [V: 04-1] rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [08:24:38] (03CR) 10Marostegui: [V: 03+2 C: 03+2] "misc requires a full refactor that will be done at a later time" [puppet] - 10https://gerrit.wikimedia.org/r/615669 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [08:26:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1107 from s1 T257540', diff saved to https://phabricator.wikimedia.org/P12025 and previous config saved to /var/cache/conftool/dbconfig/20200723-082647-marostegui.json [08:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:53] T257540: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 [08:27:45] (03PS2) 10JMeybohm: termbox: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615255 (https://phabricator.wikimedia.org/T256843) [08:27:53] (03PS2) 10JMeybohm: wikifeeds: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615256 (https://phabricator.wikimedia.org/T256843) [08:27:59] (03PS2) 10JMeybohm: zotero: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615257 (https://phabricator.wikimedia.org/T256843) [08:28:10] (03PS2) 10JMeybohm: _scaffold: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615259 (https://phabricator.wikimedia.org/T256843) [08:29:13] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' . [08:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:31] 10Operations, 10observability: Change smokeping to have pinging active/active, with alerts active/standby - https://phabricator.wikimedia.org/T258675 (10ayounsi) πŸ‘ Good idea! I'd say send alerts only from one host as it's already quite loud (no easy way to mute alerts). Also T169860 is most likely the future... [08:34:12] (03PS1) 10Filippo Giunchedi: smokeping: match documentroot with smokeping installation [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) [08:35:38] (03CR) 10Ayounsi: [C: 03+1] "I checked the codfw network records in authdns2001:/srv/git/netbox_dns_snippets and they lgtm." [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [08:37:34] (03Abandoned) 10Volans: puppetdb microservice: add some filtering [puppet] - 10https://gerrit.wikimedia.org/r/615232 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [08:38:18] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10akosiaris) >>! In T258614#6328624, @jcrespo wrote: > Thanks for your work on this. One clarification, for th... [08:38:24] (03CR) 10QChris: [C: 03+1] remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183) (owner: 10Dzahn) [08:38:47] (03PS17) 10Alexandros Kosiaris: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [08:39:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:05] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' . [08:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:51] !log remove pim-rp IPs from last routers - T257573 [08:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:56] T257573: Remove multicast - https://phabricator.wikimedia.org/T257573 [08:41:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:18] !log test librenms poller from netmon2001 [08:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:43:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:45:28] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' . [08:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:40] (03CR) 10JMeybohm: [C: 03+2] termbox: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615255 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [08:45:48] (03PS1) 10Alexandros Kosiaris: mobileapps: Bump memory limits another 20% [deployment-charts] - 10https://gerrit.wikimedia.org/r/615671 (https://phabricator.wikimedia.org/T218733) [08:45:50] (03PS1) 10Alexandros Kosiaris: mobileapps: Lower replicas to 80 from 240 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615672 (https://phabricator.wikimedia.org/T218733) [08:45:54] (03PS1) 10Ayounsi: Reclaim PIM-RP IPs from the multicast gods [dns] - 10https://gerrit.wikimedia.org/r/615673 (https://phabricator.wikimedia.org/T257573) [08:46:07] (03PS6) 10ArielGlenn: rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) [08:46:34] (03CR) 10jerkins-bot: [V: 04-1] rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [08:46:36] (03PS7) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [08:46:55] (03CR) 10Ayounsi: [C: 03+2] Reclaim PIM-RP IPs from the multicast gods [dns] - 10https://gerrit.wikimedia.org/r/615673 (https://phabricator.wikimedia.org/T257573) (owner: 10Ayounsi) [08:47:01] (03Merged) 10jenkins-bot: termbox: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615255 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [08:47:43] godog: be carefull with the pollers/discovery as they update the DB as well, not sure how 2 instances would play together [08:47:52] (03PS1) 10DCausse: Fix bug that causes wrong prefixes in RDF output [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507) [08:48:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Bump memory limits another 20% [deployment-charts] - 10https://gerrit.wikimedia.org/r/615671 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [08:49:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Lower replicas to 80 from 240 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615672 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [08:49:06] XioNoX: mmhh good point, thanks! yeah I won't mess with the crons more [08:49:49] (03Merged) 10jenkins-bot: mobileapps: Bump memory limits another 20% [deployment-charts] - 10https://gerrit.wikimedia.org/r/615671 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [08:50:08] 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) I am looking at the GUC log linked above and it appears that first delivery was successful on 145 of the [[ https://meta.wikimedia.o... [08:50:10] (03Merged) 10jenkins-bot: mobileapps: Lower replicas to 80 from 240 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615672 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [08:50:17] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Remove multicast - https://phabricator.wikimedia.org/T257573 (10ayounsi) 05Openβ†’03Resolved Indeed, cleaned up as well: ` rancid-configs/configs$ ag 208.80.153.194 cr1-codfw.wikimedia.org 1661: address 208.80.153.194/32; cr2-codfw.w... [08:51:49] 10Operations, 10serviceops: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10elukey) p:05Triageβ†’03High [08:52:49] 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) For more quirkiness, [[ https://www.mediawiki.org/wiki/User_talk:Krinkle#GUC_Tool_error,_or? | I had recently brought up]] that in... [08:57:19] (03PS1) 10Alexandros Kosiaris: Update helm repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/615675 [08:58:13] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) I see, thanks. [08:59:00] !log transfer --type=xtrabackup from db1117:3322 to db1107 T257540 [08:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:06] T257540: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 [09:00:40] (03PS7) 10ArielGlenn: rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) [09:02:43] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'staging' . [09:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:08] (03CR) 10jerkins-bot: [V: 04-1] Fix bug that causes wrong prefixes in RDF output [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507) (owner: 10DCausse) [09:08:16] PROBLEM - librenms.wikimedia.org requires authentication on netmon2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:08:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] Update helm repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/615675 (owner: 10Alexandros Kosiaris) [09:08:58] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:09:08] (03PS8) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [09:09:17] kormat: ^ is that you ? [09:09:26] (03CR) 10DCausse: "recheck" [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507) (owner: 10DCausse) [09:09:47] (03Merged) 10jenkins-bot: Update helm repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/615675 (owner: 10Alexandros Kosiaris) [09:09:50] (03PS20) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [09:10:26] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:10:27] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:10:40] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:10:50] ^ me [09:10:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll merge this in the interest of pushing this forward. If we encounter issues, we 'll solve them then. PCC across the fleet is OK, so I" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [09:11:15] ah, it is always the DBAβ„’ [09:12:26] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:12:30] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:14:02] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:14:02] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:14:34] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:15:22] (03PS1) 10Alexandros Kosiaris: ganeti: Fix ganeti-mond typo [puppet] - 10https://gerrit.wikimedia.org/r/615676 [09:15:27] (03PS6) 10Hnowlan: ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) [09:16:35] (03CR) 10Hnowlan: ratelimit: add new docker image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [09:19:24] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [09:19:24] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [09:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:56] !log lower replica count back to 80 for mobileapps. T218733 [09:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:02] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [09:20:37] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'termbox' for release 'production' . [09:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:20] (03CR) 10Volans: [C: 03+2] "Last PS includes the requested comment, merging." (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [09:22:23] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10jcrespo) BTW: ` webperf2002 Disk space WARNING 2020-07-23 09:12:15 0d 0h 38m 11s 3/3 DISK WARNING - free space: /srv 20077 MB (6% inode=99%): ` [09:24:27] (03Merged) 10jenkins-bot: GC: add time-based GC for Image objects [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [09:24:53] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'termbox' for release 'production' . [09:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:22] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [09:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:15] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612384 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:27:29] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [09:27:29] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [09:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:53] (03PS5) 10Jcrespo: Firewall.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:30:02] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:30:04] (03PS3) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) [09:31:00] (03Abandoned) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615227 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [09:31:34] (03CR) 10Jcrespo: [C: 03+2] "This is ok, but if in the future you add more unit tests, please consider splitting the testing into several files. Putting unit tests for" [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:32:27] (03Merged) 10jenkins-bot: Firewall.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:35:01] (03CR) 10JMeybohm: [C: 03+2] wikifeeds: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615256 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [09:36:05] (03Merged) 10jenkins-bot: wikifeeds: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615256 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [09:36:38] RECOVERY - librenms.wikimedia.org requires authentication on netmon2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 661 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:38:04] (03CR) 10Jcrespo: "It needs rebase again :-(" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:38:22] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [09:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:01] (03PS1) 10Volans: Release v0.2.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/615679 [09:39:22] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: analytics1041.eqiad.wmnet, contint2001.wikimedia.org, contint1001.wikimedia.org, testreduce1001.eqiad.wmnet, aphlict1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [09:40:05] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [09:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:33] (03CR) 10Privacybatm: "> Patch Set 6:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:45:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/615679 (owner: 10Volans) [09:46:54] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [09:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:09] (03CR) 10JMeybohm: [C: 03+2] zotero: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615257 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [09:47:35] (03PS7) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [09:48:12] (03Merged) 10jenkins-bot: zotero: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615257 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [09:48:15] (03Merged) 10jenkins-bot: _scaffold: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615259 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [09:48:21] (03PS8) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [09:48:57] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:49:08] (03CR) 10Jcrespo: "2 nitpicks, let me know what you think." (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [09:50:13] (03CR) 10Privacybatm: "> Patch Set 6:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:50:57] (03CR) 10Privacybatm: "> Patch Set 8:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:51:04] (03PS2) 10Volans: Release v0.2.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/615679 [09:51:25] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,service=mobileapps,name=kubernetes.* [09:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:53] !log prepare for pooling kubernetes mobileapps capacity in eqiad. T218733 [09:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:58] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [09:55:32] 10Operations: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10MoritzMuehlenhoff) p:05Triageβ†’03Medium [09:56:21] (03CR) 10Jcrespo: "> Patch Set 8:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:57:58] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:59:53] (03CR) 10Privacybatm: "I am pushing a new patch right away" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:00:04] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1000). [10:00:46] (03PS4) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) [10:01:02] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:03:20] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.2.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/615679 (owner: 10Volans) [10:03:23] 10Operations, 10serviceops: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10elukey) The mc1020 spikes are interesting: https://grafana.wikimedia.org/d/000000317/memcache-slabs?panelId=60&fullscreen&orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prome... [10:04:18] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' . [10:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:09] !log volans@deploy1001 Started deploy [debmonitor/deploy@44aa1ee]: Release v0.2.6 [10:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:20] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [10:05:22] !log volans@deploy1001 Finished deploy [debmonitor/deploy@44aa1ee]: Release v0.2.6 (duration: 00m 14s) [10:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:20] !log volans@deploy1001 Started deploy [debmonitor/deploy@16d0c45]: Release v0.2.6 [10:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:56] !log volans@deploy1001 Finished deploy [debmonitor/deploy@16d0c45]: Release v0.2.6 (duration: 00m 36s) [10:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:14] (03CR) 10Jcrespo: "I like that this still works even if source and target are the same host, kudos." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:08:51] (03PS1) 10Giuseppe Lavagetto: Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 [10:09:10] (03CR) 10Privacybatm: "> Patch Set 8:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [10:09:26] (03CR) 10Jcrespo: Transferer.py: Resolve concurrency issue with checksum file names (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:11:18] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=mobileapps,name=kubernetes.* [10:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:43] !log poole kubernetes in mobileapps/eqiad. T218733 [10:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:47] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [10:14:45] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:08] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:25] (03PS3) 10JMeybohm: mobileapps: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615253 (https://phabricator.wikimedia.org/T256843) [10:20:11] (03PS9) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [10:20:15] (03PS1) 10Privacybatm: transferpy: Change tox development environment to Python3.7 [software/transferpy] - 10https://gerrit.wikimedia.org/r/615688 (https://phabricator.wikimedia.org/T257600) [10:22:47] (03CR) 10Privacybatm: "> Patch Set 8:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [10:24:26] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=mobileapps,name=scb1001.* [10:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:34] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=mobileapps,name=scb1002.* [10:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:17] !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=eqiad,service=mobileapps,name=scb* [10:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:27] !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=eqiad,service=mobileapps,name=scb.* [10:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:30] (03CR) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:33:30] (03CR) 10Jcrespo: "Careful, you have now a couple of "IndexError: tuple index out of range" on jenkins." [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:39:46] (03CR) 10Hnowlan: [C: 03+1] Limit concurrency for processMediaModeration job [deployment-charts] - 10https://gerrit.wikimedia.org/r/615572 (https://phabricator.wikimedia.org/T258653) (owner: 10Ppchelko) [10:42:58] (03PS5) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) [10:44:11] (03CR) 10Privacybatm: "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:44:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:46:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:51:53] (03CR) 10Jcrespo: [C: 03+2] "Testing went ok now." [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:52:22] (03Merged) 10jenkins-bot: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:53:23] (03PS7) 10Hnowlan: ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) [10:55:36] (03PS4) 10Privacybatm: Make transferpy configurable using a configuration file [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) [10:56:35] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [10:56:53] (03PS1) 10Jbond: librenms: Fix hash definition [puppet] - 10https://gerrit.wikimedia.org/r/615702 [10:57:06] (03PS8) 10Ema: varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) [10:57:39] (03CR) 10Jbond: [C: 03+2] librenms: Fix hash definition [puppet] - 10https://gerrit.wikimedia.org/r/615702 (owner: 10Jbond) [10:58:26] (03PS18) 10Effie Mouzeli: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [10:59:15] (03CR) 10Ema: [C: 03+2] varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1100). [11:00:04] dcausse: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:33] o/ [11:01:00] o/ [11:01:06] dcausse: do you want to deploy the changes yourself? [11:01:13] Lucas_WMDE: sure I can [11:01:20] ok! [11:01:37] (03PS1) 10Effie Mouzeli: Add certificates and API keys for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/615704 (https://phabricator.wikimedia.org/T255042) [11:01:53] (03CR) 10DCausse: [C: 03+2] "DEPLOYING" [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507) (owner: 10DCausse) [11:02:06] (03PS2) 10DCausse: [sdoc] fix entity source base URIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615171 (https://phabricator.wikimedia.org/T258474) [11:02:17] (03CR) 10Jcrespo: "2 comments below:" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:04:21] (03CR) 10DCausse: [C: 03+2] "DEPLOYING" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615171 (https://phabricator.wikimedia.org/T258474) (owner: 10DCausse) [11:04:26] (03CR) 10Muehlenhoff: "Let's also change profile::idp::client::httpd::document_root to /usr/share/smokeping/www in the same patch? CAS will only be enabled once " [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [11:05:21] (03Merged) 10jenkins-bot: [sdoc] fix entity source base URIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615171 (https://phabricator.wikimedia.org/T258474) (owner: 10DCausse) [11:08:28] (03CR) 10Jcrespo: "Only one question- otherwise looks good, only needs deep testing on my side. Once merged we can merge at the same time as 612162." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) (owner: 10Privacybatm) [11:10:18] (03PS1) 10Jbond: librenms: add additional parameters [puppet] - 10https://gerrit.wikimedia.org/r/615706 [11:10:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] librenms: add additional parameters [puppet] - 10https://gerrit.wikimedia.org/r/615706 (owner: 10Jbond) [11:13:23] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T258474: [sdoc] fix entity source base URIs (duration: 01m 07s) [11:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:29] T258474: RDF dumps for Structured Data on Commons are broken - https://phabricator.wikimedia.org/T258474 [11:15:33] (03CR) 10Privacybatm: Transferer.py: Add proper cleanup (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:17:07] (03PS1) 10Jbond: librenms: correct group map type [puppet] - 10https://gerrit.wikimedia.org/r/615707 [11:17:46] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=mobileapps,name=scb.* [11:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:01] !log depool scb in mobileapps/eqiad. T218733 [11:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:06] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [11:18:41] (03CR) 10Jbond: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24091/netmon2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615707 (owner: 10Jbond) [11:24:00] (03Merged) 10jenkins-bot: Fix bug that causes wrong prefixes in RDF output [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507) (owner: 10DCausse) [11:25:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [11:26:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615665 (owner: 10Muehlenhoff) [11:26:55] (03CR) 10JMeybohm: [C: 03+1] "Nice!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [11:27:03] 10Operations, 10serviceops: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10elukey) >>! In T258679#6328948, @elukey wrote: > There is a baseline for slab 136 of constant GET traffic, that should be related to `ITEM WANCache:v:global:SqlBlobStore-blob:en... [11:28:09] (03CR) 10Jbond: [C: 03+1] "lol sorry 😊" [puppet] - 10https://gerrit.wikimedia.org/r/615509 (owner: 10Muehlenhoff) [11:28:14] (03CR) 10Jbond: [C: 03+2] Also remove priority for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/615509 (owner: 10Muehlenhoff) [11:28:18] (03CR) 10Jcrespo: Transferer.py: Add proper cleanup (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:30:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:31:53] !log dcausse@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Wikibase: T258507: Fix bug that causes wrong prefixes in RDF output (duration: 01m 11s) [11:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:59] T258507: v: prefix not correctly prefixed in Wikibase when using entitysource config and extra prefixes - https://phabricator.wikimedia.org/T258507 [11:32:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:36:06] !log European mid-day backport window done [11:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:08] (03PS3) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [11:42:44] (03PS12) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [11:46:00] (03PS1) 10Urbanecm: Log ClosedWikiProvider's start with info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615713 (https://phabricator.wikimedia.org/T258695) [11:46:03] (03CR) 10Privacybatm: Transferer.py: Add proper cleanup (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:46:25] 10Operations, 10Puppet, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) >>! In T195981#5635590, @jbond wrote: > I attempted a [[ https://github.com/puppetlabs/puppet/pull/7802 | patch for this upstream ]] although its not quite wo... [11:47:18] (03CR) 10Urbanecm: [C: 03+2] Log ClosedWikiProvider's start with info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615713 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [11:47:32] (03CR) 10Jbond: [C: 03+2] thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [11:48:08] (03Merged) 10jenkins-bot: Log ClosedWikiProvider's start with info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615713 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [11:48:10] !log Deploy MCR schema change on db1145:3314 [11:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:56] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 745ff20f53e4914cf6e1717c963419e74b68e693: Log ClosedWikiProviders start with info level (T258695) (duration: 01m 05s) [11:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:01] T258695: Investigate why ClosedWikiProvider doesn't work - https://phabricator.wikimedia.org/T258695 [11:50:14] (03PS1) 10Marostegui: install_server: Do not reimage db1107 [puppet] - 10https://gerrit.wikimedia.org/r/615715 [11:51:12] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1107 [puppet] - 10https://gerrit.wikimedia.org/r/615715 (owner: 10Marostegui) [11:51:38] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:53:25] (03CR) 10Privacybatm: "Thank you for the review! please see my reply." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) (owner: 10Privacybatm) [11:55:49] 10Operations: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10CDanis) [11:56:06] (03CR) 10JMeybohm: ""helmfile template" fails for me with:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [12:00:12] (03PS1) 10Jbond: idp: add thanos service [puppet] - 10https://gerrit.wikimedia.org/r/615718 [12:00:36] !log Stagging at mwdebug1001 [12:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:04] (03PS2) 10Privacybatm: Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) [12:02:11] (03PS3) 10Jbond: idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824 [12:02:20] (03CR) 10Jbond: [C: 03+2] idp: add thanos service [puppet] - 10https://gerrit.wikimedia.org/r/615718 (owner: 10Jbond) [12:02:39] (03PS8) 10ArielGlenn: rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) [12:02:59] !log Stagging at mwdebug1001 ended, run scap pull to clean changes [12:03:02] (03CR) 10Jbond: [C: 03+2] idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824 (owner: 10Jbond) [12:03:02] (03PS4) 10Jbond: idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824 [12:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:04] (03PS2) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) [12:03:06] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824 (owner: 10Jbond) [12:03:29] (03CR) 10jerkins-bot: [V: 04-1] Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [12:04:02] (03CR) 10Jcrespo: "Looks good. No comments except final testing on my side to validate it." [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [12:04:17] (03CR) 10ArielGlenn: [C: 03+2] rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [12:06:36] (03CR) 10Ayounsi: "I don't know enough to review that, but 2 notes:" [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [12:06:38] (03CR) 10Jcrespo: [C: 03+2] transferpy: Change tox development environment to Python3.7 [software/transferpy] - 10https://gerrit.wikimedia.org/r/615688 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [12:07:08] (03Merged) 10jenkins-bot: transferpy: Change tox development environment to Python3.7 [software/transferpy] - 10https://gerrit.wikimedia.org/r/615688 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [12:07:33] (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [12:08:07] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [12:08:32] (03CR) 10Jcrespo: "Give me some time to think of a proper alternative and I will answer you soon." [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [12:09:36] (03PS1) 10Jbond: lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) [12:12:11] (03CR) 10Jcrespo: "Did I merge things in the wrong order?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [12:12:34] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:14:02] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.152 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:14:27] (03CR) 10CDanis: "In addition to this stuff you'll also 1) need Pybal restarts and 2) be willing to tolerate some downtime (if you aren't, you'll have to se" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:15:55] (03PS2) 10Alexandros Kosiaris: ganeti: Fix ganeti-mond typo [puppet] - 10https://gerrit.wikimedia.org/r/615676 [12:15:56] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.154 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:17:05] (03PS2) 10ArielGlenn: script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856) [12:17:30] (03CR) 10jerkins-bot: [V: 04-1] script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [12:17:35] !log Stagging at mwdebug1001 again [12:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:30] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:19:33] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/24062/db1080.eqiad.wmnet/index.html -> aren't those used for https://grafana.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) (owner: 10Kormat) [12:20:14] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.154 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:20:30] RECOVERY - MD RAID on restbase-dev1004 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:20:56] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10akosiaris) >>! In T256863#6320727, @Papaul wrote: > @Eevans @akosiaris we have 2 spare on site that we can use to replace this server > wmf6413 and wmf6414 in netbox both servers are: > HP ProLiant DL360... [12:20:58] (03PS1) 10Jbond: idp - librenms: only allow librenms-readers and ops groups for librenms [puppet] - 10https://gerrit.wikimedia.org/r/615721 [12:21:11] 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Ottomata) @Joe I don't have full context of how MassMessageJob and JobQueue work here, but at the very least it seems we are able to save th... [12:21:21] !log Stagging at mwdebug1001 ended, run scap pull to clean changes [12:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:38] (03CR) 10Jbond: [C: 03+2] idp - librenms: only allow librenms-readers and ops groups for librenms [puppet] - 10https://gerrit.wikimedia.org/r/615721 (owner: 10Jbond) [12:23:11] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10hnowlan) a:05Cmjohnsonβ†’03hnowlan [12:23:15] !log remove bogus lo0 IPs from cr3-knams [12:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:49] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10hnowlan) Thanks @Jclark-ctr ! We'll need to rebuild the raid0 that the cassandra storage is located upon. [12:23:50] (03PS1) 10Urbanecm: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) [12:24:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:24:42] (03CR) 10jerkins-bot: [V: 04-1] ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [12:24:47] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: add passive server for rsync server [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [12:24:54] (03PS3) 10Filippo Giunchedi: librenms: add passive server for rsync server [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T247967) [12:25:13] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [12:25:27] (03PS1) 10Muehlenhoff: profile::superset: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/615724 [12:26:00] (03PS2) 10Urbanecm: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) [12:26:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:26:23] (03PS1) 10Marostegui: dbproxy1019: Decrease a bit labsdb1009 weight [puppet] - 10https://gerrit.wikimedia.org/r/615725 [12:26:59] (03CR) 10Urbanecm: "Hello James, I'd appreciate your review here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [12:27:01] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Decrease a bit labsdb1009 weight [puppet] - 10https://gerrit.wikimedia.org/r/615725 (owner: 10Marostegui) [12:28:10] (03CR) 10Kormat: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) (owner: 10Kormat) [12:28:39] 10Operations, 10Analytics, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10MoritzMuehlenhoff) [12:29:13] !log Decrease labsdb1009 weight a bit, as it is lagging again. [12:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:00] (03PS2) 10Jbond: lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) [12:30:02] (03PS1) 10Jbond: thanos-query - lvs: update service to monitoring_setup while we update [puppet] - 10https://gerrit.wikimedia.org/r/615726 [12:30:20] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38741432 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:46] (03PS1) 10Jbond: thanos-query - lvs: update service to production state [puppet] - 10https://gerrit.wikimedia.org/r/615727 [12:32:10] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 872 and 84 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:44] (03PS3) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) [12:33:22] (03PS1) 10Esanders: Fix VE-RealTime CSP entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615728 [12:34:38] (03PS3) 10Jbond: lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) [12:35:36] (03CR) 10Jbond: "thanks, updated" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:35:38] (03CR) 10Privacybatm: "> Patch Set 9:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [12:36:32] (03PS2) 10Jbond: thanos-query - lvs: update service to production state [puppet] - 10https://gerrit.wikimedia.org/r/615727 [12:40:38] (03PS10) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [12:41:29] (03CR) 10Privacybatm: "> Patch Set 9:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [12:41:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes: add namespace for api-gateway [puppet] - 10https://gerrit.wikimedia.org/r/615521 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [12:42:46] (03Restored) 10Privacybatm: transferpy: Package transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [12:43:15] (03Abandoned) 10Privacybatm: transferpy: Package transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [12:44:37] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks everyone for the reviews! Going ahead with it now and we can tackle the refactor after the Buster migration" [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [12:44:40] (03Abandoned) 10Privacybatm: [POC1 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [12:44:43] (03PS2) 10Filippo Giunchedi: smokeping: match documentroot with smokeping installation [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) [12:45:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add discovery and disabled LVS components for API gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [12:45:17] (03Abandoned) 10Privacybatm: [POC2 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614744 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [12:48:38] (03PS3) 10ArielGlenn: script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856) [12:49:02] (03CR) 10jerkins-bot: [V: 04-1] script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [12:49:49] (03CR) 10Alexandros Kosiaris: "Quick q before I review. Does the WIP in the commit message still hold, or is this ready for review?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [12:52:58] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [12:54:58] (03PS3) 10Privacybatm: Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) [12:55:00] (03PS4) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) [12:55:02] (03PS6) 10Privacybatm: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) [12:55:30] ok who made icinga sad [12:58:13] akosiaris: Jul 23 12:46:56 icinga1001 icinga[75759]: Error: Could not find any host matching 'chartmuseum2001.codfw.wmnet' (config file '/etc/nagios/nagios_service.cfg', starting on line 23985) [12:58:15] Jul 23 12:46:56 icinga1001 icinga[75759]: Error: Could not expand hostgroups and/or hosts specified in service (config file '/etc/nagios/nagios_service.cfg', starting on line 23985) [12:58:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall! Other clients that will need changing are all in puppet, namely the grafana datasource and modules/profile/manifests/thanos/" [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:58:36] (03CR) 10Privacybatm: [C: 03+1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [13:00:04] longma and liw: (Dis)respected human, time to deploy Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1300). Please do the needful. [13:01:22] (03PS1) 10Filippo Giunchedi: librenms: create/update users when using SSO [puppet] - 10https://gerrit.wikimedia.org/r/615730 (https://phabricator.wikimedia.org/T247967) [13:01:24] (03PS1) 10Filippo Giunchedi: librenms: set bootstrap/cache permissions [puppet] - 10https://gerrit.wikimedia.org/r/615731 (https://phabricator.wikimedia.org/T247967) [13:01:26] (03PS1) 10Filippo Giunchedi: role: force mpm_prefork for netmon/librenms [puppet] - 10https://gerrit.wikimedia.org/r/615732 (https://phabricator.wikimedia.org/T247967) [13:02:51] (03CR) 10jerkins-bot: [V: 04-1] librenms: create/update users when using SSO [puppet] - 10https://gerrit.wikimedia.org/r/615730 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:03:34] (03PS1) 10Filippo Giunchedi: profile: move thanos-query clients to https [puppet] - 10https://gerrit.wikimedia.org/r/615733 (https://phabricator.wikimedia.org/T151009) [13:05:30] (03CR) 10CDanis: [C: 03+1] profile: move thanos-query clients to https [puppet] - 10https://gerrit.wikimedia.org/r/615733 (https://phabricator.wikimedia.org/T151009) (owner: 10Filippo Giunchedi) [13:05:33] (03CR) 10CDanis: [C: 03+1] lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [13:08:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [13:09:50] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [13:11:36] (03PS1) 10Muehlenhoff: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615734 [13:18:47] (03PS4) 10ArielGlenn: script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856) [13:20:54] cdanis: looking [13:22:49] (03PS1) 10Muehlenhoff: Switch Superset to CAS (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736 [13:24:49] ah, the codfw.wmnet thing. /me fixing [13:26:20] (03PS3) 10Alexandros Kosiaris: ganeti: Fix ganeti-mond typo [puppet] - 10https://gerrit.wikimedia.org/r/615676 [13:26:22] (03PS1) 10Alexandros Kosiaris: chartmuseum: Skip the domain names in service [puppet] - 10https://gerrit.wikimedia.org/r/615737 [13:26:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Fix ganeti-mond typo [puppet] - 10https://gerrit.wikimedia.org/r/615676 (owner: 10Alexandros Kosiaris) [13:26:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] chartmuseum: Skip the domain names in service [puppet] - 10https://gerrit.wikimedia.org/r/615737 (owner: 10Alexandros Kosiaris) [13:29:43] (03CR) 10Hnowlan: "> Patch Set 20:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [13:32:13] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [13:34:59] (03PS1) 10Filippo Giunchedi: hieradata: flip netmon2001 back to ldap auth [puppet] - 10https://gerrit.wikimedia.org/r/615738 (https://phabricator.wikimedia.org/T247967) [13:35:02] I'm seeing multiple onwiki reports of problems related to Commons, that may be unrelated but kinda smell of a common infrastructure problem somewhere. [13:35:21] 10Operations: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10Joe) p:05Triageβ†’03Low that description is used in a comment in pybal, where the $::site is evaluated correctly. The problem here is that it was reused to make the icinga alerts unique AIUI. b... [13:35:33] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [13:35:44] cdanis: ^ [13:35:46] fixed [13:36:07] :) ty [13:36:10] FΓ¦ reported trouble getting members of a category from the API: https://commons.wikimedia.org/wiki/Commons:Village_pump/Technical#Please_help_prioritize_the_Commons_API_"error_500"_bug_on_searches_and_category_queries [13:36:38] Multiple reports that FileImporter fails: https://commons.wikimedia.org/wiki/Commons:Village_pump/Technical#File_importer_is_broken [13:36:43] PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns5001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:37:26] (imports tried from enWS, jaWP, and a third by a Chinese user) [13:37:57] And then there was a report of the ia-upload tool failing an upload. [13:37:58] (03Abandoned) 10Filippo Giunchedi: librenms: create/update users when using SSO [puppet] - 10https://gerrit.wikimedia.org/r/615730 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:38:56] (ia-upload, for those not aware, is a toolforge tool that grabs book scans from the Internet Archive and upload them to Commons) [13:38:57] PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns5002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:40:12] I believe a manual/UploadWizard upload of the same ~300MB PDF (i.e. chunked upload) file also failed, but haven't tested that myself. [13:41:17] PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:41:17] PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns2002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:41:51] FΓ¦'s API problem is apparently a couple of weeks old and ongoing, while the two upload/import problems look like they may have started yesterday. [13:42:03] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Krinkle) 05Resolvedβ†’03Open [13:42:33] (03PS1) 10Alexandros Kosiaris: discovery: Add helm-charts discovery stanzas [puppet] - 10https://gerrit.wikimedia.org/r/615739 [13:42:37] (03PS2) 10Muehlenhoff: Switch Superset to CAS (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736 [13:43:51] (03CR) 10jerkins-bot: [V: 04-1] Switch Superset to CAS (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736 (owner: 10Muehlenhoff) [13:44:48] xover: Allowed memory size of 698351616 bytes exhausted (tried to allocate 6128792 bytes) doesn't look very promising. That's a lot of memory [13:45:04] akosiaris: it sounds like that's new behavior on unchanged API calls, though [13:45:41] could also be new larger uploads as well. Difficult to say at this point [13:45:42] (03PS4) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) [13:45:57] PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns1001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:45:59] PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns4001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:46:04] we can try subscribing a few people to the task, they might have some more insight [13:46:09] akosiaris: does this seem like a thing CPT could/should look at? [13:46:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] discovery: Add helm-charts discovery stanzas [puppet] - 10https://gerrit.wikimedia.org/r/615739 (owner: 10Alexandros Kosiaris) [13:46:23] (03PS3) 10Muehlenhoff: Switch Superset to CAS (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736 [13:47:24] cdanis: I think so [13:47:59] excuse me, I guess I mean 'Platform Engineering'? [13:48:06] did they have a separate workboard for their clinic duty? [13:48:14] multiple ones IIRC [13:48:18] RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns5001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:48:19] RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:48:22] RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns4001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:48:28] RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns5002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:48:34] RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:48:34] RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns2002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:49:13] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=.* [13:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:58] I know metdata handling for PDF and DjVu files is suboptimal: OCR text layer wrapped in an XML structure is stored in DB fields along with things like dates, dpi, and x / y resolution. [13:51:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Should be good to go now" [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:52:54] xover: hope you don't mind but I quoted you on https://phabricator.wikimedia.org/T255981, and put it on CPT's radar [13:53:28] cdanis: thanks! [13:55:36] (03PS1) 10Jbond: puppetboard: increase buffer size [puppet] - 10https://gerrit.wikimedia.org/r/615742 [13:57:24] (03CR) 10Jbond: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24096/" [puppet] - 10https://gerrit.wikimedia.org/r/615742 (owner: 10Jbond) [13:58:51] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/615742 (owner: 10Jbond) [14:00:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615738 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:00:52] PROBLEM - ganeti-metad running on ganeti3003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-metad https://wikitech.wikimedia.org/wiki/Ganeti [14:02:36] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10RobH) [14:03:13] (03PS1) 10Ema: Add missing field: uri_query [software/atskafka] - 10https://gerrit.wikimedia.org/r/615744 (https://phabricator.wikimedia.org/T254317) [14:03:19] PROBLEM - ganeti-metad running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-metad https://wikitech.wikimedia.org/wiki/Ganeti [14:04:39] akosiaris: do you know what's up with metad? [14:05:02] hmm [14:05:50] quite possibly eventual consistency. I had to rename that check [14:05:56] * akosiaris verifying [14:06:18] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10RobH) [14:06:24] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:07:01] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: flip netmon2001 back to ldap auth [puppet] - 10https://gerrit.wikimedia.org/r/615738 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:07:03] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:07:06] (03PS2) 10Filippo Giunchedi: hieradata: flip netmon2001 back to ldap auth [puppet] - 10https://gerrit.wikimedia.org/r/615738 (https://phabricator.wikimedia.org/T247967) [14:11:36] (03PS2) 10Ema: Add missing field: uri_query [software/atskafka] - 10https://gerrit.wikimedia.org/r/615744 (https://phabricator.wikimedia.org/T254317) [14:12:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:13:33] (03CR) 10Vgutierrez: [C: 03+1] Add missing field: uri_query [software/atskafka] - 10https://gerrit.wikimedia.org/r/615744 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [14:13:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:33] (03PS4) 10Muehlenhoff: Add CAS support to Superset (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736 [14:23:18] (03CR) 10Filippo Giunchedi: [C: 03+2] role: force mpm_prefork for netmon/librenms [puppet] - 10https://gerrit.wikimedia.org/r/615732 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:23:22] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: set bootstrap/cache permissions [puppet] - 10https://gerrit.wikimedia.org/r/615731 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:23:37] (03PS2) 10Filippo Giunchedi: librenms: set bootstrap/cache permissions [puppet] - 10https://gerrit.wikimedia.org/r/615731 (https://phabricator.wikimedia.org/T247967) [14:24:53] PROBLEM - ganeti-metad running on ganeti5003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-metad https://wikitech.wikimedia.org/wiki/Ganeti [14:25:02] (03CR) 10Elukey: [C: 03+1] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615734 (owner: 10Muehlenhoff) [14:25:28] (03CR) 10Elukey: [C: 03+1] profile::superset: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/615724 (owner: 10Muehlenhoff) [14:25:50] (03PS2) 10Filippo Giunchedi: role: force mpm_prefork for netmon/librenms [puppet] - 10https://gerrit.wikimedia.org/r/615732 (https://phabricator.wikimedia.org/T247967) [14:25:59] PROBLEM - librenms.wikimedia.org requires authentication on netmon2001 is CRITICAL: connect to address 208.80.153.110 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:29:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615734 (owner: 10Muehlenhoff) [14:34:25] (03PS5) 10Muehlenhoff: Add CAS support to Superset [puppet] - 10https://gerrit.wikimedia.org/r/615736 [14:36:56] (03PS1) 10Muehlenhoff: Enable CAS for Superset [puppet] - 10https://gerrit.wikimedia.org/r/615754 [14:38:46] PROBLEM - ganeti-metad running on ganeti4003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-metad https://wikitech.wikimedia.org/wiki/Ganeti [14:41:37] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24098/" [puppet] - 10https://gerrit.wikimedia.org/r/615736 (owner: 10Muehlenhoff) [14:47:43] (03PS2) 10Ayounsi: Routers interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/613641 [14:48:17] (03PS2) 10Ayounsi: Add routers interfaces support to wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 [14:48:59] (03PS6) 10Muehlenhoff: Add CAS support to Superset [puppet] - 10https://gerrit.wikimedia.org/r/615736 [14:50:57] (03CR) 10Ayounsi: "This change is ready for review." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 (owner: 10Ayounsi) [14:51:50] (03CR) 10Ayounsi: "This change is ready for review." [homer/public] - 10https://gerrit.wikimedia.org/r/613641 (owner: 10Ayounsi) [14:51:52] (03PS1) 10Elukey: druid: allow different package/class prefixes for logging/alarming [puppet] - 10https://gerrit.wikimedia.org/r/615759 (https://phabricator.wikimedia.org/T244482) [14:52:38] (03PS1) 10Filippo Giunchedi: smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) [14:57:08] (03CR) 10Elukey: [C: 03+2] druid: allow different package/class prefixes for logging/alarming [puppet] - 10https://gerrit.wikimedia.org/r/615759 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [14:57:12] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24101/" [puppet] - 10https://gerrit.wikimedia.org/r/615759 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [14:57:16] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:59:47] (03PS1) 10Alexandros Kosiaris: ganeti: Fix wconfd monitoring typo [puppet] - 10https://gerrit.wikimedia.org/r/615761 [15:01:36] (03PS2) 10Filippo Giunchedi: smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) [15:02:56] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:03:17] (03PS1) 10Elukey: role::druid::test_analytics::worker: set middlemanager java opts [puppet] - 10https://gerrit.wikimedia.org/r/615762 (https://phabricator.wikimedia.org/T244482) [15:03:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Fix wconfd monitoring typo [puppet] - 10https://gerrit.wikimedia.org/r/615761 (owner: 10Alexandros Kosiaris) [15:03:49] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: set middlemanager java opts [puppet] - 10https://gerrit.wikimedia.org/r/615762 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [15:09:50] (03CR) 10Muehlenhoff: Modernise Apache config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615459 (owner: 10Muehlenhoff) [15:10:26] (03PS2) 10Muehlenhoff: Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/615459 [15:10:37] (03PS3) 10Filippo Giunchedi: smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) [15:14:47] (03PS4) 10Filippo Giunchedi: smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) [15:16:11] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/24105/" [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) (owner: 10Filippo Giunchedi) [15:17:57] (03PS21) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [15:19:23] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [15:21:56] (03PS22) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [15:23:03] (03PS5) 10JMeybohm: Add helm-charts discovery record [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) [15:23:15] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [15:27:28] (03PS2) 10Cmjohnson: Revert "Adding cloudcephosd servers to private vlan" [dns] - 10https://gerrit.wikimedia.org/r/615436 [15:27:31] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Revert "Adding cloudcephosd servers to private vlan" [dns] - 10https://gerrit.wikimedia.org/r/615436 (owner: 10Cmjohnson) [15:32:03] 10Operations, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10MoritzMuehlenhoff) [15:35:14] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [15:36:47] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 01m 05s) [15:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:27] (03PS1) 10Cmjohnson: Addig cloudcephosd to cloud-host vlan [dns] - 10https://gerrit.wikimedia.org/r/615765 (https://phabricator.wikimedia.org/T251619) [15:40:30] (03CR) 10Cmjohnson: [C: 03+2] Addig cloudcephosd to cloud-host vlan [dns] - 10https://gerrit.wikimedia.org/r/615765 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [15:42:36] (03CR) 10Zfilipin: [C: 03+1] "`npm t` and `npm run selenium` pass! https://phabricator.wikimedia.org/P12029" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [15:42:49] (03CR) 10Volans: "Few comments inline, pure from the python PoV" (035 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 (owner: 10Ayounsi) [15:50:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ayounsi) Current status on the switches side is that vlans (cloud-hosts + cloud-storage) a... [15:54:28] (03CR) 10JMeybohm: "> > you need to explicitly indicate the environment:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [15:58:10] (03PS1) 10Volans: GC: fix reported counter [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 [15:58:19] (03CR) 10Volans: GC: fix reported counter (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 (owner: 10Volans) [15:58:52] (03PS1) 10Alexandros Kosiaris: ganeti: Remove metad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/615789 [15:59:14] (03PS1) 10Cmjohnson: Adding the mgmt dns entries created by netbox to dns file (not yet automated) [dns] - 10https://gerrit.wikimedia.org/r/615790 (https://phabricator.wikimedia.org/T251619) [15:59:21] (03CR) 10jerkins-bot: [V: 04-1] GC: fix reported counter [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 (owner: 10Volans) [16:00:04] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1600). [16:00:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Remove metad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/615789 (owner: 10Alexandros Kosiaris) [16:01:53] (03CR) 10Dzahn: [C: 03+2] remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183) (owner: 10Dzahn) [16:01:56] (03PS2) 10Dzahn: remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183) [16:02:51] (03CR) 10Cmjohnson: [C: 03+2] Adding the mgmt dns entries created by netbox to dns file (not yet automated) [dns] - 10https://gerrit.wikimedia.org/r/615790 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [16:07:24] (03CR) 10Dzahn: [V: 03+2 C: 03+2] remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183) (owner: 10Dzahn) [16:07:29] (03PS3) 10Dzahn: remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183) [16:09:00] (03PS2) 10Volans: GC: fix reported counter [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 [16:11:08] (03PS1) 10CDanis: secure.wm.o: tighten redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615792 [16:11:30] (03CR) 10CDanis: "Manually tested with httpbb on mwdebug1002." [puppet] - 10https://gerrit.wikimedia.org/r/615792 (owner: 10CDanis) [16:11:54] (03PS2) 10CDanis: secure.wm.o: tighten redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615792 (https://phabricator.wikimedia.org/T151977) [16:13:11] (03PS1) 10Jbond: python3: add tox checks for python3 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/615793 [16:13:12] 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10ayounsi) No idea if it's useful here but came across https://github.com/jeremyschulman/netbox-plugin-auth-saml2 [16:14:06] (03CR) 10Jbond: "Ready for at least a first pass." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/615793 (owner: 10Jbond) [16:14:11] (03CR) 10RLazarus: [C: 03+1] secure.wm.o: tighten redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615792 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [16:15:13] (03CR) 10CDanis: [C: 03+2] secure.wm.o: tighten redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615792 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [16:25:45] (03PS1) 10CDanis: httpbb: secure.wm.o: test the tightened redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977) [16:28:48] (03PS1) 10Dzahn: phabricator: set aphlict to disabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615796 [16:30:49] (03PS27) 10Ryan Kemper: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [16:31:41] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [16:33:19] (03PS28) 10Ryan Kemper: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [16:34:24] (03CR) 10DCausse: [C: 03+1] [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [16:35:42] (03PS1) 10Dzahn: ATS: add backend for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/615797 [16:36:29] (03CR) 10Ryan Kemper: [C: 03+2] "pcc looks fine: https://puppet-compiler.wmflabs.org/compiler1003/24106/" [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [16:36:53] (03CR) 10RLazarus: [C: 03+1] "Optional: These support comments (# syntax) so you could add one here, either explaining the vulnerability or just listing the task number" [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [16:37:52] (03CR) 10Lucas Werkmeister (WMDE): "This is probably okay to go in now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 (owner: 10Lucas Werkmeister (WMDE)) [16:38:06] (03PS2) 10CDanis: httpbb: secure.wm.o: test the tightened redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977) [16:38:50] (03PS1) 10CDanis: ATS: force cache revalidation on secure.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/615799 (https://phabricator.wikimedia.org/T151977) [16:39:15] (03CR) 10RLazarus: [C: 03+1] httpbb: secure.wm.o: test the tightened redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [16:39:37] (03CR) 10CDanis: [C: 03+2] httpbb: secure.wm.o: test the tightened redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [16:40:40] (03PS2) 10Dzahn: visualdiff: update git branch from ruthenium to scandium [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) [16:42:38] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [16:52:44] 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10sbassett) 05Stalledβ†’03Resolved a:03CDanis Thanks, @cdanis. Looks to be fixed. Resolving and making public. [16:52:51] 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10sbassett) [16:56:02] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [16:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:25] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [16:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:35] (03PS1) 10Vidhi-Mody: Selenium: Update to WebdriverIO v6 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) [17:00:04] halfak and accraze: #bothumor I οΏ½ Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1700). [17:00:54] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [17:04:06] (03CR) 10BBlack: [C: 03+1] ATS: force cache revalidation on secure.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/615799 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [17:06:46] (03CR) 10CDanis: [C: 03+2] ATS: force cache revalidation on secure.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/615799 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [17:11:40] 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10ArielGlenn) 05Resolvedβ†’03Open Almost resolved, heh. [17:16:44] 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10CDanis) 05Openβ†’03Resolved Re-validation forced for ATS-BE, and also a Varnish cache ban has been put in place, so we should no l... [17:22:36] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [17:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:18] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/P [17:24:24] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [17:24:26] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/P [17:24:57] ? [17:25:35] PROBLEM - LVS wdqs codfw port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:25:53] here [17:25:55] ryankemper: ^ [17:26:16] looking [17:27:01] here if you need anything [17:27:02] Pybal says the `readiness-probe` endpoint is timing out after 5 seconds on all WDQS boxes [17:27:08] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [17:27:30] I'll probably want to de-pool these now while investigating, trying to figure out if they're all down or just a subset [17:27:32] 17:11 was the first occurrence and it got worse from there [17:27:57] "marked down but pooled" suggests it's all of them, or at least more than pybal will depool at once [17:28:03] ack, thanks [17:28:07] cdanis: have I got that right? ^ [17:28:13] is it related to the running cookbooks above? [17:28:15] more than pybal will depool at once [17:28:43] (03CR) 10Ebernhardson: "pcc looks as expected: https://puppet-compiler.wmflabs.org/compiler1003/24107/an-airflow1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615582 (owner: 10Ebernhardson) [17:28:58] ryankemper: https://phabricator.wikimedia.org/P12032 [17:29:32] I believe 'partially up' means it isn't actually passing the readiness probe, but depooling it would lead to Pybal having too many depooled [17:29:43] what's the easiest way to find puppet's last run time? [17:29:53] ryankemper: sudo /etc/update-motd.d/97-last-puppet-run [17:30:59] PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:31:14] ryankemper: did your change only roll out to codfw, or is it in eqiad too? [17:31:23] ryankemper: either puppetboard or if you need on a lot of hosts via cumin [17:31:25] We merged a puppet change for some work we're doing so I was wondering if that somehow broke things, but the associated services shouldn't have restarted [17:31:40] I'm not sure of wdqs provisioning, but it might be prudent to DNS Discovery-depool codfw for wdqs services [17:32:40] trying it by hand, I do see the `/readiness-probe` handler time out on wdqs2003 [17:33:00] looks like it gets rewritten by nginx to a trivial sparql: rewrite ^/readiness-probe$ /sparql?query=%20ASK%7B%20%3Fx%20%3Fy%20%3Fz%20%7D; [17:33:01] cdanis: the change we rolled out is for all instances but we only actually restarted services on our canary instance in eqiad [17:33:56] I'm realizing I don't have a great notion of what handles incoming requests for wdqs, i.e. if we have nginx in front of it or what [17:34:03] we do [17:34:09] the `DNS Discovery-depool codfw for wdqs services` sounds like a good idea since we're not seeing any problems on eqiad currently [17:34:27] ryankemper: was this the change? [17:34:28] https://puppetboard.wikimedia.org/report/wdqs2004.codfw.wmnet/902d0affb63bd8bd9f79db2cadce5e611f1359e4 [17:34:40] ack, I don't think codfw is doing useful work right now: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&from=now-1h&to=now [17:35:14] volans: yes [17:35:32] !log cdanis@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs.*,name=codfw [17:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:40] also just realized I only fed eqiad nodes to the `pcc` command so that change is looking suspect [17:35:58] notably: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=7&fullscreen&orgId=1&refresh=1m&from=now-1h&to=now [17:36:07] that change was applied to wdqs2004 at 16:45:58. [17:36:11] I'm guessing from this graph that the codfw nodes stopped being able to count how many triples they're serving once puppet ran [17:36:41] Okay well first things first let's get that change backed out [17:36:47] Working on a revert patch [17:37:34] ryankemper: did you find the cause? [17:37:55] * gehel is just back, need any help? [17:38:18] not specifically but `cdanis` noted that codfw's triple graphite metrics aren't coming through anymore [17:38:26] and the problem started when the automated puppet ran occurred on codfw [17:38:31] blazegraph is stuck on these nodes [17:38:52] codfw will be DNS-Discovery-depooled in another 2 minutes btw (5 minute TTL) [17:39:20] okay, probably not worth reverting given the 2 mins [17:39:33] to dcausse's point here's logs for blazegraph on 2003 https://www.irccloud.com/pastebin/QlLehAS8/ [17:40:39] curl -d 'query=SELECT * WHERE {?s ?p ?o . } limit 1&format=json' http://localhost:9999/bigdata/sparql is not responding on the few codfw nodes I tried [17:41:26] gehel: tldr is all of wdqs codfw is down, eqiad seems totally fine, and it's presumably related to the application of https://puppetboard.wikimedia.org/report/wdqs2004.codfw.wmnet/902d0affb63bd8bd9f79db2cadce5e611f1359e4 [17:41:46] how can I see ^ ? [17:42:15] dcausse: https://gerrit.wikimedia.org/r/c/operations/puppet/+/615795 [17:43:00] dcausse: puppetboard is ops-only because it can have prod auth secrets recorded in it by accident, but, here's teh diff that was applied here: https://phabricator.wikimedia.org/P12033 [17:43:38] ryankemper: are you reverting that change? need any help? [17:44:02] gehel: codfw is no longer serving wdqs requests (dnsdisc-depooled) so I think we're working on a fix instead of a revert [17:44:11] I can revert, I wasn't sure if we wanted to given that we're not serving requests from codfw [17:44:16] ^ [17:44:31] I'd like to debug few things first [17:44:38] blazegraph has not been restarted [17:44:47] Yup, go ahead [17:44:48] only the main endpoint is stuck [17:45:03] ryankemper: make sure you have the revert ready, just in case [17:45:12] good point, will open up a patch [17:45:40] ryankemper: you can create a revert directly from the gerrit UI [17:45:49] wdqs2008 is fine [17:45:53] I don't see anything obviously wrong in https://phabricator.wikimedia.org/P12033 [17:46:21] the diff that was applied in eqiad looks similar [17:46:30] it's not all codfw, I hope it's not T242453 accross all the codfw fleet... [17:46:31] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [17:46:38] Yeah my thinking was given none of blazegraph, categories, updater were restarted, I don't quite understand how things would have broken [17:46:53] taking few stackdumps [17:47:00] this looks like a coincidence to me [17:47:24] dcausse: should we restart one of the stuck server, see if it recovers? [17:47:28] yes [17:47:48] dcausse: let us know when you're good on the thread dumps and on which server [17:47:49] Let me know when you have the trace and then I can restart blazegraph on that instance [17:48:47] restarted blazegraph on wdqs2001 [17:49:06] 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Quiddity) Just noting for the record, I had similar problems on Monday, whilst delivering TechNews. It delivered duplicates to 6 Wiktionary... [17:49:10] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmne [17:49:10] .wmnet, wdqs1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:49:38] dcausse: is blazegraph serving on 2001 yet? Pybal still reports it failing the readiness probe [17:49:39] Well this is now seeming awfully coincidental [17:49:42] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [17:50:01] (03PS1) 10ZPapierski: Migrate wcqs to wcqs-beta.wmfcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/615810 [17:50:01] ehm [17:50:05] PROBLEM - LVS wdqs-ssl eqiad port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:50:05] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 5.491 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:50:06] I think we're about to have a full outage [17:50:20] PROBLEM - Check the last execution of mediawiki_job_wikidata-updateQueryServiceLag on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_wikidata-updateQueryServiceLag https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:50:21] Agreed [17:50:26] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmne [17:50:26] .wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:50:28] Opening up the revert [17:50:29] PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:50:38] Should we just try restarting blazegraphs on eqiad now? [17:51:09] I can now confirm user impact, simple queries don't run for me [17:51:10] (03PS1) 10Ryan Kemper: Revert "[wdqs] add a new streaming updater profile" [puppet] - 10https://gerrit.wikimedia.org/r/615784 [17:51:12] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:32] (03CR) 10CDanis: [V: 03+2 C: 03+2] Revert "[wdqs] add a new streaming updater profile" [puppet] - 10https://gerrit.wikimedia.org/r/615784 (owner: 10Ryan Kemper) [17:51:37] RECOVERY - LVS wdqs codfw port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:51:52] RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:51:55] ^ Does this imply the blazegraph restart fixed 2001 and now codfw is responsive? [17:52:04] I think it does [17:52:07] Should we revert and restart blazegraph fleetwide or just restart without revert [17:52:08] I'm merging your puppet patch anyway [17:52:10] Okay [17:52:13] Sounds good [17:52:14] eqiad still down [17:52:15] !log cdanis@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs.*,name=codfw [17:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:20] and repooling codfw [17:52:30] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [17:52:35] Should I restart blazegraph on an eqiad instance to get it serving, or just wait for your repool cdanis [17:52:43] I am only reporting user impact, as it is the only thing I know how to do [17:52:56] thanks jynus, helpful to know :) [17:53:11] ryankemper: wait for the patch to be merged, puppet-apply and restart [17:53:18] !log ❌cdanis@cumin1001.eqiad.wmnet ~ πŸ•‘β˜• sudo cumin -b10 'wdqs*' "run-puppet-agent --unless-version 1a4ae81" [17:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:15] ryankemper: okay, revert is applied on all wdqs* hosts, please start restarting blazegraphs [17:54:23] Proceeding [17:56:22] internal clusters were fine, it's only server receiving queries from outside [17:58:00] PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:58:33] Manually restarted blazegraph on `wdqs1003` to get eqiad back up asap, and am now restarting every wdqs instance except `wdqs1003` and `wdqs2001` which we've already restarted: [17:58:52] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1004 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:58:57] RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:58:57] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2002 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:59:00] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:59:02] !log sudo -E cumin -b 10 'A:wdqs-all and not A:wdqs-test and not P{wdqs1003.eqiad.wmnet} and not P{wdqs2001.codfw.wmnet}' 'sudo systemctl restart wdqs-blazegraph.service' [17:59:02] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:08] checking [17:59:25] query went through, I think we are back [17:59:25] RECOVERY - LVS wdqs-ssl eqiad port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.023 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:59:34] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:59:38] okay, we have some healthy wdqsen in both clusters now, so, we are out of outage [17:59:41] waiting for confirmation at https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=2&fullscreen&orgId=1&from=1595516372253&to=1595527172254&var-cluster_name=wdqs [17:59:47] yup, WDQS works again for me too [17:59:47] RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:59:52] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:59:54] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:59:56] (it's important to remember that DNS Discovery does *not* consider backend healthiness in its 'decisions') [17:59:58] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.686 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:00:00] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:00:03] ack re backend readiness [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I οΏ½ Unicode. All rise for Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1800). [18:00:05] Amir1: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:14] o/ [18:00:18] (03PS2) 10ZPapierski: Migrate wcqs to wcqs-beta.wmfcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/615810 [18:00:23] ryankemper: \o/ [18:00:23] RoanKattouw, Niharika, Urbanecm, Amir1: please hold off on deploying anything [18:00:24] rate of queries going up [18:00:30] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:00:38] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:43] close to previous rate [18:00:45] sure [18:00:46] rzl: ack [18:00:46] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:00:49] rzl: ack [18:00:50] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:00:54] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:00:56] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:00:59] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=2&fullscreen&orgId=1&from=1595516372000&to=1595527254469&var-cluster_name=wdqs [18:01:14] RECOVERY - Check the last execution of mediawiki_job_wikidata-updateQueryServiceLag on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_wikidata-updateQueryServiceLag https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:01:28] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:01:34] So now that the dust is settling, sounds like we've got two theories to follow: one that something in the puppet change broke everything and the other being the deadlock issue that dcausse referenced [18:01:42] will this have impacted dispatch lag, or that was part of the internal "not impacted" part? [18:01:51] not dispatch [18:01:59] the api lag, not sure what that is called [18:02:05] maxlag? [18:02:05] jynus: yes it will [18:02:10] yes, that [18:02:12] ryankemper: that sounds right to me, yeah -- although if it was the former, I would have expected it to follow Puppet runs more closely [18:02:16] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:02:16] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:02:21] ok, checking mw api now [18:02:27] for wikidata [18:02:35] ryankemper et all: whenever you're satisfied prod is stable again, can you highlight the folks in that jouncebot message again and give them the all clear please [18:02:38] maxlag on https://grafana.wikimedia.org/d/000000170/wikidata-edits?refresh=1m&orgId=1&from=1595525551004&to=1595527351005 has gaps [18:02:38] and internal cluster would have been affested too [18:02:40] ryankemper: I didn't actually verify that it did follow puppet runs; that's just what it possibly looked like from the graph (failures staggered across a plausible-enough time interval) [18:02:52] cdanis: right, understood [18:03:02] icinga alerts for wdqs and for LVS-for-wdqs have cleared [18:03:08] my mental model on the deadlock is if it were somewhat related to load / cpu usage etc we could have had a domino-type effect [18:03:16] since if it's temporally independent we would never see the behavior we saw today [18:03:25] ryankemper: yeah, you can also get a similar-looking effect from a "query of death" from a user [18:03:36] deadlock is perhaps related to a bad query [18:03:45] gets LB'd to one server, crashes it --> user retries --> bad query goes to another server --> wash rinse repeat [18:03:45] that would make a lot of sense [18:03:52] if repeated enough it'll bring all clusters down [18:03:58] rzl: which jouncebot message are you referring to? things are stable enough now for me to give them the all-clear [18:03:59] not sure I see any impact on mediawiki api for wikibase behaviour [18:04:13] ryankemper: ack, thanks! [18:04:21] Niharika: Amir1: you can proceed :) [18:04:23] RoanKattouw, Niharika, Urbanecm, Amir1: disregard my last, go ahead at your convenience :) thanks [18:04:32] :) [18:04:35] (I know it is unrelated, but thinking about usual complain about maxlag) [18:05:21] jynus: since the WDQS lag wasn't reported by dead servers, it will probably take a few minutes to propagate back to Wikidata maxlag [18:05:22] (03PS3) 10ZPapierski: Migrate wcqs to wcqs-beta.wmfcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/615810 [18:05:47] gehel: I see, so the bug cancelled itself :-D [18:05:58] (03PS3) 10Dzahn: visualdiff: update git branch from ruthenium to scandium [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) [18:06:20] I have a meeting right now. Urbanecm would you be able to swat? [18:06:35] Niharika: I believe Amir1 is able to self-service :) [18:07:08] Sure, I can do it [18:07:20] so everything I am looking at looks health except, understandable, the lag [18:07:21] it's also risky things, so I need to test lots of things [18:07:56] ryankemper: FWIW, my money is on the query-of-death idea -- codfw hosts broke within a few minutes of each other, but no eqiad hosts until ~10 minutes after my dnsdisc-depool of codfw [18:08:18] and then when they did break, all the eqiad hosts broke ~simultaneously at 17:45 [18:08:54] (03CR) 10Dzahn: [C: 03+2] "scandium-only: https://puppet-compiler.wmflabs.org/compiler1003/24110/scandium.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [18:09:37] yes, that would also explain why *almost* all but not literally all of codfw got into a bad state [18:09:40] uh [18:09:44] codfw hosts are going down again [18:09:55] it's not done [18:10:12] so, we need to figure out what query is doing it and then possibly figure out the actual user and make them stop? [18:10:21] yeah [18:10:23] PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:10:23] meanwhile i'll need to play whackamole and try to restart blazegraph enough to keep service going [18:10:33] we need to find the query at fault, and block it somehow [18:10:41] any volunteers to try to fix the actual problem while I play whack-a-mole [18:10:46] we can also do it the other way around [18:11:04] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled https://wikitec [18:11:04] iki/PyBal [18:11:22] the query is unlikely to have a chance to be logged if it's causing the deadlock (at least from the backend [18:11:45] perhaps a stupid idea: could decreasing query timeout help? [18:11:45] dcausse: does blazegraph only log at the end of query execution? and not the start? [18:11:50] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [18:12:20] (03CR) 10Ladsgroup: [C: 03+2] "BACC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 (owner: 10Lucas Werkmeister (WMDE)) [18:12:27] PROBLEM - LVS wdqs codfw port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:12:28] cdanis: yes, we could log before but we might get the same info from webrequest logs [18:12:35] this might be the tail end of a/the query? https://www.irccloud.com/pastebin/r44IH5FF/ [18:12:37] cdanis: yes, blazgraph only log after query completion [18:12:50] logging only at the end of query execution is one of my distributed systems pet peeves :) [18:13:04] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmne [18:13:04] but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:13:10] (03Merged) 10jenkins-bot: extension-list: Load WikibaseRepo via JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 (owner: 10Lucas Werkmeister (WMDE)) [18:13:15] !log restarted blazegraph on 2001 [18:13:17] actually not entirely true, it logs a number of operations, but probably nothing that will help us too much here [18:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:28] (that log message was a little too vague...oh well) [18:13:52] gehel: any thoughts on Urbanecm 's timeout idea btw [18:14:04] I doubt it will help much [18:14:09] does /var/log/nginx/access.log contain info (can't access it) [18:14:16] looking [18:14:27] but we might try (the timeout thing) [18:14:54] dcausse: here's a susbet of what it looks like [18:15:02] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [18:15:06] https://www.irccloud.com/pastebin/AdxovFit/%2Fvar%2Flog%2Fnginx%2Faccess.log [18:15:21] users with too many requests in error should be banned by throttling eventually, but if the server freezes, that's not actually going to help [18:15:35] `2001` is back up, going to whack the 3 remaining codfw nodes [18:15:53] https://www.irccloud.com/pastebin/Cx6mXSc7/ [18:16:09] (03CR) 10Ladsgroup: "https://www.irccloud.com/pastebin/Cx6mXSc7/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 (owner: 10Lucas Werkmeister (WMDE)) [18:16:24] (03CR) 10Dzahn: "This was a noop on scandium (so far i did not touch anything manually so the branch is still ruthenium there as before and puppet does not" [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [18:16:54] (03CR) 10Ladsgroup: [C: 03+2] Load WikibaseClient from extension.json file instead of php one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613235 (https://phabricator.wikimedia.org/T256228) (owner: 10Ladsgroup) [18:17:31] this is the one from the paste: https://logstash.wikimedia.org/goto/1d2fabcb65d0c0520c3e58f31f3ca786 [18:17:39] (03Merged) 10jenkins-bot: Load WikibaseClient from extension.json file instead of php one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613235 (https://phabricator.wikimedia.org/T256228) (owner: 10Ladsgroup) [18:17:49] RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 485 bytes in 7.922 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:17:54] but I don't know if it is the one causing issues, just searched it on logs [18:18:17] ryankemper: we might perhaps have the line in /var/log/nginx/error.log when nginx bails on a gateway error? [18:18:43] (03CR) 10ZPapierski: [C: 04-1] "This patch behaves correctly, but requires a compatible oauth consumer, so I'm blocking it until one is available." [puppet] - 10https://gerrit.wikimedia.org/r/615810 (owner: 10ZPapierski) [18:20:17] the first error.log entry I see when the 'new' round of errors begin at 18:12:01 is from maps2003 [18:20:22] on wdqs2003 [18:21:09] cdanis: could you copy /var/log/nginx/error.log somewhere I can read on wdqs2007, please? [18:21:21] the error logs are full of stuff, not sure what to search for [18:21:39] dcausse: copy in your homedir [18:21:44] thanks! [18:21:47] !log testreduce1001 - rm -rf /srv/testreduce and run puppet to re-clone testreduce to it from the scandium branch (T257906) [18:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:52] T257906: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 [18:22:08] also, AI for later: file a task to get devs access to those logs [18:22:26] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2003 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:22:48] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:23:29] PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:23:42] okay, so, operating on the theory that these are user queries that are being routed to codfw, here's the ones that are doing that and are failing at the Varnish level too: https://logstash.wikimedia.org/goto/bcbe3c97cc8f541aa44db95a523841af [18:24:20] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:24:40] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:25:10] it's important to note that this of course includes 'anyone who is sending close-to-significant query traffic to WDQS', [18:25:13] RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.165 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:25:20] looking at high time_firstbytes helps filter some [18:25:25] RECOVERY - LVS wdqs codfw port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:25:36] I am going to disable notifs for those [18:26:02] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:26:12] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:26:48] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:27:30] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10Andrew) [18:28:35] Deploying this big change [18:28:47] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service: wdqs admins should have access to nginx logs on wdqs machines - https://phabricator.wikimedia.org/T258739 (10Dzahn) [18:28:48] Keep in mind for performance, errors, etc. [18:28:58] cdanis: i made that task, hope it helped [18:29:21] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:613235|Load WikibaseClient from extension.json file instead of php one (T257437 T256228 T88258)]] (duration: 01m 05s) [18:29:26] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:29] T88258: Convert WikibaseRepository, WikibaseClient, WikibaseLib and WikibaseView to use extension registration - https://phabricator.wikimedia.org/T88258 [18:29:29] T257437: Deploy Client to production using extension registration - https://phabricator.wikimedia.org/T257437 [18:29:29] T256228: Convert WikibaseClient to use extension registration - https://phabricator.wikimedia.org/T256228 [18:34:28] mutante: thanks [18:35:26] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:35:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-sidecar site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:35:34] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:36:26] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [18:38:02] time for another round of whack a mole [18:39:04] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [18:39:37] !log BACC is done [18:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:58] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [18:41:12] dcausse: I don't know how can I help, but could give me read writes to that error.log? [18:41:56] thx! [18:42:17] done [18:42:22] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:44:08] zpapierski_: dcausse: "sudo journalctl -u nginx" should work already, btw [18:44:14] RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:44:19] because journalctl * is in your sudo privs [18:44:24] !log Restarted blazegraph on following codfw wdqs nodes: 2007, 2003, and 2002 [18:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:38] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:46:46] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:47:20] looks like wdqs2001 is presently unhealthy, but not others [18:48:08] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:48:12] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service: wdqs admins should have access to nginx logs on wdqs machines - https://phabricator.wikimedia.org/T258739 (10Dzahn) `sudo journalctl -u nginx` should already work but it does not contain the same information that is in the error.log... [18:48:28] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [18:49:58] 10Operations: Grant IRC operator privileges to Urbanecm in #wikimedia-operations - https://phabricator.wikimedia.org/T258741 (10Urbanecm) [18:50:48] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:51:00] !log restarted blazegraph on codfw wdqs2001 [18:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:55] So quick update, we've got some people spelunking for a potential bad query/actor, we've blocked a suspect IP at the varnish level and restarted blazegraph on affected codfw nodes so we're waiting to see if we get another round of outage [18:52:36] Also note the suspect ip was apparently entering via `ulsfo` which would hit `codfw` which lines up w/ the behavior we've seen [18:54:06] RECOVERY - Thanos query has high gRPC client errors on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [18:57:06] (03PS1) 10Dzahn: admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) [18:57:26] thanks ^ [18:59:02] (03PS1) 10Mholloway: Bump wikifeeds to 2020-07-23-185301-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/615819 [19:00:04] longma and liw: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American+European Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1900). [19:01:56] Hello. The train is blocked currently so we might not have a deployment during the window [19:02:47] (03CR) 10DCausse: "thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [19:04:53] (03CR) 10Mholloway: [C: 03+2] Bump wikifeeds to 2020-07-23-185301-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/615819 (owner: 10Mholloway) [19:05:58] (03Merged) 10jenkins-bot: Bump wikifeeds to 2020-07-23-185301-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/615819 (owner: 10Mholloway) [19:06:05] (03PS1) 10Dzahn: admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) [19:07:01] (03CR) 10Dzahn: admins: let wdqs-admins view nginx logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [19:07:22] (03CR) 10DCausse: [C: 03+1] admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [19:07:38] (03CR) 10DCausse: [C: 03+1] admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [19:09:01] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [19:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:38] Any chanop around? [19:11:02] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [19:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:22] _joe_: mind a pm about a chanopy thing? [19:12:52] <_joe_> what's up? [19:13:30] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [19:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:11] (03PS1) 10Bstorm: wiki-replicas: Add clouddb naming to regexes [puppet] - 10https://gerrit.wikimedia.org/r/615823 (https://phabricator.wikimedia.org/T257987) [19:19:36] PROBLEM - Long running screen/tmux on weblog1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 9621, 1733239s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [19:21:09] ^ it's possible to whitelist hosts where long running screen/tmux should never alert [19:21:21] if that is desired for weblog* ..not sure [19:26:54] (03PS1) 10Ladsgroup: labs: Load Wikibase Repo using extension.json instead of php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615825 (https://phabricator.wikimedia.org/T257436) [19:27:07] 10Operations, 10Traffic, 10Sustainability (Incident Followup): upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517 (10Aklapper) >>! In T106517#5666579, @ema wrote: > I cannot reproduce with URLs such as https://upload.wikimedia.org/wikiped... [19:29:05] (03PS1) 10Dzahn: do not monitor long-running screens on weblog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/615826 [19:29:54] (03CR) 10Dzahn: "19:19 <+icinga-wm> PROBLEM - Long running screen/tmux on weblog1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 9621," [puppet] - 10https://gerrit.wikimedia.org/r/615826 (owner: 10Dzahn) [19:33:51] (03CR) 10Ladsgroup: "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615825 (https://phabricator.wikimedia.org/T257436) (owner: 10Ladsgroup) [19:35:32] (03CR) 10Ladsgroup: [C: 03+2] labs: Load Wikibase Repo using extension.json instead of php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615825 (https://phabricator.wikimedia.org/T257436) (owner: 10Ladsgroup) [19:36:15] (03Merged) 10jenkins-bot: labs: Load Wikibase Repo using extension.json instead of php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615825 (https://phabricator.wikimedia.org/T257436) (owner: 10Ladsgroup) [19:43:20] (03CR) 10RLazarus: [C: 03+1] admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [19:43:30] (03CR) 10RLazarus: [C: 03+1] admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [19:48:05] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kubernetes2017.codfw.wmnet - https://phabricator.wikimedia.org/T258745 (10RobH) [19:48:16] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kubernetes2017.codfw.wmnet - https://phabricator.wikimedia.org/T258745 (10RobH) [19:51:29] (03PS1) 10Andrew Bogott: Rename cloudcephosd1004 through 1015. [puppet] - 10https://gerrit.wikimedia.org/r/615828 (https://phabricator.wikimedia.org/T251619) [19:52:11] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/24111/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615796 (owner: 10Dzahn) [19:52:32] (03CR) 10Andrew Bogott: [C: 03+2] Rename cloudcephosd1004 through 1015. [puppet] - 10https://gerrit.wikimedia.org/r/615828 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott) [19:54:18] (03PS1) 10BryanDavis: dynamicproxy: Only redirect to wmcloud if proxy is registered [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730) [19:58:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10RobH) [19:58:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10RobH) [20:00:09] (03PS2) 10BryanDavis: dynamicproxy: Only redirect to wmcloud if proxy is registered [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730) [20:01:44] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) Merging the change above was a noop on scandium. I did not manually touch it so far, so the git repo at /srv/testreduce is unchang... [20:07:34] (03PS1) 10Dzahn: parsoid: remove vd_server and vd_client from parsoid::testing role [puppet] - 10https://gerrit.wikimedia.org/r/615831 (https://phabricator.wikimedia.org/T257906) [20:10:09] (03CR) 10BryanDavis: "100% untested at this point" [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730) (owner: 10BryanDavis) [20:17:40] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-30) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10RobH) [20:18:48] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-30) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10RobH) [20:20:26] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: () rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10RobH) [20:20:35] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: () rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10RobH) [20:21:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (2020-09-30) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10RobH) [20:22:46] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10RobH) [20:22:51] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (2020-09-14) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10RobH) [20:26:54] PROBLEM - Disk space on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus5001&var-datasource=eqsin+prometheus/ops [20:27:10] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:10] PROBLEM - Check size of conntrack table on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:33:27] (03PS1) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) [20:34:36] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:36] PROBLEM - Check size of conntrack table on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:36:18] PROBLEM - configured eth on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:36:26] ^^ that's me remiaging (which is slow to eqsin and downtime expired) [20:37:41] (03PS2) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) [20:39:38] 10Operations: Grant IRC operator privileges to Urbanecm in #wikimedia-operations - https://phabricator.wikimedia.org/T258741 (10Aklapper) May want to update https://meta.wikimedia.org/wiki/IRC/wikimedia-ops/Operators once done [20:41:39] (03PS3) 10BryanDavis: dynamicproxy: Only redirect to wmcloud if proxy is registered [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730) [20:44:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [20:45:39] (03CR) 10BryanDavis: "Tested via manual application to /etc/nginx/lua/domainproxy.lua on proxy-01.proxy-codfw1dev.codfw1dev.wikimedia.cloud." [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730) (owner: 10BryanDavis) [20:47:18] (03PS1) 10RobH: updating with new power cord skus [software] - 10https://gerrit.wikimedia.org/r/615836 [20:48:21] (03CR) 10RobH: [C: 03+2] updating with new power cord skus [software] - 10https://gerrit.wikimedia.org/r/615836 (owner: 10RobH) [20:57:30] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [20:58:09] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:58:09] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:58:09] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:58:10] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:58:10] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:58:12] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:58:12] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:58:13] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:58:13] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:11] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:39] andrewbogott: fyi, the host rename breaks icinga config because of the relation between switches and the cloudcephos files. maybe it will just go away after the next puppet run though, i dunno yet [21:00:56] :( [21:01:02] it'll probably clear on its own after a run or two [21:02:34] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:11] Error: 'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet' is not a valid parent [21:03:18] let's check again later [21:07:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1008.eqiad.wmnet', 'c... [21:10:36] (03PS3) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) [21:10:38] (03PS1) 10Andrew Bogott: cloudcephosd nodes: Experiment with using a hw raid for the / volume [puppet] - 10https://gerrit.wikimedia.org/r/615838 (https://phabricator.wikimedia.org/T251619) [21:11:34] (03CR) 10Andrew Bogott: [C: 03+2] cloudcephosd nodes: Experiment with using a hw raid for the / volume [puppet] - 10https://gerrit.wikimedia.org/r/615838 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott) [21:13:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [21:14:55] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10RKemper) [21:15:16] RECOVERY - Check size of conntrack table on prometheus5001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:15:16] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:28] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:18:36] (03CR) 10Herron: [C: 03+1] profile: add prometheus instance for external metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [21:19:24] DannyS712: can you remember which script unsets everyone from a user group? I think you asked for it to be ran before. [21:19:57] RhinosF1: emptyUserGroup.php iirc [21:20:12] Majavah: that would be obvious [21:20:28] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@c99c626]: airflow: centralize installation specific airflow Variables [21:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:43] clearly you haven't used createAndPromote.php [21:21:01] Majavah: last time I tried, I gave up [21:21:02] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@c99c626]: airflow: centralize installation specific airflow Variables (duration: 00m 34s) [21:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:20] heh. anyways I'm off to bed [21:27:07] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:20] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:28] RECOVERY - Disk space on prometheus5001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus5001&var-datasource=eqsin+prometheus/ops [21:30:43] yeah I've asked a few times - eg T250575 [21:30:44] T250575: Remove user rights on test2.wikipedia.org for undeployed extension EducationProgram - https://phabricator.wikimedia.org/T250575 [21:31:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [21:34:11] (03CR) 10Herron: [C: 03+1] "I'm not super familiar with the inner workings of smokeping, but the approach LGTM as long as it is valid to have alerts unset in the targ" [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) (owner: 10Filippo Giunchedi) [21:38:04] RECOVERY - configured eth on prometheus5001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:38:42] (03CR) 10Herron: [C: 03+1] mariadb: Remove puppet mysql grants for m1 misc databases [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo) [21:39:35] (03PS13) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [21:41:02] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (11) node(s) change every puppet run: cloudcephosd1004.eqiad.wmnet, cloudcephosd1007.eqiad.wmnet, cloudcephosd1010.eqiad.wmnet, cloudcephosd1006.eqiad.wmnet, contint2001.wikimedia.org, cloudcephosd1009.eqiad.wmnet, cloudcephosd1005.eqiad.wmnet, cloudcephosd1008.eqiad.wmnet, contint1001.wikimedia.org, testred [21:41:02] et, aphlict1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [21:41:19] (03PS14) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [21:42:37] (03PS15) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [21:45:02] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:02] (03PS2) 10Dzahn: ATS: add backend for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) [21:53:07] (03PS2) 10Dzahn: phabricator: set aphlict to disabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615796 (https://phabricator.wikimedia.org/T238593) [21:53:14] (03PS16) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [21:53:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1004.eqiad.wmnet'] `... [21:54:13] (03CR) 10Herron: [C: 03+1] lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [21:54:52] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/615733 (https://phabricator.wikimedia.org/T151009) (owner: 10Filippo Giunchedi) [21:55:39] (03PS1) 10Dzahn: aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) [21:56:50] (03CR) 10Jeena Huneidi: [C: 03+2] Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi) [21:56:54] (03CR) 10jerkins-bot: [V: 04-1] aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [21:58:21] (03Merged) 10jenkins-bot: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi) [22:04:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [22:14:20] andrewbogott: it still has 7 errors [22:15:09] mutante: I will look shortly. Currently lost in partman :( [22:15:34] (03CR) 10Dzahn: "noop on phab1001 https://puppet-compiler.wmflabs.org/compiler1001/24112/" [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [22:16:18] andrewbogott: ack, thanks [22:18:53] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [22:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1004.eqiad.wmnet'] `... [22:35:31] (03PS1) 10BryanDavis: toolforge: Temp handling for tools.wmflabs.org/wpcleaner [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495) [22:36:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [22:36:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [22:37:10] mutante: in addition to my ping in the other channel… I don't see which 7 errors you're seeing. Did they go away on their own? [22:37:43] oh maybe it's because everything is downtimed for reimage [22:41:57] (03PS1) 10CDanis: appserver hiera: nginx is no more, long live envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/615874 [22:42:24] (03PS2) 10Dzahn: aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) [22:43:17] andrewbogott: it's these: sudo icinga -v /etc/icinga/icinga.cfg | grep Errors [22:43:27] Error: 'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host ... [22:43:31] followed by the 7 new hosts [22:43:36] (03CR) 10jerkins-bot: [V: 04-1] aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [22:44:00] that means it can't reload the config to add new things [22:44:24] Ok, will recheck [22:44:49] there is a parent/child relationship between the cloud switches and these hosts [22:45:05] and for some reason the parent switches are not valid anymore now [22:45:35] this stuff is there to avoid that all the hosts are alerting when one switch is down (afaict) [22:46:22] (03PS2) 10BryanDavis: toolforge: Temp handling for tools.wmflabs.org/wpcleaner [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495) [22:46:53] (03CR) 10CDanis: "pcc https://puppet-compiler.wmflabs.org/compiler1001/24113/mw2335.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615874 (owner: 10CDanis) [22:47:03] to avoid this you may have to properly decom the old hosts and then add the new hosts.. not sure [22:48:01] direct renaming often has (similar) issues [22:50:51] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [22:50:51] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [22:50:51] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [22:50:51] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [22:50:51] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [22:50:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1011.eqiad.wmnet'] `... [22:50:54] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:50:54] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:50:54] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:50:54] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:15] !log jhuneidi@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'echostore' for release 'staging' . [22:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:53] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [22:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:57] what's with the duplicate logging today [22:52:26] stashbot is doing everything 4 times? [22:52:26] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [22:52:55] !log stashbot quadruple log test [22:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:06] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:25] (03PS3) 10Dzahn: aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) [22:55:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:48] mutante: I think it is just stashbot falling behind on processing the rapid fire !log messages from the cookbooks [22:57:06] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10wkandek) p:05Triageβ†’03Medium [22:57:48] bd808: ah, yep. maybe the difference between one host at a time or a regex [22:58:10] It does look funny here in the channel, but it seems to have the right number of entires in the actual log [22:59:55] ack [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T2300). [23:00:11] bleh, there isn't info about how to purge a host from icinga in the docs anymore [23:01:01] it's part of the decom cookbook [23:01:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1005.eqiad.wmnet', 'c... [23:07:25] (03CR) 10Dzahn: [C: 03+2] aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [23:13:11] (03PS3) 10BryanDavis: toolforge: Temp handling for tools.wmflabs.org/wpcleaner [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495) [23:16:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [23:16:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [23:16:42] (03PS1) 10CDanis: WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 [23:17:55] (03CR) 10jerkins-bot: [V: 04-1] WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 (owner: 10CDanis) [23:18:00] yah I know jerkins [23:18:33] (03CR) 10BryanDavis: [C: 03+1] "Tested via manual application on tools-legacy-redirector.tools.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495) (owner: 10BryanDavis) [23:18:35] (03PS2) 10CDanis: WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 [23:19:42] (03PS3) 10CDanis: WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 [23:19:59] (03PS1) 10Dzahn: aphlict: add phab_deploy_finalize and rollback scripts [puppet] - 10https://gerrit.wikimedia.org/r/615879 (https://phabricator.wikimedia.org/T238593) [23:21:02] (03CR) 10jerkins-bot: [V: 04-1] WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 (owner: 10CDanis) [23:29:38] (03PS1) 10Tim Starling: Revert "Remove lilypond for now" [puppet] - 10https://gerrit.wikimedia.org/r/615851 [23:30:21] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [23:30:21] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [23:30:21] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [23:30:22] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [23:30:24] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:30:24] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:30:25] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1011.eqiad.wmnet'] `... [23:32:20] (03PS4) 10CDanis: WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 (https://phabricator.wikimedia.org/T258648) [23:32:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:46] (03PS1) 10BryanDavis: toolforge: Allow large POST to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/615881 (https://phabricator.wikimedia.org/T258760) [23:33:35] (03CR) 10jerkins-bot: [V: 04-1] WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 (https://phabricator.wikimedia.org/T258648) (owner: 10CDanis) [23:34:56] (03CR) 10CDanis: "This is mostly-correct, as far as I can tell from PCC*, but fails CI because I'm not smart enough to edit the fixtures there." [puppet] - 10https://gerrit.wikimedia.org/r/615877 (https://phabricator.wikimedia.org/T258648) (owner: 10CDanis) [23:37:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1007.eqiad.wmnet', 'c... [23:41:17] (03CR) 10BryanDavis: toolforge: Allow large POST to tools.wmflabs.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615881 (https://phabricator.wikimedia.org/T258760) (owner: 10BryanDavis) [23:42:02] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:00] (03CR) 10BryanDavis: [C: 03+1] toolforge: Temp handling for tools.wmflabs.org/wpcleaner (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495) (owner: 10BryanDavis) [23:59:37] (03CR) 10Legoktm: [C: 03+1] Revert "Remove lilypond for now" [puppet] - 10https://gerrit.wikimedia.org/r/615851 (owner: 10Tim Starling) [23:59:40] (03CR) 10Reedy: [C: 03+1] Revert "Remove lilypond for now" [puppet] - 10https://gerrit.wikimedia.org/r/615851 (owner: 10Tim Starling)