[00:07:45] (03CR) 10Dzahn: [C: 03+2] Add DNS entries for initiativeswiki [dns] - 10https://gerrit.wikimedia.org/r/504503 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [00:07:50] (03PS4) 10Dzahn: Add DNS entries for initiativeswiki [dns] - 10https://gerrit.wikimedia.org/r/504503 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [00:08:55] !log DNS - add initiatives.wikimedia.org (and initiaves.m) for campaign wiki requested at T167375 [00:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:00] T167375: Creation of a "Campaign" Wiki - initiatives.wikimedia.org - https://phabricator.wikimedia.org/T167375 [00:17:44] PROBLEM - Check systemd state on mw1297 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:19:33] oic wm1297 is WIP :) [00:19:58] chaomodus: yes, and there wasn't even an alert, right [00:20:06] ah, yes [00:20:10] no just here [00:20:58] PROBLEM - Nginx local proxy to apache on mw1297 is CRITICAL: connect to address 10.64.16.62 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [00:21:05] yes, downtimed [00:21:09] but can't do that before it exists [00:21:17] ofc [00:21:24] and yes.. a single appserver would never page [00:21:36] that just happens for some cloud* [00:22:26] also the systemd state is fixed by "have you tried rebooting it" [00:23:11] hehe in this case it was apache because it didn't have configs in conf-enabled yet [00:28:52] !log mw1297 - scap pull [00:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:45] !log mw1297 - rebooting for nutcracker issue [00:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:38] PROBLEM - Host mw1297 is DOWN: PING CRITICAL - Packet loss = 100% [00:31:32] RECOVERY - Host mw1297 is UP: PING WARNING - Packet loss = 58%, RTA = 70.38 ms [00:31:36] RECOVERY - Check systemd state on mw1297 is OK: OK - running: The system is fully operational [00:31:46] RECOVERY - Nginx local proxy to apache on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:32:39] ACKNOWLEDGEMENT - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: sgebastion class instances not spread out enough BryanDavis New instance added for testing, needs to be placed on a different cloudvirt manually. - The acknowledgement expires at: 2019-04-26 00:31:10. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:35:55] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Multiple systems in esams OE10 showing PSU failures - https://phabricator.wikimedia.org/T177228 (10Dzahn) cp3033 is shown as having CRIT redundancy https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp3033&service=IPMI+Sensor+Status [00:40:03] 10Operations, 10ops-esams, 10Traffic: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 (10Dzahn) The host also shows that power supplies are not redundant.. which had a comment linking to T177403 -> T177228. And support has expired (https://netbox.wikimedia.org/dcim... [00:41:24] ACKNOWLEDGEMENT - Long running screen/tmux on notebook1003 is CRITICAL: CRIT: Long running SCREEN process. (user: fsalutari PID: 15618, 2604253s 1728000s). daniel_zahn already emailed user https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [00:44:18] 10Operations, 10Traffic: cp4021 - UNKNOWN: cannot run varnishstat - https://phabricator.wikimedia.org/T221731 (10Dzahn) [01:15:10] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:46:54] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:49:30] PROBLEM - mediawiki-installation DSH group on mw1297 is CRITICAL: Host mw1297 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [03:45:11] !log repooled wdqs1003, it's good now [03:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:10] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10CDanis) @Cwek Thank you very much for the detailed report! I've rolled back the experimental change to our DNS... [04:11:08] PROBLEM - puppet last run on aqs1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:37:34] RECOVERY - puppet last run on aqs1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [05:01:48] (03PS1) 10Santhosh: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) [05:03:13] (03PS4) 10Santhosh: Remove ExternalGuidanceEnableContextDetection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504261 (https://phabricator.wikimedia.org/T219819) (owner: 10KartikMistry) [05:33:41] (03PS1) 10Marostegui: db-codfw.php: Depool db2080,db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506046 [05:34:49] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2080,db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506046 (owner: 10Marostegui) [05:35:49] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2080,db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506046 (owner: 10Marostegui) [05:37:14] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2080 and db2083 (duration: 00m 54s) [05:37:22] !log Upgrade db2080 and db2083 [05:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:52] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2080,db2083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506047 [05:42:34] (03CR) 10jenkins-bot: db-codfw.php: Depool db2080,db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506046 (owner: 10Marostegui) [05:49:42] 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) @Papaul can we upgrade firmware and BIOS on db2080?, I was bitten by this today. [05:49:58] (03Abandoned) 10Marostegui: Revert "db-codfw.php: Depool db2080,db2083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506047 (owner: 10Marostegui) [05:52:04] (03PS1) 10Marostegui: db-codfw.php: Depool db2086, repool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506048 [05:53:21] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2086, repool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506048 (owner: 10Marostegui) [05:54:19] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2086, repool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506048 (owner: 10Marostegui) [05:54:33] (03CR) 10jenkins-bot: db-codfw.php: Depool db2086, repool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506048 (owner: 10Marostegui) [05:55:33] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2083 and depool db2086 (duration: 00m 52s) [05:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:41] !log Upgrade db2086 [05:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:17] (03PS1) 10Marostegui: db-codfw.php: Repool db2086, depool db2079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506050 [06:08:22] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Repool db2086, depool db2079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506050 (owner: 10Marostegui) [06:09:21] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2086, depool db2079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506050 (owner: 10Marostegui) [06:10:34] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2086, depool db2079 (duration: 00m 53s) [06:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:43] !log Upgrade db2079 [06:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:41] (03CR) 10jenkins-bot: db-codfw.php: Repool db2086, depool db2079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506050 (owner: 10Marostegui) [06:15:54] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [06:18:07] !log Upgrade db2081 [06:18:08] (03PS1) 10Marostegui: db-codfw.php: Repool db2079, depool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506051 [06:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:30] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Repool db2079, depool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506051 (owner: 10Marostegui) [06:21:39] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2079, depool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506051 (owner: 10Marostegui) [06:22:50] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2079, depool db2082 (duration: 00m 55s) [06:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:56] !log Upgrade db2082 [06:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:47] (03CR) 10jenkins-bot: db-codfw.php: Repool db2079, depool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506051 (owner: 10Marostegui) [06:29:38] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:31:32] 10Operations, 10Traffic: cp4021 - UNKNOWN: cannot run varnishstat - https://phabricator.wikimedia.org/T221731 (10Vgutierrez) p:05Triage→03Low that's expected, as @ema mentioned yesterday in -traffic: ` so we've got cp4021 reimaged as Varnish/ATS and it seems to be looking kind-of OK it is howeve... [06:33:25] (03PS1) 10Marostegui: db-codfw.php: Repool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506053 [06:34:16] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.982 second response time https://phabricator.wikimedia.org/T174916 [06:34:59] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Repool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506053 (owner: 10Marostegui) [06:35:59] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506053 (owner: 10Marostegui) [06:37:03] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2082 (duration: 00m 52s) [06:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:43] (03CR) 10jenkins-bot: db-codfw.php: Repool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506053 (owner: 10Marostegui) [06:38:16] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [06:38:46] !log restart pdfrender on scb1003 [06:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:50] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:39:24] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [06:40:46] gehel, onimisionipe - o/ I can see an alert for "ElasticSearch unassigned shard check - 9243 [06:41:07] in icinga, not sure if super important or not (it is just a warning for the moment) [06:41:20] !log Optimize tables on pc1010 [06:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:40] elukey: not super urgent, I'll have a look [06:52:54] super thanks [06:52:58] elukey: thanks for the ping! [06:53:13] * gehel is still on the way back from daycare [06:54:18] ah sorry! [06:54:23] :( [06:54:28] didn't check the calendar [06:55:03] (03PS1) 10Elukey: profile::superset: add libmariadb3 for buster [puppet] - 10https://gerrit.wikimedia.org/r/506054 [06:59:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I upgraded an-tool1004 yesterday to the latest Buster, that's probably when it broke" [puppet] - 10https://gerrit.wikimedia.org/r/506054 (owner: 10Elukey) [07:00:27] (03CR) 10Muehlenhoff: [C: 03+1] admins: add shell account for Willy Pao and add to datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/506003 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn) [07:02:04] (03CR) 10Muehlenhoff: "I'm not sure this is still valid when the ongoing work is completed to allow wikitech user registration to be opened up again. I'd recomme" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [07:02:15] (03CR) 10Elukey: [C: 03+2] profile::superset: add libmariadb3 for buster [puppet] - 10https://gerrit.wikimedia.org/r/506054 (owner: 10Elukey) [07:04:40] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) We need to keep an eye on https://github.com/apache/trafficserver/issues/5084 [07:11:25] (03CR) 10Muehlenhoff: "IIRC we already tested more fine-grained ownerships/permissions for the various HDFS service keytabs that were deployed in the Kerberos pi" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [07:13:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [07:18:58] (03CR) 10Elukey: "> IIRC we already tested more fine-grained ownerships/permissions for" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [07:20:08] (03CR) 10Elukey: "More info https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/SecureContainer.html" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [07:23:00] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 430.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:25:46] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:26:54] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:30:11] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "This is working fine in my tests/benchmarks, merging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [07:30:13] (03PS4) 10Alexandros Kosiaris: First version of the kask chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) [07:31:01] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] First version of the kask chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [07:33:40] (03CR) 10Muehlenhoff: [C: 03+1] "Ah, indeed, we spoke about LinuxContainerExecutor before, that makes a lot of sense, then." [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [07:35:18] (03PS2) 10Alexandros Kosiaris: Add a dedicated=kask label to kask nodes [puppet] - 10https://gerrit.wikimedia.org/r/505832 (https://phabricator.wikimedia.org/T220821) [07:35:29] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add a dedicated=kask label to kask nodes [puppet] - 10https://gerrit.wikimedia.org/r/505832 (https://phabricator.wikimedia.org/T220821) (owner: 10Alexandros Kosiaris) [07:47:20] (03PS1) 10Muehlenhoff: grub: Remove fallback code for augeas < 1.2 [puppet] - 10https://gerrit.wikimedia.org/r/506061 [07:50:58] RECOVERY - mediawiki-installation DSH group on mw1297 is OK: OK https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [08:01:58] (03PS2) 10Gehel: maps: align tilerator CPU usage across all nodes [puppet] - 10https://gerrit.wikimedia.org/r/505819 [08:02:43] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [08:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:47] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [08:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:31] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [08:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:35] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [08:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:29] (03CR) 10Gehel: [C: 03+2] maps: align tilerator CPU usage across all nodes [puppet] - 10https://gerrit.wikimedia.org/r/505819 (owner: 10Gehel) [08:05:59] (03PS2) 10Gehel: maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) [08:09:53] (03CR) 10Filippo Giunchedi: [C: 03+1] codfw decom: halve non-object weights and 2/3rds object weights [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505888 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis) [08:13:42] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 56.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:14:29] (03CR) 10Mathew.onipe: maps: smooth the tilerator load by reducing cpu assigned to tilerator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) (owner: 10Gehel) [08:16:55] (03CR) 10Gehel: maps: smooth the tilerator load by reducing cpu assigned to tilerator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) (owner: 10Gehel) [08:17:33] !log bounce prometheus on bast5001 after migration and backfill [08:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:54] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10MoritzMuehlenhoff) >>! In T220383#5133808, @Vgutierrez wrote: > We need to keep an eye on https://github.com/apache/trafficserver/issues/5084 Buster has OpenSSL 1.1.1b, so this affects ATS as shipped in Buster? Shou... [08:27:56] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) so I guess it's affected but right now I'm working under the assumption that we will use stretch in the cp nodes, using our own ATS packaging. @ema can confirm that :) [08:29:09] !log swift eqiad-prod: start decom for ms-be101[45] - T220590 [08:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:14] T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 [08:29:51] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Support affinity in all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/505185 (owner: 10Alexandros Kosiaris) [08:30:13] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested this locally, works fine, merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/505185 (owner: 10Alexandros Kosiaris) [08:35:25] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [08:39:32] (03PS3) 10Gehel: maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) [08:40:48] (03CR) 10Mathew.onipe: [C: 03+1] maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) (owner: 10Gehel) [08:40:48] (03CR) 10Filippo Giunchedi: "From reading the related task it looks like we're going to ship the logs as-is and then grok on the logstash side?" [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [08:43:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, for rollout I think we should disable puppet fleetwide and reenable gradually because this change will mean a whole lot more kafka c" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [08:44:00] (03PS7) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [08:44:19] !log rolling restart of Cassandra on restbase/codfw to pick up Java security update [08:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:45] (03PS3) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [08:44:47] (03PS1) 10Alexandros Kosiaris: Update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/506075 [08:44:56] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:46:22] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Gilles) itshappening [08:47:11] (03PS8) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [08:49:31] (03PS9) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [08:51:48] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/506075 (owner: 10Alexandros Kosiaris) [08:51:54] (03PS2) 10Alexandros Kosiaris: Update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/506075 [08:51:56] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/506075 (owner: 10Alexandros Kosiaris) [08:51:58] (03PS2) 10Alexandros Kosiaris: mariadb::ferm: Switch ferm::rule => ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/505831 [08:52:00] (03PS1) 10Alexandros Kosiaris: kafka_cluster_name: ifguard ::labsproject lookup [puppet] - 10https://gerrit.wikimedia.org/r/506084 [08:52:09] (03PS10) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [08:53:00] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:54:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] "PCC happy at https://puppet-compiler.wmflabs.org/compiler1002/15969/" [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris) [08:56:34] (03PS11) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [08:57:07] 10Operations, 10Traffic: cp4021 - UNKNOWN: cannot run varnishstat - https://phabricator.wikimedia.org/T221731 (10ema) Indeed our Varnish mailbox lag Icinga check only applies to Varnish backends, given that backends are those affected by T145661 and similar issues. During the Puppet refactoring splitting front... [08:58:05] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) @Tarrow, @WMDE-leszek. I 've been working on the termbox helm chart and while the service seems to be up and running... [08:58:33] (03PS1) 10Ema: cache: move check_varnish_expiry_mailbox_lag to backend profile [puppet] - 10https://gerrit.wikimedia.org/r/506090 (https://phabricator.wikimedia.org/T145661) [08:58:38] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/15975/" [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:01:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Fixes the" [puppet] - 10https://gerrit.wikimedia.org/r/506084 (owner: 10Alexandros Kosiaris) [09:01:20] (03PS2) 10Alexandros Kosiaris: kafka_cluster_name: ifguard ::labsproject lookup [puppet] - 10https://gerrit.wikimedia.org/r/506084 [09:02:34] (03PS12) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [09:05:52] (03CR) 10Ema: [C: 03+1] "Nice! Two nits." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:11:53] (03PS4) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [09:11:55] (03PS1) 10Alexandros Kosiaris: cxserver: Fix typo in GC metric name [deployment-charts] - 10https://gerrit.wikimedia.org/r/506098 [09:13:21] (03PS2) 10Alexandros Kosiaris: cxserver: Fix typo in GC metric name [deployment-charts] - 10https://gerrit.wikimedia.org/r/506098 [09:13:23] (03PS5) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [09:13:25] (03PS2) 10Ema: cache: move check_varnish_expiry_mailbox_lag to backend profile [puppet] - 10https://gerrit.wikimedia.org/r/506090 (https://phabricator.wikimedia.org/T145661) [09:14:20] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] cxserver: Fix typo in GC metric name [deployment-charts] - 10https://gerrit.wikimedia.org/r/506098 (owner: 10Alexandros Kosiaris) [09:15:23] (03CR) 10Ema: [C: 03+2] cache: move check_varnish_expiry_mailbox_lag to backend profile [puppet] - 10https://gerrit.wikimedia.org/r/506090 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [09:16:54] (03PS13) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [09:17:08] (03CR) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:17:39] (03CR) 10Ema: [C: 03+1] trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:17:49] (03PS6) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [09:17:51] (03PS1) 10Alexandros Kosiaris: Publish the kask chart in the repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/506104 (https://phabricator.wikimedia.org/T220401) [09:18:17] (03CR) 10Jcrespo: "I am ok with this, but this is not a noop (even if technically is), I would like to see this config change to single host first:" [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris) [09:18:36] RECOVERY - tools project instance distribution on cloudcontrol1003 is OK: OK: All critical instances are spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:49] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Publish the kask chart in the repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/506104 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [09:22:33] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:22:41] (03PS14) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) [09:23:04] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:24:04] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [09:24:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [09:24:21] woot? [09:24:42] the cr2-eqiad port down might be a transit [09:25:02] I don't see any scheduled maintenance [09:25:24] the 500s spike looks gone now [09:25:38] BGP Session Down: 91.198.174.249 (AS65003) [09:25:56] this is cr2-eqiad <-> cr2-esams [09:26:47] upload does not seem affected [09:27:40] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:27:41] it seems the Level3 link between eqiad and esams [09:27:52] https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png [09:28:00] we are probably going through knams now? [09:29:00] akosiaris: ^ ? [09:29:50] * akosiaris looking [09:32:11] (03PS4) 10Gehel: maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) [09:32:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [09:32:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [09:32:56] (03CR) 10Gehel: [C: 03+2] maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) (owner: 10Gehel) [09:34:19] we have some GRE tunnel there as well it seems [09:34:26] not sure what it is about [09:34:28] mark ^ ? [09:35:16] the overall device traffic hasn't fallen [09:35:50] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 198.1 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [09:35:51] akosiaris: is it the dotted line in https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png between eqiad and esams? [09:35:52] (03CR) 10Vgutierrez: [C: 03+1] "LGTM, should we notify wikitech-l regarding this change?" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [09:36:01] (the GRE tunnel I mean) [09:36:31] elukey: I guess so? [09:36:37] I don't see traffic over it fwiw [09:36:54] but graphs say nan [09:36:58] NaN that is [09:37:14] so... I am not sure it's actually 0 traffic, sounds more like something not being graphed? [09:37:16] is traffic going through knams now? (trying to understand) [09:37:20] me too [09:37:45] (03PS2) 10Santhosh: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) [09:39:50] elukey: yes it's going through cr2-knams [09:40:32] yep I am seeing it via librenms, the spike in traffic is nice :D [09:40:51] should we open a task or something to level3? [09:40:52] if only we could change the timespan easily [09:41:05] it's killing me that I have to chase the spike in 6h graphs [09:41:24] otherwise where would be the joy?? [09:41:30] * elukey runs [09:41:47] lol [09:42:31] elukey: I guess we should [09:43:43] elukey: runbook is at https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down [09:45:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:45:38] yeah yeah we know :D [09:45:45] heh, icinga was slow on this one [09:46:49] (03PS17) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [09:47:51] let's open up a task for that [09:48:07] I see a scheduled maint on May 1st for that link [09:48:25] with reasoning cheduled maintenance to (Project modifier - troubleshoot and clear network alarms, clean fiber to clear network alarms) in order to prevent future service interruptions to customer services. [09:48:41] so it maybe they just weren't fast enough [09:54:20] (03PS18) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [09:54:45] (03CR) 10Gilles: [C: 03+2] Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [09:54:47] (03CR) 10Gilles: [V: 03+2 C: 03+2] Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [09:56:06] (03PS1) 10Gilles: Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/506115 (https://phabricator.wikimedia.org/T221562) [09:56:19] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/15979/" [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [09:56:29] (03CR) 10Gilles: [V: 03+2 C: 03+2] Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/506115 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [09:56:38] (03CR) 10Jcrespo: [C: 03+1] "I just realized this is for the extra_port only, so this can go anytime now." [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris) [09:56:42] akosiaris: if you want I can take care of the netops task while you contact their support [09:56:50] (creating etc.. as the runbook states) [09:58:14] !log installing rsync security updates on jessie [09:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:37] (03PS1) 10Gilles: Upgrade to 2.5 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/506116 (https://phabricator.wikimedia.org/T221562) [10:00:25] 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) [10:00:44] 10Operations, 10Wikidata, 10wikidata-tech-focus, 10Performance: Request timeout while loading Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930 - https://phabricator.wikimedia.org/T140879 (10abian) [10:02:52] 10Operations, 10Wikidata, 10wikidata-tech-focus, 10Performance: Request timeout when loading diffs on Wikidata - https://phabricator.wikimedia.org/T140879 (10abian) [10:03:55] 10Operations, 10Traffic, 10Patch-For-Review: cp4021 - UNKNOWN: cannot run varnishstat - https://phabricator.wikimedia.org/T221731 (10ema) 05Open→03Resolved [10:04:47] 10Operations, 10Wikidata, 10wikidata-tech-focus, 10Performance: Request timeout when loading diffs on Wikidata - https://phabricator.wikimedia.org/T140879 (10abian) [10:10:49] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jijiki) [10:12:19] 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) Received information for Level3 ` This is to confirm that ticket 16262986 has been created regarding your service. Customer Name: Wikimedia Foundation Billing Account Number: 1-DCG6LL Customer... [10:14:04] 10Operations, 10MediaWiki-History-and-Diffs, 10Wikidata, 10wikidata-tech-focus, 10Performance: Request timeout when loading diffs on Wikidata - https://phabricator.wikimedia.org/T140879 (10Epidosis) [10:17:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] "More like cleanup. ferm::rule isn't really meant to be used much. It's there for the cases where ferm::service just doesn't cut it." [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris) [10:18:51] (03PS1) 10Ema: prometheus: use ATS profile instead of role in job definition [puppet] - 10https://gerrit.wikimedia.org/r/506122 (https://phabricator.wikimedia.org/T219967) [10:19:31] 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) p:05Triage→03High [10:19:42] PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:21:35] (03CR) 10Gilles: "I've built the package for Buster successfully using this on WMCS." [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/506116 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [10:22:29] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) @Eevans @Clarakosi chart has been merged and is published. The only thing missing before we can move on to... [10:22:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks, merging!" [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris) [10:22:56] (03PS3) 10Alexandros Kosiaris: mariadb::ferm: Switch ferm::rule => ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/505831 [10:23:08] !log Restarting php-fpm on mw1238 for 505383 and T211488 [10:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:16] T211488: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 [10:25:17] (03PS4) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) [10:25:22] (03PS3) 10Muehlenhoff: Support upgrades which introduce changes to binary package names (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 [10:26:02] (03CR) 10Jbond: "updated to incorporate changes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/505817" [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:26:05] (03CR) 10jerkins-bot: [V: 04-1] kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:27:08] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Cwek) @CDanis I can't confirm it completely, but it seems the side effect may have formed. I extracted some su... [10:27:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:28:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:28:19] (03CR) 10Muehlenhoff: kafka shipper: move kafka rsyslog shipping to base profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:28:53] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [10:28:54] !log akosiaris@deploy1001 scap-helm cxserver cluster staging completed [10:28:54] !log akosiaris@deploy1001 scap-helm cxserver finished [10:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:10] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Cwek) @CDanis Thanks your help, but it seems the side effect may have formed. I extracted some subdomains of w... [10:30:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:30:36] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:31:02] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:31:06] (03PS5) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) [10:31:17] (03CR) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:31:24] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1001.eqiad.wmnet'] ` The... [10:31:24] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:31:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Ran on db1063 first, looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris) [10:31:27] (03PS3) 10Jbond: kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) [10:31:38] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:31:42] (03PS4) 10Jbond: ulogd logstash: Add rule to parse ulogd ouput to json [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) [10:32:00] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [10:32:12] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:32:16] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:32:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:32:37] seems a single spike [10:32:40] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational [10:33:10] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:33:14] akosiaris: I think that the interface is flapping [10:33:16] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:33:19] https://librenms.wikimedia.org/eventlog [10:33:26] ifOperStatus: lowerLayerDown -> up [10:33:27] and then [10:33:36] (03Abandoned) 10Jbond: kafka: It was pointed out that kafak shipping may not work for all hosts [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:33:38] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:33:59] ah no wait seems up now [10:34:12] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:34:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:34:29] so the above mess was traffic shifted back to level3? [10:34:48] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:35:02] (got confused with asw2-a-eqiad) [10:35:45] yep traffic back to the interface [10:35:49] gooood [10:35:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/506003 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn) [10:36:31] elukey: yup, looks like it [10:36:36] but it might indeed flap [10:36:42] so, let's keep an eye out for it [10:36:44] (03CR) 10Dr0ptp4kt: "Question / suggestion on test cases." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh) [10:36:58] that weird announcement makes me anxious [10:37:11] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) @jijiki this is all good to go, successfully built on buster.thumbor.eqiad.wmflabs python-thumbor-community-core and thumbor need tiny pa... [10:37:14] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [10:37:34] akosiaris: where did the email from CenturyLink land? noc or something else? [10:37:36] elukey: lol [10:37:38] Field Operations dispatched and upon arrival to the site determined the fiber near the equipment had been burned. Field Operations are currently working to install a new fiber pair to restore services. [10:37:41] burned? [10:37:43] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) [10:37:53] ahahhaha [10:37:55] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) a:05Gilles→03jijiki [10:38:03] elukey: ops-maintenance@ [10:38:13] ahhh I think I am not on it [10:38:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506061 (owner: 10Muehlenhoff) [10:38:35] elukey: you can access it through the google groups interface [10:38:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:39:14] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:40:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:40:23] jynus: ack thanks! [10:40:44] (going afk) [10:41:08] 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) New update says: ` Field Operations dispatched and upon arrival to the site determined the fiber near the equipment had been burned. Field Operations are currently working to install a new fibe... [10:41:26] 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) [10:41:33] 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) p:05High→03Low [10:41:59] (03CR) 10Jbond: "> LGTM, for rollout I think we should disable puppet fleetwide and reenable gradually because this change will mean a whole lot more kafka" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:46:20] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [10:51:09] (03PS9) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [10:51:11] (03PS7) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [10:51:13] (03PS1) 10Jcrespo: mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125 [10:51:28] (03PS3) 10Santhosh: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) [10:51:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The... [10:52:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125 (owner: 10Jcrespo) [10:53:28] (03CR) 10Gilles: "@Ema any chance this could get looked at this quarter?" [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [10:58:30] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: expose metrics in prometheus format for new docker-registry and create a grafana dashboard - https://phabricator.wikimedia.org/T221099 (10fsero) metrics has been exposed and there is a preliminar grafana dashboard https://grafana.wi... [10:58:41] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [10:58:48] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: expose metrics in prometheus format for new docker-registry and create a grafana dashboard - https://phabricator.wikimedia.org/T221099 (10fsero) 05Open→03Resolved [10:58:51] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: use ATS profile instead of role in job definition [puppet] - 10https://gerrit.wikimedia.org/r/506122 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [11:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190424T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:30] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: create and enable alerting on docker_registry_ha - https://phabricator.wikimedia.org/T221759 (10fsero) [11:02:48] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbprov1002.eqiad.wmne... [11:03:47] I'm stealing SWAT to do a small ores deployment [11:03:56] akosiaris: FYI ^ [11:04:03] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:04:19] ok [11:07:12] Revision to revert just in case: 8f01d40bfac1c3026472efcaedb70a5df54fa0fb [11:07:22] !log ladsgroup@deploy1001 Started deploy [ores/deploy@060fc37]: (no justification provided) [11:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:05] Amir1: is your swat done? [11:13:16] oh not yet, sorry [11:13:17] not yet [11:14:01] the canary is healthy, let's roll [11:19:28] (03PS1) 10Ladsgroup: nagios: Migrate ores checks from testwiki to fakewiki [puppet] - 10https://gerrit.wikimedia.org/r/506127 (https://phabricator.wikimedia.org/T219930) [11:19:53] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The... [11:20:04] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbprov1002.eqiad.wmne... [11:21:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The... [11:22:05] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbprov1002.eqiad.wmne... [11:22:23] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The... [11:22:35] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbprov1002.eqiad.wmne... [11:23:40] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@060fc37]: (no justification provided) (duration: 16m 18s) [11:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:49] jijiki: ^ bam [11:23:50] :D [11:23:54] tx! [11:24:55] (03PS4) 10Muehlenhoff: Support upgrades which introduce changes to binary package names [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 [11:25:12] !log Restarting php7.2-fpm on mw-canary for 505383 and T211488 [11:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:17] T211488: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 [11:26:17] PROBLEM - Check systemd state on ms-be1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] nagios: Migrate ores checks from testwiki to fakewiki [puppet] - 10https://gerrit.wikimedia.org/r/506127 (https://phabricator.wikimedia.org/T219930) (owner: 10Ladsgroup) [11:30:08] akosiaris: Thanks! [11:33:55] !log security update ghostscript on scb jessie servers [11:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:45] PROBLEM - puppet last run on proton1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:36:01] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 101.2 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [11:41:22] (03PS4) 10Jbond: puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 [11:45:46] !log restarting relforge for jvm ugprade [11:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:50] moritzm: ^ [11:46:06] ack [11:46:32] 10Operations, 10Patch-For-Review: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10jcrespo) I think it is better to hardcode the constants on `modules/profile/manifests/mariadb/ferm.pp` (for now, not as an ideal situation) than to go on a multi-file refactoring co... [11:47:03] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational [11:47:16] (03CR) 10Jcrespo: [C: 04-1] network::constants: Move mysql_root_clients from special_hosts to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [11:48:47] (03CR) 10Jcrespo: [C: 04-1] "Why wasn't I added as reviewer? See phabricator comment." [puppet] - 10https://gerrit.wikimedia.org/r/505406 (owner: 10Alex Monk) [11:49:35] (03PS1) 10Mathew.onipe: maps: add pgpass file [puppet] - 10https://gerrit.wikimedia.org/r/506131 (https://phabricator.wikimedia.org/T220946) [11:51:16] 10Operations, 10Core Platform Team, 10Multi-Content-Revisions, 10Regression, 10Wikimedia-production-error: Unable to move page (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10jijiki) p:05Triage→03Unbreak! [11:52:39] 10Operations, 10Core Platform Team, 10Multi-Content-Revisions, 10Regression, 10Wikimedia-production-error: Unable to move page (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10jijiki) [11:53:34] (03PS1) 10Ladsgroup: icinga: change dashboard uid of ores to the new dashboard [puppet] - 10https://gerrit.wikimedia.org/r/506132 (https://phabricator.wikimedia.org/T221618) [11:55:56] (03CR) 10Mathew.onipe: "PCC output looks good: https://puppet-compiler.wmflabs.org/compiler1002/15980/" [puppet] - 10https://gerrit.wikimedia.org/r/506131 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [11:56:53] (03PS2) 10Gehel: maps: add pgpass file [puppet] - 10https://gerrit.wikimedia.org/r/506131 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [11:56:57] (03CR) 10Ladsgroup: "Is this correct?" [puppet] - 10https://gerrit.wikimedia.org/r/506132 (https://phabricator.wikimedia.org/T221618) (owner: 10Ladsgroup) [11:58:09] (03CR) 10Gehel: [C: 03+2] maps: add pgpass file [puppet] - 10https://gerrit.wikimedia.org/r/506131 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [11:59:36] !log Restarting php7.2-fpm on mw12* for 505383 and T211488 [11:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:41] T211488: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 [12:07:53] PROBLEM - EDAC syslog messages on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [12:09:17] RECOVERY - Check systemd state on ms-be1031 is OK: OK - running: The system is fully operational [12:09:58] (03CR) 10Ladsgroup: [C: 04-1] [DNM] Rename JADE to Jade (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480284 (https://phabricator.wikimedia.org/T212182) (owner: 10Awight) [12:12:01] (03CR) 10Jbond: "LGTM, some minor comments" (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 (owner: 10Muehlenhoff) [12:14:24] (03PS1) 10Urbanecm: Create new namespace "Edice" for cswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) [12:17:00] (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: split profile for VM instances [puppet] - 10https://gerrit.wikimedia.org/r/506135 (https://phabricator.wikimedia.org/T220051) [12:17:14] (03PS1) 10ArielGlenn: allow the display of only index or column differences for db table checker [software] - 10https://gerrit.wikimedia.org/r/506136 [12:17:16] (03PS1) 10ArielGlenn: script to show section/dbhost info by asking mediawiki for it [software] - 10https://gerrit.wikimedia.org/r/506137 [12:18:06] (03CR) 10jerkins-bot: [V: 04-1] allow the display of only index or column differences for db table checker [software] - 10https://gerrit.wikimedia.org/r/506136 (owner: 10ArielGlenn) [12:18:11] (03CR) 10jerkins-bot: [V: 04-1] script to show section/dbhost info by asking mediawiki for it [software] - 10https://gerrit.wikimedia.org/r/506137 (owner: 10ArielGlenn) [12:20:45] (03PS2) 10Ema: prometheus: use ATS profile instead of role in job definition [puppet] - 10https://gerrit.wikimedia.org/r/506122 (https://phabricator.wikimedia.org/T219967) [12:22:11] (03CR) 10Ema: [C: 03+2] prometheus: use ATS profile instead of role in job definition [puppet] - 10https://gerrit.wikimedia.org/r/506122 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:22:34] (03PS2) 10Arturo Borrero Gonzalez: openstack: clientpackages: split profile for VM instances [puppet] - 10https://gerrit.wikimedia.org/r/506135 (https://phabricator.wikimedia.org/T220051) [12:23:56] !log rolling restart of Cassandra on restbase/eqiad to pick up Java security update [12:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:29] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:30:17] (03PS3) 10Arturo Borrero Gonzalez: openstack: clientpackages: split profile for VM instances [puppet] - 10https://gerrit.wikimedia.org/r/506135 (https://phabricator.wikimedia.org/T220051) [12:36:27] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [12:36:46] !log restarting pdfrender on scb1004 [12:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:20] (03PS1) 10Ema: Revert "cache: define ATS nodes in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/506141 (https://phabricator.wikimedia.org/T213263) [12:38:53] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [12:42:51] godog: FYI there's been a couple of alerts related to ms-be1013 (I see T220590, perhaps some decom steps missing)? [12:42:51] T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 [12:44:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC for toolforge: https://puppet-compiler.wmflabs.org/compiler1001/15982/" [puppet] - 10https://gerrit.wikimedia.org/r/506135 (https://phabricator.wikimedia.org/T220051) (owner: 10Arturo Borrero Gonzalez) [12:44:28] (03CR) 10Muehlenhoff: Support upgrades which introduce changes to binary package names (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 (owner: 10Muehlenhoff) [12:44:36] (03PS5) 10Muehlenhoff: Support upgrades which introduce changes to binary package names [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 [12:48:05] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational [12:51:42] (03PS2) 10Ema: Revert "cache: define ATS nodes in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/506141 (https://phabricator.wikimedia.org/T213263) [12:52:36] (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: vms: fix assert for realm [puppet] - 10https://gerrit.wikimedia.org/r/506146 (https://phabricator.wikimedia.org/T220051) [12:52:45] (03CR) 10Ema: [C: 03+2] Revert "cache: define ATS nodes in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/506141 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [12:53:44] (03PS2) 10Arturo Borrero Gonzalez: openstack: clientpackages: vms: fix assert for realm [puppet] - 10https://gerrit.wikimedia.org/r/506146 (https://phabricator.wikimedia.org/T220051) [12:54:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: clientpackages: vms: fix assert for realm [puppet] - 10https://gerrit.wikimedia.org/r/506146 (https://phabricator.wikimedia.org/T220051) (owner: 10Arturo Borrero Gonzalez) [13:00:57] PROBLEM - HP RAID on ms-be1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.142: Connection reset by peer [13:01:21] !log Restarting php7.2-fpm on mw13* for 505383 and T211488 [13:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:27] T211488: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 [13:15:46] ema: thanks I'll take a look! [13:17:46] 10Operations, 10Core Platform Team, 10Multi-Content-Revisions, 10Regression, 10Wikimedia-production-error: Unable to move page (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10jijiki) [13:26:17] ACKNOWLEDGEMENT - EDAC syslog messages on thumbor1004 is CRITICAL: 4.001 ge 4 Effie Mouzeli https://phabricator.wikimedia.org/T215411 - The acknowledgement expires at: 2019-05-25 13:25:49. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [13:30:51] (03PS1) 10Fsero: registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759) [13:34:07] (03PS1) 10Matthias Mullie: Allow cross-site requests from Commons' mobile domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221678) [13:37:44] !log Poweroff db2080 for onsite maintenance - T216240 [13:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:24] T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 [13:38:48] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "We might have to rename the extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505816 (https://phabricator.wikimedia.org/T221651) (owner: 10Lucas Werkmeister (WMDE)) [13:38:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "We might have to rename the extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505813 (https://phabricator.wikimedia.org/T221650) (owner: 10Lucas Werkmeister (WMDE)) [13:42:02] (03PS1) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) [13:42:15] (03CR) 10Filippo Giunchedi: "Not a puppet/hiera expert so can't really say, LGTM to my untrained eye tho" [puppet] - 10https://gerrit.wikimedia.org/r/504409 (owner: 10Jbond) [13:42:59] (03CR) 10jerkins-bot: [V: 04-1] cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:48:32] (03PS2) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) [13:50:31] (03CR) 10Jforrester: "Wow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221678) (owner: 10Matthias Mullie) [13:54:05] (03CR) 10Alex Monk: "While here we should probably add anything else missing from the list" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221678) (owner: 10Matthias Mullie) [13:54:13] (03CR) 10CDanis: [V: 03+2 C: 03+2] codfw decom: halve non-object weights and 2/3rds object weights [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505888 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis) [13:54:42] (03CR) 10Fsero: "PCC is happy https://puppet-compiler.wmflabs.org/compiler1002/15986/registry1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero) [13:54:53] (03CR) 10Fsero: [C: 03+2] registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero) [13:55:11] (03PS2) 10Fsero: registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759) [13:59:10] (03CR) 10Filippo Giunchedi: [C: 03+2] "That's indeed correct Amir, I'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/506132 (https://phabricator.wikimedia.org/T221618) (owner: 10Ladsgroup) [13:59:18] (03PS2) 10Filippo Giunchedi: icinga: change dashboard uid of ores to the new dashboard [puppet] - 10https://gerrit.wikimedia.org/r/506132 (https://phabricator.wikimedia.org/T221618) (owner: 10Ladsgroup) [13:59:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 (owner: 10Jbond) [14:00:07] (03PS3) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) [14:02:58] (03PS3) 10Mathew.onipe: Add maps postgres init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [14:03:22] (03CR) 10Mathew.onipe: Add maps postgres init cookbook (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [14:04:41] 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) 05Open→03Resolved a:03akosiaris CenturyLink sent a summary and a notification they 'll close the issues as resolved on their end. ` Summary: On April 24, 2019 at 9:21 GMT, CenturyLink ide... [14:05:03] elukey: https://phabricator.wikimedia.org/T221758. Resolved. [14:05:35] (03PS5) 10Jbond: puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 [14:05:53] (03PS3) 10Fsero: registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759) [14:06:19] (03PS4) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) [14:07:47] (03CR) 10Jbond: [C: 03+2] puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 (owner: 10Jbond) [14:08:26] (03PS4) 10Fsero: registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759) [14:11:06] akosiaris: thanks! [14:17:10] 10Operations, 10User-fgiunchedi: Upgrade jessie hosts to rsyslog 8.1901.0-1 - https://phabricator.wikimedia.org/T219764 (10fgiunchedi) [14:18:45] (03CR) 10Ema: "pcc lgtm https://puppet-compiler.wmflabs.org/compiler1001/15994/" [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:19:01] (03PS1) 10Vgutierrez: trafficserver: Provide support for incoming TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [14:19:39] (03PS1) 10Fsero: registryha: better description for health [puppet] - 10https://gerrit.wikimedia.org/r/506161 (https://phabricator.wikimedia.org/T221759) [14:20:37] (03PS2) 10Fsero: registryha: better description for health [puppet] - 10https://gerrit.wikimedia.org/r/506161 (https://phabricator.wikimedia.org/T221759) [14:21:22] (03CR) 10Fsero: [C: 03+2] registryha: better description for health [puppet] - 10https://gerrit.wikimedia.org/r/506161 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero) [14:22:12] (03PS2) 10Vgutierrez: trafficserver: Provide support for incoming TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [14:24:03] (03CR) 10Vgutierrez: "PCC looks happy for existing cp-ats hosts, resulting almost in a NOOP: https://puppet-compiler.wmflabs.org/compiler1001/15996/" [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [14:28:31] !log being rollout rsyslog 8.1901.0-1 to jessie hosts - T219764 [14:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:36] T219764: Upgrade jessie hosts to rsyslog 8.1901.0-1 - https://phabricator.wikimedia.org/T219764 [14:29:47] (03CR) 10Ema: "A couple of comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [14:29:58] 2 != 3 [14:29:59] error! [14:30:12] :) [14:30:43] thx for the review <3 [14:30:55] (03PS1) 10Ottomata: Add eventgate-main chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) [14:33:37] (03PS1) 10Jbond: Hiera backend: update the hiera configuration to remove the role backend [puppet] - 10https://gerrit.wikimedia.org/r/506167 [14:34:47] (03CR) 10Vgutierrez: [C: 03+1] cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:38:03] PROBLEM - Docker registry HTTPS interface on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string schemaVersion not found on https://registry1002.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 394 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Docker [14:38:05] (03CR) 1020after4: [C: 03+1] Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (owner: 10Aklapper) [14:39:50] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2047 - https://phabricator.wikimedia.org/T221481 (10Papaul) a:05Papaul→03Marostegui complete [14:40:03] ACKNOWLEDGEMENT - Docker registry HTTPS interface on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string schemaVersion not found on https://registry1002.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 394 bytes in 0.069 second response time Fsero this is expected because swit replication is kind of slow and have not finished. https://wikitech.wikimedia.org/wiki/Docker [14:40:07] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [14:41:09] (03PS2) 1020after4: Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper) [14:41:15] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [14:41:38] !log restart pdfrender on scb1002 [14:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:14] (03PS3) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [14:43:53] PROBLEM - swift-container-updater on ms-be2024 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.60: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [14:43:55] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2024 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.60: Connection reset by peer [14:44:04] (03PS4) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [14:44:22] (03CR) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [14:44:24] (03PS4) 10Herron: phabricator: remove rfc1918 ip4 addrs from SPF record [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) [14:44:29] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) @Cwek - Thanks for the reports! Have you tried other Wikimedia projects (e.g. wikiversity, wikiquote,... [14:45:01] (03CR) 10Jbond: [C: 03+1] "LGTM" (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 (owner: 10Muehlenhoff) [14:45:07] RECOVERY - swift-container-updater on ms-be2024 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [14:45:09] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2024 is OK: OK ferm input default policy is set [14:45:39] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2037 - https://phabricator.wikimedia.org/T221512 (10Papaul) a:05Papaul→03Marostegui Complete [14:45:57] (03CR) 10Eric Gardner: [C: 03+1] Allow cross-site requests from Commons' mobile domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221678) (owner: 10Matthias Mullie) [14:47:03] (03CR) 10Herron: [C: 03+2] phabricator: remove rfc1918 ip4 addrs from SPF record [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) (owner: 10Herron) [14:47:15] (03CR) 10Fsero: "syntactically looks good, however, how this chart relates to the eventgate-analytics one? it seems there is some duplication between both." [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [14:47:55] (03PS5) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) [14:48:56] (03CR) 10Ema: [C: 03+2] cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:49:48] PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:50:02] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestcontrol2003: rename to cloudcontrol2003-dev - https://phabricator.wikimedia.org/T220095 (10Papaul) 05Open→03Resolved Complete [14:50:21] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestcontrol2003: rename to cloudcontrol2003-dev - https://phabricator.wikimedia.org/T220095 (10Papaul) [14:51:06] PROBLEM - Docker registry health on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry1002.eqiad.wmnet:443/debug/health - 366 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Docker [14:51:48] (03CR) 10Ottomata: "Yeah there's a lot of duplication. -analytics and -main are two separate deployments with different destination Kafka clusters, serving d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [14:54:48] (03CR) 10Vgutierrez: "pcc is still happy after all the renaming: https://puppet-compiler.wmflabs.org/compiler1002/15998/" [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [14:55:24] 10Operations, 10Analytics, 10Analytics-Cluster: furud - DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error - https://phabricator.wikimedia.org/T221483 (10Ottomata) [14:55:29] 10Operations, 10Analytics, 10Analytics-Cluster: Remove Hadoop configs and unmount /mnt/hdfs from unused backup hosts (furud, +) - https://phabricator.wikimedia.org/T221629 (10Ottomata) 05Open→03Declined Oh, actually, /mnt/hdfs is not puppetized. It was leftover from when it was. I just removed it from f... [14:55:51] ACKNOWLEDGEMENT - Docker registry health on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry1002.eqiad.wmnet:443/debug/health - 366 bytes in 0.010 second response time Fsero /debug/health listens on 5001 and not on 443.. tried to acknowlodge this before alerting but icinga UI is so fast Sigh https://wikitech.wikimedia.org/wiki/Docker [14:58:32] PROBLEM - Docker registry health on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry2001.codfw.wmnet:443/debug/health - 366 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Docker [14:58:37] 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) a:05Ottomata→03None [14:59:33] (03PS2) 10Herron: lvs: switch kibana scheduler to source hash [puppet] - 10https://gerrit.wikimedia.org/r/504590 (https://phabricator.wikimedia.org/T221143) [15:00:31] !log switching kibana lvs to source hash scheduler [15:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:58] (03CR) 10Herron: [C: 03+2] lvs: switch kibana scheduler to source hash [puppet] - 10https://gerrit.wikimedia.org/r/504590 (https://phabricator.wikimedia.org/T221143) (owner: 10Herron) [15:01:14] RECOVERY - Device not healthy -SMART- on db2047 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2047&var-datasource=codfw+prometheus/ops [15:01:47] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Cmjohnson) 05Open→03Resolved All the disk were securely wiped and server reset to server defaults [15:02:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10cloud-services-team (Kanban): decommission labvirt101[01].eqiad.wmnet (Dec 2018 lease return) - https://phabricator.wikimedia.org/T210735 (10Cmjohnson) [15:02:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10cloud-services-team (Kanban): decommission labvirt101[01].eqiad.wmnet (Dec 2018 lease return) - https://phabricator.wikimedia.org/T210735 (10Cmjohnson) 05Open→03Resolved xAll the disk were securely wiped and server reset to server defaults [15:04:10] PROBLEM - Docker registry health on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry2002.codfw.wmnet:443/debug/health - 366 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Docker [15:09:38] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Cwek) @BBlack You can read this [[ https://zh.wikipedia.org/wiki/Help:%E5%A6%82%E4%BD%95%E8%AE%BF%E9%97%AE%E7%B... [15:10:28] RECOVERY - puppet last run on ms-be2027 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:10:50] (03PS1) 10Fsero: registryha: registry health check was querying wrong port [puppet] - 10https://gerrit.wikimedia.org/r/506170 (https://phabricator.wikimedia.org/T221759) [15:11:26] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Support upgrades which introduce changes to binary package names [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 (owner: 10Muehlenhoff) [15:12:41] ACKNOWLEDGEMENT - Docker registry health on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry2001.codfw.wmnet:443/debug/health - 366 bytes in 0.151 second response time Fsero known problem, see registry1002 ack https://wikitech.wikimedia.org/wiki/Docker [15:12:41] ACKNOWLEDGEMENT - Docker registry health on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry2002.codfw.wmnet:443/debug/health - 366 bytes in 0.154 second response time Fsero known problem, see registry1002 ack https://wikitech.wikimedia.org/wiki/Docker [15:15:15] (03CR) 10Fsero: [C: 03+2] registryha: registry health check was querying wrong port [puppet] - 10https://gerrit.wikimedia.org/r/506170 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero) [15:15:37] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10Cmjohnson) [15:15:54] (03PS1) 10Muehlenhoff: Support upgrades which introduce changes to binary package names (client side) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506171 [15:18:44] 10Operations, 10cloud-services-team (Kanban): Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10herron) [15:18:45] (03PS10) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [15:18:48] (03PS8) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [15:18:50] 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10herron) [15:18:51] (03PS2) 10Jcrespo: mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125 [15:19:13] 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10ayounsi) a:03ayounsi [15:20:59] (03PS3) 10Ema: cache: distinguish between Varnish and ATS nodes [puppet] - 10https://gerrit.wikimedia.org/r/505815 (https://phabricator.wikimedia.org/T219967) [15:24:59] 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Papaul) Before: BIOS Version 2.4.3 Firmware Version 2.40.40.40 IP Address(es) 10.193.1.75 iDRAC MAC Address 84:7B:EB:F6:99:B2 DNS Domain Name Lifecyc... [15:25:54] (03PS1) 10Fsero: registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759) [15:26:21] (03CR) 10jerkins-bot: [V: 04-1] registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero) [15:27:32] (03CR) 10Niedzielski: [C: 03+1] Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [15:27:48] (03PS2) 10Matthias Mullie: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) [15:28:16] (03PS2) 10Fsero: registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759) [15:28:19] (03CR) 10Ema: "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/16000/" [puppet] - 10https://gerrit.wikimedia.org/r/505815 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [15:29:03] (03CR) 10jerkins-bot: [V: 04-1] registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero) [15:30:39] 10Operations, 10Wikimedia-Logstash: logstash stuck on its persistent queue - https://phabricator.wikimedia.org/T212640 (10herron) 05Open→03Resolved a:03herron I think it's safe to resolve this now since we're on logstash 5.6.15, and have disabled the logstash persistent queue. [15:31:18] (03CR) 10WMDE-leszek: First draft of a wikibase-termbox chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) (owner: 10Alexandros Kosiaris) [15:32:03] !log Restarting php7.2-fpm on mw2* in codfw for 505383 and T211488 [15:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:08] T211488: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 [15:33:02] 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) Thanks @Papaul I am rebooting the server a few times to confirm it is indeed solved! [15:33:41] 10Operations: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 (10herron) Is there anything left to do before closing this? [15:34:45] (03PS1) 10Ema: debdeploy: make filter_services default to empty hash [puppet] - 10https://gerrit.wikimedia.org/r/506176 [15:35:16] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decom rigel.frack.codfw.wmnet - https://phabricator.wikimedia.org/T202535 (10Papaul) papaul@fasw-c-codfw> show interfaces descriptions | match "ge-[0-1]/0/14" ge-0/0/14 down down DISABLED ge-1/0/14 down... [15:35:20] (03PS3) 10Fsero: registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759) [15:37:03] (03CR) 10Fsero: [C: 03+2] registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero) [15:38:06] (03PS3) 10Jcrespo: mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125 [15:38:10] 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10crusnov) I forget where but in digging about this it seems that Puppet will return 503 if it is too busy, there are numerous reports of this (to be clear I don't know if it's puppet itself or an... [15:40:37] (03PS2) 10Ema: debdeploy: make filter_services default to empty hash [puppet] - 10https://gerrit.wikimedia.org/r/506176 [15:40:39] (03PS1) 10Ema: cumin: add ATS production hosts to aliases [puppet] - 10https://gerrit.wikimedia.org/r/506177 (https://phabricator.wikimedia.org/T219967) [15:40:51] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Papaul) [15:43:34] 10Operations, 10Puppet, 10Icinga, 10monitoring: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10ema) [15:43:38] (03CR) 10Alexandros Kosiaris: "Haven't seen the chart, I 'll have a look tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [15:43:46] 10Operations, 10Puppet, 10Icinga, 10monitoring: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10ema) p:05Triage→03Normal [15:45:02] 10Operations, 10ops-ulsfo: ulsfo netbox updates - https://phabricator.wikimedia.org/T221785 (10RobH) p:05Triage→03Normal [15:45:08] PROBLEM - swift-account-replicator on ms-be2039 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.83: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:50:24] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) [15:51:14] (03CR) 10Jcrespo: [C: 03+2] mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125 (owner: 10Jcrespo) [15:52:36] !log performing rolling restart of pybal on low-traffic eqiad/codfw lvs hosts [15:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:22] (03PS1) 10Fsero: registryha: fix: bad regexp pattern on check (again) [puppet] - 10https://gerrit.wikimedia.org/r/506178 (https://phabricator.wikimedia.org/T221759) [15:55:35] (03CR) 10jerkins-bot: [V: 04-1] registryha: fix: bad regexp pattern on check (again) [puppet] - 10https://gerrit.wikimedia.org/r/506178 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero) [15:57:11] (03PS2) 10Fsero: registryha: fix: bad regexp pattern on check (again) [puppet] - 10https://gerrit.wikimedia.org/r/506178 (https://phabricator.wikimedia.org/T221759) [15:57:35] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Kibana breaks during rolling upgrade - https://phabricator.wikimedia.org/T221143 (10herron) 05Open→03Resolved a:03herron The Kibana lvs has been updated to use the source hash scheduler [15:58:50] (03CR) 10Fsero: [C: 03+2] registryha: fix: bad regexp pattern on check (again) [puppet] - 10https://gerrit.wikimedia.org/r/506178 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero) [15:59:49] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2037 - https://phabricator.wikimedia.org/T221512 (10Marostegui) 05Open→03Resolved Thanks! All good! ` root@db2037:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380312088E0) Port Name: 1I Port Name:... [16:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190424T1600). Please do the needful. [16:00:04] kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:27] i'm here [16:02:36] 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [16:02:47] (03PS1) 10Elukey: profile::analytics::database::meta: add properties to my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243) [16:04:16] 10Operations, 10cloud-services-team (Kanban): cumin: leaked aliases - https://phabricator.wikimedia.org/T221788 (10aborrero) [16:09:33] (03CR) 10Marostegui: [C: 03+1] "Remember you will need the table to have ROW_FORMAT=DYNAMIC as shown during the earlier discussion." [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [16:09:48] I'll do the SWAT, sorry for the delay [16:10:08] RECOVERY - swift-account-replicator on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [16:10:33] (03CR) 10Elukey: "> Remember you will need the table to have ROW_FORMAT=DYNAMIC as" [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [16:10:44] RECOVERY - Device not healthy -SMART- on db2037 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2037&var-datasource=codfw+prometheus/ops [16:16:37] RECOVERY - Docker registry health on registry1002 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker [16:17:07] RECOVERY - Docker registry health on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [16:17:33] (03PS11) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [16:17:35] (03PS9) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [16:17:37] (03PS1) 10Jcrespo: mariadb-backups: Setup dbprov eqiad servers, remove dbstore1001 backups [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) [16:18:19] (03CR) 10Jcrespo: "First version, needs more work on the previous steps still, and probably more improvements I may be missing now." [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [16:19:41] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2658 MB (5% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [16:19:46] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Papaul) ` papaul@asw-b-codfw> show interfaces ge-6/0/4 descriptions Interface Admin Link Description ge-6/0/4 up up db2020 papau... [16:19:56] (03PS2) 10Jcrespo: mariadb-backups: Setup dbprov eqiad servers, remove dbstore1001 backups [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) [16:20:59] (03CR) 10Jcrespo: "This will also need a better distribution later so we minimize simultaneous backups on the same server (target or source)." [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [16:23:43] kostajh: Patch is live on mwdebug1002, please test there insofar possible [16:23:53] RoanKattouw: checking [16:30:18] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [16:32:14] RoanKattouw: it looks OK, although there's another issue now (not caused by this patch) [16:32:22] OK syncing this for now hten [16:33:52] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/GrowthExperiments/: Fix exceptions in Homepage logging (duration: 00m 56s) [16:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:11] (03PS2) 10Elukey: profile::analytics::database::meta: add properties to my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243) [16:37:38] (03CR) 10Ottomata: "Hm, ok. Was worried there would be potential for bugs in one causing errors in the other, but I'll give that a go. I'd prefer DRY charts" [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [16:40:07] RECOVERY - Docker registry health on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [16:40:15] PROBLEM - HP RAID on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer [16:46:04] (03PS1) 10Jbond: refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 [16:55:13] PROBLEM - Check systemd state on ms-be2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:55:47] (03CR) 10Lucas Werkmeister (WMDE): "The phabricator task also mentions an alias for the talk namespace – I don’t see that being added in this change…?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm) [16:56:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506176 (owner: 10Ema) [16:58:08] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10mobrovac) >>! In T220402#5133994, @akosiaris wrote: > @Tarrow, @WMDE-leszek. I 've been working on the termbox helm chart and w... [16:58:26] (03PS2) 10Dzahn: admins: add shell account for Willy Pao and add to datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/506003 (https://phabricator.wikimedia.org/T221142) [16:59:47] (03CR) 10Dzahn: [C: 03+2] admins: add shell account for Willy Pao and add to datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/506003 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn) [17:02:01] 10Operations, 10ops-codfw: find horizontal PDUs in codfw - https://phabricator.wikimedia.org/T221153 (10Papaul) a:05Papaul→03RobH {F28756699} [17:02:15] (03CR) 10Jbond: "compiler suggests this is a noop (which is what i would expected)" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond) [17:04:20] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [17:04:28] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [17:09:25] (03PS2) 10Urbanecm: Create new namespace "Edice" for cswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) [17:10:23] (03CR) 10Urbanecm: "Thanks for catching that Lucas, fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm) [17:12:40] (03PS12) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [17:13:11] (03CR) 10jerkins-bot: [V: 04-1] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [17:17:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "no problem :) it looks like I’ll be deploying this tomorrow btw, I have two other changes in SWAT" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm) [17:22:43] RECOVERY - HP RAID on ms-be2028 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [17:23:46] !log proton1001 - restarting proton service - low RAM caused facter/puppet fails (https://tickets.puppetlabs.com/browse/PUP-8048) freed memory and fixed puppet run (cc: T219456 T214975) [17:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:54] T219456: Add ram to Proton* - https://phabricator.wikimedia.org/T219456 [17:23:54] T214975: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 [17:26:35] RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:28:50] ACKNOWLEDGEMENT - Check systemd state on ms-be2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T219854 [17:29:09] RECOVERY - Check systemd state on ms-be2026 is OK: OK - running: The system is fully operational [17:32:00] (03PS13) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [17:33:53] !log contint1001 - apt-get clean for 1% more disk space [17:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:58] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) Icinga alerting again: contint1001 - Disk space CRITICAL 2019-04-24 17:29:39 0d 1h 10m 17s 3/3... [17:35:03] (03PS14) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [17:35:05] 10Operations, 10Core Platform Team, 10Multi-Content-Revisions, 10Regression, 10Wikimedia-production-error: Unable to move page (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10mobrovac) p:05Unbreak!→03High [17:35:42] (03CR) 10jerkins-bot: [V: 04-1] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [17:36:28] (03PS15) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [17:37:33] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2649 MB (5% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:38:03] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [17:39:06] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) p:05Normal→03High [17:39:23] PROBLEM - swift-object-server on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:39:27] PROBLEM - swift-object-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:39:31] PROBLEM - swift-account-server on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:39:31] PROBLEM - dhclient process on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [17:39:35] PROBLEM - swift-container-server on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:39:39] PROBLEM - very high load average likely xfs on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:39:39] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [17:39:45] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [17:39:47] PROBLEM - swift-account-reaper on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:39:53] PROBLEM - MD RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [17:39:53] PROBLEM - swift-container-updater on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:39:57] PROBLEM - Disk space on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:39:57] PROBLEM - DPKG on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [17:40:09] PROBLEM - Check size of conntrack table on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [17:40:11] PROBLEM - swift-container-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:40:13] PROBLEM - swift-object-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:40:23] PROBLEM - configured eth on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer [17:40:27] PROBLEM - swift-account-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:40:31] PROBLEM - swift-object-updater on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:40:33] PROBLEM - swift-account-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:40:45] PROBLEM - swift-container-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [17:41:05] probably nagios-nrpe-server crashed.. looking [17:41:05] RECOVERY - MD RAID on ms-be2019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:41:05] RECOVERY - swift-container-updater on ms-be2019 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [17:41:08] eh, ok [17:41:09] RECOVERY - Disk space on ms-be2019 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:41:09] RECOVERY - DPKG on ms-be2019 is OK: All packages OK [17:41:21] RECOVERY - Check size of conntrack table on ms-be2019 is OK: OK: nf_conntrack is 7 % full [17:41:23] RECOVERY - swift-container-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [17:41:25] RECOVERY - swift-object-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [17:41:31] noisy thoug [17:41:35] RECOVERY - configured eth on ms-be2019 is OK: OK - interfaces up [17:41:39] RECOVERY - swift-account-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [17:41:41] yea, noisy for being just a single server [17:41:41] RECOVERY - swift-object-updater on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [17:41:45] RECOVERY - swift-account-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [17:41:51] did not restart anything [17:41:53] RECOVERY - swift-object-server on ms-be2019 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [17:41:57] RECOVERY - swift-container-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [17:41:57] RECOVERY - swift-object-auditor on ms-be2019 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [17:41:59] RECOVERY - swift-account-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [17:41:59] RECOVERY - dhclient process on ms-be2019 is OK: PROCS OK: 0 processes with command name dhclient [17:42:05] RECOVERY - swift-container-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [17:42:09] RECOVERY - very high load average likely xfs on ms-be2019 is OK: OK - load average: 27.53, 29.85, 28.51 https://wikitech.wikimedia.org/wiki/Swift [17:42:09] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2019 is OK: OK ferm input default policy is set [17:42:17] RECOVERY - swift-account-reaper on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [17:42:23] that is how it looks when nagios-nrpe gets killed due to OOM [17:42:35] Yah i'm familiar :) [17:42:41] but this time i did nothing to fix it [17:42:55] the swift backend machines in both codfw and eqiad are busy right now, lots of data moving around because of decomming some hosts in the cluster [17:43:06] ack, gotcha [17:43:21] I am overdue for lunch but I'll spend some time looking if there's an easy way to rate-limit replication [17:43:56] I have to run too, but +1 to what cdanis said [17:44:54] alright, no rush. enjoy lunch & dinner [17:44:57] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:47:16] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm) [17:47:41] thanks mutante ! [17:48:15] but yeah from past experiences the cluster can freak out a little right after a rebalancing has begun and the settles [17:49:29] gotta go! [17:50:18] yep! laters [17:51:55] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:52:38] !log contint1001 - for logfile in $(find /var/log/zuul/ ! -name "*.gz"); do gzip $logfile; done to get more disk space (T207707) [17:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:44] T207707: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 [17:57:14] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) p:05High→03Normal gzipping all files in /var/log/zuul that were not already gzipped saved almost... [18:03:00] (03CR) 10CDanis: "FWIW I feel incompetent to review this change; well beyond my Puppet knowledge. Sorry :)" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond) [18:10:31] PROBLEM - swift-object-auditor on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:10:31] PROBLEM - swift-container-updater on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:10:31] PROBLEM - swift-container-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:11:35] PROBLEM - DPKG on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:11:51] PROBLEM - swift-container-auditor on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:11:59] PROBLEM - swift-account-reaper on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:12:05] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:12:15] PROBLEM - Disk space on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [18:12:17] PROBLEM - MD RAID on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:12:21] PROBLEM - swift-account-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:12:23] PROBLEM - swift-account-server on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:12:37] PROBLEM - swift-object-updater on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:12:37] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:12:47] PROBLEM - configured eth on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:12:53] PROBLEM - swift-account-auditor on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:12:55] PROBLEM - dhclient process on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:13:01] PROBLEM - Check size of conntrack table on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:13:03] PROBLEM - swift-object-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:13:03] PROBLEM - swift-container-server on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:13:09] PROBLEM - swift-container-updater on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:13:34] yea.. now that's known [18:13:43] PROBLEM - swift-object-server on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:13:43] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:14:21] PROBLEM - HP RAID on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:14:57] PROBLEM - MD RAID on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:14:59] PROBLEM - swift-account-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:15:29] PROBLEM - puppet last run on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer [18:15:35] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:15:39] PROBLEM - swift-object-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:15:47] PROBLEM - swift-container-auditor on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:15:47] PROBLEM - swift-container-updater on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:16:19] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2031 is OK: OK ferm input default policy is set [18:16:21] RECOVERY - swift-object-server on ms-be2031 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [18:16:21] RECOVERY - swift-account-server on ms-be2031 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [18:16:27] RECOVERY - very high load average likely xfs on ms-be2031 is OK: OK - load average: 35.97, 47.61, 45.85 https://wikitech.wikimedia.org/wiki/Swift [18:16:27] RECOVERY - swift-object-updater on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [18:16:39] RECOVERY - configured eth on ms-be2031 is OK: OK - interfaces up [18:16:45] RECOVERY - swift-account-auditor on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [18:16:45] RECOVERY - dhclient process on ms-be2031 is OK: PROCS OK: 0 processes with command name dhclient [18:16:47] RECOVERY - DPKG on ms-be2031 is OK: All packages OK [18:16:51] RECOVERY - Check size of conntrack table on ms-be2031 is OK: OK: nf_conntrack is 3 % full [18:16:51] RECOVERY - swift-object-replicator on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [18:16:51] RECOVERY - swift-container-server on ms-be2031 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [18:16:59] RECOVERY - swift-container-replicator on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [18:16:59] RECOVERY - swift-object-auditor on ms-be2031 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [18:16:59] RECOVERY - swift-container-updater on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [18:16:59] RECOVERY - swift-container-auditor on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [18:17:09] RECOVERY - swift-account-reaper on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [18:17:15] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational [18:17:25] RECOVERY - Disk space on ms-be2031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [18:17:26] !log sudo icinga-downtime -h ms-be2031 -r swift-rebalancing -d 86400 [18:17:27] RECOVERY - MD RAID on ms-be2031 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:31] RECOVERY - swift-account-replicator on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [18:20:41] RECOVERY - puppet last run on ms-be2031 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [18:20:47] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:23:33] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) Hi @wiki_willy your shell account has been created. You should be able to ssh to the following hosts: - bastion hosts, to jump to other hosts in the internal netw... [18:25:00] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10wiki_willy) Thanks @Dzahn , much appreciated. ~Willy [18:25:55] (03CR) 10Herron: [C: 03+1] "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [18:30:19] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational [18:37:31] (03PS17) 10CRusnov: Port MakeVM to a cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [18:40:24] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [18:41:43] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) @wiki_willy You're welcome. I also just added you to root@ mail, mainly because then you receive noc@ mail which is an alias for it. Prepare for a _little_ more mail... [18:43:09] (03CR) 10Herron: "> > Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [18:44:19] (03CR) 10Herron: [C: 03+1] kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [18:45:05] (03CR) 10Herron: [C: 03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/505741 (https://phabricator.wikimedia.org/T220103) (owner: 10Filippo Giunchedi) [18:45:49] !log mw1297 - scap pull [18:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:49] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1297.eqiad.wmnet,cluster=api_appserver [18:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:22] !log pooled mw1297 as a new API server (T192457) [18:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:28] T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 [18:47:49] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [18:49:52] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) [18:49:56] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) 05Open→03Stalled Ticket is done besides one check box and that is T215332 unless a different server is used, making sure in T215332#5133171. [18:50:43] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [18:51:00] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [18:55:51] 10Operations, 10cloud-services-team (Kanban): cumin: leaked aliases - https://phabricator.wikimedia.org/T221788 (10Dzahn) almost duplicate of T221125 [18:59:54] hi, I've got the following notification and hope to find help here: Puppet is failing to run on the "wikimedia-ui.design.eqiad.wmflabs" instance in Wikimedia Cloud VPS. [18:59:58] :) [19:01:31] 10Operations, 10ops-codfw: wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10Dzahn) History of this host: * wtp2019 - hardware (RAM) check (T146113) * wtp2019 has faulty memory (T146009) * wtp2019 issues an uncorrectable memory error (T148710) * wtp2019.... [19:02:27] Volker_E: on the host there should be a /var/log/puppet.log with additional info as to why it failed to run [19:08:29] (03CR) 10Cwhite: "This change looks like a step in the right direction, but I don't see where $_role comes from. Is it a datapoint added by the role() func" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond) [19:09:40] (03CR) 10Cwhite: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/505741 (https://phabricator.wikimedia.org/T220103) (owner: 10Filippo Giunchedi) [19:11:33] PROBLEM - swift-account-server on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:12:11] PROBLEM - MD RAID on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer [19:12:33] PROBLEM - swift-account-replicator on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:12:47] PROBLEM - Disk space on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [19:12:49] PROBLEM - swift-container-replicator on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:12:53] PROBLEM - swift-account-reaper on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:12:53] PROBLEM - swift-container-auditor on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:13:01] PROBLEM - swift-container-updater on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:13:07] PROBLEM - very high load average likely xfs on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:13:09] PROBLEM - configured eth on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer [19:13:23] PROBLEM - Check size of conntrack table on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer [19:13:23] PROBLEM - DPKG on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer [19:13:25] PROBLEM - swift-account-auditor on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [19:13:31] RECOVERY - MD RAID on ms-be2033 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:13:47] RECOVERY - swift-account-replicator on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [19:13:57] RECOVERY - Disk space on ms-be2033 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [19:14:01] RECOVERY - swift-container-replicator on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [19:14:05] RECOVERY - swift-account-reaper on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [19:14:05] RECOVERY - swift-container-auditor on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [19:14:05] RECOVERY - swift-account-server on ms-be2033 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [19:14:13] RECOVERY - swift-container-updater on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [19:14:19] RECOVERY - very high load average likely xfs on ms-be2033 is OK: OK - load average: 56.72, 56.58, 50.93 https://wikitech.wikimedia.org/wiki/Swift [19:14:21] RECOVERY - configured eth on ms-be2033 is OK: OK - interfaces up [19:14:35] RECOVERY - Check size of conntrack table on ms-be2033 is OK: OK: nf_conntrack is 3 % full [19:14:35] RECOVERY - DPKG on ms-be2033 is OK: All packages OK [19:14:37] RECOVERY - swift-account-auditor on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [19:20:06] (03CR) 10CRusnov: [C: 03+2] Port MakeVM to a cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [19:22:31] (03CR) 10Jforrester: [C: 03+1] "Eurgh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie) [19:23:39] (03CR) 10Reedy: Allow cross-site requests from mobile domains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie) [19:32:47] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Papaul) ` papaul@asw-a-codfw# run show interfaces ge-6/0/17 descriptions Interface Admin Link Description ge-6/0/17 down down DISABLED... [19:34:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Papaul) [19:39:58] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [19:40:37] mutante, shall I write a post? [19:41:22] Krenair: yes please :) [19:42:44] mutante, do we have a particular date/time in mind? [19:43:29] uhm.. no.. should we merge before sending the post? [19:44:05] (03PS1) 10ArielGlenn: allow section, list of dbs or list of wikis stand alone as arg [software] - 10https://gerrit.wikimedia.org/r/506225 [19:44:13] Krenair: or we can put it on deployment calendar. what do you think? [19:44:53] (03CR) 10jerkins-bot: [V: 04-1] allow section, list of dbs or list of wikis stand alone as arg [software] - 10https://gerrit.wikimedia.org/r/506225 (owner: 10ArielGlenn) [19:45:08] i could add it to a puppet swat window.. just so that there is a scheduled time [19:50:06] (03CR) 10Brian Wolff: [C: 03+1] Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie) [19:55:20] mutante, not sure. [19:57:29] !log mobrovac@deploy1001 Started deploy [restbase/deploy@8a6b6fc] (dev-cluster): Switch Parsoid stashing to simple key/value [19:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:37] mutante, realistically this probably breaks old versions of IE [19:57:46] and a bunch of unsupported other stuff [20:00:04] cscott, arlolra, subbu, bearND, and halfak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190424T2000). [20:01:47] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@8a6b6fc] (dev-cluster): Switch Parsoid stashing to simple key/value (duration: 04m 18s) [20:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:51] https://etherpad.wikimedia.org/p/g505410announce [20:10:54] thanks Krenair! [20:18:17] Krenair: thank you for the etherpad. i will look at getting it on the calendar [20:19:57] PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:21:02] !log mobrovac@deploy1001 Started deploy [restbase/deploy@8a6b6fc]: Parsoid storage simplification step 1: switch Parsoid stashing to simple key/value - T215956 [20:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:08] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [20:32:47] (03PS1) 10Sbisson: Cleanup old EchoCrossWikiBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260) [20:35:11] 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm) [20:35:27] 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm) a:05Bstorm→03None [20:35:58] Krenair: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1824248&oldid=1824244 [20:40:18] (03CR) 10Jforrester: [C: 03+1] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260) (owner: 10Sbisson) [20:40:48] (03CR) 10Dzahn: [C: 03+1] "added to Deployment calendar in the Puppet SWAT section tomorrow: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revisi" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [20:41:22] thanks mutante [20:41:41] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@8a6b6fc]: Parsoid storage simplification step 1: switch Parsoid stashing to simple key/value - T215956 (duration: 20m 39s) [20:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:46] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [20:44:32] Krenair: feel free to send the mail to wikitech-l and i reply with the calendar link or just add it. still making sure there are no concerns in -releng [20:44:53] am waiting for -releng too [20:44:53] but "likely shortly" is true [20:44:56] ok [20:46:05] RECOVERY - Check systemd state on ms-be1037 is OK: OK - running: The system is fully operational [20:56:50] (03PS1) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) [21:07:11] (03PS1) 10CDanis: swift-object-replicator: nice & ionice it [puppet] - 10https://gerrit.wikimedia.org/r/506321 [21:09:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:11:15] (03PS1) 10CDanis: standard_packages: add iotop [puppet] - 10https://gerrit.wikimedia.org/r/506322 [21:14:43] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 25 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:19:21] (03CR) 10CDanis: [C: 03+2] standard_packages: add iotop [puppet] - 10https://gerrit.wikimedia.org/r/506322 (owner: 10CDanis) [21:20:09] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:21:01] RECOVERY - HP RAID on ms-be2031 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [21:23:35] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 66 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:24:07] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 74, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:26:13] XioNoX: cr1-eqsin - interface down. type: Peering: Equinix Singapore (WIKIMEDIA-SG1-IX-00, MAC filter) . i am not sure i can identify who from looking at https://netbox.wikimedia.org/circuits/providers/ ticket worthy? [21:27:52] well.. and it recovered [21:28:53] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 29 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:29:24] (03PS2) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) [21:31:50] !log icinga-downtime -h ms-be2038 -r swift-rebalancing -d 86400 [21:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:36] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16003/" [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [21:46:45] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) We are indeed using service-runner; I don't think this provides /_info or /?spec though? Or are we missing something? [21:49:03] XioNoX: mr1-ulsfo wil blip ] [21:49:05] it just power cycled [21:51:05] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [21:51:05] PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:05] PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:05] PROBLEM - Host cp4022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:05] PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:05] PROBLEM - Host bast4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:09] PROBLEM - Host cp4027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:09] PROBLEM - Host cp4028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:09] PROBLEM - Host cp4029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:09] PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:51:35] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [21:52:17] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [21:52:25] PROBLEM - Host cp4031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:52:25] PROBLEM - Host cp4030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:52:30] oh [21:52:55] PROBLEM - Host dns4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:52:55] PROBLEM - Host dns4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:52:58] oic [21:53:08] robh: sounds expected then.. pheew [21:53:09] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [21:53:33] PROBLEM - Host cp4021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:53:33] PROBLEM - Host cp4032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:53:33] PROBLEM - Host lvs4005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:53:47] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.83 ms [21:53:49] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.20 ms [21:54:29] RECOVERY - Host cp4030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.21 ms [21:55:37] RECOVERY - Host cp4027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms [21:56:29] RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.01 ms [21:56:29] RECOVERY - Host cp4024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.79 ms [21:56:29] RECOVERY - Host cp4022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.83 ms [21:56:29] RECOVERY - Host bast4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms [21:56:29] RECOVERY - Host cp4025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms [21:56:32] that was a bit more than just mr1? [21:56:33] RECOVERY - Host cp4028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.10 ms [21:56:33] RECOVERY - Host cp4029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.24 ms [21:56:33] RECOVERY - Host cp4026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.89 ms [21:56:47] (03PS6) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) [21:56:53] oh right, because those hosts' mgmt interfaces would be linked via mr1 [21:57:29] there are more details in the -dcops channel. they are working on power [21:57:36] ripe-atlas-ulsfo too? [21:57:41] RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.29 ms [21:57:46] yes, all of them (the regular hosts) were mgmt [21:57:49] RECOVERY - Host cp4031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.87 ms [21:58:19] RECOVERY - Host dns4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.86 ms [21:58:19] RECOVERY - Host dns4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.99 ms [21:58:33] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 73.05 ms [21:58:57] RECOVERY - Host cp4021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.81 ms [21:58:57] RECOVERY - Host cp4032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.08 ms [21:58:57] RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.78 ms [21:59:23] mgmt goes down when mr1 goes down [21:59:26] but its ok and expected [22:00:23] yeah I made that mental link shortly afterwards, what about ripe-atlas though? [22:00:26] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond) [22:01:39] ripe atlas is just a ping destination device [22:01:49] and is single power supply so normalizing power on racks means it gets unplugged some [22:01:55] but it comes back on its own with no interference [22:02:17] basically today we're normalizing and labeling every single power input/cable in the racks into netbox [22:02:23] ah [22:02:30] so it involved moving power plugs around in the rack [22:02:38] so server X uses port 2 on both A and B towers, etc... [22:02:50] for 99% of the stuff, its dual power feeds so no one notices [22:03:03] but mr1, atlas, and mgmt switches are single power supply fed, so they go down for this [22:03:08] yeah [22:04:26] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH) https://docs.google.com/spreadsheets/d/13XPw-PyFqUUO5oqeljQvwpN9N7aWT5Yyii00GJrkeaE/edit?usp=sharing has all the power connections documented, I'll import that into netbox shortly [22:07:57] sorry about the channel spam =] [22:09:01] (03PS3) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) [22:09:51] no worries I was just curious is all [22:09:55] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10jijiki) @Dzahn I need to to talk with our team before I green light this, also mentioned in T221132. Is it Possible to revisit this in a week from now? Thank you! [22:10:03] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 41 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:10:21] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:11:24] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10faidon) [22:12:33] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 49 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:13:17] (03CR) 10Dzahn: [C: 03+2] varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [22:13:25] (03PS7) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) [22:13:54] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jijiki) [22:13:59] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar): Set `enable_dl` to 0 in php.ini - https://phabricator.wikimedia.org/T220681 (10jijiki) 05Open→03Resolved a:03jijiki @Joe @Krinkle, since we have pushed enable_dl => 0 to production, I am resolving this. Feel free to reop... [22:14:30] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jijiki) [22:14:43] 10Operations, 10ops-ulsfo: Degraded RAID on cp4032 - https://phabricator.wikimedia.org/T219586 (10RobH) 05Open→03Invalid not sure why that check made this task, as the output in the task description shows a perfectly fine raid. checked the system manually as well, no errors in systlog either [22:15:21] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:15:39] 10Operations, 10ops-ulsfo: Degraded RAID on cp4032 - https://phabricator.wikimedia.org/T219586 (10Dzahn) It looks like the cause wasn't an actual RAID failure but a networking or DNS failure: ` connect to address 10.128.0.132 port 5666: No route to host ` [22:18:29] PROBLEM - swift-object-replicator on ms-be2039 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.83: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:18:56] !log icinga-downtime -h ms-be2039 -r swift-rebalancing -d 86400 [22:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:29] !log deploying varnish/trafficserver change to cover www.wikiba.se (not prod yet) [22:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:35] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:21:49] PROBLEM - Docker registry HTTPS interface on darmstadtium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [22:22:29] RECOVERY - swift-object-replicator on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [22:23:01] RECOVERY - Docker registry HTTPS interface on darmstadtium is OK: HTTP OK: HTTP/1.1 200 OK - 2482 bytes in 0.656 second response time https://wikitech.wikimedia.org/wiki/Docker [22:23:09] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:23:17] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) @jijiki Yes, of course it can wait, i just realized again the holiday situation. [22:26:15] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 15 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:26:20] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jijiki) We have pushed https://gerrit.wikimedia.org/r/502986 and (its update) https://gerrit.wikimed... [22:31:03] 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10ayounsi) Thanks, sounds good! There is nothing special to do, make sure to reserve it in DNS, eg. 208.80.155.119/2620:0:861:... [22:31:30] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH) [22:36:56] 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm) Great! I'll sort that out, then. [22:37:08] 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm) a:03Bstorm [22:44:45] PROBLEM - HP RAID on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer [22:45:15] PROBLEM - swift-account-reaper on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:45:17] PROBLEM - swift-object-updater on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:45:21] PROBLEM - swift-object-auditor on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:45:21] PROBLEM - swift-account-replicator on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:45:27] PROBLEM - Check size of conntrack table on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer [22:45:31] PROBLEM - configured eth on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer [22:45:35] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer [22:45:39] PROBLEM - swift-object-server on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:45:47] PROBLEM - very high load average likely xfs on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:45:49] PROBLEM - swift-account-auditor on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:45:57] PROBLEM - swift-container-server on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:46:01] PROBLEM - swift-container-auditor on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [22:46:11] PROBLEM - HP RAID on ms-be1033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.16.82: Connection reset by peer [22:46:27] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) >>! In T155359#5108338, @Dzahn wrote: > Next is deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/500715 for T99531#50771... [22:46:56] !log icinga-downtime -h ms-be2034 -r swift-rebalancing -d 86400 [22:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:54] 10Operations, 10ops-ulsfo: ulsfo netbox updates - https://phabricator.wikimedia.org/T221785 (10RobH) updated all but the atlas, which has no serial connection to query for serial number [22:49:41] RECOVERY - very high load average likely xfs on ms-be2034 is OK: OK - load average: 39.99, 47.31, 45.74 https://wikitech.wikimedia.org/wiki/Swift [22:49:45] RECOVERY - swift-account-auditor on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [22:49:49] RECOVERY - swift-container-server on ms-be2034 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [22:49:53] RECOVERY - swift-container-auditor on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [22:50:25] RECOVERY - swift-object-updater on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [22:50:25] RECOVERY - swift-account-reaper on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [22:50:29] RECOVERY - swift-account-replicator on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [22:50:29] RECOVERY - swift-object-auditor on ms-be2034 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [22:50:37] RECOVERY - Check size of conntrack table on ms-be2034 is OK: OK: nf_conntrack is 4 % full [22:50:41] RECOVERY - configured eth on ms-be2034 is OK: OK - interfaces up [22:50:45] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2034 is OK: OK ferm input default policy is set [22:50:49] RECOVERY - swift-object-server on ms-be2034 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [22:54:41] (03PS1) 10Bstorm: cloudstore: add floating IP for the maps homes/project NFS [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806) [22:55:01] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: add floating IP for the maps homes/project NFS [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806) (owner: 10Bstorm) [22:55:59] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [22:57:41] RECOVERY - HP RAID on ms-be1033 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [23:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190424T2300). [23:00:04] SMalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:15] I'm here [23:00:23] Guess I can do it [23:00:29] cool [23:01:29] it's a cherry-pick for a maintenance script, so shouldn't make any trouble [23:03:04] Who will run it? [23:03:11] I will [23:04:00] (03PS2) 10Bstorm: cloudstore: add floating IP for the maps homes/project NFS [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806) [23:06:13] (03CR) 10Bstorm: "Just checking my approach. This is how we do it on tools NFS, but this is public." [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806) (owner: 10Bstorm) [23:08:04] (03CR) 10Cwhite: [C: 03+1] "Looks good to me. I hope Giuseppe will give more context around the prior attempts." [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond) [23:09:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:09:19] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:09:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:10:39] 10Operations, 10ops-ulsfo, 10netops: Interface errors on cr4-ulsfo:et-0/0/1 - https://phabricator.wikimedia.org/T205937 (10RobH) 05Open→03Resolved a:03RobH [23:10:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [23:10:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [23:11:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:12:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [23:13:15] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:13:28] (03PS4) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) [23:17:11] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:17:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:18:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:18:31] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:18:36] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) Hi @tramm Any update on this from your side? [23:20:26] @robh I see recoveries - are we good to deploy? [23:23:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:23:45] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:23:57] Anyone? [23:24:51] MaxSem: i dont know the details of the ongoing work but i saw the other channel and yes i see it all recovered on the graphs. so afaict, yes [23:25:15] Cool, thanks [23:26:04] (03PS1) 10Bstorm: toolforge: iotop was added to standard_packages, removing from exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/506329 [23:27:45] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [23:27:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [23:27:47] (03CR) 10Bstorm: [C: 03+2] toolforge: iotop was added to standard_packages, removing from exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/506329 (owner: 10Bstorm) [23:27:57] (03CR) 10Dzahn: "caused a duplicate declaration on tool labs -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/506329" [puppet] - 10https://gerrit.wikimedia.org/r/506322 (owner: 10CDanis) [23:28:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [23:30:20] 23:27:19 sync-file failed: Command 'find -O2 '/srv/mediawiki-staging/php-1.34.0-wmf.1/extensions/CirrusSearch' -not -type d -name '*.php' -not -name 'autoload_static.php' -or -name '*.inc' | xargs -n1 -P30 -exec php -l >/dev/null 2>&1' returned non-zero exit status 125 [23:32:25] hmm [23:32:29] what's that [23:33:01] merge conflict? [23:34:51] Don't see conflict tokens anywhere [23:35:19] hmm [23:35:40] MaxSem: does it say which file it has trouble with? [23:35:57] Nah, xargs seems to swallow that :O [23:37:09] weird it's pretty simple patch only touching 2 php files [23:37:58] I run it manually and got this: xargs: php: terminated by signal 6 [23:38:04] this is weird [23:38:57] MaxSem: this seems to also happen on old code, pre-patch [23:39:22] i.e. if I run it on mwmaint1002 now, it doesn't have the patch and exit code still 125 [23:39:59] in fact php-1.33.0-wmf.25 produces the same [23:40:05] is it some kind of new check? [23:40:37] HHVM vs. PHP7 shenanigans? [23:40:57] find -O2 '/srv/mediawiki-staging/php-1.34.0-wmf.1/extensions/CirrusSearch' -not -type d -name '*.php' -not -name 'autoload_static.php' -or -name '*.inc' | xargs -n1 -P30 -exec php7.2 -l | grep -v 'No syntax' [23:40:57] no idea, but definitely not new for this patch... [23:41:05] ^ runs without problems [23:41:40] yeah signal 6 is SIGABRT... I have no idea why hhvm is doing that [23:41:49] let me try to find which file it is [23:43:10] For Satan's sake... [23:43:27] (03CR) 10Bstorm: "Ok, this passes the puppet compiler for various projects at this point. The question I need to check on is whether a value for $mode or $" [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [23:43:42] (03PS5) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) [23:43:47] HHVM just sets exit code and doesn't output anything in case of syntax errors... [23:44:15] Because why TF be user friendly? [23:44:53] Or not? [23:45:09] [23:46:23] I wonder if this has something to do with -P [23:46:56] Nah, I forgot that everything that doesn't have nope produces signal 6 even without parallel [23:51:19] MaxSem: i didnt have anythign do do with those =] [23:51:32] Can you prove that? [23:51:54] can you disprove? ;D [23:53:48] well this is definitely weird... with xargs signal 6 is reproducible... but I can [23:53:56] I can not cause it without xargs [23:55:17] yeah when I run them without xargs every file passes with code 0 [23:55:57] 6877 Aborted (core dumped) [23:56:19] oh cool what causes this? [23:56:35] That's a harder question XD [23:56:57] well you should have core, right? [23:57:02] so that might give some clues maybe [23:57:54] (03CR) 10CDanis: "Sorry for the breakage when I modified standard_packages! I didn't even know this could happen :\" [puppet] - 10https://gerrit.wikimedia.org/r/506329 (owner: 10Bstorm)