[00:09:58] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Papaul) ` apaul@asw2-b-eqiad# show | compare [edit interfaces] - ge-4/0/8 { - description iron; - } [00:10:20] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Papaul) [00:13:38] (03PS1) 10Papaul: DNS: Remove mgmt DNS for iron [dns] - 10https://gerrit.wikimedia.org/r/542635 [00:20:47] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for iron [dns] - 10https://gerrit.wikimedia.org/r/542635 (owner: 10Papaul) [00:22:10] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission, 10Patch-For-Review: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Papaul) [00:22:33] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission, 10Patch-For-Review: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Papaul) 05Open→03Resolved Complete [00:31:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 464 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:37:41] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 23 probes of 464 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [01:52:55] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:12:13] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 55 probes of 464 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:14:09] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:16:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 464 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:16:21] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 65 probes of 464 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:19:03] (03PS1) 10Krinkle: Remove unused wmgReduceStartupExpiry logic in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542641 (https://phabricator.wikimedia.org/T235314) [02:19:05] (03PS1) 10Krinkle: Remove wmgReduceStartupExpiry (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542642 (https://phabricator.wikimedia.org/T235314) [02:21:55] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 25 probes of 464 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:21:59] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 23 probes of 464 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:23:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 27 probes of 464 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:36:23] (03PS1) 10Niharika29: Auto-set global prefs for email-blacklist and echo-notifications-blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542644 [02:39:38] (03CR) 10Niharika29: [C: 03+2] "Patch for labs only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542644 (owner: 10Niharika29) [02:40:28] (03Merged) 10jenkins-bot: Auto-set global prefs for email-blacklist and echo-notifications-blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542644 (owner: 10Niharika29) [02:49:31] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 30486840 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:59:17] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2104 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:30:05] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:40:43] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:12:44] (03PS1) 10Krinkle: [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [04:13:28] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [04:14:01] (03PS2) 10Krinkle: [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [04:14:42] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [04:14:58] (03PS3) 10Krinkle: [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [04:15:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [04:15:40] (03PS4) 10Krinkle: [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [04:16:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [04:21:47] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:58] (03PS3) 10Krinkle: Drop HHVM XHProf and Arclamp code, no longer called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [04:23:09] (03CR) 10Krinkle: [C: 03+1] "Updated to retain some of the comments only present in the original copy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [04:24:24] (03PS4) 10Krinkle: Drop HHVM XHProf and Arclamp code, no longer called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [04:25:45] (03CR) 10Krinkle: "While the code only conditionally does something, it is included from PhpAutoPrepend on all web requests. Splitting up so that there is no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [04:26:30] (03CR) 10Krinkle: [C: 03+2] Drop HHVM XHProf and Arclamp code, no longer called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [04:27:18] (03Merged) 10jenkins-bot: Drop HHVM XHProf and Arclamp code, no longer called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [04:29:13] * Krinkle staging X-Wikimedia-Debug fix on mwdebug1002 [04:37:30] !log krinkle@deploy1001 Synchronized wmf-config/profiler.php: 29d846938c898dd (duration: 00m 57s) [04:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:30] (03PS5) 10Krinkle: [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [04:38:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [04:39:37] (03PS6) 10Krinkle: [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [04:40:26] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [05:08:13] (03PS7) 10Krinkle: [WIP] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [05:47:45] (03PS1) 10Ammarpad: Add custom Minerva wordmark for Hebrew wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) [05:50:42] (03CR) 10Ammarpad: "I don't know how these width and height figures are calculated, so I use the generic wikivoyage values as default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) (owner: 10Ammarpad) [06:24:18] (03CR) 10Masumrezarock100: [C: 03+1] "I just hope its dimension is correct. Rest of the code looks OK to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) (owner: 10Ammarpad) [08:57:33] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:43] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:11:21] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:23:14] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Peachey88) [11:41:29] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:45:47] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10mobrovac) We have to resolve the same problem here to the one we encountered in Beta. Namely, both php-fpm and parsoid services use port 8000 to listen to incoming reque... [11:52:07] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:18:24] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) @mobrovac Yes, i agree. Making 2 new LVS and DNS services, one parsoid-php and one parsoid-js and then switching first from old parsoid to parsoid-js seems like t... [15:08:31] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10Aklapper) @jkumalah: Please use `ssh -v`, `ssh -vv`, or `ssh -vvv` to get verbose debug output to potentially identify the underlying issue. [15:15:27] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Aklapper) @BMueller: Could you answer the last question, please? [15:32:27] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 220.6 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [15:39:00] that's interesting [15:39:19] ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace [15:45:51] 10Operations, 10Wikimedia-production-error: ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace - https://phabricator.wikimedia.org/T235334 (10CDanis) [15:50:43] 10Operations, 10Core Platform Team, 10MediaWiki-API, 10Wikimedia-production-error: ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace - https://phabricator.wikimedia.org/T235334 (10Reedy) [15:52:09] 10Operations, 10Core Platform Team, 10MediaWiki-API, 10Wikimedia-production-error: ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace - https://phabricator.wikimedia.org/T235334 (10Reedy) [15:54:23] ty Reedy [15:54:36] I'm guessing this is mostly someone has started hammering a query that's broken [15:54:56] yeah that was my impression as well; at a quick glance a lot of the same URLs were repeated [15:55:06] the interval also suggests that [15:58:11] cdanis: I blame https://github.com/wikimedia/mediawiki/commit/a8525d7201dada88f3117142fe1919d0b9b80d4e [15:59:05] 10Operations, 10Core Platform Team, 10MediaWiki-API, 10Wikimedia-production-error: ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace - https://phabricator.wikimedia.org/T235334 (10Reedy) I think https://g... [15:59:38] 10Operations, 10Core Platform Team, 10MediaWiki-API, 10MW-1.34-release, 10Wikimedia-production-error: ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace - https://phabricator.wikimedia.org/T235334 (10Reed... [15:59:47] cdanis: Question is whether it's spammy enough to backport a revert of that to shut it up [16:00:14] 240K errors in 4 hours suggests maybe so [16:00:27] it's worse than that, it's 240K errors in about 1.5 hours [16:00:36] ah, ok [16:00:38] Let's do that [16:02:04] 10Operations, 10Core Platform Team, 10MediaWiki-API, 10MW-1.34-release, and 2 others: ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace - https://phabricator.wikimedia.org/T235334 (10Reedy) Reverting out... [16:06:01] 10Operations, 10Release-Engineering-Team, 10Scap, 10Wikimedia-General-or-Unknown: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10Reedy) Current implementation: `lang=html

Currently active MediaWiki versions: 10Operations, 10Core Platform Team, 10MediaWiki-API, 10MW-1.34-release, and 2 others: ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace - https://phabricator.wikimedia.org/T235334 (10Umherirrender) Alread... [16:15:36] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.1/includes/api/ApiQueryBacklinksprop.php: T235334 (duration: 00m 56s) [16:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:43] T235334: ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace - https://phabricator.wikimedia.org/T235334 [16:16:42] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: T235334 (duration: 00m 51s) [16:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:36] cdanis: That should shut it up [16:43:58] (03CR) 10Krinkle: [C: 03+1] Document Apache gzip sidestepping [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [16:44:34] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [16:47:24] 10Operations, 10Core Platform Team, 10MediaWiki-API, 10MW-1.34-release, and 2 others: ErrorException from line 596 of /srv/mediawiki/php-1.35.0-wmf.1/includes/api/ApiQueryBase.php: PHP Notice: Undefined property: stdClass::$page_namespace - https://phabricator.wikimedia.org/T235334 (10Krinkle) [16:49:37] 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10AlexisJazz) https://commons.wikimedia.org/wiki/File:Busan_tower_by_night.jpg First four revisions missing! [16:50:33] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [16:53:19] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) Removed a few unrelated tasks from the tree that need to happen after this, but aren't part of the sam... [16:55:54] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [16:59:50] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [16:59:59] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [17:33:03] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 104.2 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [18:00:10] (03CR) 10Thiemo Kreuz (WMDE): "Uh, allow me to question this:" [dns] - 10https://gerrit.wikimedia.org/r/521966 (owner: 10Jforrester) [18:27:37] PROBLEM - LVS HTTP IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:29:31] * godog shakes fist at zotero [18:29:58] I'm taking a look [18:30:51] <_joe_> godog: I would bet it's the usual memleak [18:31:03] on my phone far from my laptop, let me know if there is anything network related [18:31:34] <_joe_> godog: you would need to go kill the pods [18:31:54] _joe_: I wouldn't be suprised if it was memleak indeed, what's the easiest way to confirm/deny ? [18:32:11] and/or kill the pods ? [18:32:25] <_joe_> for killing the pods https://wikitech.wikimedia.org/wiki/Mathoid#Restarting_all_pods_for_the_service [18:32:37] <_joe_> just use zotero instead of mathoid [18:33:29] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:33:32] <_joe_> lemme et to a computer [18:33:40] found https://grafana.wikimedia.org/d/000000620/xxxx-zotero-debugging-kubernetes?orgId=1&from=now-3h&to=now so indeed seems memory [18:33:47] I'll kill the pods now, thanks _joe_ [18:33:59] RECOVERY - LVS HTTP IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:34:09] <_joe_> uh that was fast :P [18:34:16] or not? didn't touch anything :( [18:34:58] <_joe_> https://grafana.wikimedia.org/d/000000620/xxxx-zotero-debugging-kubernetes?orgId=1&from=now-1h&to=now [18:35:01] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:35:07] <_joe_> yeah just got a less hogged zotero pod [18:35:15] <_joe_> ok lemme see what's going on [18:36:13] _joe_: I'm going to hold off then for now [18:36:24] <_joe_> godog: what I am doing is looking at grafana [18:36:34] <_joe_> seeing which pods are high on memory usage [18:36:38] <_joe_> and go to delete them only [18:37:53] ok [18:38:05] <_joe_> !log deleting zotero pods with excessive memory usage in eqiad [18:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:37] <_joe_> ok crisis averted I think [18:42:44] <_joe_> I ended up deleting a few pods [18:43:29] <_joe_> interestingly the pods that went crazy were the only ones who were never restarted [18:45:03] hah! restarted automatically by k8s ? [18:46:33] looks like we're back and can go afk again [18:49:07] <_joe_> yes [18:49:26] <_joe_> the way to restart pods in a Deployment is to delete the running ones [18:49:34] <_joe_> k8s just starts new ones [18:50:43] Or a rollout restart command. Not sure if it's supported in our version of k8s [18:56:02] <_joe_> I don't think that's supported by any current version. If you want a rolling restart the best way is to bump your deployment version and run helm [18:56:23] <_joe_> a rolling restart can be easily created with some bash / awk. [18:57:51] two questions: 1) if some pods were okay, why did the service address fail? is the liveness check not working? 2) should zotero page? [18:58:38] It's supported in 1.16 [19:00:26] With kubectl rollout restart deployment -n [19:02:59] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 30.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:04:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 55.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:06:13] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 99.47 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:07:35] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 83.47 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:17:08] <_joe_> cdanis: 1) it's a known problem that the liveness check is not effective, and there is IIRC a request to upstream for a non-POST url 2) definitely, it breaks functionality [20:12:48] (03CR) 10Masumrezarock100: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) (owner: 10Ammarpad) [20:57:03] !log Reset user email of User:Gardini (T235318) [20:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:59] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Bmueller) Hey @CDanis, sorry that I missed your question! (thanks for the ping, @AKlapper :-) >>! In T2260... [20:58:03] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_content_1546970425[3](2019-10-09T14:42:44.498Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:07:25] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.1/includes/resourceloader/ResourceLoaderStartUpModule.php: 8c6baeae2 (duration: 00m 53s) [21:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:30] (03CR) 10Krinkle: [C: 04-1] "Keeping as redirect, like we do for the root m.wikipedia.org, seems preferred." [dns] - 10https://gerrit.wikimedia.org/r/521966 (owner: 10Jforrester) [21:57:49] PROBLEM - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.122 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:57:51] PROBLEM - cassandra-b service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:58:47] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:59] PROBLEM - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [22:10:13] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:55] RECOVERY - cassandra-b service on restbase2017 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:11:59] RECOVERY - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is OK: SSL OK - Certificate restbase2017-b valid until 2020-11-29 09:26:18 +0000 (expires in 413 days) https://phabricator.wikimedia.org/T120662 [22:12:27] RECOVERY - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is OK: TCP OK - 0.036 second response time on 10.192.48.122 port 9042 https://phabricator.wikimedia.org/T93886 [22:15:47] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10CDanis) Is this still an issue? [23:11:40] (03PS1) 10Krinkle: profiler.php: Remove trigger_error call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542732 (https://phabricator.wikimedia.org/T231564) [23:14:33] (03CR) 10Krinkle: [C: 03+2] profiler.php: Remove trigger_error call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542732 (https://phabricator.wikimedia.org/T231564) (owner: 10Krinkle) [23:15:20] (03Merged) 10jenkins-bot: profiler.php: Remove trigger_error call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542732 (https://phabricator.wikimedia.org/T231564) (owner: 10Krinkle) [23:17:02] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [23:19:06] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) Tagging RelEng for awareness as this means whe... [23:19:24] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) p:05Low→03Normal [23:21:02] !log krinkle@deploy1001 Synchronized wmf-config/profiler.php: bfa8bb69c1f, T231564 (duration: 00m 51s) [23:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:06] T231564: profiler.php: PHP Notice: RedisException: Connection timed out - https://phabricator.wikimedia.org/T231564