[00:00:05] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190912T0000). [00:00:34] (03CR) 10jerkins-bot: [V: 04-1] add generic interface to metrics gathering [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite) [00:00:41] (03PS8) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP, for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) [00:00:43] (03PS5) 10Jforrester: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 [00:00:45] (03PS2) 10Jforrester: Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 [00:00:47] (03PS1) 10Jforrester: tests: Skip all CirrusSearch tests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535972 [00:02:23] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Read from JSON, not serialised PHP, for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [00:02:42] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 (owner: 10Jforrester) [00:03:07] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 (owner: 10Jforrester) [00:03:15] (03CR) 10jerkins-bot: [V: 04-1] tests: Skip all CirrusSearch tests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535972 (owner: 10Jforrester) [00:04:51] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10srishakatux) [00:07:06] (03PS2) 10Jforrester: tests: Skip all CirrusSearch tests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535972 [00:07:08] (03PS1) 10Jforrester: tests: Re-try to restore globals in dbConfigTests just in case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 [00:08:13] (03CR) 10jerkins-bot: [V: 04-1] tests: Re-try to restore globals in dbConfigTests just in case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 (owner: 10Jforrester) [00:08:15] (03CR) 10jerkins-bot: [V: 04-1] tests: Skip all CirrusSearch tests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535972 (owner: 10Jforrester) [00:10:33] (03PS3) 10Jforrester: tests: Skip all CirrusSearch tests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535972 [00:10:35] (03PS2) 10Jforrester: tests: Skip all dbConfigTests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 [00:11:28] (03CR) 10jerkins-bot: [V: 04-1] tests: Skip all CirrusSearch tests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535972 (owner: 10Jforrester) [00:12:06] (03PS3) 10Jforrester: tests: Skip all dbConfigTests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 [00:12:08] (03PS4) 10Jforrester: tests: Skip all CirrusSearch tests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535972 [00:14:19] (03Abandoned) 10Jforrester: tests: Skip all CirrusSearch tests when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535972 (owner: 10Jforrester) [00:14:44] (03PS5) 10Cwhite: add generic interface to metrics gathering [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 [00:15:33] (03PS4) 10Jforrester: tests: Skip dbConfigTest:testSectionLoadsInHostsbyname when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 [00:16:27] (03CR) 10jerkins-bot: [V: 04-1] tests: Skip dbConfigTest:testSectionLoadsInHostsbyname when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 (owner: 10Jforrester) [00:18:09] (03PS5) 10Jforrester: tests: Skip dbConfigTest::testDbAssignedToAnExistingCluster when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 [00:19:22] (03CR) 10jerkins-bot: [V: 04-1] tests: Skip dbConfigTest::testDbAssignedToAnExistingCluster when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 (owner: 10Jforrester) [00:20:13] (03PS6) 10Jforrester: tests: Skip all dbConfigTest testsr when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 (https://phabricator.wikimedia.org/T232691) [00:21:57] (03CR) 10Jforrester: [C: 03+2] tests: Skip all dbConfigTest testsr when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 (https://phabricator.wikimedia.org/T232691) (owner: 10Jforrester) [00:22:55] (03Merged) 10jenkins-bot: tests: Skip all dbConfigTest testsr when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 (https://phabricator.wikimedia.org/T232691) (owner: 10Jforrester) [00:23:10] (03CR) 10jenkins-bot: tests: Skip all dbConfigTest testsr when not on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535973 (https://phabricator.wikimedia.org/T232691) (owner: 10Jforrester) [00:23:28] (03PS9) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP, for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) [00:25:42] (03PS6) 10Jforrester: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 [00:26:09] (03PS3) 10Jforrester: Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 [00:26:30] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [00:27:24] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [00:27:45] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [00:28:39] (03CR) 10Jforrester: "In particular, given that for T232691 I've had to disable some of the tests on HHVM, we should fix that before proceeding. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [00:28:41] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [00:29:08] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 (owner: 10Jforrester) [00:30:17] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 (owner: 10Jforrester) [00:30:36] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [00:31:43] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [00:31:45] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 (owner: 10Jforrester) [00:32:57] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 (owner: 10Jforrester) [00:35:20] (03PS3) 10Jforrester: Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 [00:35:22] (03PS3) 10Jforrester: Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 [00:35:24] (03PS3) 10Jforrester: composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 [00:35:26] (03PS4) 10Jforrester: composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 [00:35:28] (03PS4) 10Jforrester: composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 [00:36:56] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 (owner: 10Jforrester) [00:37:01] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [00:37:11] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [00:37:15] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:42] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [00:38:03] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 (owner: 10Jforrester) [00:40:29] RECOVERY - MegaRAID on helium is OK: OK: optimal, 1 logical, 12 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:43:29] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:43:35] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:14] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10Aklapper) @Agusbou2015: Why? Please always elaborate why when adding comments. [01:13:03] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) restbase1018 is decommissioned and ready to be reimaged. [01:17:20] (03PS4) 10Jforrester: Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 [01:17:22] (03PS4) 10Jforrester: Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 [01:17:24] (03PS4) 10Jforrester: composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 [01:17:26] (03PS5) 10Jforrester: composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 [01:17:28] (03PS5) 10Jforrester: composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 [01:17:30] (03PS9) 10Jforrester: Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [01:17:32] (03PS1) 10Jforrester: tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 [01:17:34] (03PS1) 10Jforrester: WmfCluster: Use static VariantSettings instead of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535981 [01:19:03] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 (owner: 10Jforrester) [01:19:44] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [01:19:49] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [01:19:59] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 (owner: 10Jforrester) [01:20:05] (03CR) 10jerkins-bot: [V: 04-1] Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [01:20:08] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [01:20:26] (03CR) 10jerkins-bot: [V: 04-1] tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 (owner: 10Jforrester) [01:21:04] (03CR) 10jerkins-bot: [V: 04-1] WmfCluster: Use static VariantSettings instead of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535981 (owner: 10Jforrester) [01:37:17] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:49] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:01:44] (03PS1) 10Ayounsi: Revert "Depool ulsfo for DC power work" [dns] - 10https://gerrit.wikimedia.org/r/535982 [02:02:48] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for DC power work" [dns] - 10https://gerrit.wikimedia.org/r/535982 (owner: 10Ayounsi) [02:02:53] (03PS2) 10Ayounsi: Revert "Depool ulsfo for DC power work" [dns] - 10https://gerrit.wikimedia.org/r/535982 [02:03:36] !log repooling ulsfo [02:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:05] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.4 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:23:28] PROBLEM - LVS HTTP IPv4 #page on ncredir-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:24:25] paged. is that due to repool uslfo? [02:24:46] uh? [02:24:50] I don't think so [02:24:52] yeah i just got paged as well [02:24:54] RECOVERY - LVS HTTP IPv4 #page on ncredir-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:25:11] of course, the directions dont tell you how to determine which system is lvs director, heh [02:25:11] ncredir it's on eqiad / codfw only right now [02:25:12] odd [02:25:27] is on text load balancers [02:25:38] on all, oh ok [02:25:46] "on all"? [02:25:51] :? [02:26:00] sorry, misparse [02:26:22] instance hiccup maybe? [02:26:55] so there are two ncredir instances per DC [02:27:06] https://grafana.wikimedia.org/d/000000545/ganeti?orgId=1 [02:27:20] there was some odd spiky stuff on ganeti graphs past several mins [02:27:25] lvs1013 is having issues reaching both of them [02:27:28] big traffic jump [02:28:13] wow: https://grafana.wikimedia.org/d/zCYRtYvWz/ncredir-overview?orgId=1 [02:28:47] ganeti1001 and ganeti1006 had the network bumps [02:28:53] I'm guessing that's where the 2x ncredir live [02:29:17] crapload of reqs for www.wikipedia.com ? [02:29:49] to wikipedia.com, not www [02:31:40] assuming no further pages / alerts, we're probably fine! :) [02:43:31] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.62 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:49:23] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 75855568 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:52:33] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 55984 and 54 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:22:28] (03PS1) 10Rxy: Revert "Add CSP headers for doc.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/535987 (https://phabricator.wikimedia.org/T232697) [03:31:09] JSDuck is now broken for me... [03:32:36] infinite looping gears icon [03:32:49] https://doc.wikimedia.org/oojs-ui/master/js/ [03:37:10] JSDuck is now lame duck [03:37:18] -.- [03:39:37] ⚙⚙ [03:47:30] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:48:41] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:49] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:49] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [05:02:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:02:50] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:02:59] Um, if it hasn't been reported yet, there are a bunch of 503 errors [05:03:30] I couldn't access phab to report it, but see, eg, `Request from [snip] via cp2013 cp2013, Varnish XID 880596179 [05:03:49] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [05:03:51] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [05:03:59] (03PS1) 10Marostegui: mariadb: Decommission db1073 [puppet] - 10https://gerrit.wikimedia.org/r/535994 (https://phabricator.wikimedia.org/T231892) [05:04:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:04:51] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:05:00] ping jbond42 see my report above. Still can't access wikidata or commons [05:05:09] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:05:13] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:05:35] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:05:35] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:05:47] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:05:55] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Pr [05:05:55] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [05:05:55] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:06:33] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [05:07:09] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:07:10] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:07:23] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:07:30] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:07:31] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [05:08:05] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:08:07] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [05:08:21] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.2984 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [05:08:25] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:09:21] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.5079 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [05:09:30] Like Danny, I'm seeing lots of errors from cp2012, cp2013, and cp2023 [05:09:53] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.4225 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [05:09:57] RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:10:25] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10JJMC89) [05:11:11] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10DannyS712) [05:11:29] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2006_v4, cp2007_v4, cp2010_v4, cp2012_v4, cp2013_v4, cp2016_v4, cp2019_v4, cp2023_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:11:41] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2001_v4, cp2004_v4, cp2006_v4, cp2010_v4, cp2016_v4, cp2023_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:11:43] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2001_v4, cp2004_v4, cp2007_v4, cp2010_v4, cp2012_v4, cp2013_v4, cp2019_v4, cp2023_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:11:49] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [05:11:53] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 51 not-conn: cp2001_v4, cp2007_v4, cp2012_v4, cp2013_v4, cp2016_v4, cp2019_v4, cp2023_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:12:15] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 51 not-conn: cp2001_v4, cp2004_v4, cp2006_v4, cp2012_v4, cp2013_v4, cp2016_v4, cp2019_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:12:15] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10JJMC89) I've seen similar errors from cp2012, cp2013, and cp2023 when trying to edit/preview/save and view page histories. [05:12:33] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [05:12:39] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 53 not-conn: cp2004_v4, cp2007_v4, cp2010_v4, cp2012_v4, cp2016_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:12:47] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2004_v4, cp2006_v4, cp2007_v4, cp2013_v4, cp2019_v4, cp2023_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:13:23] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [05:13:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:13:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:13:59] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10DannyS712) In case it helps, another specific error code: 880596179 - cp2013 [05:14:01] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10JJMC89) [05:14:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:14:39] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [05:14:45] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [05:15:03] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [05:16:11] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10JarrahTree) Who knows - where I am - via cp2019 cp2019, Varnish XID 773227245 Error: 503, Backend fetch failed at Thu, 12 Sep 2019 05:03:55 GMT [05:17:31] Definitely looks like something started going badly about 15 minutes ago - massive spike in 500 errors [05:17:34] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10Marostegui) We are checking [05:18:27] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={0,1} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logg [05:18:27] ic=All&var-consumer_group=All [05:20:40] <_joe_> twentyafterfour: 500 or 5xx? [05:20:55] 503 for me [05:21:03] 5xx [05:21:07] <_joe_> DannyS712: still seeing those? [05:21:37] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [05:21:40] yeah mostly 503 ... https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X?_g=h@66534ad&_a=h@a2e0e60 [05:22:18] Per the graphs it looks like it is gone now? https://grafana.wikimedia.org/d/000000503/varnish-http-errors?refresh=5m&orgId=1 [05:22:30] yeah, gone for me [05:22:32] <_joe_> yeah hence my question [05:22:39] <_joe_> as soon as we startedd looking... [05:22:51] weird [05:23:39] <_joe_> !log restarting strongswan on cp1077 [05:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:19] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:24:25] <_joe_> yep [05:24:28] <_joe_> that was enough [05:24:37] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:39] <_joe_> I know strongswan well enough to know how it can fail [05:24:44] <_joe_> and never recover [05:24:57] <_joe_> ok, gonna do the same on all eqiad cp hosts [05:25:08] Seems to have subsided. I did see 503s on enwiki and phab during the spike. [05:25:13] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:26:01] <_joe_> !log restarting strongswan on all eqiad caches that need it [05:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:17] _joe_: you want me to do it on codfw ones? [05:26:27] <_joe_> marostegui: no need [05:26:30] gotcha [05:26:59] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:27:17] RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:27:43] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:28:03] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:28:29] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:29:05] RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:36:53] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2) [05:37:13] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:33] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 214.72 ms [05:43:37] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:41] PROBLEM - HHVM rendering on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:45:57] PROBLEM - Apache HTTP on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:46:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 259, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:11] PROBLEM - Nginx local proxy to apache on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:46:13] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:46:19] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:47:15] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 82454 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:47:31] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:47:47] RECOVERY - Nginx local proxy to apache on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:48:49] PROBLEM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:55:03] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1073 [puppet] - 10https://gerrit.wikimedia.org/r/535994 (https://phabricator.wikimedia.org/T231892) (owner: 10Marostegui) [05:55:33] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) [05:58:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 261, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:58:55] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:59:28] !log Remove db1073 from tendril and zarcillo T231892 [05:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:32] T231892: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 [06:00:49] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10Joe) 05Open→03Resolved a:03Joe Hi, we had some connectivity issues earlier. As soon as we were alerted and started checking, the issues recovered. We suspect the root cause to be a network maint... [06:00:57] !log Stop MySQL on db1073 for decommission T231892 [06:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) a:05Marostegui→03RobH [06:02:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) This host is now ready for #dc-ops to finish its decommission steps. [06:02:55] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:03:38] (03PS3) 10Giuseppe Lavagetto: sudo: use validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/492718 [06:05:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sudo: use validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/492718 (owner: 10Giuseppe Lavagetto) [06:07:16] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10Pine) Thank you for the quick response. [06:10:29] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:37] 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10JarrahTree) Thanks [06:12:10] (03PS1) 10Vgutierrez: ATS: Shrink SSL session cache size [puppet] - 10https://gerrit.wikimedia.org/r/535995 (https://phabricator.wikimedia.org/T232298) [06:34:46] PROBLEM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:44:40] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:46:37] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/18247/" [puppet] - 10https://gerrit.wikimedia.org/r/535995 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [06:48:16] ACKNOWLEDGEMENT - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL Cas Rusnov Will address. https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:48:31] ACKNOWLEDGEMENT - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL Cas Rusnov Will address. https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:50:38] (03PS1) 10Marostegui: control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/536005 [06:50:55] (03CR) 10Marostegui: [C: 04-2] "Not yet ready" [software] - 10https://gerrit.wikimedia.org/r/536005 (owner: 10Marostegui) [06:51:31] !log restarting ATS-TLS on cp4021 and cp2002 to get the new SSL session cache size - T232298 [06:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:35] T232298: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 [07:00:42] (03PS1) 10Vgutierrez: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536006 (https://phabricator.wikimedia.org/T219765) [07:00:44] (03PS1) 10Vgutierrez: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536007 (https://phabricator.wikimedia.org/T219765) [07:00:46] (03PS1) 10Vgutierrez: ocsp: Allow to load an existing OCSPResponse from disk [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536008 (https://phabricator.wikimedia.org/T219765) [07:00:48] (03PS1) 10Vgutierrez: ocsp: Provide basic test coverage [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536009 (https://phabricator.wikimedia.org/T219765) [07:00:50] (03PS1) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536010 (https://phabricator.wikimedia.org/T219765) [07:00:52] (03PS1) 10Vgutierrez: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536011 (https://phabricator.wikimedia.org/T219765) [07:00:54] (03PS1) 10Vgutierrez: Release 0.21 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536012 (https://phabricator.wikimedia.org/T219765) [07:07:05] (03CR) 10Vgutierrez: [C: 03+2] x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536006 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:07:13] (03CR) 10Vgutierrez: [C: 03+2] ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536007 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:07:18] (03CR) 10Vgutierrez: [C: 03+2] ocsp: Allow to load an existing OCSPResponse from disk [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536008 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:07:22] (03CR) 10Vgutierrez: [C: 03+2] ocsp: Provide basic test coverage [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536009 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:07:26] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Provide OCSP responses [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536010 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:07:32] (03CR) 10Vgutierrez: [C: 03+2] api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536011 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:07:36] (03CR) 10Vgutierrez: [C: 03+2] Release 0.21 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536012 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:09:53] (03Merged) 10jenkins-bot: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536006 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:10:33] (03Merged) 10jenkins-bot: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536007 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:10:37] (03Merged) 10jenkins-bot: ocsp: Allow to load an existing OCSPResponse from disk [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536008 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:10:41] (03Merged) 10jenkins-bot: ocsp: Provide basic test coverage [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536009 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:10:44] (03Merged) 10jenkins-bot: acme_chief: Provide OCSP responses [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536010 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:14:12] (03Merged) 10jenkins-bot: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536011 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:14:48] (03CR) 10jenkins-bot: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536006 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:14:53] (03CR) 10jenkins-bot: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536007 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:15:00] (03Merged) 10jenkins-bot: Release 0.21 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536012 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:15:08] (03CR) 10jenkins-bot: ocsp: Provide basic test coverage [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536009 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:15:18] (03CR) 10jenkins-bot: ocsp: Allow to load an existing OCSPResponse from disk [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536008 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:15:24] (03CR) 10jenkins-bot: acme_chief: Provide OCSP responses [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536010 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:17:06] (03CR) 10jenkins-bot: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536011 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:17:45] (03CR) 10jenkins-bot: Release 0.21 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536012 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:20:25] (03PS1) 10Vgutierrez: debian: Add release 0.21 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536015 (https://phabricator.wikimedia.org/T219765) [07:26:16] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.21 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536015 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:29:26] (03CR) 10jenkins-bot: debian: Add release 0.21 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/536015 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:39:57] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/18249/" [puppet] - 10https://gerrit.wikimedia.org/r/535859 (owner: 10Elukey) [07:45:36] !log uploaded acme-chief 0.21 to apt.wikimedia.org (buster) - T219765 [07:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:39] T219765: Implement server-side OCSP stapling - https://phabricator.wikimedia.org/T219765 [07:46:41] (03CR) 10Vgutierrez: [C: 03+2] wdqs: allow port 8888 for domain networks [puppet] - 10https://gerrit.wikimedia.org/r/535528 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [07:46:50] (03PS5) 10Vgutierrez: wdqs: allow port 8888 for domain networks [puppet] - 10https://gerrit.wikimedia.org/r/535528 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [07:47:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:20] (03CR) 10Jcrespo: [C: 03+1] control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/536005 (owner: 10Marostegui) [07:52:15] (03PS2) 10Elukey: profile::mediawiki::webserver: remove hhvm restart cron when needed [puppet] - 10https://gerrit.wikimedia.org/r/535859 [07:53:45] (03CR) 10Elukey: "Effie: something seems to right in the pcc's output, since I can see the conditional restart cron in mw1270's new catalog, but it has prof" [puppet] - 10https://gerrit.wikimedia.org/r/535859 (owner: 10Elukey) [07:54:02] (03CR) 10Elukey: "> Effie: something seems to right in the pcc's output, since I can" [puppet] - 10https://gerrit.wikimedia.org/r/535859 (owner: 10Elukey) [07:55:35] (03CR) 10Elukey: "Ah ok it is in the catalog because I moved it out the if block, but it is absented. Good :)" [puppet] - 10https://gerrit.wikimedia.org/r/535859 (owner: 10Elukey) [07:56:24] (03CR) 10Elukey: [C: 03+2] profile::mediawiki::webserver: remove hhvm restart cron when needed [puppet] - 10https://gerrit.wikimedia.org/r/535859 (owner: 10Elukey) [07:58:00] (03CR) 10Vgutierrez: [C: 03+2] lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/535520 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [07:58:10] (03PS6) 10Vgutierrez: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/535520 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [08:01:59] !log restarting pybal on lvs1016 - T176875 [08:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:02] T176875: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 [08:04:52] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.32:8888]) https://wikitech.wikimedia.org/wiki/PyBal [08:05:07] yeah that's expected :) [08:07:01] !log restarting pybal on lvs2006 - T176875 [08:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:04] T176875: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 [08:08:10] 08̶W̶a̶r̶n̶i̶n̶g [08:11:23] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=puppetmaster1001.eqiad.wmnet,service=wdqs-heavy-queries [08:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:48] the empty messages from librenms are spooky [08:13:09] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wdqs,service=wdqs-heavy-queries [08:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:20] now that looks better :) [08:15:02] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:16:18] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:17:01] !log restarting pybal on lvs1015 and lvs2003 - T176875 [08:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:04] T176875: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 [08:22:54] !log upgrading to acme-chief 0.21 on acmechief-test instances - T219765 [08:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:57] T219765: Implement server-side OCSP stapling - https://phabricator.wikimedia.org/T219765 [08:25:11] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:40:07] (03CR) 10Jcrespo: "https://phabricator.wikimedia.org/T232704" [puppet] - 10https://gerrit.wikimedia.org/r/535895 (https://phabricator.wikimedia.org/T213223) (owner: 10Dzahn) [08:40:30] (03CR) 10Elukey: "The code is ready for review :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [08:40:47] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 259, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:13] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:41:18] 10Operations, 10MediaWiki-Documentation, 10Release-Engineering-Team, 10Security-Team, 10Documentation: Wikimedia Documentation Broken - https://phabricator.wikimedia.org/T232704 (10jcrespo) [08:45:22] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for MARTIN GERLACH - https://phabricator.wikimedia.org/T232707 (10MGerlach) [08:47:22] 10Operations, 10MediaWiki-Documentation, 10Release-Engineering-Team, 10Security-Team, 10Documentation: Wikimedia Documentation Broken - https://phabricator.wikimedia.org/T232704 (10JaydenKieran) Looks like @Rxy has a revert patch here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/535987/ [08:48:04] (03CR) 10Filippo Giunchedi: "journald by default should forward everything to (r)syslog, thus a form of persistence is already there, unless of course journald doesn't" [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [08:51:40] 10Operations, 10MediaWiki-Documentation, 10Release-Engineering-Team, 10Security-Team, 10Documentation: Wikimedia Documentation Broken - https://phabricator.wikimedia.org/T232704 (10jcrespo) [08:54:09] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Mathew.onipe) @Addshore @Ladsgroup @WMDE-leszek, can you test that you can reach wdqs.svc.eqiad.wmnet on port 8888. LVS and othe... [08:54:20] The ospf alert is related to the Telia link, there seems to be planned maintenance [08:56:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 261, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:57:31] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:01:22] (03PS2) 10Alexandros Kosiaris: calico: remove eventbus fw rules [puppet] - 10https://gerrit.wikimedia.org/r/535889 (https://phabricator.wikimedia.org/T232122) (owner: 10BBlack) [09:01:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: remove eventbus fw rules [puppet] - 10https://gerrit.wikimedia.org/r/535889 (https://phabricator.wikimedia.org/T232122) (owner: 10BBlack) [09:05:26] 10Operations, 10netops, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10fgiunchedi) [09:06:32] (03CR) 10Jcrespo: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [09:06:33] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:07:55] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:10:16] (03CR) 10Jcrespo: "Note the above patch is only for 2 hosts. (but the common code should be at least infrastructure wide-correct)." [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [09:11:31] (03PS2) 10Filippo Giunchedi: swift: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535515 (https://phabricator.wikimedia.org/T205870) [09:11:59] (03PS1) 10Jbond: ldap account: add Alex Hollender [puppet] - 10https://gerrit.wikimedia.org/r/536131 (https://phabricator.wikimedia.org/T232476) [09:12:02] (03CR) 10Hashar: "I will repurpose this patch to use Content-Security-Policy-Report-Only instead, so we get the logs on our side and can finely tweak the ru" [puppet] - 10https://gerrit.wikimedia.org/r/535987 (https://phabricator.wikimedia.org/T232697) (owner: 10Rxy) [09:14:49] (03CR) 10Jbond: [C: 03+2] ldap account: add Alex Hollender [puppet] - 10https://gerrit.wikimedia.org/r/536131 (https://phabricator.wikimedia.org/T232476) (owner: 10Jbond) [09:15:56] (03PS2) 10Hashar: Change doc.wikimedia CSP header to report only [puppet] - 10https://gerrit.wikimedia.org/r/535987 (https://phabricator.wikimedia.org/T213223) (owner: 10Rxy) [09:16:19] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/18250/" [puppet] - 10https://gerrit.wikimedia.org/r/535515 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:16:34] (03PS3) 10Filippo Giunchedi: swift: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535515 (https://phabricator.wikimedia.org/T205870) [09:17:15] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] swift: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535515 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:18:26] (03CR) 10Rxy: [C: 03+1] "> I will repurpose this patch to use Content-Security-Policy-Report-Only" [puppet] - 10https://gerrit.wikimedia.org/r/535987 (https://phabricator.wikimedia.org/T213223) (owner: 10Rxy) [09:19:31] (03PS3) 10Giuseppe Lavagetto: Change doc.wikimedia CSP header to report only [puppet] - 10https://gerrit.wikimedia.org/r/535987 (https://phabricator.wikimedia.org/T213223) (owner: 10Rxy) [09:19:46] (03PS3) 10Alexandros Kosiaris: calico: remove eventbus fw rules [puppet] - 10https://gerrit.wikimedia.org/r/535889 (https://phabricator.wikimedia.org/T232122) (owner: 10BBlack) [09:20:40] <_joe_> akosiaris: so eventbus is no more? [09:21:08] _joe_: looks like it! [09:21:33] \o/ \o/ \o/ [09:21:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Change doc.wikimedia CSP header to report only [puppet] - 10https://gerrit.wikimedia.org/r/535987 (https://phabricator.wikimedia.org/T213223) (owner: 10Rxy) [09:22:04] yesssss only eventgate now :) [09:22:15] <_joe_> so we just need to remove it [09:22:28] Andrew is working on the clean up, there is a task [09:22:37] <_joe_> and finally ahve the software use kafka instead of a dumbed down rest interface, and we're ready \o/ [09:22:38] <_joe_> :D [09:22:45] https://phabricator.wikimedia.org/T232122 [09:24:02] (03PS1) 10Elukey: debconf::set: use set-selections instead of communicate [puppet] - 10https://gerrit.wikimedia.org/r/536132 [09:25:24] (03PS4) 10Alexandros Kosiaris: calico: remove eventbus fw rules [puppet] - 10https://gerrit.wikimedia.org/r/535889 (https://phabricator.wikimedia.org/T232122) (owner: 10BBlack) [09:25:34] 3rd time I rebase this [09:25:40] this time I am not waiting for CI [09:25:44] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] calico: remove eventbus fw rules [puppet] - 10https://gerrit.wikimedia.org/r/535889 (https://phabricator.wikimedia.org/T232122) (owner: 10BBlack) [09:28:10] 08Warning [09:28:19] (03PS3) 10Urbanecm: Increase move rate-limit on Commons for all autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535944 (https://phabricator.wikimedia.org/T232657) (owner: 104nn1l2) [09:37:53] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:15] !log compressing tables on labsdb1012 T232446 [09:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:18] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [09:44:25] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [09:47:32] 10Operations, 10Traffic, 10netops: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10Aklapper) [09:47:39] 10Operations, 10netops, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10jbond) I think this is a great idea. As to which host, the cumin server makes sense to me or perhaps bastion? The user is a bit of a pain, it would be nice if we could ha... [10:01:25] (03PS1) 10Filippo Giunchedi: statsite: support for ensure/removal [puppet] - 10https://gerrit.wikimedia.org/r/536136 [10:01:45] (03Abandoned) 10Giuseppe Lavagetto: conftool: schema for database configuration on etcd [puppet] - 10https://gerrit.wikimedia.org/r/422373 (https://phabricator.wikimedia.org/T197531) (owner: 10Giuseppe Lavagetto) [10:02:04] (03Abandoned) 10Giuseppe Lavagetto: Manage slave databases load/presence via etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422374 (owner: 10Giuseppe Lavagetto) [10:06:27] (03PS2) 10Filippo Giunchedi: statsite: support for ensure/removal [puppet] - 10https://gerrit.wikimedia.org/r/536136 [10:10:27] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/18252/" [puppet] - 10https://gerrit.wikimedia.org/r/536136 (owner: 10Filippo Giunchedi) [10:14:02] (03PS1) 10Filippo Giunchedi: facilities: ps1-b6-eqiad replaced with newer PDU [puppet] - 10https://gerrit.wikimedia.org/r/536140 (https://phabricator.wikimedia.org/T227541) [10:14:12] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) Any update about this? Thanks! [10:17:42] (03CR) 10Filippo Giunchedi: [C: 03+2] facilities: ps1-b6-eqiad replaced with newer PDU [puppet] - 10https://gerrit.wikimedia.org/r/536140 (https://phabricator.wikimedia.org/T227541) (owner: 10Filippo Giunchedi) [10:21:35] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Ladsgroup) The requests work but the TLS ones give me this error: ` ladsgroup@stat1007:~$ curl https://wdqs.svc.eqiad.wmnet:8888... [10:25:00] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Mathew.onipe) @Ladsgroup there's no TLS termination on that port for now. We should have and I will work on it in the nearest fu... [10:26:31] (03PS1) 10Ladsgroup: statistics: Use the new wdqs address [puppet] - 10https://gerrit.wikimedia.org/r/536143 (https://phabricator.wikimedia.org/T176875) [10:27:50] (03CR) 10Jcrespo: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [10:32:43] (03PS1) 10Elukey: Generalize Piwik's db backup profile for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) [10:33:14] (03PS1) 10Filippo Giunchedi: swift: remove statsite [puppet] - 10https://gerrit.wikimedia.org/r/536146 (https://phabricator.wikimedia.org/T205870) [10:34:37] (03PS1) 10Elukey: Rename hiera private settings for Piwik's role [labs/private] - 10https://gerrit.wikimedia.org/r/536147 (https://phabricator.wikimedia.org/T231208) [10:34:47] (03CR) 10Elukey: [V: 03+2 C: 03+2] Rename hiera private settings for Piwik's role [labs/private] - 10https://gerrit.wikimedia.org/r/536147 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [10:42:22] (03PS2) 10Elukey: Generalize Piwik's db backup profile for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) [10:44:58] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/18256/" [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [10:50:46] (03PS1) 10Giuseppe Lavagetto: envoyproxy: use the hot restarter everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/536149 [10:50:48] (03PS1) 10Giuseppe Lavagetto: envoy: add command-line options to be used for zone-aware routing [puppet] - 10https://gerrit.wikimedia.org/r/536150 [10:52:57] (03CR) 10jerkins-bot: [V: 04-1] envoy: add command-line options to be used for zone-aware routing [puppet] - 10https://gerrit.wikimedia.org/r/536150 (owner: 10Giuseppe Lavagetto) [10:58:00] (03PS1) 10Alexandros Kosiaris: Remove eventbus from calico rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/536151 [10:58:17] 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) [10:58:30] 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) p:05Triage→03High [10:59:41] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove eventbus from calico rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/536151 (owner: 10Alexandros Kosiaris) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190912T1100). Please do the needful. [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:54] o/ [11:00:59] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:20] any last-minute SWAT requests? [11:03:45] 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) Enabling the session cache debug on a local instance shows this: `willikins:~ vgutierrez$ docker logs -f ats_ats_1 |fgrep timeout [E. Mgmt] log ==> [TrafficManager] using root directory '/usr'... [11:04:46] (03CR) 10Jbond: "this is definitely a question for moritz to answer. however it is worth noting that debconf::set is only used in mailmain::listserve so a" [puppet] - 10https://gerrit.wikimedia.org/r/536132 (owner: 10Elukey) [11:05:31] 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) [11:06:05] 10Operations, 10netops, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10jbond) p:05Triage→03Normal [11:06:45] 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10jbond) p:05Triage→03Normal [11:06:49] (03PS1) 10Alexandros Kosiaris: Force pods recreation for calico controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/536152 [11:07:11] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Force pods recreation for calico controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/536152 (owner: 10Alexandros Kosiaris) [11:07:51] 10Operations, 10FR-Q2-FY2019-20-cleanup-list, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10jbond) p:05Triage→03Normal [11:08:21] 10Operations, 10hardware-requests: eqiad: three clouvirt-wdqs servers for WDQS testing - https://phabricator.wikimedia.org/T232654 (10jbond) p:05Triage→03Normal [11:09:13] !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:34] 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) [11:09:37] 10Operations, 10Traffic: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849 (10Vgutierrez) [11:09:57] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:02] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:10] (03CR) 10Elukey: "> this is definitely a question for moritz to answer. however it is" [puppet] - 10https://gerrit.wikimedia.org/r/536132 (owner: 10Elukey) [11:11:47] !log akosiaris@ helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:25] (03PS2) 10Alexandros Kosiaris: Remove informational default-kubernetes-policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/526114 [11:14:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove informational default-kubernetes-policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/526114 (owner: 10Alexandros Kosiaris) [11:15:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] "everything has been moved into helmfile and the deployment-charts repo, deleting this" [puppet] - 10https://gerrit.wikimedia.org/r/526114 (owner: 10Alexandros Kosiaris) [11:16:05] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod@lists.wikimedia.org - https://phabricator.wikimedia.org/T232177 (10jbond) wiki has been updated https://wikitech.wikimedia.org/w/index.php?title=Mailman&type=revision&diff=1837413&oldid=1826755 [11:16:12] elukey: merging your labs/private change [11:16:27] 182fe9f that is [11:26:31] ack thanks! [11:33:42] (03PS1) 10Alexandros Kosiaris: blubberoid: Remove monitoring support [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 [11:41:27] (03PS1) 10KartikMistry: apertium-nob: New upstream release [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/536165 (https://phabricator.wikimedia.org/T218184) [11:43:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:11] Anyone to merge this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/536143 [11:44:54] Amir1: with pleasure :) [11:45:04] elukey: Thanks \o/ [11:45:11] Amir1: have you already tested the endpoint from a stat host? [11:45:14] (triple checking) [11:45:54] elukey: yup [11:45:57] super [11:46:03] https://phabricator.wikimedia.org/T176875#5487109 [11:46:06] (03PS2) 10Elukey: statistics: Use the new wdqs address [puppet] - 10https://gerrit.wikimedia.org/r/536143 (https://phabricator.wikimedia.org/T176875) (owner: 10Ladsgroup) [11:47:45] (03CR) 10Elukey: [C: 03+2] statistics: Use the new wdqs address [puppet] - 10https://gerrit.wikimedia.org/r/536143 (https://phabricator.wikimedia.org/T176875) (owner: 10Ladsgroup) [11:50:55] Amir1: I am running puppet on stat1007, let's sync later on when you will not need anymore any wdqs100X config (so I'll update the network's ACLs accordingly) [11:51:46] elukey: there's a tool that I need to build and deploy, that's slightly annoying but I can do it [11:51:49] later today [11:52:54] Amir1: oh yes absolutely not urgent, when you have time [11:52:56] even next week [11:53:00] just let's keep track of this [11:53:03] :) [11:53:09] sure [11:54:11] I have a quick SWAT thingy right now [11:56:05] (03PS1) 10Ladsgroup: Set item terms on write both up to Q20Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536167 (https://phabricator.wikimedia.org/T225055) [11:56:19] that’s extremely last minute, but okay [11:56:27] (03CR) 10Ladsgroup: [C: 03+2] Set item terms on write both up to Q20Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536167 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:57:07] Sorry [11:57:33] no problem [11:57:36] added it to the calendar [11:57:47] you’re taking care of deployment? [11:57:51] oh thanks. I was about to [11:57:57] yeah, I deploy it [11:59:32] jenkins :/ [11:59:46] The new quibble is not deployed yet I think [12:00:19] it’s not blocked on the rest of the gate-and-submit chain, though, right? [12:00:23] should be done soon [12:00:38] it gets queued because every node is busy [12:01:10] (03Merged) 10jenkins-bot: Set item terms on write both up to Q20Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536167 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [12:01:29] (03CR) 10jenkins-bot: Set item terms on write both up to Q20Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536167 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [12:02:20] (03PS3) 10Muehlenhoff: Fix distro check in puppetdb default file for JAVA_BIN [puppet] - 10https://gerrit.wikimedia.org/r/535877 [12:03:25] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:536167|Set item terms on write both up to Q20mio (T225055)]] (duration: 01m 31s) [12:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:28] T225055: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_BOTH - https://phabricator.wikimedia.org/T225055 [12:03:35] !log EU SWAT is done [12:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:44] ^ jynus marostegui [12:03:58] thanks [12:05:03] (03CR) 10Muehlenhoff: [C: 03+2] Fix distro check in puppetdb default file for JAVA_BIN [puppet] - 10https://gerrit.wikimedia.org/r/535877 (owner: 10Muehlenhoff) [12:05:39] marostegui: in a couple of hours we can start https://gerrit.wikimedia.org/r/c/operations/puppet/+/535526 [12:07:35] (03CR) 10Lucas Werkmeister (WMDE): mediawiki: Start rebuildItermTerms for wikidatawiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535526 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [12:07:50] 10Operations, 10netops, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10fgiunchedi) Yeah I think cumin host would be ok and ditto for user atlas, and we can also use the `ripe-atlas-tools` debian package! [12:08:05] (03CR) 10Ladsgroup: mediawiki: Start rebuildItermTerms for wikidatawiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535526 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [12:09:17] (03PS2) 10Ladsgroup: mediawiki: Start rebuildItermTerms for wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/535526 (https://phabricator.wikimedia.org/T225056) [12:16:10] (03PS2) 10Giuseppe Lavagetto: envoyproxy: use the hot restarter everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/536149 [12:16:50] (03PS3) 10Elukey: Generalize Piwik's db backup profile for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) [12:19:42] (03PS4) 10Elukey: Generalize Piwik's db backup profile for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) [12:22:53] (03PS2) 10Filippo Giunchedi: thumbor: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535591 (https://phabricator.wikimedia.org/T205870) [12:22:55] (03PS3) 10Filippo Giunchedi: statsite: support for ensure/removal [puppet] - 10https://gerrit.wikimedia.org/r/536136 [12:22:57] (03PS2) 10Filippo Giunchedi: swift: remove statsite [puppet] - 10https://gerrit.wikimedia.org/r/536146 (https://phabricator.wikimedia.org/T205870) [12:25:47] (03CR) 10Filippo Giunchedi: [C: 03+1] statsite: support for ensure/removal [puppet] - 10https://gerrit.wikimedia.org/r/536136 (owner: 10Filippo Giunchedi) [12:27:41] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10MoritzMuehlenhoff) [12:28:00] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete, restbase-dev is running Stretch. [12:28:03] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) [12:28:05] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [12:29:07] (03PS5) 10Elukey: Generalize Piwik's db backup profile for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) [12:32:23] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [12:32:29] (03PS6) 10Elukey: Generalize Piwik's db backup profile for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) [12:34:01] (03PS7) 10Elukey: Generalize Piwik's db backup profile for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) [12:34:25] 10Operations, 10Wikimedia-Incident: September 2019 DoS attacks [Public] - https://phabricator.wikimedia.org/T232224 (10Patriccck) [12:35:20] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/18262/matomo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [12:38:10] (03PS1) 10Elukey: role::analytics_cluster::cordinator: add mysqldump/bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/536177 (https://phabricator.wikimedia.org/T231208) [12:38:15] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:37] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [12:38:40] !log reimaging restbase1018 to stretch [12:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:24] (03PS1) 10Elukey: Add new hiera backup configs for analytics coordinator [labs/private] - 10https://gerrit.wikimedia.org/r/536178 [12:41:34] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add new hiera backup configs for analytics coordinator [labs/private] - 10https://gerrit.wikimedia.org/r/536178 (owner: 10Elukey) [12:43:43] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/18264/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/536177 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [12:44:30] (03PS2) 10Andrew Bogott: boostrapvz: disable systemd-timesyncd during first boot [puppet] - 10https://gerrit.wikimedia.org/r/535931 [12:45:28] (03CR) 10Andrew Bogott: [C: 03+2] boostrapvz: disable systemd-timesyncd during first boot [puppet] - 10https://gerrit.wikimedia.org/r/535931 (owner: 10Andrew Bogott) [12:50:53] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, and 2 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10fgiunchedi) [12:51:18] (03CR) 10Muehlenhoff: [C: 03+1] "debconf-communicate is like a debug tool which allows you to run the same settings as if you were running on the debconf frontends. debcon" [puppet] - 10https://gerrit.wikimedia.org/r/536132 (owner: 10Elukey) [12:51:37] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:51:43] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:52:06] ^ checking [12:54:05] (03PS3) 10Giuseppe Lavagetto: envoyproxy: use the hot restarter everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/536149 [12:54:39] 10Operations, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10mark) >>! In T231387#5471833, @Varnent wrote: > @mark - Thank you very much for that thoughtful and helpful reply! > > Talking it over, we would like to try the first option if you believe that will work. >... [12:56:01] !log depool mw12333 [12:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoyproxy: use the hot restarter everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/536149 (owner: 10Giuseppe Lavagetto) [12:57:56] !log restarting hhvm on mw1233 and repooling [12:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:09] (03PS2) 10Elukey: debconf::set: use set-selections instead of communicate [puppet] - 10https://gerrit.wikimedia.org/r/536132 [13:00:04] hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190912T1300). [13:00:51] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 82875 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:00:59] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:01:00] (03CR) 10Elukey: [C: 03+2] debconf::set: use set-selections instead of communicate [puppet] - 10https://gerrit.wikimedia.org/r/536132 (owner: 10Elukey) [13:01:10] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [13:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:10] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:00] (03PS2) 10Elukey: profile::kerberos::kdc: add debconf settings [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) [13:06:05] (03PS1) 10Alexandros Kosiaris: coredns: Make it actually prometheus scrapable [deployment-charts] - 10https://gerrit.wikimedia.org/r/536181 [13:06:07] (03PS1) 10Alexandros Kosiaris: Add a network policy to coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/536182 [13:07:57] (03PS2) 10Alexandros Kosiaris: blubberoid: Remove monitoring support [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 [13:16:18] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add a network policy to coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/536182 (owner: 10Alexandros Kosiaris) [13:16:27] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] coredns: Make it actually prometheus scrapable [deployment-charts] - 10https://gerrit.wikimedia.org/r/536181 (owner: 10Alexandros Kosiaris) [13:17:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:03] (03PS3) 10Elukey: profile::kerberos::kdc: add debconf settings [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) [13:22:17] !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [13:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:20] (03CR) 10Elukey: "Moritz: after the recent change in debconf::set, I have basically simplified everything to be set before package krb5-kdc (since the confi" [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [13:33:38] 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10Ottomata) [13:33:49] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10Ottomata) a:05Ottomata→03None [13:35:04] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10Ottomata) [13:36:29] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:22] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 4 others: Create puppet configs for SDC query - https://phabricator.wikimedia.org/T232297 (10Gehel) [13:40:36] (03CR) 10Ottomata: [C: 03+1] Generalize Piwik's db backup profile for Analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [13:40:42] (03CR) 10Ottomata: [C: 03+1] "Thanks Luca!" [puppet] - 10https://gerrit.wikimedia.org/r/536177 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [13:41:07] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 52.23 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:42:46] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Lucas_Werkmeister_WMDE) [13:44:13] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 92.97 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:47:13] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:15] Amir1: error rate for wikidata seems low [13:57:47] could it be then the bot the reason for the issues? [14:01:29] jynus: yeah. Now it's also only up to Q20mio [14:01:38] ah, thanks [14:02:21] Amir1: does wikibase have an irc channel, I feel bad ping ing you only? [14:03:24] #wikimedia-de-tech ? [14:03:31] ok, that works, thanks [14:03:50] I just don't want to bother you every time :-D [14:04:39] and sometimes a quick interactive chat is better than a ticket [14:05:52] Yeah, no worries [14:10:04] (03CR) 10Muehlenhoff: "Looks good, two comments inline. The next step would be to figure out the best way to integrate the krb5_newrealm invocation into the pupp" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [14:22:37] (03PS2) 10Giuseppe Lavagetto: envoy: add command-line options to be used for zone-aware routing [puppet] - 10https://gerrit.wikimedia.org/r/536150 [14:27:11] 10Operations, 10Mail, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10herron) a:03herron [14:27:55] (03PS1) 10Muehlenhoff: Add missing JBOD config for restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/536201 (https://phabricator.wikimedia.org/T224553) [14:29:28] !log ensure cr1-eqiad is vrrp backup for all groups - T226424 [14:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:31] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [14:30:50] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) [14:32:24] (03PS1) 10Subramanya Sastry: Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 [14:32:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 (owner: 10Subramanya Sastry) [14:32:56] (03PS2) 10Subramanya Sastry: Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 [14:33:24] (03CR) 10jerkins-bot: [V: 04-1] Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 (owner: 10Subramanya Sastry) [14:34:13] (03PS1) 10Muehlenhoff: Remove obsolete restbase hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/536204 [14:35:35] (03PS1) 10Alexandros Kosiaris: calico: Add API access for coreDNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/536206 [14:36:43] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] calico: Add API access for coreDNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/536206 (owner: 10Alexandros Kosiaris) [14:37:17] (03PS1) 10Subramanya Sastry: Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) [14:37:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:35] (03CR) 10jerkins-bot: [V: 04-1] Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [14:37:55] !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:00] (03PS3) 10Subramanya Sastry: Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 [14:39:56] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:39] (03CR) 10Subramanya Sastry: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [14:41:01] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [14:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:43] (03CR) 10jerkins-bot: [V: 04-1] Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [14:43:08] (03PS1) 10BBlack: Temporarily connect all eqiad pybal to cr2 [puppet] - 10https://gerrit.wikimedia.org/r/536209 (https://phabricator.wikimedia.org/T226424) [14:45:07] (03PS2) 10BBlack: Temporarily connect all eqiad pybal to cr2 [puppet] - 10https://gerrit.wikimedia.org/r/536209 (https://phabricator.wikimedia.org/T226424) [14:45:18] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:32] (03PS3) 10Halfak: Adds git::lfs class and requirement to ores::base [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) [14:45:48] heh I did PS2 because I saw that I left unaligned => in PS1, but jenkins ended up V+2 on PS1 anyways... [14:46:04] ho ho [14:46:10] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Require git-lfs in ores::base puppet role - https://phabricator.wikimedia.org/T232494 (10Halfak) We need git-lfs on the nodes we use to build models too. Hence, why I suggested ores::base. But otherwise, I think this sounds like a f... [14:46:19] did we kill arrow alignment from the V rules? [14:46:24] (03CR) 10Filippo Giunchedi: Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [14:46:42] (03CR) 10CRusnov: "Looks good minor question inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535945 (owner: 10Ayounsi) [14:46:43] that'd be gold [14:47:34] (03CR) 10jerkins-bot: [V: 04-1] Adds git::lfs class and requirement to ores::base [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [14:48:06] (03CR) 10BBlack: [C: 03+2] Temporarily connect all eqiad pybal to cr2 [puppet] - 10https://gerrit.wikimedia.org/r/536209 (https://phabricator.wikimedia.org/T226424) (owner: 10BBlack) [14:49:55] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536201 (https://phabricator.wikimedia.org/T224553) (owner: 10Muehlenhoff) [14:50:45] !log restart pybal on lvs1016 to move BGP conn to cr2-eqiad - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536209 - T226424 [14:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:52] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [14:51:56] (03PS2) 10Subramanya Sastry: Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) [14:53:20] !log restart pybal on lvs1013 to move BGP conn to cr2-eqiad - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536209 - T226424 [14:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:57] (03CR) 10jerkins-bot: [V: 04-1] Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [14:55:15] (03CR) 10Subramanya Sastry: Take #2: Redirect Parsoid/PHP rt-testing log events to "parsoid-tests" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536208 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [14:58:09] (03PS1) 10BBlack: Temporarily connect all eqiad pybal to cr1 [puppet] - 10https://gerrit.wikimedia.org/r/536211 (https://phabricator.wikimedia.org/T226424) [14:58:20] (03CR) 10Jbennett: "From the security side while this isn't ideal it's low risk. Please ensure these changes are still OK with legal and please prioritize imp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [14:58:52] (03PS2) 10Muehlenhoff: Add missing JBOD config for restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/536201 (https://phabricator.wikimedia.org/T224553) [15:00:09] (03CR) 10jerkins-bot: [V: 04-1] Temporarily connect all eqiad pybal to cr1 [puppet] - 10https://gerrit.wikimedia.org/r/536211 (https://phabricator.wikimedia.org/T226424) (owner: 10BBlack) [15:00:27] (03CR) 10Muehlenhoff: [C: 03+2] Add missing JBOD config for restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/536201 (https://phabricator.wikimedia.org/T224553) (owner: 10Muehlenhoff) [15:01:08] (mostly noting it in case there's some alert-spam) [15:02:33] !log disable primary tunnel to CF in eqiad [15:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:17] PROBLEM - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:03:27] PROBLEM - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:03:27] PROBLEM - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:03:40] !log rolled back disable primary tunnel to CF in eqiad [15:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:51] PROBLEM - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:04:05] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete restbase hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/536204 (owner: 10Muehlenhoff) [15:04:07] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:04:09] PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:04:31] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:04:35] PROBLEM - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:04:43] PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:04:56] host is being reimaged btw ^ [15:05:10] oic [15:05:14] godog: ah, well, I thought my change caused the issue [15:05:18] :) [15:05:28] yeah, downtime expired, I'll extend [15:05:50] !log disable primary tunnel to CF in eqiad (for real this time, I did see an uptake of traffic on backup link before the rollback) [15:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:17] godog: oh ! [15:06:18] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused Muehlenhoff Reimage to Stretch https://phabricator.wikimedia.org/T93886 [15:06:18] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff Reimage to Stretch https://phabricator.wikimedia.org/T120662 [15:06:18] ACKNOWLEDGEMENT - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive Muehlenhoff Reimage to Stretch https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:06:18] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused Muehlenhoff Reimage to Stretch https://phabricator.wikimedia.org/T93886 [15:06:18] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff Reimage to Stretch https://phabricator.wikimedia.org/T120662 [15:06:19] ACKNOWLEDGEMENT - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive Muehlenhoff Reimage to Stretch https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:06:25] I had just logged in [15:06:30] I am leaving :p [15:06:38] tx [15:06:58] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused Muehlenhoff Reimage to Stretch https://phabricator.wikimedia.org/T93886 [15:06:58] ACKNOWLEDGEMENT - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff Reimage to Stretch https://phabricator.wikimedia.org/T120662 [15:06:58] ACKNOWLEDGEMENT - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive Muehlenhoff Reimage to Stretch https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:07:45] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, and 2 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) >>! In T189333#5483346, @fgiunchedi wrote: >>>! In T189333#5481492, @Krinkle wrote: >> I re-ran my analysis today, and oddly eno... [15:09:13] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/18266/" [puppet] - 10https://gerrit.wikimedia.org/r/536177 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [15:09:25] (03PS8) 10Elukey: Generalize Piwik's db backup profile for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) [15:09:38] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10jcrespo) I got asked specifically about this by mark. He asked me to track the progress of this as it blocks an important goal and general service (backups). Old backup hardwar... [15:09:58] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) restbase1018 is reimaged and ready for Cassandra bootstrap. [15:10:14] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) [15:10:51] PROBLEM - SSH wtp1031.mgmt on wtp1031.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:13:36] !log shutdown re1.cr1-eqiad - T226424 [15:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:40] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [15:14:16] (03CR) 10Elukey: [C: 03+2] Generalize Piwik's db backup profile for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/536145 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [15:18:09] (03PS6) 10Cwhite: add generic interface to metrics gathering [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 [15:18:46] (03PS2) 10Elukey: role::analytics_cluster::cordinator: add mysqldump/bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/536177 (https://phabricator.wikimedia.org/T231208) [15:22:59] !log failover master RE from RE0 to RE1 on cr1-eqiad - T226424 [15:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:02] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [15:24:05] (03PS1) 10Herron: prometheus: switch to per-site aggregate ipsec checks [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) [15:25:17] linecards still booting up [15:25:32] (03CR) 10jerkins-bot: [V: 04-1] prometheus: switch to per-site aggregate ipsec checks [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [15:28:07] (03PS2) 10Herron: prometheus: switch to per-site aggregate ipsec checks [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) [15:28:22] linecards are back [15:28:53] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:29:05] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1001.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:29:41] PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) timed out before a response was received: /sessions/v1/{key} (Store value for key) timed out before a response was received https://www.mediawiki.org/wiki/Kask [15:29:50] bblack: ^ [15:30:03] _joe_: ^ ? [15:30:29] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:30:33] I don't think it's related to my maintenanc [15:30:34] e [15:30:39] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:30:48] <_joe_> well I have no idea either, but the timing is suspect :P [15:31:07] RECOVERY - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [15:31:09] ok [15:31:20] I don't think it's a blocker? [15:31:46] !log shutdown re0.cr1-eqiad - T226424 [15:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:49] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [15:31:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 107 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:33:10] 04Critical [15:33:14] ah, I focused so much on CF that I forgot about v6 [15:35:48] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/18267/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [15:37:33] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 24 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:38:05] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10leila) I approve. thanks! [15:39:30] !log deactivate transit4/6 on cr1-eqiad - T226424 [15:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:34] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [15:40:21] I wonder why just the k8s alerted there [15:41:09] maybe more sensible? [15:41:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "nit inline, LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [15:41:20] all the failures looked like: [15:41:24] Sep 12 15:29:31 lvs1015 pybal[12097]: [sessionstore_8081 ProxyFetch] WARN: kubernetes1001.eqiad.wmnet (enabled/down/not pooled): Fetch failed (http [15:41:27] s://localhost/healthz), 30.001 s [15:41:37] (30s timeout on the healthz endpoint) [15:41:59] oh no, there was also idleconn failure to connect [15:42:03] Sep 12 15:28:54 lvs1015 pybal[12097]: [sessionstore_8081 IdleConnection] WARN: kubernetes1002.eqiad.wmnet (enabled/down/not pooled): Connection to [15:42:06] 10.64.16.75:8081 failed. [15:43:33] there was another spate of these back around :12/:13 near the re1 shutdown, just didn't escalate to icinga alert here, but it's in pybal logs [15:43:45] (03CR) 10Jforrester: "I think the broken JSDuck documentation was a reasonable cost given the plan for the original patch." [puppet] - 10https://gerrit.wikimedia.org/r/535987 (https://phabricator.wikimedia.org/T213223) (owner: 10Rxy) [15:43:48] weird [15:44:03] re1 is backup so the shutdown should be 100% transparent [15:44:09] it doesn't do routing or anything [15:44:39] about to switchover master back to cr0, that's the bumpy part as it reboots linecards [15:44:43] actually, they seem to fail pretty routinely now that I look further back in logs [15:45:15] !log failover master RE from RE1 to RE0 on cr1-eqiad - T226424 [15:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:18] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [15:45:46] although the exact "30.001s" timeout is a little new this morning [15:46:02] but there were some like that in the various failure noise of it yesterday too [15:46:08] (03PS3) 10Herron: prometheus: switch to per-site aggregate ipsec checks [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) [15:46:42] waiting for the linecards to come back up [15:46:44] 10Operations, 10Commons, 10MediaWiki-File-management, 10media-storage, 10User-fgiunchedi: bring swiftrepl back to life - https://phabricator.wikimedia.org/T231110 (10fgiunchedi) [15:46:54] then next step is a full reboot of cr1-eqiad [15:47:18] (03CR) 10jerkins-bot: [V: 04-1] prometheus: switch to per-site aggregate ipsec checks [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [15:47:58] surprizingly the linecards are slower to boot than the routing engines [15:47:59] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Cmjohnson) @wiki_willy not really but I reseated it anyway. As far as I can tell in bios everything looks normal. I did swap the 2 disks. @gehel try again please. [15:49:19] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 56, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:49:31] ah, forgot to downtime that one [15:50:09] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:50:18] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Cmjohnson) I did notice that ssds are different types The new ssd is a DC3320 series The old ssd is a DC3610 series [15:50:37] (03PS4) 10Herron: prometheus: switch to per-site aggregate ipsec checks [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) [15:50:59] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:51:10] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:52:19] downtimed cr2-eqdfw [15:52:25] still waiting for the linecards [15:52:33] PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) timed out before a response was received: /sessions/v1/{key} (Store value for key) timed out before a response was received https://www.mediawiki.org/wiki/Kask [15:53:24] bblack: the same 3 alerts ^ [15:53:36] (03CR) 10Herron: "PCC looks good as well https://puppet-compiler.wmflabs.org/compiler1002/18269/icinga1001.wikimedia.org/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [15:53:45] (03PS5) 10Herron: prometheus: switch to per-site aggregate ipsec checks [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) [15:54:45] yeah clearly something's not right with sessionstore during cr1 maint, but what? [15:54:48] and why? [15:55:04] do the k8s hosts have a weird subnet or router setup? [15:55:45] PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) timed out before a response was received: /sessions/v1/{key} (Store value for key) timed out before a response was received https://www.mediawiki.org/wiki/Kask [15:56:09] (03CR) 10Herron: [C: 03+2] prometheus: switch to per-site aggregate ipsec checks [puppet] - 10https://gerrit.wikimedia.org/r/536216 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [15:57:15] PROBLEM - LVS HTTP IPv4 #page on sessionstore.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:57:24] ah yeah, they peer with the routers [15:57:38] but at this point it should have failed over cr2 [15:57:55] and the linecards are still not coming back online [15:58:03] are we even sure that works? have we ever failed them over to cr2? [15:58:40] bblack: yes, they are routers [15:58:41] RECOVERY - LVS HTTP IPv4 #page on sessionstore.svc.eqiad.wmnet is OK: HTTP OK: Status line output matched 200 - 258 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:58:53] and are peering with both crs [15:59:31] hm, ContentTranslation might not fail, actually [15:59:35] sorry, wrong window [15:59:36] akosiaris: no I mean, have we ever tested the 1-router scenario [15:59:40] akosiaris: cr2 is missing kubernetes1005/1006 [15:59:49] PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) is CRITICAL: Test Get value for key returned the unexpected status 500 (expecting: 200): /sessions/v1/{key} (Store value for key) timed out before a response was received https://www.mediawiki.org/wiki/Kask [16:00:00] bblack: I think we have in codfw [16:00:04] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190912T1600). [16:00:04] subbu: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:07] on the bgp config [16:00:07] but it looks like XioNoX found the problem [16:00:19] <_joe_> subbu: I'm in a meeting sorry :/ [16:00:19] and it also explains perfectly while only sessionstore alerted [16:00:21] are we doing swat or delaying it (given issues)? [16:00:23] yeah but we lost all, not just 5/6 [16:00:35] I so no other alerts [16:00:38] saw* [16:00:44] akosiaris: added the 2 [16:00:48] _joe_: I'll take it, though I'm not sure we should now heh [16:00:51] XioNoX: thanks! that should fix it [16:00:58] sessionstore is just on 5 and 6 btw [16:01:01] <_joe_> godog: that too yeah [16:01:01] let's hold off for a few minutes on the swat [16:01:08] those nodes are VMs and are dedicated to just those [16:01:14] to just sessionstore* [16:01:19] sorry in a meeting at the same time [16:01:36] it's not critical btw, sessionstore is currently unused [16:01:41] _joe_, godog no worries .. there isn't any great rush to get them out now. [16:01:53] and it's great luck we found out the misconfiguration today [16:01:56] RECOVERY - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [16:01:56] !log force offline/online of FPC3 on cr1-eqiad [16:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:10] 08̶W̶a̶r̶n̶i̶n̶g [16:03:15] from the pybal perspective, we had pybal->k8s100x failures for all of them, not just 5/6 [16:03:24] but maybe there's some indirect fallout once two are missing? [16:03:54] none of the linecards on cr1 are coming online, I'm going to reboot the whole box [16:04:12] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:04:18] !log reboot cr1-eqiad - T226424 [16:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:21] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [16:05:47] bblack: lvs1016 ? [16:05:58] I 'd like to take a look in the logs [16:06:04] I was looking at 1015 [16:06:12] but yeah 1015 + 1016 should be similar, both hit it [16:06:28] grep kubernetes /var/log/pybal.log|grep -i fail|less [16:06:34] (03PS3) 10Lucas Werkmeister (WMDE): dologmsg: fix variable [puppet] - 10https://gerrit.wikimedia.org/r/511750 [16:07:35] subbu: ack! [16:08:46] RECOVERY - cassandra-a service on restbase1018 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:09:12] RECOVERY - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-a valid until 2020-06-24 13:01:20 +0000 (expires in 285 days) https://phabricator.wikimedia.org/T120662 [16:09:48] !log bootstrapping Cassandra, restbase1018-a -- T224553 [16:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:51] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [16:10:17] cr1 is back online, still waiting for the linecards [16:10:55] they are all indeed just sessionstore, despite all the other services [16:10:58] RECOVERY - SSH wtp1031.mgmt on wtp1031.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:11:21] bblack: I think it's exactly that. cr1 rebooted, 5 and 6 did not advertise their pod ips to cr2 because of the misconfiguration [16:11:36] and then services failed and thus the pybal alerts and pages [16:12:05] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10Nuria) Approved on my end as well. [16:12:30] XioNoX: more nodes are going to be added this year btw, we should automate this process. It's a shame I end up making mistakes like this [16:12:45] thanks for catching it so quickly btw [16:13:15] akosiaris: homer will solve all those issues :) [16:13:18] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Resolve local commits on cloud-puppetmaster-01.cloudinfra.eqiad.wmflabs and cloud-puppetmaster-02.cloudinfra.eqiad.wmflabs - https://phabricator.wikimedia.org/T232428 (10Andrew) Here is what happens without those three reverts: {P9095} [16:13:23] can't wait for it ;-) [16:13:36] ah, linecards are coming back from the deads [16:14:12] (03PS1) 10CRusnov: profile::puppetdb: Make microservice use actual cert [puppet] - 10https://gerrit.wikimedia.org/r/536226 [16:14:27] seems like cr1-eqiad is back to normal [16:14:28] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:14:33] see ^ [16:15:59] !log activate transit4/6 on cr1-eqiad - T226424 [16:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:02] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [16:16:30] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, LGTM. I can't find any traces how /etc/nginx/ssl/cert.pem|server.key were created via puppet on puppetdb1001/2001 and base::expose" [puppet] - 10https://gerrit.wikimedia.org/r/536226 (owner: 10CRusnov) [16:16:35] !log activate CF tunnel on cr1-eqiad - T226424 [16:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:40] (03CR) 10CRusnov: [C: 03+2] profile::puppetdb: Make microservice use actual cert [puppet] - 10https://gerrit.wikimedia.org/r/536226 (owner: 10CRusnov) [16:17:40] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:19:24] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:19:52] !log rollback force VRRP backup on cr1-eqiad - T226424 [16:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:31] bblack: we can fail the LVS over [16:21:44] done with settling in, things are stable? [16:21:57] (also, might want to check k8s on cr1 in case more missing!) [16:23:21] bblack: yeah everything is stable, I re-enabled everything [16:23:37] both routers have the same k8s peers now [16:24:08] I added the LVS peers IPs on cr1 eqiad, so as soon as you're done on your side they should come up [16:24:42] maybe do it in 2 phases? 1/ move the cr2 LVS back to cr2, then move the cr1 LVS over cr2 ? [16:25:09] er, 1/ move the cr1 LVS back to cr1, then move the cr2 LVS over cr1 ? [16:26:50] fixing the -1 on the patch, I think [16:27:24] (03PS2) 10BBlack: Temporarily connect all eqiad pybal to cr1 [puppet] - 10https://gerrit.wikimedia.org/r/536211 (https://phabricator.wikimedia.org/T226424) [16:27:26] Ihave to step through all of them 1 by 1 anyways [16:27:35] I usually start with 16 for all things, then the other 3 [16:27:50] ok [16:28:03] so I guess I'll go in reverse numerical order [16:28:03] I added the two missing v6 sessions for k8s too [16:28:07] _joe_, godog but is today a possibility for those patches? [16:28:47] <_joe_> subbu: it's a bit late for me, but if you don't need them to be merged in sync with you, I can do it tomorrow morning [16:28:53] <_joe_> or maybe mutante could help [16:29:04] ya, i don't need to be around for them. [16:29:09] jerkins... [16:30:17] (03CR) 10BBlack: [C: 03+2] Temporarily connect all eqiad pybal to cr1 [puppet] - 10https://gerrit.wikimedia.org/r/536211 (https://phabricator.wikimedia.org/T226424) (owner: 10BBlack) [16:30:26] but, it would be helpful to have them out latest tomorrow so we have error logs we can use to fix bugs. [16:31:06] subbu: a bit late for me as well, I'll take a look tomorrow with _joe_ tho [16:31:17] k [16:33:10] 04̶C̶r̶i̶t̶i̶c̶a̶l [16:34:30] !log lvs1016: restart pybal to move bgp session to cr1 - T226424 [16:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:35] that's a regression from yesterday's librenms upgrade [16:34:36] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [16:35:17] XioNoX: 1016 sessions look ok? [16:35:24] bblack: I see 1016 up with 44/6 prefixes [16:35:28] ok [16:35:42] !log lvs1015: restart pybal to move bgp session to cr1 - T226424 [16:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:47] (on cr1) [16:36:19] !log lvs1014: restart pybal to move bgp session to cr1 - T226424 [16:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:37] !log lvs1013: restart pybal to move bgp session to cr1 - T226424 [16:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:11] if all 4 look good on your end (all on cr1), good to go! [16:37:37] all 4 are established on cr1 [16:38:44] alright, time to do the failovers [16:38:46] thanks [16:40:59] confirmed that CF went back to the cr1 tunnel [16:41:07] (03PS1) 10Krinkle: tests: Fix "Deprecated: The each() function is deprecated" from timelineTest.php:32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536232 (https://phabricator.wikimedia.org/T232691) [16:42:13] !log switch VRRP master to cr2-eqiad - T226424 [16:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:15] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [16:42:33] !log er, switch VRRP master to cr1-eqiad - T226424 [16:42:33] (03PS1) 10Muehlenhoff: Adapt path for puppet cert used by nginx/puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/536233 [16:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:46:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:46:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:46:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:46:54] bblack: ^ is that due to the vrrp failover? [16:47:13] !log installing NSS security updates on buster [16:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:21] (03CR) 10Elukey: ">" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [16:47:58] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:48:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:48:37] XioNoX: presumably... [16:49:02] !log Deactivate IX/transit/private-peer v4/v6 BGP on cr2-eqiad - T226424 [16:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:05] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [16:49:22] (03PS4) 10Elukey: profile::kerberos::kdc: add debconf settings [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) [16:50:00] all the sites had a blip there of 5xx [16:50:25] ~:45 -> ~:47 [16:50:41] bblack: if a TCP session gets interrupted between the caches do they retry? Fail with a 500? [16:51:32] from this pov, there's a single retry [16:51:53] the backend->backend part will fail with a 500, the frontend will restart the whole transaction once, over a new backend->backend conn [16:52:00] but there were 5xx in eqiad too, so it's not all transport-related [16:52:24] more likely there was a hiccup with traffic to some backend service inside eqiad [16:53:02] but that availability check is kinda trigger-happy too [16:53:14] vrrp failover should be sub-second brief, it didn't cause any issues in codfw [16:53:26] (when we did it in codfw) [16:53:30] it was a ~2-3 minute sustained chunk of 5xx, but we're talking a net global rate of ~30/sec 5xx's, out of ~150K/sec [16:53:38] but there there are less backend services there [16:53:51] which is lik 0.02% [16:54:31] we don't need to do another failover so it shouldn't happen again during this maintenance [16:54:38] next step OSPF metrics [16:57:05] !log installing libxslt security updates on buster [16:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:58] (03PS1) 10Muehlenhoff: Add library hint for libxslt [puppet] - 10https://gerrit.wikimedia.org/r/536236 [17:00:04] cscott, arlolra, subbu, halfak, and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190912T1700). [17:00:08] !log +1000 metric to all transport to/from cr2-eqiad - T226424 [17:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:11] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [17:00:16] no parsoid deploy today [17:00:20] ORES is going out. [17:00:27] ^deploy [17:01:27] cr2 is fully drained [17:02:26] !log installing unzip security updates on buster [17:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:11] !log power off re1.cr2-eqiad - T226424 [17:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:50] !log halfak@deploy1001 Started deploy [ores/deploy@7d45b80]: T232660 [17:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:53] here we go! [17:05:53] T232660: ORES deploy early Sept. 2019 - https://phabricator.wikimedia.org/T232660 [17:12:04] (03PS1) 10Hashar: Enable coredump on some mysterious php7.2 failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536243 (https://phabricator.wikimedia.org/T232613) [17:13:08] (03CR) 10Hashar: "And I just found out the mediawiki/core patch I wrote does not log anything when it will take the core dump. But we should be able to find" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536243 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [17:14:24] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10fdans) [17:15:29] (03PS2) 10Dzahn: acme_chief: add gerrit1001 as authorized host for gerrit certs [puppet] - 10https://gerrit.wikimedia.org/r/535962 (https://phabricator.wikimedia.org/T222391) [17:15:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10fdans) 05Open→03Resolved [17:16:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10fdans) 05Resolved→03Open [17:17:07] (03CR) 10jerkins-bot: [V: 04-1] Enable coredump on some mysterious php7.2 failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536243 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [17:18:24] (03CR) 10Dzahn: [C: 03+2] acme_chief: add gerrit1001 as authorized host for gerrit certs [puppet] - 10https://gerrit.wikimedia.org/r/535962 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [17:19:01] (03CR) 10Jforrester: "Before and after this patch these tests are skipped?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536232 (https://phabricator.wikimedia.org/T232691) (owner: 10Krinkle) [17:19:29] !log halfak@deploy1001 Finished deploy [ores/deploy@7d45b80]: T232660 (duration: 13m 41s) [17:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:32] T232660: ORES deploy early Sept. 2019 - https://phabricator.wikimedia.org/T232660 [17:22:21] (03CR) 10Thcipriani: [C: 03+1] "Reasoning sounds fine to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 (owner: 10Alexandros Kosiaris) [17:23:10] 04Critical [17:23:20] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:24:52] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:59] !log failover cr2-eqiad master RE from RE0 to RE1 - T226424 [17:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:02] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [17:25:49] (03PS3) 10Dzahn: gerrit: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535964 (https://phabricator.wikimedia.org/T222391) [17:26:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:34] waiting for linecards to boot up [17:26:44] (03CR) 10Krinkle: "Ugh, is it not checking out submodules?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536232 (https://phabricator.wikimedia.org/T232691) (owner: 10Krinkle) [17:27:18] Krinkle: Oh, wow, yes, mw-config has submodules. [17:27:28] James_F: it needs those for the tests. [17:27:28] Krinkle: Yeah, no submodule magic for the CI job. [17:27:33] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Ottomata) 05Open→03Resolved Closing this task. {T222941} is still an issue, but for now we will ensure that we don't accidentally upgrade to 1.4.6 o... [17:27:38] jouncebot: now [17:27:38] For the next 0 hour(s) and 32 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190912T1700) [17:27:42] I tested it locally and found the warning that way [17:27:44] Oh well [17:28:03] (03PS1) 10Ottomata: Ensure python-kafka at 1.4.1 on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/536300 (https://phabricator.wikimedia.org/T222941) [17:28:09] (03CR) 10Jforrester: [C: 03+2] "Fixed locally, not seen unless you have the submodule checked out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536232 (https://phabricator.wikimedia.org/T232691) (owner: 10Krinkle) [17:28:28] I'm fiddling in prod. [17:28:41] (03CR) 10jerkins-bot: [V: 04-1] Ensure python-kafka at 1.4.1 on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/536300 (https://phabricator.wikimedia.org/T222941) (owner: 10Ottomata) [17:29:41] (03CR) 10Dzahn: [C: 03+2] gerrit: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535964 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [17:30:21] (03Merged) 10jenkins-bot: tests: Fix "Deprecated: The each() function is deprecated" from timelineTest.php:32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536232 (https://phabricator.wikimedia.org/T232691) (owner: 10Krinkle) [17:30:45] (03CR) 10jenkins-bot: tests: Fix "Deprecated: The each() function is deprecated" from timelineTest.php:32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536232 (https://phabricator.wikimedia.org/T232691) (owner: 10Krinkle) [17:31:30] !log power off re0.cr2-eqiad - T226424 [17:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:35] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [17:31:53] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: T232613 Add ability to core dump on empty string array key that should exist (wmf.22 only, flagged off) (duration: 01m 03s) [17:31:58] (03PS2) 10Ottomata: Ensure python-kafka at 1.4.1 on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/536300 (https://phabricator.wikimedia.org/T222941) [17:32:14] (03PS2) 10Dzahn: ci: allow ssh from new gerrit server gerrit1001 in ferm [puppet] - 10https://gerrit.wikimedia.org/r/535965 (https://phabricator.wikimedia.org/T222391) [17:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:19] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [17:33:31] (03CR) 10Brennen Bearnes: [C: 03+1] blubberoid: Remove monitoring support [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 (owner: 10Alexandros Kosiaris) [17:33:54] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:35:15] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/18271/eventlog1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/536300 (https://phabricator.wikimedia.org/T222941) (owner: 10Ottomata) [17:35:18] (03CR) 10Ottomata: [C: 03+2] Ensure python-kafka at 1.4.1 on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/536300 (https://phabricator.wikimedia.org/T222941) (owner: 10Ottomata) [17:35:20] (03PS2) 10Jforrester: Enable coredump on some mysterious php7.2 failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536243 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [17:35:25] (03PS3) 10Ottomata: Ensure python-kafka at 1.4.1 on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/536300 (https://phabricator.wikimedia.org/T222941) [17:35:35] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Ensure python-kafka at 1.4.1 on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/536300 (https://phabricator.wikimedia.org/T222941) (owner: 10Ottomata) [17:36:20] (03PS3) 10Dzahn: ci: allow ssh from new gerrit server gerrit1001 in ferm [puppet] - 10https://gerrit.wikimedia.org/r/535965 (https://phabricator.wikimedia.org/T222391) [17:36:40] (03CR) 10Dzahn: [C: 03+2] ci: allow ssh from new gerrit server gerrit1001 in ferm [puppet] - 10https://gerrit.wikimedia.org/r/535965 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [17:40:30] !log failover cr2-eqiad master RE from RE1 to RE0 - T226424 [17:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:32] (03PS1) 10Vgutierrez: Release 8.0.5-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/536302 (https://phabricator.wikimedia.org/T232298) [17:40:33] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [17:41:14] RECOVERY - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is OK: TCP OK - 0.000 second response time on 10.64.48.98 port 9042 https://phabricator.wikimedia.org/T93886 [17:42:55] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/536302 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [17:43:15] !log reboot cr2-eqiad - T226424 [17:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:12] (03CR) 10Daimona Eaytoy: [C: 03+1] Enable coredump on some mysterious php7.2 failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536243 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [17:44:19] (03PS2) 10Vgutierrez: Release 8.0.5-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/536302 (https://phabricator.wikimedia.org/T232298) [17:46:18] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [17:46:45] !log bootstrapping Cassandra, restbase1018-b -- T224553 [17:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:48] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [17:47:00] RECOVERY - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-b valid until 2020-06-24 13:01:21 +0000 (expires in 285 days) https://phabricator.wikimedia.org/T120662 [17:47:48] RECOVERY - cassandra-b service on restbase1018 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:49:27] (03CR) 10Krinkle: [C: 03+1] Revert "Direct Parsoid/PHP rt-testing log events to a different target" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536202 (owner: 10Subramanya Sastry) [17:50:07] (03PS1) 10BBlack: Revert "Temporarily connect all eqiad pybal to cr1" [puppet] - 10https://gerrit.wikimedia.org/r/536303 (https://phabricator.wikimedia.org/T226424) [17:50:09] (03PS1) 10BBlack: Revert "Temporarily connect all eqiad pybal to cr2" [puppet] - 10https://gerrit.wikimedia.org/r/536304 (https://phabricator.wikimedia.org/T226424) [17:52:28] (03CR) 10Jforrester: [C: 03+1] "This should be fine. Leaving to Antoine/Tim to deploy for hunting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536243 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [17:53:00] 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) After debugging why the eviction is triggered: `#0 ssl_rm_cached_session (ctx=0x188db10, sess=0x2ad5e802a910) at SSLUtils.cc:304 #1 0x00002ad5d30030f9 in remove_session_lock (ctx=0x188db10,... [17:53:30] !log re-enabled external BGP on cr2-eqiad - T226424 [17:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:33] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [17:54:03] !log revert OSPF priority change on cr2-eqiad - T226424 [17:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:18] !log revert VRRP priority change cr2-eqiad - T226424 [17:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:06] bblack: alright, ready to get the LVS back to cr2 [17:59:33] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/536302 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [17:59:34] done done? [17:59:39] ok [17:59:50] done done done [18:00:02] (03CR) 10BBlack: [C: 03+2] Revert "Temporarily connect all eqiad pybal to cr1" [puppet] - 10https://gerrit.wikimedia.org/r/536303 (https://phabricator.wikimedia.org/T226424) (owner: 10BBlack) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190912T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:17] (03CR) 10BBlack: [C: 03+2] Revert "Temporarily connect all eqiad pybal to cr2" [puppet] - 10https://gerrit.wikimedia.org/r/536304 (https://phabricator.wikimedia.org/T226424) (owner: 10BBlack) [18:02:30] Krinkle: https://phabricator.wikimedia.org/T232764#5488580 – seems like the submodule should be there? [18:03:53] !log lvs1014: restart pybal to return BGP session to cr2 - T226424 [18:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:56] T226424: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 [18:04:02] !log lvs1015: restart pybal to return BGP session to cr2 - T226424 [18:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:36] (03PS2) 10Jforrester: tests: Update local copy of SiteConfiguration.php to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534546 [18:05:03] (03CR) 10Jeena Huneidi: [C: 03+1] blubberoid: Remove monitoring support [deployment-charts] - 10https://gerrit.wikimedia.org/r/536163 (owner: 10Alexandros Kosiaris) [18:05:11] James_F: uh, indeed. [18:05:26] James_F: I see now, also, that it is skipping from the data provider [18:05:31] which only looks at timeline.php [18:05:36] Oh. [18:05:40] the inspection of fonts/ directory is in the test itself [18:05:45] My bad :) [18:05:55] bblack: I see them both on cr2 [18:06:01] So I guess the magic php parsing there isn't working quite right [18:06:14] removing the config from cr1 [18:06:20] ok [18:06:23] I'm shocked. Magical PHP crap is so reliable normally. ;-) [18:07:40] (03PS1) 10Krinkle: tests: Use fill $token array in timelineTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 [18:08:34] (03PS4) 10Dzahn: ci: allow ssh from new gerrit server gerrit1001 in ferm [puppet] - 10https://gerrit.wikimedia.org/r/535965 (https://phabricator.wikimedia.org/T222391) [18:08:49] (03CR) 10jerkins-bot: [V: 04-1] tests: Use fill $token array in timelineTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 (owner: 10Krinkle) [18:08:55] (03PS2) 10Krinkle: tests: Use full $token array in timelineTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 [18:08:57] (03CR) 10jerkins-bot: [V: 04-1] tests: Update local copy of SiteConfiguration.php to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534546 (owner: 10Jforrester) [18:08:59] (03CR) 10Krinkle: "The data provider specified for TimelineTest::testTimelineFontFileDoesNotHaveTtfSuffix is invalid." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 (owner: 10Krinkle) [18:09:43] (03PS3) 10Vgutierrez: Release 8.0.5-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/536302 (https://phabricator.wikimedia.org/T232298) [18:11:21] (03CR) 10jerkins-bot: [V: 04-1] tests: Use full $token array in timelineTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 (owner: 10Krinkle) [18:11:32] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:12:29] bblack: so the feature is there https://phabricator.wikimedia.org/T180069? [18:12:39] but not in the prod version of pybal? [18:13:10] 04̶C̶r̶i̶t̶i̶c̶a̶l [18:13:42] XioNoX: right [18:13:46] it's complicated! :) [18:14:34] I'm sure it is [18:14:58] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) 05Open→03Resolved Alright everything here is done. And was quite smooth. Some notes: * k8s1005 and k8s1006 only had v4/v6 sessions to cr1 and not cr2, which... [18:15:01] 10Operations, 10Traffic, 10netops: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) [18:15:03] (03PS2) 10Dzahn: gerrit: add gerrit1001 to SSH known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/535971 (https://phabricator.wikimedia.org/T222391) [18:15:12] just wanted a task to link in my summary https://phabricator.wikimedia.org/T226424#5488615 [18:16:52] hmmm or maybe I'm wrong [18:17:01] I'm looking at the 1.15-stretch branch and the feature seems to be there [18:17:07] anyways, can take to traffic chan :) [18:18:03] (03CR) 10Dzahn: [C: 03+2] gerrit: add gerrit1001 to SSH known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/535971 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [18:19:57] mutante \o/ [18:21:29] paladox: fyi we were able to fix the ores' issue switching to ssh :) [18:21:33] !log lvs1016: restart pybal to test dual bgp peering [18:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:36] great! [18:25:12] (03PS2) 10Muehlenhoff: Add library hint for libxslt [puppet] - 10https://gerrit.wikimedia.org/r/536236 [18:26:19] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [18:28:46] !log lvs1016: restart pybal to revert test [18:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:28] (03PS3) 10Krinkle: tests: Fix broken timelineTest.php provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 [18:29:30] James_F: ^ [18:29:57] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libxslt [puppet] - 10https://gerrit.wikimedia.org/r/536236 (owner: 10Muehlenhoff) [18:30:14] (03CR) 10Krinkle: tests: Fix broken timelineTest.php provider (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 (owner: 10Krinkle) [18:32:34] !log installing gdb updates from buster 10.1 point release [18:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:23] 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) It looks like the culprit is https://github.com/apache/trafficserver/commit/03734d05e28af8a7b105a0579056c913fb5d1bc5, I've tested ` https://gerrit.wikimedia.org/g/operations/debs/trafficserve... [18:37:49] (03CR) 10Jforrester: tests: Fix broken timelineTest.php provider (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 (owner: 10Krinkle) [18:44:17] (03CR) 10Krinkle: tests: Fix broken timelineTest.php provider (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 (owner: 10Krinkle) [18:45:18] (03PS1) 10Herron: exim: add pr.wikimedia.org to wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/536313 (https://phabricator.wikimedia.org/T231387) [18:53:16] jouncebot: now [18:53:16] For the next 0 hour(s) and 6 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190912T1800) [18:53:20] well [18:53:52] hashar: UBNs take priority. [18:53:59] Also SWAT is empty. [18:54:30] (03CR) 10Jforrester: [C: 03+2] tests: Fix broken timelineTest.php provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 (owner: 10Krinkle) [18:55:53] (03Merged) 10jenkins-bot: tests: Fix broken timelineTest.php provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 (owner: 10Krinkle) [18:55:55] (03CR) 10Hashar: [C: 03+2] Enable coredump on some mysterious php7.2 failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536243 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [18:56:37] (03CR) 10jenkins-bot: tests: Fix broken timelineTest.php provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536306 (owner: 10Krinkle) [18:57:28] (03Merged) 10jenkins-bot: Enable coredump on some mysterious php7.2 failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536243 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [18:57:47] (03PS3) 10Jforrester: tests: Update local copy of SiteConfiguration.php to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534546 [18:58:40] (03CR) 10jenkins-bot: Enable coredump on some mysterious php7.2 failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536243 (https://phabricator.wikimedia.org/T232613) (owner: 10Hashar) [18:58:54] I bet that the code deployment would cause the specific bug to no more occur [18:59:23] !log hashar@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable coredump on some mysterious php7.2 failure - T232613 (duration: 01m 04s) [18:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:26] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [18:59:57] Krinkle: I was thinking about re-doing the Cirrus tests to only test on wmfVariantSettings() and ignore the Beta Cluster config, scrapping all the crap. Sound sane? [19:00:52] (03CR) 10jerkins-bot: [V: 04-1] tests: Update local copy of SiteConfiguration.php to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534546 (owner: 10Jforrester) [19:03:49] James_F: I think it'd be nice if we validate that as well, especially since it's so easy to forget about Beta. But we can/should definitely bypass the wgConf complexity there. We can `$config = require 'InitialiseSettings.php'` for prod and then for it'd just be one array_merge() away from its own include. [19:04:05] note that cirrusTest.php doesn't seem to cover beta currently. [19:04:14] It's failing at covering prod config [19:05:04] There's a test for 'labs', isn't there? [19:05:14] E_TOOMANYCONFIGS [19:06:02] (03PS1) 10Herron: dns: add mail records for pr.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/536318 (https://phabricator.wikimedia.org/T231387) [19:07:21] (03PS1) 10Krinkle: beta: Fix mismatching indentation in wmfLabsSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536319 [19:08:14] RECOVERY - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is OK: TCP OK - 0.000 second response time on 10.64.48.99 port 9042 https://phabricator.wikimedia.org/T93886 [19:11:13] 10Operations, 10netops, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10ayounsi) LGTM! [19:11:50] (03CR) 10Krinkle: [C: 03+2] beta: Fix mismatching indentation in wmfLabsSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536319 (owner: 10Krinkle) [19:14:09] Krinkle: i made performance.discovery.wmnet. it points to webperf1001.eqiad.wmnet as that is the currently active backend in ATS. next i would like to switch the ATS config to use https://performance.discovery.wmnet so TLS between caching and backend. [19:14:10] (03PS2) 10Ayounsi: LibreNMS: fix new deployments permissions errors [puppet] - 10https://gerrit.wikimedia.org/r/535945 [19:15:08] (03PS1) 10Krinkle: beta: Remove unused wgConf feature of '+beta' tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536321 (https://phabricator.wikimedia.org/T223602) [19:15:29] the Varnish config (where it was active/active) is not used afaict. replaced by ATS already [19:15:33] (03PS2) 10Krinkle: beta: Remove unused wgConf feature of '+beta' tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536321 (https://phabricator.wikimedia.org/T223602) [19:16:03] (03CR) 10Jforrester: [C: 03+1] beta: Remove unused wgConf feature of '+beta' tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536321 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [19:16:09] mutante: I see. What is the path towards being active-active again? [19:16:40] (03CR) 10Krinkle: [C: 03+2] beta: Remove unused wgConf feature of '+beta' tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536321 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [19:16:45] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/535945 (owner: 10Ayounsi) [19:17:18] (03CR) 10Ayounsi: LibreNMS: fix new deployments permissions errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535945 (owner: 10Ayounsi) [19:17:41] (03CR) 10Ayounsi: [C: 03+2] LibreNMS: fix new deployments permissions errors [puppet] - 10https://gerrit.wikimedia.org/r/535945 (owner: 10Ayounsi) [19:18:31] (03Merged) 10jenkins-bot: beta: Fix mismatching indentation in wmfLabsSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536319 (owner: 10Krinkle) [19:18:48] (03CR) 10jenkins-bot: beta: Fix mismatching indentation in wmfLabsSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536319 (owner: 10Krinkle) [19:19:01] Krinkle: i am not sure. would have to ask Ema but he is on vacation. it's an existing thing since the ATS switch though and separate from the protocol change. per https://gerrit.wikimedia.org/r/c/operations/puppet/+/535929/1/hieradata/common/profile/trafficserver/backend.yaml [19:19:16] (03Merged) 10jenkins-bot: beta: Remove unused wgConf feature of '+beta' tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536321 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [19:20:24] 10Operations, 10Traffic, 10Patch-For-Review: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10BBlack) What's missing here is turning on BGP peering with all local routers, which is available in our current 1.15 pybal releases. Will fix that up here and then resolv... [19:20:54] (03CR) 10jenkins-bot: beta: Remove unused wgConf feature of '+beta' tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536321 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [19:21:40] (03CR) 10Jforrester: beta: Remove unused wgConf feature of '+beta' tag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536321 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [19:23:55] (03PS1) 10Krinkle: beta: Move the only dynamic config from IS-labs to CS-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536323 (https://phabricator.wikimedia.org/T223602) [19:25:01] (03CR) 10Krinkle: [C: 03+2] beta: Remove unused wgConf feature of '+beta' tag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536321 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [19:25:30] 10Operations, 10Mail, 10WMF-Communications, 10Patch-For-Review: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10herron) @mark you bet! I've uploaded a few patches to get mail flowing for this subdomain. @varnent were there any SPF records provided by muckrack/ses? Also could you plea... [19:27:57] 10Operations, 10Traffic, 10Patch-For-Review: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10BBlack) T180069 - Ticket from the feature add for pybal itself [19:28:42] Krinkle: TLS termination would be handled by envoy which is already installed and running on 443 on webperf1001. can i switch it? already done for a bunch of other misc services without issue and separate from the number of backends [19:29:25] mutante: yes, that's fine. I'm just asking when we get active-active back as until recently before ATS. [19:29:51] I didn't realise it had stopped (or I forgot) [19:30:12] (03PS1) 10BBlack: codfw backup LVS: BGP sessions with both routers [puppet] - 10https://gerrit.wikimedia.org/r/536324 (https://phabricator.wikimedia.org/T165765) [19:30:33] Krinkle: yea, i understand. will let you know once i have a better answer for that. i noticed that for planet too but this is only the second service after that that was really active/active [19:32:54] RECOVERY - cassandra-c service on restbase1018 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:34:08] RECOVERY - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-c valid until 2020-06-24 13:01:22 +0000 (expires in 285 days) https://phabricator.wikimedia.org/T120662 [19:34:13] !log bootstrapping Cassandra, restbase1018-c -- T224553 [19:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:16] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [19:36:27] [{exception_id}] {exception_url} ErrorException from line 304 of /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: PHP Warning: Use of undefined constant SIGABRT - assumed 'SIGABRT' (this will throw an Error in a future [19:36:28] I aM DOOMED [19:36:29] really [19:36:30] :D [19:37:55] (03PS5) 10Jforrester: Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 [19:38:37] (03CR) 10Jforrester: [C: 03+2] beta: Move the only dynamic config from IS-labs to CS-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536323 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [19:40:16] ErrorException from line 304 of /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: PHP Warning: Use of undefined constant SIGABRT - assumed 'SIGABRT' (this will throw an Error in a future [19:40:43] hmm [19:41:40] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [19:42:41] (03Merged) 10jenkins-bot: beta: Move the only dynamic config from IS-labs to CS-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536323 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [19:43:35] (03CR) 10jenkins-bot: beta: Move the only dynamic config from IS-labs to CS-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536323 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [19:43:37] (03PS1) 10Krinkle: beta: Remove redundant 'flow_only_labs.dblist' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536328 [19:44:52] !log installing 4.19.67 kernel from 10.1 point release on Buster systems [19:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:52] (03PS1) 10Jforrester: tests/WgConfTestCase: HACK: Downgrade throw to echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536329 (https://phabricator.wikimedia.org/T232691) [19:51:04] !log installing systemd bugfix update from Buster 10.1 point release [19:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:23] (03CR) 10Jforrester: "I mean, it works…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536329 (https://phabricator.wikimedia.org/T232691) (owner: 10Jforrester) [19:51:39] (03PS6) 10Jforrester: Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 [19:52:51] hashar: Anything I can do to help? [19:53:04] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [19:57:04] (03CR) 10Jforrester: [C: 03+1] beta: Remove redundant 'flow_only_labs.dblist' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536328 (owner: 10Krinkle) [20:01:01] (03CR) 10Krinkle: [C: 03+2] beta: Remove redundant 'flow_only_labs.dblist' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536328 (owner: 10Krinkle) [20:01:16] James_F: I'll do a sync if calendar is free to prodify those beta no-ops [20:01:21] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [20:01:34] calendar looks clear [20:01:36] Krinkle: Do it. [20:01:44] Just Antoine trying to rescue the train. [20:02:04] (I've been syncing the beta patches as I've merged them to avoid confusion.) [20:02:18] s/syncing/pulling onto the deployment server/ [20:02:24] k [20:02:28] * Krinkle too [20:02:38] (03Merged) 10jenkins-bot: beta: Remove redundant 'flow_only_labs.dblist' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536328 (owner: 10Krinkle) [20:02:54] (03CR) 10jenkins-bot: beta: Remove redundant 'flow_only_labs.dblist' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536328 (owner: 10Krinkle) [20:02:55] !log installing firmware-nonfree update from Buster 10.1 point release [20:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:11] (03PS1) 10Krinkle: beta: Remove use of SiteConfiguration->siteParamsCallback feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 [20:03:17] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: beta-only (duration: 01m 04s) [20:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:13] (03CR) 10Krinkle: beta: Remove use of SiteConfiguration->siteParamsCallback feature (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 (owner: 10Krinkle) [20:04:26] James_F: thoughts on this one? ^ [20:05:38] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [20:05:45] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: beta-only (duration: 01m 02s) [20:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:29] Krinkle: Alternatively, we could scrap the idea of labs-only dblists? [20:08:21] The complexity is very high for the value of a single, 14 line list. [20:09:02] (03PS1) 10Krinkle: dblists: Restore flow_only_labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536337 [20:09:09] (03CR) 10Krinkle: [C: 03+2] dblists: Restore flow_only_labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536337 (owner: 10Krinkle) [20:09:19] James_F: yeah, agreed. [20:09:27] (03CR) 10Jforrester: beta: Remove use of SiteConfiguration->siteParamsCallback feature (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 (owner: 10Krinkle) [20:09:43] But yes, getting rid of uses of siteParamsCallback seems like a good idea. [20:10:37] AFAICT it's only ever used for Labs and for the SplitPrivateWiki extension? [20:11:00] (03Merged) 10jenkins-bot: dblists: Restore flow_only_labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536337 (owner: 10Krinkle) [20:11:04] (03CR) 10jenkins-bot: dblists: Restore flow_only_labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536337 (owner: 10Krinkle) [20:11:15] (03CR) 10Krinkle: beta: Remove use of SiteConfiguration->siteParamsCallback feature (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 (owner: 10Krinkle) [20:11:54] I don't know what SplitPrivateWiki is, but yeah, codesearch shows it only used in IS-labs.php [20:12:04] oh I see [20:12:07] right, I was on /deployed/ [20:12:39] yeah, maybe deprecate and recommend if a callback is desired, for the caller to call the relevant function directly from their own code before calling getAll() [20:13:47] Sure. [20:14:20] (03CR) 10Jforrester: beta: Remove use of SiteConfiguration->siteParamsCallback feature (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 (owner: 10Krinkle) [20:14:45] I'm sure siteParamsCallback() was clever but it's too much magic. [20:15:40] (03CR) 10Jforrester: [C: 03+1] beta: Remove use of SiteConfiguration->siteParamsCallback feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 (owner: 10Krinkle) [20:18:59] James_F: Took a look at SplitPrivateWiki, it's not trivial to migrate their use case. [20:19:08] fwiw [20:19:32] (03PS2) 10Krinkle: beta: Remove use of SiteConfiguration->siteParamsCallback feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 [20:19:45] (03CR) 10Krinkle: [C: 03+2] beta: Remove use of SiteConfiguration->siteParamsCallback feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 (owner: 10Krinkle) [20:20:48] (03Merged) 10jenkins-bot: beta: Remove use of SiteConfiguration->siteParamsCallback feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 (owner: 10Krinkle) [20:21:03] (03CR) 10jenkins-bot: beta: Remove use of SiteConfiguration->siteParamsCallback feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536335 (owner: 10Krinkle) [20:21:22] Krinkle: I have another patch for the php7 core dump :-\ https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/536338/1/includes/libs/rdbms/lbfactory/LBFactoryMulti.php [20:21:30] this time adding some logging statement I am not sure about [20:22:28] Krinkle: File a TODO task to investigate deprecation? [20:23:20] (03PS1) 10Dzahn: DHCP: switch gerrit1001 to use buster [puppet] - 10https://gerrit.wikimedia.org/r/536341 [20:24:07] hashar: ok [20:24:20] and pcntl is not defined in php-fpm :) [20:24:56] * Krinkle staging on mwdebug1002 [20:25:31] hashar: wfDebugLog('ncod', …) would work as well [20:25:34] AdhocDebug [20:25:47] oh [20:25:51] yeah probably better [20:25:53] AdHocDebug * [20:26:05] but then the 'ncod' bucket would have to be defined in mediawiki-config? [20:27:15] hashar: that was a typo [20:27:16] 'AdHocDebug' [20:27:26] thx [20:27:46] this one is enabled in wmf-config for this purpose :) [20:28:09] !log krinkle@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: d495d5e24949 (duration: 01m 04s) [20:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:31] (03PS1) 10Dzahn: gerrit: switch JDK package to version 11 if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536344 [20:33:23] (03CR) 10jerkins-bot: [V: 04-1] gerrit: switch JDK package to version 11 if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536344 (owner: 10Dzahn) [20:33:31] Krinkle: If you're content with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/536329/1 I can land a few things and do VariantConfig-labs etc.? [20:33:57] !log krinkle@deploy1001 Synchronized wmf-config/: d495d5e24949 (duration: 01m 03s) [20:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:08] * Krinkle done staging [20:34:25] got one last one to push, but won't roll out mine today [20:34:33] (03PS1) 10Krinkle: Remove dependency on wgConfig from wmf-config/InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) [20:34:36] (03PS2) 10Dzahn: gerrit: switch JDK package to version 11 if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536344 [20:34:42] James_F: not sure about this one, but might help [20:34:49] I'll leave it at this for now [20:35:05] checking out that patch now [20:35:58] (03CR) 10Paladox: gerrit: switch JDK package to version 11 if on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536344 (owner: 10Dzahn) [20:36:00] (03CR) 10Paladox: [C: 03+1] gerrit: switch JDK package to version 11 if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536344 (owner: 10Dzahn) [20:36:06] Krinkle: Rummana is having issues with Beta Cluster. Something that we broke? Some services seem flaky. [20:36:47] (03CR) 10Krinkle: "Unlike core, wmf-config has a very loose phpunit config." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536329 (https://phabricator.wikimedia.org/T232691) (owner: 10Jforrester) [20:37:15] (03CR) 10Krinkle: "Basically the same as skipping the test I mean. Which is fine for the time being I guess." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536329 (https://phabricator.wikimedia.org/T232691) (owner: 10Jforrester) [20:37:45] James_F: huh, that looks weird indeed. [20:37:49] https://en.wikipedia.%24variant.wmflabs.org/wiki/Main_Page [20:37:53] being redirected to that domain [20:38:08] Yeah, she spotted that too. [20:38:14] (03CR) 10Muehlenhoff: gerrit: switch JDK package to version 11 if on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536344 (owner: 10Dzahn) [20:38:21] It's working OK for me. [20:38:22] OK, looking into it now [20:38:39] Oh, no, not if I trigger Special:Random [20:38:58] yeah, canonical urls are fine, self-reference and redirect is broken [20:39:24] 'wikipedia' => 'https://$lang.wikipedia.$variant.wmflabs.org', [20:39:31] from wgServer [20:39:54] (03CR) 10Jforrester: [C: 03+1] Remove dependency on wgConfig from wmf-config/InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [20:39:56] (03PS3) 10Dzahn: gerrit: switch JDK package to version 11 if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536344 [20:40:18] (03CR) 10Dzahn: gerrit: switch JDK package to version 11 if on buster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536344 (owner: 10Dzahn) [20:40:52] Oh, $variant is meant to be 'm.' or ''? [20:41:04] found it [20:41:08] And it was silently inheriting? [20:41:11] no, it's meant to be 'beta' [20:41:11] No global. [20:41:30] Oh. Yes, but we only do beta? [20:41:38] Was this for the mythical beta vs. staging concept? [20:41:39] I didn't find it because it's not used like a real variable. It uses the magic substitution logic and I removed it by accident [20:41:40] will fix [20:41:44] Kk. [20:41:57] yeah, if we want to support that we should vary '.beta.wmflabs.org' as a whole probably which we already do in various places [20:43:31] (03PS1) 10Krinkle: beta: Unbreak config vars that used '$variant' placeholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536346 [20:44:07] James_F: "TODO" - validate that wgConfig doesn't contain placeholders that aren't declared :P [20:44:54] (03CR) 10Krinkle: [C: 03+2] beta: Unbreak config vars that used '$variant' placeholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536346 (owner: 10Krinkle) [20:45:41] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/536338/2/includes/libs/rdbms/lbfactory/LBFactoryMulti.php [20:45:48] should be better now :] [20:45:53] (03Merged) 10jenkins-bot: beta: Unbreak config vars that used '$variant' placeholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536346 (owner: 10Krinkle) [20:46:27] (03CR) 10jenkins-bot: beta: Unbreak config vars that used '$variant' placeholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536346 (owner: 10Krinkle) [20:48:24] RECOVERY - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is OK: TCP OK - 0.000 second response time on 10.64.48.100 port 9042 https://phabricator.wikimedia.org/T93886 [20:50:17] (03CR) 10Jforrester: [C: 03+2] tests/WgConfTestCase: HACK: Downgrade throw to echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536329 (https://phabricator.wikimedia.org/T232691) (owner: 10Jforrester) [20:50:34] OK, I'll drop the HHVM test (but keep the HHVM lint) for the repo. [20:51:17] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) [20:51:19] (03Merged) 10jenkins-bot: tests/WgConfTestCase: HACK: Downgrade throw to echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536329 (https://phabricator.wikimedia.org/T232691) (owner: 10Jforrester) [20:51:34] (03CR) 10jenkins-bot: tests/WgConfTestCase: HACK: Downgrade throw to echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536329 (https://phabricator.wikimedia.org/T232691) (owner: 10Jforrester) [20:51:51] James_F: the test suite loads commonsettings / initialisesettings iirc so loading them with hhvm is a good way to prevent bad things? [20:51:57] or maybe the lint is good enough yeah [20:52:15] hashar: The test suite doesn't work in HHVM and fatals in PHP72, so it's not that good. [20:52:23] (03PS1) 10Dzahn: gerrit: switch from mysql to mariadb db driver if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536349 [20:52:29] hashar: Hence the task. :-) But you have a UBN to fix first. [20:52:32] oops :( [20:52:51] well [20:52:54] (03CR) 10Muehlenhoff: [C: 03+1] gerrit: switch JDK package to version 11 if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536344 (owner: 10Dzahn) [20:52:59] that UBN has been floating around for a few weeks already [20:53:19] Sure, but it's also nearly 23:00 for you. [20:53:38] (03CR) 10jerkins-bot: [V: 04-1] gerrit: switch from mysql to mariadb db driver if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536349 (owner: 10Dzahn) [20:53:45] yeah :-\ [20:54:03] (03PS1) 10Krinkle: [WIP] Simplify WgConfTestCase and re-enable DbconfigTest for PHP7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536350 [20:55:00] (03PS7) 10Jforrester: Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 [20:55:13] (03PS2) 10Krinkle: [WIP] Simplify WgConfTestCase and re-enable DbconfigTest for PHP7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536350 [20:55:59] (03CR) 10Dzahn: [C: 03+2] gerrit: switch JDK package to version 11 if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536344 (owner: 10Dzahn) [20:56:01] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Simplify WgConfTestCase and re-enable DbconfigTest for PHP7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536350 (owner: 10Krinkle) [20:56:07] (03CR) 10Paladox: gerrit: switch JDK package to version 11 if on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536344 (owner: 10Dzahn) [20:58:10] (03PS5) 10Jforrester: Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 [20:58:19] (03PS5) 10Jforrester: composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 [20:58:29] (03PS6) 10Jforrester: composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 [20:58:40] (03PS6) 10Jforrester: composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 [20:58:55] (03CR) 10Dzahn: gerrit: switch JDK package to version 11 if on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536344 (owner: 10Dzahn) [20:59:52] (03PS2) 10Dzahn: gerrit: switch from mysql to mariadb db driver if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536349 [21:00:02] !log decommissioning Cassandra, restbase2009 -- T224553 [21:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [21:01:45] (03CR) 10Jforrester: [C: 03+1] "This doesn't expose it for testing, of course, unlike my VariantSettings approach…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [21:05:19] (03CR) 10Krinkle: "The beta config array is already exposed from a callable function. The logic to apply the beta config is previously not testable without m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [21:05:28] (03PS2) 10Krinkle: Remove dependency on wgConf from wmf-config/InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) [21:05:51] (03PS10) 10Jforrester: Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [21:06:23] (03CR) 10Jforrester: [C: 03+2] Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [21:06:26] (03CR) 10Jforrester: [C: 03+2] Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [21:06:29] (03CR) 10Jforrester: [C: 03+2] composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 (owner: 10Jforrester) [21:06:33] (03CR) 10Jforrester: [C: 03+2] composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [21:06:37] (03PS1) 10Dzahn: gerrit: add gerrit1001 as a replica host [puppet] - 10https://gerrit.wikimedia.org/r/536352 [21:06:44] (All no-ops.) [21:06:50] well, for prod [21:06:57] it changes stuff for ci ;) [21:07:04] Yes, it makes it work. ;-) [21:07:27] (03Merged) 10jenkins-bot: Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [21:07:37] (03Merged) 10jenkins-bot: Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [21:07:43] (03Merged) 10jenkins-bot: composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 (owner: 10Jforrester) [21:07:49] (03Merged) 10jenkins-bot: composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [21:08:49] uh, James_F... [21:08:54] If you did --no-dev... [21:09:01] Oh, the patch after wasn't --no-dev [21:09:02] nvm [21:09:05] The followup wasn't, yeah. [21:09:10] I was like, why is there phpunit [21:09:13] Because the repo was in a mixed state. [21:09:18] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@c4c9e8b]: Deploy kartotherian 1.1.4-wmf.0 [21:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:22] Which wasn't good for anything. [21:09:28] (03PS2) 10Dzahn: gerrit: add gerrit1001 as a replica host [puppet] - 10https://gerrit.wikimedia.org/r/536352 [21:09:32] (03CR) 10jenkins-bot: Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [21:10:14] (03CR) 10Jforrester: [C: 03+2] composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 (owner: 10Jforrester) [21:11:15] (03Merged) 10jenkins-bot: composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 (owner: 10Jforrester) [21:11:44] (03CR) 10jenkins-bot: composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 (owner: 10Jforrester) [21:12:53] Krinkle: Want to land https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/536345 (wgConf dependency removal)? [21:13:10] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@c4c9e8b]: Deploy kartotherian 1.1.4-wmf.0 (duration: 03m 52s) [21:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:16] (03PS2) 10Dzahn: ATS: switch webperf backends to TLS and discovery name [puppet] - 10https://gerrit.wikimedia.org/r/535929 (https://phabricator.wikimedia.org/T210411) [21:13:54] PROBLEM - SSH wtp1031.mgmt on wtp1031.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:14:16] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@5996843]: Deploy tilerator 1.1.4-wmf.0 [21:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:18] (03PS3) 10Jforrester: Remove dependency on wgConf from wmf-config/InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [21:16:09] (03PS11) 10Jforrester: Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [21:16:28] going to deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/536338/2/includes/libs/rdbms/lbfactory/LBFactoryMulti.php [21:16:36] hashar: Go for it. [21:16:54] (03CR) 10Dzahn: [C: 04-2] "oops, no. i renamed the directory from webperf.discovery to performance.discovery to match the director name and did not create a new cert" [puppet] - 10https://gerrit.wikimedia.org/r/535929 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:16:54] James_F: not sure, bit scared to do that right now [21:17:03] currently figuring out the reason wmf-config tests fail on php72 [21:17:12] I can repro locally [21:17:17] ruled out and commented out most code [21:17:34] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@5996843]: Deploy tilerator 1.1.4-wmf.0 (duration: 03m 18s) [21:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:44] I'm down now to the following working: setGlobals([ 'wmfDatacenter' => 'eqiad' ]); restoreGlobals() [21:17:49] but the following failing: [21:17:58] setGlobals([ 'wmfDatacenter' => 'eqiad' ]); require db-eqiad.php; restoreGlobals() [21:18:16] Yeah, I re-wrote it locally and it still dirtied the globals. [21:18:16] it's somehow causing WgConfigTestCase to forget it stashed that global [21:18:18] so random [21:18:24] !log hashar@deploy1001 Synchronized php-1.34.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: Hardcode posix signal and log coredump - T232613 (duration: 01m 04s) [21:18:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/536349 (owner: 10Dzahn) [21:18:41] Is it calling the setGlobals twice and over-writing the memory of what is written? [21:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:45] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [21:20:05] (03CR) 10Dzahn: [C: 03+2] gerrit: switch from mysql to mariadb db driver if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536349 (owner: 10Dzahn) [21:20:47] moauhahahhahaha [21:20:54] I got a coredump right now :-] [21:20:59] on mw1227 ! [21:21:00] hurrah [21:21:19] (03CR) 10Muehlenhoff: [C: 03+1] "Actually, needs a followup patch, sorry only just noticed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536349 (owner: 10Dzahn) [21:21:20] Success! [21:22:00] (03CR) 10Dzahn: gerrit: switch from mysql to mariadb db driver if on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536349 (owner: 10Dzahn) [21:25:56] just in time for TimStarling breakfast :] [21:26:28] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:33:50] (03PS2) 10Jforrester: tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 [21:34:00] (03PS2) 10Jforrester: WmfCluster: Use static VariantSettings instead of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535981 [21:34:03] (03PS1) 10Dzahn: gerrit: fix db driver package dependency if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 [21:37:28] James_F: oh god, this is horrible [21:37:34] James_F: It's a phpunit bug [21:37:52] backupGlobals="true" [21:37:52] convertNoticesToExceptions="true" [21:37:52] convertWarningsToExceptions="true" [21:37:53] Krinkle: … fun. [21:37:56] These are the settings we use [21:38:06] DbconfigTest includes db-eqiad.php [21:38:16] which includes a few references to undefined globals [21:38:22] like wmgOldExtTemplate and wgSecretKey [21:38:29] so they cause E_NOTICE undefined variable [21:38:37] which, like in mw core, is turned into an exception by PHPUnit [21:38:48] and would normally cause something like this: [21:38:49] There was 1 error: [21:38:49] 1) DbconfigTest::testSectionLoadsInHostsbyname with data set #0 ('production', 'eqiad', 'eqiad') [21:38:49] Undefined variable: wmgOldExtTemplate [21:38:53] But... [21:39:07] PHPUnit also validates that no globals were left modified [21:39:12] The backupGlobals="true" feature [21:39:14] Because wmgOldExtTemplate isn't defined for "unittest"? [21:39:31] The loadConfig in DbconfigTest has its own very basic mock. [21:39:37] uses realm 'production' [21:39:38] anyway [21:39:48] So when PHPUnit catches this exception and wants to tell us. [21:40:06] It decides it is a good idea to first "end" the test [21:40:11] which then triggers the check for new globals [21:40:15] Oh. [21:40:17] Oh dear. [21:40:27] and that finds an issue of course becase we bloody haven't reached the restoreGlobals() call [21:40:28] That's… quite poor. [21:40:30] it just stops the test half-way [21:40:34] and then complains about something unrelated [21:40:40] PHPUnit4 didn't have this bug [21:40:42] :) [21:40:44] Hi, postmerge for operations/mediawiki-config on https://integration.wikimedia.org/zuul/ stucks [21:40:45] (03PS2) 10Dzahn: gerrit: mariadb-java-client to replace mysql-connector-java if on buster [puppet] - 10https://gerrit.wikimedia.org/r/536355 [21:41:00] Zoranzoki21: It's probably just overloaded. [21:41:20] Krinkle: So the solution is to mock wmgOldExtTemplate? [21:41:30] James_F: beta-mediawiki-config-update-eqiad overloaded? [21:41:48] Zoranzoki21: Plausibly. [21:42:25] https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ -> Waiting for next available executor on ‘deployment-deploy01' [21:42:32] James_F: wgSecretKey [21:42:42] at which point the test no longer randomly stops mid-way complaining about something unrelated [21:42:44] but... [21:42:47] then the real issue is uncovered [21:42:58] Undefined index: sectionLoads [21:43:05] The test is for something that no longer exists in wmf-config [21:43:14] LBFactoryConf['sectionLoads'] comes from etcd now [21:43:15] Oh, because it moved to etcdconfig? [21:43:17] Yeah. [21:43:21] Whoops. [21:43:49] James_F: It's confusing .. It doesn't look like it's loaded, there aren't many patches. [21:44:32] (03PS2) 10Dzahn: DHCP: switch gerrit1001 to use buster [puppet] - 10https://gerrit.wikimedia.org/r/536341 [21:45:56] (03CR) 10Dzahn: [C: 03+2] "We have to do an upgrade from jessie either way so not like we can avoid that going together with the hardware upgrade...so going ahead." [puppet] - 10https://gerrit.wikimedia.org/r/536341 (owner: 10Dzahn) [21:47:03] (03CR) 10jenkins-bot: Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [21:47:04] (03CR) 10jenkins-bot: composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [21:47:06] (03CR) 10jenkins-bot: composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 (owner: 10Jforrester) [21:47:13] Zoranzoki21: There. [21:47:38] James_F: Cool now, ty. [21:48:57] (03PS3) 10Krinkle: tests: Remove dbconfigTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536350 (https://phabricator.wikimedia.org/T232691) [21:49:51] (03CR) 10Dzahn: "it does not mean i reinstall it right away, we can still talk about the Java version ?" [puppet] - 10https://gerrit.wikimedia.org/r/536341 (owner: 10Dzahn) [21:50:05] Krinkle: So the entire set of tests is now pointless? [21:50:34] (03CR) 10Paladox: "We carn't use buster just yet as gerrit 2.15 does not support java 11. We might as well use stretch, upgrade gerrit to 2.16, then reimage " [puppet] - 10https://gerrit.wikimedia.org/r/536341 (owner: 10Dzahn) [21:50:39] Do we have an equivalent set of tests in the etcd pipeline. [21:51:55] (03CR) 10Jforrester: [C: 03+2] tests: Remove dbconfigTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536350 (https://phabricator.wikimedia.org/T232691) (owner: 10Krinkle) [21:52:38] (03PS2) 10Nuria: Bumping up jar version to correct typo [puppet] - 10https://gerrit.wikimedia.org/r/535910 [21:52:54] (03Merged) 10jenkins-bot: tests: Remove dbconfigTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536350 (https://phabricator.wikimedia.org/T232691) (owner: 10Krinkle) [21:53:19] (03CR) 10jenkins-bot: tests: Remove dbconfigTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536350 (https://phabricator.wikimedia.org/T232691) (owner: 10Krinkle) [21:53:46] Krinkle: Do you want to keep the task open for the CirrusSearch tests? [21:54:51] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:20] (03PS1) 10Dzahn: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/536357 (https://phabricator.wikimedia.org/T222391) [21:56:47] (03CR) 10Jforrester: [C: 03+1] Remove dependency on wgConf from wmf-config/InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [21:57:07] (03CR) 10Paladox: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/536357 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [21:57:21] (03PS4) 10Jforrester: tests: Update local copy of SiteConfiguration.php to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534546 [21:58:10] (03CR) 10jerkins-bot: [V: 04-1] tests: Update local copy of SiteConfiguration.php to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534546 (owner: 10Jforrester) [21:59:46] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:00:57] (03CR) 10Ottomata: [C: 03+2] Bumping up jar version to correct typo [puppet] - 10https://gerrit.wikimedia.org/r/535910 (owner: 10Nuria) [22:01:00] James_F: maybe yeah, would be good for them to take a look [22:01:49] (03PS1) 10Cwhite: profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) [22:02:03] (03CR) 10MSantos: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/535533 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [22:02:29] (03CR) 10Paladox: gerrit: add gerrit1001 as a replica host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536352 (owner: 10Dzahn) [22:03:23] (03CR) 10Paladox: [C: 03+1] mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [22:07:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:09:56] (03CR) 10Ayounsi: [C: 03+1] codfw backup LVS: BGP sessions with both routers [puppet] - 10https://gerrit.wikimedia.org/r/536324 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack) [22:10:20] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:10:51] (03PS1) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 [22:11:20] (03PS2) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 [22:11:31] (03PS1) 10Dzahn: gerrit2001: re-enable Icinga monitoring of systemd [puppet] - 10https://gerrit.wikimedia.org/r/536360 [22:11:50] (03CR) 10Paladox: [C: 03+1] gerrit2001: re-enable Icinga monitoring of systemd [puppet] - 10https://gerrit.wikimedia.org/r/536360 (owner: 10Dzahn) [22:12:44] (03CR) 10Dzahn: [C: 03+2] gerrit2001: re-enable Icinga monitoring of systemd [puppet] - 10https://gerrit.wikimedia.org/r/536360 (owner: 10Dzahn) [22:12:52] (03PS2) 10Dzahn: gerrit2001: re-enable Icinga monitoring of systemd [puppet] - 10https://gerrit.wikimedia.org/r/536360 [22:14:16] !log remove extra prepend in AMS-IX [22:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:28] RECOVERY - SSH wtp1031.mgmt on wtp1031.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:19:37] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/535969 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [22:21:15] 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) cc @Ottomata just in case he can do the change too [22:22:02] (03PS1) 10Cwhite: prometheus: make statsd.relay-address toggle-able [puppet] - 10https://gerrit.wikimedia.org/r/536365 (https://phabricator.wikimedia.org/T205870) [22:24:29] (03PS2) 10Cwhite: profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) [22:25:23] (03PS1) 10Eevans: hieradata: specify restbase2009 jbod devices [puppet] - 10https://gerrit.wikimedia.org/r/536367 (https://phabricator.wikimedia.org/T224553) [22:25:55] (03CR) 10Cwhite: [C: 03+1] swift: remove statsite [puppet] - 10https://gerrit.wikimedia.org/r/536146 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [22:27:30] (03CR) 10Cwhite: [C: 03+1] statsite: support for ensure/removal [puppet] - 10https://gerrit.wikimedia.org/r/536136 (owner: 10Filippo Giunchedi) [22:29:31] (03CR) 10Cwhite: [C: 03+1] prometheus: add alert for widespread systemd failed units [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [22:30:14] (03PS1) 10Ayounsi: Reserve DNS for netflow2001 [dns] - 10https://gerrit.wikimedia.org/r/536369 [22:33:16] (03PS2) 10Ayounsi: Reserve DNS for netflow2001 [dns] - 10https://gerrit.wikimedia.org/r/536369 [22:34:45] (03PS1) 10Dzahn: requesttracker: comment out envoy inclusion for now [puppet] - 10https://gerrit.wikimedia.org/r/536370 [22:35:08] PROBLEM - Host backup1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:36:16] (03PS3) 10Dzahn: smokeping: replace cobalt with gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535969 (https://phabricator.wikimedia.org/T222391) [22:36:45] (03PS3) 10Ayounsi: Reserve DNS for netflow2001 [dns] - 10https://gerrit.wikimedia.org/r/536369 [22:37:01] i cant confirm what icinga says about backup1001.mgmt [22:37:15] it works [22:38:21] r4!Epru2C5 [22:39:10] (03CR) 10Dzahn: [C: 03+2] smokeping: replace cobalt with gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535969 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [22:39:12] (03CR) 10Ayounsi: [C: 03+2] Reserve DNS for netflow2001 [dns] - 10https://gerrit.wikimedia.org/r/536369 (owner: 10Ayounsi) [22:40:48] RECOVERY - Host backup1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [22:42:25] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [22:43:47] (03CR) 10Dzahn: [C: 03+2] requesttracker: comment out envoy inclusion for now [puppet] - 10https://gerrit.wikimedia.org/r/536370 (owner: 10Dzahn) [22:43:49] !log ayounsi@cumin2001 START - Cookbook sre.ganeti.makevm [22:43:49] !log ayounsi@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [22:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:50] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:55] (03PS2) 10Dzahn: requesttracker: comment out envoy inclusion for now [puppet] - 10https://gerrit.wikimedia.org/r/536370 [22:47:09] (03PS2) 10Cwhite: prometheus: make statsd.relay-address toggle-able [puppet] - 10https://gerrit.wikimedia.org/r/536365 (https://phabricator.wikimedia.org/T205870) [22:47:26] (03PS1) 10Ayounsi: netflow2001: put record in good DC [dns] - 10https://gerrit.wikimedia.org/r/536372 [22:48:02] (03CR) 10Ayounsi: [C: 03+2] netflow2001: put record in good DC [dns] - 10https://gerrit.wikimedia.org/r/536372 (owner: 10Ayounsi) [22:48:41] !log ayounsi@cumin2001 START - Cookbook sre.ganeti.makevm [22:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:44] !log ayounsi@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [22:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190912T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:56] 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) [23:05:05] (03PS1) 10Ayounsi: DHCP: add netflow2001 [puppet] - 10https://gerrit.wikimedia.org/r/536373 [23:07:32] (03PS2) 10Ayounsi: DHCP: add netflow2001 [puppet] - 10https://gerrit.wikimedia.org/r/536373 [23:08:21] (03CR) 10Ayounsi: [C: 03+2] DHCP: add netflow2001 [puppet] - 10https://gerrit.wikimedia.org/r/536373 (owner: 10Ayounsi) [23:08:46] (03PS3) 10Ayounsi: DHCP: add netflow2001 [puppet] - 10https://gerrit.wikimedia.org/r/536373 [23:10:41] (03PS10) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP, for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) [23:10:46] (03CR) 10Jforrester: [C: 03+2] Variant configuration: Read from JSON, not serialised PHP, for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:11:02] (03CR) 10Ayounsi: [C: 03+2] Add role netinsight to netflow2001 [puppet] - 10https://gerrit.wikimedia.org/r/535942 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [23:12:10] (03CR) 10jenkins-bot: Variant configuration: Read from JSON, not serialised PHP, for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:14:07] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10Jclark-ctr) Opened Dell Support Ticket Reseated cables in back of array and controller link light appeared Will wait for verification if working prior to closing @jcrespo [23:17:55] (03PS7) 10Jforrester: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 [23:18:00] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: T223602 Add ability to read config from JSON, not serialised PHP (duration: 01m 04s) [23:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:04] T223602: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 [23:18:31] (03PS1) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [23:19:03] (03PS4) 10Jforrester: Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 [23:19:32] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T223602 Read config from JSON, not serialised PHP on testwiki (duration: 01m 03s) [23:19:35] (03PS6) 10Jforrester: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) [23:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:55] (03CR) 10Jforrester: "Let's think about enabling this on Monday if all seems well over the weekend." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 (owner: 10Jforrester) [23:21:16] !log decommissioning Cassandra, restbase2009-b -- T224553 [23:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:19] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [23:26:00] (03PS1) 10Bstorm: tools-manifest: apply black formatting to webservicemonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536377 [23:26:02] (03PS1) 10Bstorm: tools-manifest: increase the timeout to 30s [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536378 (https://phabricator.wikimedia.org/T220650) [23:26:24] (03CR) 10jerkins-bot: [V: 04-1] tools-manifest: apply black formatting to webservicemonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536377 (owner: 10Bstorm) [23:26:28] (03CR) 10jerkins-bot: [V: 04-1] tools-manifest: increase the timeout to 30s [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536378 (https://phabricator.wikimedia.org/T220650) (owner: 10Bstorm) [23:30:35] (03PS2) 10Bstorm: tools-manifest: increase the timeout to 30s [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536378 (https://phabricator.wikimedia.org/T220650) [23:33:48] (03CR) 10Bstorm: "Just old fussing because I'm looking at it. I don't recall if we decided to update the changelog on commits or after it is being built?" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536378 (https://phabricator.wikimedia.org/T220650) (owner: 10Bstorm) [23:34:25] (03CR) 10BryanDavis: [C: 03+1] tools-manifest: increase the timeout to 30s [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536378 (https://phabricator.wikimedia.org/T220650) (owner: 10Bstorm) [23:35:03] !log enable netflow sampling on cr1-codfw [23:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:07] (03CR) 10BryanDavis: "Maybe move the tox change from the followup to this patch to make jerkins happy? Or just force past its -1 knowing the following patch fix" (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536377 (owner: 10Bstorm) [23:38:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:42] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:50:28] (03PS3) 10Bstorm: tools-manifest: increase the timeout to 30s [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536378 (https://phabricator.wikimedia.org/T220650) [23:51:28] (03PS2) 10Bstorm: tools-manifest: apply black formatting to webservicemonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536377 [23:54:23] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:56:00] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Nuria) @srishakatux hello, can you be a bit more specific about what data you are after? Are you thinking eventlog... [23:56:22] (03PS4) 10Bstorm: tools-manifest: increase the timeout to 30s [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536378 (https://phabricator.wikimedia.org/T220650) [23:56:28] ACKNOWLEDGEMENT - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL Cas Rusnov Will address ASAP. https://wikitech.wikimedia.org/wiki/Netbox%23Reports