[00:16:23] (03PS8) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [00:16:25] (03PS11) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [00:16:27] (03PS39) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [00:18:52] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:22:12] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational [02:01:41] (03PS7) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 [02:01:43] (03PS9) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [02:01:45] (03PS12) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [02:01:47] (03PS40) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [02:19:25] (03PS10) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [02:19:27] (03PS13) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [02:19:29] (03PS41) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [02:21:39] (03CR) 10EBernhardson: "puppet compiler runs for this patch and the rest of the series against their direct parent can be found at:" [puppet] - 10https://gerrit.wikimedia.org/r/441894 (owner: 10EBernhardson) [02:33:59] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.8) (duration: 13m 26s) [02:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:09] (03PS2) 10Samwilson: Enable Draft namespace and AfC mode for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441944 (https://phabricator.wikimedia.org/T198143) (owner: 10MusikAnimal) [03:05:26] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.999) (duration: 12m 59s) [03:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:47] 10Operations, 10ops-eqiad: Degraded RAID on db1054 - https://phabricator.wikimedia.org/T198157#4313908 (10ops-monitoring-bot) [03:43:32] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received [03:43:32] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 503 (expecting: 200) [03:43:52] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received [03:43:52] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [03:44:32] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [03:44:33] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [03:44:43] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [03:44:52] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [04:41:58] 10Operations, 10Cloud-Services, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138#4313113 (10demon) >>! In T198138#4313147, @Krenair wrote: > Might be better to ensure that all privileged users at least know that they should never use -A to a host they do not... [04:46:57] (03CR) 10Chad: [C: 031] Scap clean: remove remote cache directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441920 (https://phabricator.wikimedia.org/T157030) (owner: 10Thcipriani) [05:05:00] 10Operations, 10ops-eqiad: Degraded RAID on db1054 - https://phabricator.wikimedia.org/T198157#4314051 (10Marostegui) 05Open>03Invalid This server is going to be decommissioned, so no need to act on this. [05:10:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442012 (https://phabricator.wikimedia.org/T191316) [05:13:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442012 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:13:17] (03PS5) 10Marostegui: mariadb: Set db1095 as spare, remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/437720 (https://phabricator.wikimedia.org/T196376) [05:14:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442012 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:15:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1122 for alter table (duration: 00m 59s) [05:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:40] !log Deploy schema change on db1122 T191316 T192926 T89737 T195193 [05:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:44] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:15:44] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:15:45] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:15:45] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:16:35] (03CR) 10Marostegui: mariadb: Set db1095 as spare, remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/437720 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [05:17:24] (03CR) 10Marostegui: [C: 032] mariadb: Set db1095 as spare, remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/437720 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [05:18:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442012 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:28:57] !log testing fix for T197447 on wdqs1009 [05:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:59] T197447: Default Blazegraph configuration confuses strings with and without RTL mark - https://phabricator.wikimedia.org/T197447 [05:29:26] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:29:35] PROBLEM - Blazegraph process on wdqs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [05:29:36] PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time [05:30:05] PROBLEM - Blazegraph Port on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [05:32:36] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational [05:32:45] RECOVERY - Blazegraph process on wdqs1009 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [05:32:46] RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.065 second response time [05:33:15] RECOVERY - Blazegraph Port on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [05:33:43] (03PS1) 10Marostegui: s*.hosts: Remove db1095 [software] - 10https://gerrit.wikimedia.org/r/442013 (https://phabricator.wikimedia.org/T196376) [05:34:49] (03CR) 10Marostegui: [C: 032] s*.hosts: Remove db1095 [software] - 10https://gerrit.wikimedia.org/r/442013 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [05:35:34] (03Merged) 10jenkins-bot: s*.hosts: Remove db1095 [software] - 10https://gerrit.wikimedia.org/r/442013 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [05:43:55] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4314094 (10Marostegui) [05:44:10] (03PS2) 10Marostegui: s2.hosts: Remove db1054 [software] - 10https://gerrit.wikimedia.org/r/440989 [05:47:05] (03PS1) 10Marostegui: mariadb: Set db1054 as spare [puppet] - 10https://gerrit.wikimedia.org/r/442014 (https://phabricator.wikimedia.org/T197063) [05:51:17] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11568/" [puppet] - 10https://gerrit.wikimedia.org/r/442014 (https://phabricator.wikimedia.org/T197063) (owner: 10Marostegui) [05:52:12] !log Stop MySQL on db1054 as it is going to be decommissioned - T197063 [05:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:14] T197063: Decommission db1054 - https://phabricator.wikimedia.org/T197063 [05:53:14] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442015 (https://phabricator.wikimedia.org/T197063) [05:54:39] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442015 (https://phabricator.wikimedia.org/T197063) (owner: 10Marostegui) [05:55:47] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442015 (https://phabricator.wikimedia.org/T197063) (owner: 10Marostegui) [05:57:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1054, it is going to be decommissioned T197063 (duration: 00m 57s) [05:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:18] T197063: Decommission db1054 - https://phabricator.wikimedia.org/T197063 [05:58:17] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1054, it is going to be decommissioned T197063 (duration: 00m 55s) [05:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:32] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1054 - https://phabricator.wikimedia.org/T197063#4314111 (10Marostegui) a:05Marostegui>03Cmjohnson db1054 is now ready to be handed over to DCOps for its decommissioning [06:01:52] (03CR) 10Marostegui: [C: 032] s2.hosts: Remove db1054 [software] - 10https://gerrit.wikimedia.org/r/440989 (owner: 10Marostegui) [06:03:57] (03Merged) 10jenkins-bot: s2.hosts: Remove db1054 [software] - 10https://gerrit.wikimedia.org/r/440989 (owner: 10Marostegui) [06:18:25] (03PS1) 10Marostegui: mariadb: Set db1102 as spare [puppet] - 10https://gerrit.wikimedia.org/r/442016 (https://phabricator.wikimedia.org/T196376) [06:20:41] (03PS1) 10Marostegui: s*.hosts: Remove db1102 [software] - 10https://gerrit.wikimedia.org/r/442017 (https://phabricator.wikimedia.org/T196376) [06:22:02] (03CR) 10Marostegui: [C: 032] s*.hosts: Remove db1102 [software] - 10https://gerrit.wikimedia.org/r/442017 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [06:22:50] (03Merged) 10jenkins-bot: s*.hosts: Remove db1102 [software] - 10https://gerrit.wikimedia.org/r/442017 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [06:23:37] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11569/" [puppet] - 10https://gerrit.wikimedia.org/r/442016 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [06:32:07] (03PS2) 10Elukey: Avoid unnecessary runs for zk-init [puppet/cdh] - 10https://gerrit.wikimedia.org/r/350542 [06:32:33] (03CR) 10Elukey: [C: 032] Avoid unnecessary runs for zk-init [puppet/cdh] - 10https://gerrit.wikimedia.org/r/350542 (owner: 10Elukey) [06:34:26] (03PS1) 10Elukey: Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/442018 [06:37:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442019 [06:37:08] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442019 [06:38:02] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442015 (https://phabricator.wikimedia.org/T197063) (owner: 10Marostegui) [06:39:13] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442019 (owner: 10Marostegui) [06:39:26] (03PS2) 10Vgutierrez: smokeping: Replace acamar & achernar with dns2001 and dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/441919 (https://phabricator.wikimedia.org/T196493) [06:40:23] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442019 (owner: 10Marostegui) [06:41:46] (03CR) 10Vgutierrez: smokeping: Replace acamar & achernar with dns2001 and dns2002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441919 (https://phabricator.wikimedia.org/T196493) (owner: 10Vgutierrez) [06:42:29] (03PS2) 10Elukey: Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/442018 [06:42:35] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11570/" [puppet] - 10https://gerrit.wikimedia.org/r/442018 (owner: 10Elukey) [06:42:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1122 after alter table (duration: 00m 57s) [06:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:28] (03CR) 10Hagar Shilo: "> I do wonder about whether we should alphasort these in" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441096 (https://phabricator.wikimedia.org/T181165) (owner: 10Hagar Shilo) [06:46:37] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442019 (owner: 10Marostegui) [06:47:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442020 (https://phabricator.wikimedia.org/T191316) [07:02:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442020 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:03:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442020 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:03:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442020 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:05:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3312 for alter table (duration: 00m 56s) [07:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:15] !log Deploy schema change on db1103:3312 T191316 T192926 T89737 T195193 [07:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:19] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [07:05:19] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [07:05:19] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [07:05:19] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [07:06:26] (03PS4) 10Hashar: Gerrit: Increase changeid_project and ldap_usernames caches [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [07:07:15] (03CR) 10Hashar: [C: 031] "I have edited the commit message to capture the default values and cache hit ratios." [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [07:07:50] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive operations/puppet/varnishkafka repository - https://phabricator.wikimedia.org/T197503#4294420 (10mmodell) I also deactivated {rOPVK} [07:08:08] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive operations/puppet/varnishkafka repository - https://phabricator.wikimedia.org/T197503#4314201 (10mmodell) [07:09:00] <_joe_> ircecho is gone [07:09:02] <_joe_> wtf? [07:11:49] still in the channel, maybe stuck? [07:12:22] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive operations/puppet/varnishkafka repository - https://phabricator.wikimedia.org/T197503#4314204 (10mmodell) Github now has an archive feature, do we still delete them rather than just archive on github? [07:13:26] <_joe_> !log restarting ircecho, was not reporting alerts [07:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:51] (03CR) 10Vgutierrez: [C: 032] smokeping: Replace acamar & achernar with dns2001 and dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/441919 (https://phabricator.wikimedia.org/T196493) (owner: 10Vgutierrez) [07:17:30] (03CR) 10Vgutierrez: [C: 032] lvs: use the new dns200[12] recursive DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/441916 (https://phabricator.wikimedia.org/T196493) (owner: 10Vgutierrez) [07:17:44] (03PS3) 10Vgutierrez: lvs: use the new dns200[12] recursive DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/441916 (https://phabricator.wikimedia.org/T196493) [07:24:58] (03PS3) 10Vgutierrez: smokeping: Replace acamar & achernar with dns2001 and dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/441919 (https://phabricator.wikimedia.org/T196493) [07:26:53] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4314230 (10elukey) [07:27:43] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/441808 (https://phabricator.wikimedia.org/T198070) (owner: 10Elukey) [07:27:50] (03PS4) 10Elukey: role::cache::kafka:*: add more alarms for varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/441808 (https://phabricator.wikimedia.org/T198070) [07:39:11] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11571/" [puppet] - 10https://gerrit.wikimedia.org/r/441808 (https://phabricator.wikimedia.org/T198070) (owner: 10Elukey) [07:41:21] 10Operations, 10Discovery-Search, 10hardware-requests: replace elastic2001-2024 (codfw) with newer servers - https://phabricator.wikimedia.org/T198169#4314268 (10Gehel) [07:45:02] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archive operations/puppet/kafkatee repository - https://phabricator.wikimedia.org/T198098#4314285 (10elukey) https://phabricator.wikimedia.org/diffusion/OPKT/ deactivated [07:45:09] 10Operations, 10ops-eqiad, 10DBA: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072#4314287 (10Marostegui) @Cmjohnson do you have an estimate date more or less when we can do this movement? Asking just to organize ourselves (within DBA) [07:45:19] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archive operations/puppet/kafkatee repository - https://phabricator.wikimedia.org/T198098#4314288 (10elukey) [07:45:45] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review, 10User-Elukey: Archive operations/puppet/jmxtrans repository - https://phabricator.wikimedia.org/T198097#4314289 (10elukey) https://phabricator.wikimedia.org/diffusion/OPJM/ deactivated [07:45:54] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review, 10User-Elukey: Archive operations/puppet/jmxtrans repository - https://phabricator.wikimedia.org/T198097#4314290 (10elukey) [07:47:05] (03PS2) 10Giuseppe Lavagetto: [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) [07:48:11] (03PS4) 10Vgutierrez: smokeping: Replace acamar & achernar with dns2001 and dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/441919 (https://phabricator.wikimedia.org/T196493) [07:48:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [07:49:15] (03CR) 10Filippo Giunchedi: [C: 032] DNS: Add mgmt & production DNS entries for graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/441908 (https://phabricator.wikimedia.org/T196483) (owner: 10Papaul) [07:49:41] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review, 10User-Elukey: Archive operations/puppet/jmxtrans repository - https://phabricator.wikimedia.org/T198097#4311727 (10mmodell) archived the github mirror [07:49:50] (03PS3) 10Giuseppe Lavagetto: [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) [07:50:19] 10Operations, 10Discovery-Search, 10hardware-requests: replace elastic2001-2024 (codfw) with newer servers - https://phabricator.wikimedia.org/T198169#4314298 (10Gehel) @EBernhardson / @dcausse : the quote above is based on our last order of servers. Since we plan on keeping those servers for 4-5 years, now... [07:51:06] (03PS2) 10Vgutierrez: hieradata: Get rid of acamar and achernar references [puppet] - 10https://gerrit.wikimedia.org/r/441918 (https://phabricator.wikimedia.org/T196493) [07:51:11] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [07:51:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archive operations/puppet/kafkatee repository - https://phabricator.wikimedia.org/T198098#4311739 (10mmodell) archived https://github.com/wikimedia/operations-puppet-kafkatee [07:52:13] (03CR) 10Vgutierrez: [C: 032] hieradata: Get rid of acamar and achernar references [puppet] - 10https://gerrit.wikimedia.org/r/441918 (https://phabricator.wikimedia.org/T196493) (owner: 10Vgutierrez) [07:52:21] (03PS3) 10Vgutierrez: hieradata: Get rid of acamar and achernar references [puppet] - 10https://gerrit.wikimedia.org/r/441918 (https://phabricator.wikimedia.org/T196493) [07:58:26] !log Replaced acamar and achernar with dns200[12] as main lvs name servers in codfw - T196493 [07:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:29] T196493: rack/setup/install dns200[12].wikimedia.org - https://phabricator.wikimedia.org/T196493 [07:58:54] !ops [08:00:46] <_joe_> thanks jynus [08:00:49] thx :D [08:00:56] :) [08:01:00] <_joe_> we should really +t this channel IMHO [08:01:24] even allow only registered nicks here as well [08:01:31] <_joe_> no [08:01:59] <_joe_> we want random users to connect to any irc web interface to report problems to be able to do so [08:03:46] marostegui: one page with 40000 revisions [08:03:53] how do I delete in small batch? [08:04:27] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive operations/puppet/varnishkafka repository - https://phabricator.wikimedia.org/T197503#4314338 (10elukey) [08:04:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4314342 (10elukey) [08:04:39] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive operations/puppet/varnishkafka repository - https://phabricator.wikimedia.org/T197503#4294420 (10elukey) 05Open>03Resolved [08:05:07] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4005170 (10elukey) [08:05:11] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review, 10User-Elukey: Archive operations/puppet/jmxtrans repository - https://phabricator.wikimedia.org/T198097#4314343 (10elukey) 05Open>03Resolved [08:05:35] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archive operations/puppet/kafkatee repository - https://phabricator.wikimedia.org/T198098#4314350 (10elukey) [08:09:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [08:09:32] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [08:09:37] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4314368 (10elukey) [08:09:38] XioNoX: ^^ [08:09:41] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archive operations/puppet/kafkatee repository - https://phabricator.wikimedia.org/T198098#4314367 (10elukey) 05Open>03Resolved [08:10:37] XioNoX: related to the expected hard down reported by Zayo for today? [08:19:36] (03PS5) 10Filippo Giunchedi: prometheus: alert on config reload failure [puppet] - 10https://gerrit.wikimedia.org/r/432059 [08:19:53] (03PS6) 10Filippo Giunchedi: prometheus: alert on config reload failure [puppet] - 10https://gerrit.wikimedia.org/r/432059 [08:20:51] revi: m.arostegui is offline for some hours today [08:21:01] talked to him via task [08:21:06] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: alert on config reload failure [puppet] - 10https://gerrit.wikimedia.org/r/432059 (owner: 10Filippo Giunchedi) [08:21:27] https://phabricator.wikimedia.org/T198156 if you wonder :P [08:22:46] and I've never heard of any known tool that allows me to delete one page in a small batch [08:28:41] deleting [08:32:06] 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors [08:34:15] vgutierrez: probably, I'm on my way to the airport [08:34:54] travel safe! :) [08:35:18] let me know if it causes other issues [08:37:45] (03CR) 10Mobrovac: [C: 032] Switch all jobs to the new queue and clean up the old queue configs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437767 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [08:38:18] XioNoX: I'm trying to debug some slowness in ulsfo [08:38:24] XioNoX: dunno if it could be related to that [08:38:53] probably not, that's esams related [08:39:44] (03PS2) 10Gilles: Upgrade to 2.0 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/427612 (https://phabricator.wikimedia.org/T27611) [08:40:15] (03PS5) 10Mobrovac: Switch all jobs to the new queue and clean up the old queue configs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437767 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [08:42:09] so the Level3 link between eqiad and esams is down, planned maintenance [08:42:41] and there is a planned maintenance for the Zayo link between codfw and eqiad [08:43:28] also fyi, https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down [08:43:38] (03CR) 10Elukey: "> I think we talked about this in IRC...but I forget what we said!" [puppet] - 10https://gerrit.wikimedia.org/r/440507 (owner: 10Elukey) [08:44:15] hah, the l3 maint isn't in the calendar, adding it [08:44:34] or rather, it got moved [08:44:48] it was supposed to be tomorrow [08:45:00] * mobrovac is taking over deploy1001 [08:46:32] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@7d9a1aa]: Enable all jobs in kafka queue T190327 [08:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:37] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [08:46:47] godog: it is, it's the CenturyLink one [08:47:00] CenturyLink bought Level3 [08:47:19] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4314450 (10Pchelolo) 05Open>03Resolved [08:47:31] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@7d9a1aa]: Enable all jobs in kafka queue T190327 (duration: 00m 58s) [08:47:31] !log mobrovac@deploy1001 Synchronized wmf-config/CommonSettings.php: Switch the last remaining jobs to EventBus, only CommonSettings.php now - T190327 (duration: 00m 58s) [08:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:07] !log mobrovac@deploy1001 Started scap: Switch the last remaining jobs to EventBus, full scap sync for clean-up - T190327 [08:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:43] <_joe_> mobrovac: fingers crossed [08:48:56] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 52624 MB (10% inode=99%) [08:49:52] <_joe_> gehel: , is this expected? ^^ [08:51:12] Probably transient... Having a look [08:51:17] they usually auto-recover [08:51:50] We increased the threshold alert yesterday to try to reduce those false positives, but not enough it seems... [08:52:35] !log mobrovac@deploy1001 scap failed: average error rate on 7/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [08:52:35] !log mobrovac@deploy1001 scap failed: RuntimeError scap failed: average error rate on 7/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) (duration: 04m 27s) [08:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:11] (03PS1) 10Volans: debmonitor: add the CSP header in proxy section [puppet] - 10https://gerrit.wikimedia.org/r/442036 (https://phabricator.wikimedia.org/T191299) [09:01:32] <3 [09:01:32] (03PS1) 10Ppchelko: Put back the wmgUseEventBus into InitializeSettings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442037 [09:02:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors [09:02:18] (03CR) 10Vgutierrez: [C: 031] "LGTM! sorry for missing this on the previous CR :(" [puppet] - 10https://gerrit.wikimedia.org/r/442036 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:02:28] lot of relocations going on on the elasticsearch cluster... it will probably stabilize at some point [09:02:34] (03CR) 10Mobrovac: [V: 032 C: 032] Put back the wmgUseEventBus into InitializeSettings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442037 (owner: 10Ppchelko) [09:03:12] !log mobrovac@deploy1001 Started scap: Switch the last remaining jobs to EventBus, full scap sync, take #2 - T190327 [09:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:15] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [09:03:27] gehel: could the check be smarted and have 2 threshold, one for normal situation (alarm early) and another when relocations are ongoing (alarm later)? [09:03:33] *smarter [09:03:54] thanks vgutierrez! [09:04:18] (03CR) 10Volans: [C: 032] debmonitor: add the CSP header in proxy section [puppet] - 10https://gerrit.wikimedia.org/r/442036 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:04:22] volans: that seems too smart to be stable [09:04:35] !log unbanning elastic1042 [09:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:08] no, we should have a threshold that covers all operations. We are getting short on disk space, and we need to do something about it... [09:05:16] ack [09:05:55] 10Operations, 10Analytics, 10procurement, 10User-Elukey: eqiad (3): hardware refresh for dbstore1002 - https://phabricator.wikimedia.org/T198174#4314468 (10elukey) [09:06:02] * volans handsover an old 8GB SD card to gehel for additional space :-P [09:06:19] :) [09:06:28] !log T194342. Disable puppet on scb hosts for apertium-apy upgrade [09:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:30] T194342: Update apertium-apy - https://phabricator.wikimedia.org/T194342 [09:06:51] * gehel thanks volans and ask rob.h to buy a few SD adapers for our servers [09:08:00] RECOVERY - Disk space on elastic1018 is OK: DISK OK [09:08:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [09:08:25] !log upgrade scb1001 apertium-apy, apertium-fra-cat, enable puppet and run puppet [09:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [09:09:00] (03CR) 10jenkins-bot: Switch all jobs to the new queue and clean up the old queue configs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437767 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [09:09:02] (03CR) 10jenkins-bot: Put back the wmgUseEventBus into InitializeSettings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442037 (owner: 10Ppchelko) [09:09:08] (03PS4) 10Alexandros Kosiaris: Update apertium-apy initscripts [puppet] - 10https://gerrit.wikimedia.org/r/438135 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [09:09:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update apertium-apy initscripts [puppet] - 10https://gerrit.wikimedia.org/r/438135 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [09:09:22] !log upload librdkafka_0.11.3-1~bpo8+1+wikimedia2 to apt.w.o jessie-wikimedia - T182993 [09:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:24] T182993: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 [09:10:41] PROBLEM - apertium apy on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 2737: Connection refused [09:11:06] kart_: met a puppet snafu, fixing [09:11:30] oh [09:11:47] !log mobrovac@deploy1001 Finished scap: Switch the last remaining jobs to EventBus, full scap sync, take #2 - T190327 (duration: 08m 34s) [09:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:49] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [09:11:52] akosiaris: Is package updated on scb? [09:12:53] kart_: only scb1001 [09:13:35] (03PS1) 10Alexandros Kosiaris: Remove upstart support from apertium service_unit [puppet] - 10https://gerrit.wikimedia.org/r/442039 (https://phabricator.wikimedia.org/T194342) [09:15:09] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC says it's fine in https://puppet-compiler.wmflabs.org/compiler02/11572/" [puppet] - 10https://gerrit.wikimedia.org/r/442039 (https://phabricator.wikimedia.org/T194342) (owner: 10Alexandros Kosiaris) [09:16:10] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:00] kart_: scb1001 fixed. Let's review [09:17:10] RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.002 second response time [09:17:16] nice [09:17:21] logs point out everything is fine, so does icinga [09:17:36] can you do a quick check and then I proceed with the rest of the scb hosts [09:17:36] ? [09:18:11] !log mobrovac@deploy1001 Started deploy [changeprop/deploy@5d8eaf1]: Bug fix: Add restore to WD description checks - T197626 [09:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:13] T197626: Public summary and mobile-sections endpoints return old description for a specific page - https://phabricator.wikimedia.org/T197626 [09:18:41] akosiaris: sure. Checking. [09:19:24] !log mobrovac@deploy1001 Finished deploy [changeprop/deploy@5d8eaf1]: Bug fix: Add restore to WD description checks - T197626 (duration: 01m 14s) [09:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:11] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:21:32] kart_: Apertium streamparser not installed, spelling handler disabled [09:21:48] I guess we need that, right ? Per https://phabricator.wikimedia.org/T192978 [09:22:34] I 'll update puppet. It's in recommends of the package anyway [09:22:55] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274#4314573 (10mobrovac) [09:24:37] (03PS1) 10Alexandros Kosiaris: Add python3-streamparser to apertium [puppet] - 10https://gerrit.wikimedia.org/r/442040 (https://phabricator.wikimedia.org/T194342) [09:25:35] !log manually install python3-streamparser on scb1001 for testing. T194342 [09:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:37] T194342: Update apertium-apy - https://phabricator.wikimedia.org/T194342 [09:25:38] kart_: ^ [09:25:48] akosiaris: yep. [09:26:01] akosiaris: thanks. [09:27:10] heads up, i will be doing a full scap sync with hhvm restarts now to pick up the changes in jobqueue classes (hhvm is spitting errors currently about it) [09:28:01] _joe_: ^ [09:28:21] mobrovac: ok I 'll keep an eye on it as well [09:28:25] thnx [09:28:28] !log mobrovac@deploy1001 Started scap: Switch the last remaining jobs to EventBus, full scap sync with HHVM restarts, take #3 - T190327 [09:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:30] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [09:28:38] kart_: yw. Lemme know when we can proceed with the rest of the hosts [09:28:53] <_joe_> hhvm restarts don't work in scap [09:28:57] <_joe_> they never did [09:29:04] <_joe_> and whatever it tries to do, is harmful [09:29:10] akosiaris: I guess we can go ahead. I see no errors in log. and listPairs is OK too. [09:29:17] kart_: ok [09:29:19] _joe_: what ? [09:29:21] dammit [09:29:32] <_joe_> ? [09:29:45] <_joe_> can you tell me what's going on? [09:29:55] !log elastic@eqiad: forcemerge (only_expunge_deletes=true) on commonswiki_file [09:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:03] shit [09:30:08] _joe_: should i stop it? [09:30:58] <_joe_> mobrovac: I have no idea, maybe it was fixed [09:31:04] _joe_: we synced a version of the conf that didn't load the eventbus jobqueue class, and now even after syncing the correct stuff we still see these "unkown class" errors [09:31:09] ok, will wait for it and see [09:31:23] !log updating librdkafka1 and restarting varnishkafka-webrequest on cache::misc nodes - T182993 [09:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:25] T182993: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 [09:31:26] <_joe_> how many done by now? [09:31:36] <_joe_> I mean is it doing the hhvm restarts already? [09:31:47] still syncing, the restarts haven't started yet [09:31:48] <_joe_> mobrovac: next time, we just need an hhvm rolling restart [09:31:54] !log upgrade apertium-apy, apertium-fra-cat, python3-streamparser, enable puppet, run puppet on scb1002 as a last test before full cluster upgrade. T194342 [09:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:56] T194342: Update apertium-apy - https://phabricator.wikimedia.org/T194342 [09:32:26] _joe_: nope, hhvm restarts are failing [09:32:53] <_joe_> heh I'm surprised [09:32:55] akosiaris: fra-cat? [09:32:59] <_joe_> mobrovac: still seing the errors? [09:33:01] "Failed to restart hhvm.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files" heh [09:33:09] <_joe_> ahahahahahah [09:33:17] 200 failed, 0 success [09:33:21] and growing [09:33:26] <_joe_> can you look at logstash? [09:33:34] <_joe_> oh fuck [09:33:35] !log mobrovac@deploy1001 Finished scap: Switch the last remaining jobs to EventBus, full scap sync with HHVM restarts, take #3 - T190327 (duration: 05m 07s) [09:33:38] <_joe_> stop it mobrovac [09:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:42] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [09:33:45] stopped [09:33:49] <_joe_> jeez [09:33:50] <_joe_> ok [09:33:57] <_joe_> now let's hope pybal is saving us [09:33:59] <_joe_> as it should [09:34:03] akosiaris: fra-cat should not be updated. I'm fixing it with dependency :) [09:34:12] kart_: from 1.2.0~r78602-1+wmf2 to 1.3.0~r84327-1+wmf1 [09:34:26] hmm, ok lemme roll it back then [09:34:30] good thing you noticed [09:34:30] akosiaris: nope. 1.3.0 is broken. [09:34:31] "Request from 88.97.96.89 via cp1054 cp1054, Varnish XID 855310345 [09:34:31] Error: 503, Backend fetch failed at Tue, 26 Jun 2018 09:33:51 GMT" [09:34:36] <_joe_> akosiaris: wait please [09:34:44] Page- https://en.wikipedia.org/wiki/File:Simurgh_returns_Zal_to_his_father_Sam.png [09:34:50] Known issue? [09:35:08] (03PS1) 10Volans: Fix table ordering on click [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442042 (https://phabricator.wikimedia.org/T167504) [09:35:11] ShakespeareFan00: in a middle of 2 upgrades currently. Will take a look later on [09:35:14] Request from [redacted] via cp1065 cp1065, Varnish XID 39951007 [09:35:14] error on spanish wikipedia https://usercontent.irccloud-cdn.com/file/fonREdDM/wmf.png [09:35:14] Error: 503, Backend fetch failed at Tue, 26 Jun 2018 09:34:30 GMT [09:35:15] <_joe_> ShakespeareFan00: yes [09:35:20] <_joe_> yes it's knowmn [09:35:21] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [09:35:21] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [09:35:21] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) [09:35:21] the whole cluster seems down [09:35:26] <_joe_> it's wwhat marko did [09:35:30] oh ouch [09:35:34] <_joe_> I'm fixing it now [09:35:35] Planned mantainence? [09:35:37] ok I take that back, looking right now [09:35:38] <_joe_> with the broad cannon [09:35:44] PROBLEM - LVS HTTP IPv4 on api.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.22 and port 80: Connection refused [09:35:46] hmm? [09:35:47] i iz taken down teh_wikiz [09:35:57] what did you do this time marco [09:36:01] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=appserver [09:36:02] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a [09:36:02] ved: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) [09:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:05] (03CR) 10jerkins-bot: [V: 04-1] Fix table ordering on click [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442042 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:36:06] dat bigdelete? [09:36:10] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [09:36:15] I'm here too [09:36:20] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v [09:36:20] tle}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/media/{title} [09:36:20] dia in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vade [09:36:20] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /api/rest_v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200): /api/rest_v1/data/citation/{fo [09:36:20] citation for Darth Vader) timed out before a response was received [09:36:20] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [09:36:20] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 504 (expecting: 200): /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received: /{domain}/v1/translation/articles/{source}{/seed} (bad seed [09:36:21] bad seed returned the unexpected status 504 (expecting: 404) [09:36:21] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [09:36:26] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.22 and port 80: Connection refused [09:36:30] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [09:36:30] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [09:36:30] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /api/rest_v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200): /api/rest_v1/data/citation/{fo [09:36:30] citation for Darth Vader) timed out before a response was received [09:36:30] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [09:36:31] I 'll silence icinga-wm [09:36:38] can I help? [09:36:40] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [09:36:40] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpecte [09:36:40] ting: 200): /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia retur [09:36:40] status 504 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200) [09:36:41] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [09:36:41] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpecte [09:36:41] ting: 200): /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia retur [09:36:41] status 504 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) timed out before a response was received: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get m [09:36:41] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [09:36:42] what is the issue? [09:36:48] Looks like we're full crash cart on this one [09:36:51] <_joe_> jynus: marko depooled all the appservers [09:36:56] hhvm mess [09:36:56] <_joe_> and the api [09:36:59] <_joe_> and the jobrunner [09:37:03] all that ? [09:37:03] <_joe_> not his fault [09:37:06] <_joe_> scap's falut [09:37:09] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=eqiad [09:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:13] hmm [09:37:20] <_joe_> are things better now? [09:37:27] <_joe_> akosiaris: can you check logstash please? [09:37:31] _joe_: yes [09:37:31] I assume there was a planned upgrade that didn't hold [09:37:35] ;) [09:37:41] at least https://en.wikipedia.org/wiki/File:Simurgh_returns_Zal_to_his_father_Sam.png is ok now [09:37:42] better now [09:37:47] * akosiaris watches icinga [09:37:54] I can read and edit [09:37:54] enwiki is back up [09:37:55] it's working for me again [09:38:01] <_joe_> ok [09:38:07] leave the topic for a minute just in case [09:38:08] recovery pages have arrived as well [09:38:10] Up but intermittent and slow for me [09:38:11] <_joe_> I fixed it with a broad cannon [09:38:36] so, what happened ? scap depooled all mw apiservers ? [09:38:45] while failing to restart hhvm and repool it ? [09:38:47] <_joe_> akosiaris: yes [09:38:50] yeah [09:38:57] damn [09:39:15] :( [09:39:15] <_joe_> akosiaris: so I think the issue here is that clearly our depool threshold is not high enough for appservers [09:39:23] _joe_: api in codfw is still connection refused for me on port 80 [09:39:25] _joe_: I was about to ask that [09:39:30] <_joe_> godog: yes, wait for it [09:39:40] I am checking metrics for actual impact [09:39:59] !log re-enable icinga-wm that I 've previously silenced [09:40:00] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [09:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:07] <_joe_> can someone check the pybal logs? [09:40:11] <_joe_> vgutierrez: maybe? [09:40:11] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [09:40:13] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=codfw [09:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:20] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [09:40:35] RECOVERY - LVS HTTP IPv4 on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 25520 bytes in 0.436 second response time [09:40:36] _joe_: yup [09:41:06] <_joe_> it looks like we didn't honour depool_threshold to me vgutierrez [09:41:21] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [09:42:08] <_joe_> look at the cpu usage here https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All [09:42:12] <_joe_> they all went down [09:42:48] <_joe_> so it seems pybal doesn't honour the depool threshold anymore when we set the backends to enabled=False [09:42:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:42:50] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:42:50] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:42:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:42:51] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:42:51] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:43:01] there is something weird going on on lvs2001 and pybal.log [09:43:11] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:43:14] it's empty since May 29. Did we change it or something ? [09:43:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:43:39] trafic stats are slow to be sent through- based on other staats seems we are ok [09:43:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:44:48] or just there was nothing to log all this time.. weird [09:45:22] <_joe_> vgutierrez: yeah we definitely don't honour the depool threshold anymore when we disable servers [09:45:34] <_joe_> we used to, something has probably been changed in pybal recently [09:46:02] 09:36-09:41 [09:46:06] <_joe_> I can showcase this in codfw, where the mw api is not serving traffic if you want to reproduce [09:46:12] <_joe_> jynus: more or less, yes [09:46:31] based on 503s [09:47:17] <_joe_> jynus: nah i think that's inaccurate [09:47:28] I can check on logstash [09:47:29] _joe_: yeah.. you were too aggresive [09:47:30] Jun 26 09:37:04 lvs1016 pybal[24528]: [api_80] ERROR: Could not depool server mw1232.eqiad.wmnet because of too many down! [09:48:00] <_joe_> vgutierrez: yes, but that server was failing for other reasons [09:48:15] <_joe_> what I am saying is that in the past even if we disabled a server from etcd [09:48:20] there was a Notice: Undefined variable: wmgUseEventBus in /srv/mediawiki/wmf-config/CommonSettings.php [09:48:26] <_joe_> it would not get depooled if it crossed that threshold [09:48:27] much before this [09:48:35] _joe_: https://gerrit.wikimedia.org/r/#/c/operations/debs/pybal/+/404694/ [09:48:39] but it is only a notice, I guess [09:49:04] <_joe_> mark: I guess so, yes [09:49:20] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [09:49:21] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [09:49:22] <_joe_> the reason why I didn't scream to marko to stop was I was sure we had the safety net from pybal [09:49:27] or really https://gerrit.wikimedia.org/r/#/c/operations/debs/pybal/+/403677/ [09:50:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [09:50:31] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [09:50:37] <_joe_> mark: should we just revert that patch? [09:50:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [09:50:49] it seems a semantics issue [09:51:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [09:51:15] so.. https://phabricator.wikimedia.org/T184715#4286220 [09:51:22] https://phabricator.wikimedia.org/T184715 [09:51:23] sigh :) [09:51:24] <_joe_> it's pretty dangerous and I'm surprised it took so long before this caused an outage [09:52:26] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 2.0 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/427612 (https://phabricator.wikimedia.org/T27611) (owner: 10Gilles) [09:52:32] <_joe_> vgutierrez: https://phabricator.wikimedia.org/T184715#3896284 even :P [09:54:39] i would be inclined to say that the depool threshold is the main reason for the distinction between "server is in pybal pool but disabled" and "server is removed from the pool" [09:54:57] <_joe_> yes [09:54:59] so indeed, probably the older behavior is better? [09:55:19] <_joe_> I'm still checking production, I'll take a look once I'm done [09:55:23] sure [09:55:33] ack [09:55:53] <_joe_> ok, it looks like everything's all right at a first glance [09:56:18] <_joe_> mobrovac: did your error recover at least? :D [09:56:29] yes, it disappeared [09:56:33] sorry about this guys [09:56:40] completely unexpected outcome [09:56:40] <_joe_> mobrovac: not your fault [09:56:45] <_joe_> a double bug :D [09:57:02] <_joe_> can I ask you where was that functionality of scap documented? [09:57:12] yeah, 1) scap restart not working; 2) pybal not refusing to depool servers [09:57:55] <_joe_> mark: I'm not sure the old semantics was correct either? [09:58:04] _joe_: scap sync --help lists "-r, --restart Restart HHVM process on target hosts" [09:58:33] <_joe_> mobrovac: ok open a UBN! ticket for scap [09:58:38] k [09:58:39] <_joe_> about REMOVING that thing [09:58:43] yeah [09:58:45] <_joe_> :P [09:58:55] <_joe_> do we still have pybal-test? [09:59:06] <_joe_> we do [09:59:06] i believe so [09:59:14] <_joe_> yeah let me try something [10:01:21] !log downgrade erroneously upgraded apertium-fra-cat on scb1001, scb1002 from 1.3.0~r84327-1+wmf1 to 1.2.0~r78602-1+wmf2 T194342 [10:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:23] T194342: Update apertium-apy - https://phabricator.wikimedia.org/T194342 [10:05:08] kart_: I 'll resume the upgrade now [10:05:15] okay! [10:07:23] (03CR) 10Alexandros Kosiaris: [C: 032] Add python3-streamparser to apertium [puppet] - 10https://gerrit.wikimedia.org/r/442040 (https://phabricator.wikimedia.org/T194342) (owner: 10Alexandros Kosiaris) [10:09:06] 10Operations, 10Scap (Scap3-MediaWiki-MVP), 10Wikimedia-Incident: Scap sync --restart not working - https://phabricator.wikimedia.org/T198185#4314698 (10mobrovac) p:05Triage>03Unbreak! [10:10:13] !log proceed with apertium-apy upgrade on all scb hosts. T194342 [10:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:15] T194342: Update apertium-apy - https://phabricator.wikimedia.org/T194342 [10:10:27] !log test upgrade thumbor 2.0 on thumbor2001 [10:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:11] PROBLEM - apertium apy on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 2737: Connection refused [10:12:20] PROBLEM - apertium apy on scb2002 is CRITICAL: connect to address 10.192.48.43 and port 2737: Connection refused [10:12:20] PROBLEM - DPKG on scb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:12:30] PROBLEM - DPKG on scb2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:12:58] What. [10:14:41] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:15:10] akosiaris: ^ [10:15:30] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:16:40] RECOVERY - apertium apy on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.074 second response time [10:16:41] RECOVERY - DPKG on scb2001 is OK: All packages OK [10:17:01] yeah expected [10:18:01] OK. [10:18:17] process is going on. Gonna take a while [10:18:24] I am extra careful with what just happened [10:19:50] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:20:45] <_joe_> I'm going to play with mw2258 for a bit, sorry for any noise [10:23:21] RECOVERY - apertium apy on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.076 second response time [10:23:40] RECOVERY - DPKG on scb2002 is OK: All packages OK [10:24:04] akosiaris: Thanks! [10:24:16] akosiaris: seems errors are stable in logs. [10:24:34] still ongoing btw [10:24:57] :) [10:25:23] finally done [10:25:40] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:25:52] kart_: ok I think we are done [10:26:39] cool. [10:31:27] (03CR) 10Elukey: [V: 032 C: 032] Change varnishkafka to use the new VUT API [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/430069 (owner: 10R4q3NWnUx2CEhVyr) [10:40:15] !log upgrade thumbor to 2.0-1 on thumbor1001 [10:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:23] (03PS1) 10Elukey: Update wikimedia's copiright and current branch policy [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/442046 (https://phabricator.wikimedia.org/T177647) [10:47:19] (03PS2) 10Elukey: Update wikimedia's copiright and current branch policy [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/442046 (https://phabricator.wikimedia.org/T177647) [10:48:10] (03CR) 10Elukey: [C: 032] Update wikimedia's copiright and current branch policy [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/442046 (https://phabricator.wikimedia.org/T177647) (owner: 10Elukey) [10:51:02] (03PS1) 10Gilles: Add missing CWEBP_PATH definition [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/442047 (https://phabricator.wikimedia.org/T27611) [10:51:34] (03CR) 10Filippo Giunchedi: [C: 032] Add missing CWEBP_PATH definition [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/442047 (https://phabricator.wikimedia.org/T27611) (owner: 10Gilles) [10:53:06] Incident might have created some data loss? See https://phabricator.wikimedia.org/T198177 [10:54:11] ...and potentially also https://phabricator.wikimedia.org/T198175 I guess. [10:54:39] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: bootstrap keystone [puppet] - 10https://gerrit.wikimedia.org/r/440109 (https://phabricator.wikimedia.org/T196633) [10:57:59] andre__: can you ask to verify the issue still exist on both cases? Both seem to be temporary issues that are no longer happening, and they just seem worried about breaking things at the time [10:58:32] e.g. the supposed data loss doesn't seem to be happening, but maybe some cache isssues on their side [10:58:40] or something I may be missing [10:58:52] jynus, sure, I'll asl [10:58:54] *ask [10:59:36] andre__: now that you're here, I mentioned you in a couple of tasks. In one security task a phab diffusion repo needs deleting. [11:00:17] you can never know, but db issues are unlikely to be done partially- however purges and cache issues can indeed happen [11:00:38] (03PS1) 10Arturo Borrero Gonzalez: hieradata: add eqiad1 keystone admin_token [labs/private] - 10https://gerrit.wikimedia.org/r/442063 (https://phabricator.wikimedia.org/T196633) [11:00:40] or job queue thing like category updates, wikidata propagations, etc. [11:00:54] Hauskatze: I am not sure what I'm supposed to answer...? [11:01:23] andre__: to which? :) [11:01:28] Hauskatze, to your line? [11:01:34] Hauskatze, what are you asking for? [11:01:45] which an null edit/purge should fix, or just wait until job queue retries [11:01:59] andre__: ok, let's try again :-) : I need a Diffusion repository deletion. [11:02:16] and I mentioned you in the private task [11:02:28] Hauskatze: and you ping me here because...? [11:02:36] Hauskatze: I don't handle Diffusion repository deletions. [11:03:04] andre__: 'cause you do have CLI access to do so -BUT- if you do not handle that stuff, then okay, idk. [11:03:13] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] hieradata: add eqiad1 keystone admin_token [labs/private] - 10https://gerrit.wikimedia.org/r/442063 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:03:32] Hauskatze: oh, well, sorry! You are right, actually! I think I can, sorry! [11:03:45] np [11:04:20] Hauskatze: is deleting the repo really the recommended way? I mean, does the author have a backup? [11:04:31] plus this should probably go to #wikimedia-devtools instead. [11:04:49] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: eqiad1: bootstrap keystone [puppet] - 10https://gerrit.wikimedia.org/r/440109 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:05:01] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/440109 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:05:06] andre__: I though that channel was decommissioned? I don't know if the author has any backup. Probably we should ask in the task. [11:05:31] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54544 MB (3% inode=99%) [11:05:50] Hauskatze: I'm not after creating data loss :P [11:05:59] ^No need to delete anything, just remove git content though the usual channels (commit) and change all secrets involved- I already did half of that for you [11:06:08] ok, thanks [11:06:16] jynus: ok then [11:06:31] tell the contributor to change its on wiki password [11:06:45] and we will just need cloud to reenable the other account [11:06:59] I'm going afk for a meeting [11:07:25] s/meeting/call [11:08:12] !log upload thumbor 2.0-2 to apt.wikimedia.org [11:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:14] !log roll-upgrade thumbor in codfw/eqiad [11:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:47] andre__: Re: reports of users we should take all those seriously, but normally people are too worried they broke something, and we should just make sure they didn't, for the most cases :-) [11:16:33] right [11:17:50] RECOVERY - Disk space on maps1001 is OK: DISK OK [11:59:46] (03PS21) 10DCausse: Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [11:59:59] (03PS6) 10DCausse: Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) [12:00:04] addshore and leszek_wmde: Your horoscope predicts another unfortunate Enable WikibaseLexeme on Wikibase client wikis deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180626T1200). [12:00:13] o/ [12:00:17] hi leszek_wmde [12:00:25] heh [12:00:28] leszek_wmde1: ! [12:01:07] here [12:01:14] silly pidgin [12:01:43] rihgt then [12:02:08] oops, we need backport first, right addshore? [12:02:14] leszek_wmde1: yup [12:02:22] to .8 and .999 [12:02:30] jouncebot: next [12:02:31] In 0 hour(s) and 57 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180626T1300) [12:02:46] Ah right it is addshore again :) [12:02:50] o/ [12:02:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442080 [12:02:59] (03PS22) 10DCausse: Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [12:03:01] (03PS7) 10DCausse: Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) [12:03:03] (03PS6) 10DCausse: Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) [12:03:05] * addshore and evil wmde scheduling all the eu daytime deploy slots [12:03:09] haha [12:03:17] can I deploy two mediawiki-config changes? [12:03:19] yup [12:03:38] \o/ [12:04:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442080 (owner: 10Marostegui) [12:05:15] leszek_wmde1: I made the cherry picks at https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseLexeme/+/442084/ and https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseLexeme/+/442083/ [12:05:32] waiting for CI now [12:06:14] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442080 (owner: 10Marostegui) [12:06:33] addshore: deploying my patch [12:06:57] marostegui: ack, mine wont be merged for another 15 mins probs [12:07:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3312 after alter table (duration: 00m 57s) [12:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442080 (owner: 10Marostegui) [12:10:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442086 (https://phabricator.wikimedia.org/T191316) [12:10:46] addshore: Will merge that one ^ and that should be it for now [12:10:51] awesome! [12:10:57] Thanks :) [12:11:22] _joe_: o/ I'm sure you're busy, but wanted to flag this comment to you: https://phabricator.wikimedia.org/T196547#4296506 [12:11:50] We can wait however long it takes to get more review, but I'd like to at least work out what that timeline might look like. [12:12:24] <_joe_> awight: yeah sorry most of the team is just back from the summit [12:12:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442086 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:13:02] <_joe_> I'll followup with them this week [12:13:15] _joe_: no worries, I hope it was fun :) [12:13:36] <_joe_> I wasn't there :/ [12:13:37] Nothing builds a team like locking them in a small space together. [12:13:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442086 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:14:24] hmmm, leszek_wmde1, failure? https://integration.wikimedia.org/ci/job/mwext-testextension-php70-composer-jessie/885/console [12:14:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1105:3312 for alter table (duration: 00m 57s) [12:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:59] addshore: I am done! [12:15:04] !log Deploy schema change on db1105:3312 T191316 T192926 T89737 T195193 [12:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:08] marostegui: thanks! [12:15:09] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [12:15:09] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [12:15:09] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [12:15:09] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [12:15:48] addshore: oh dear, do we also need to backport some qualityconstraints commit? [12:16:13] leszek_wmde1: I'm really not sure what the failure is for [12:16:31] it looks like https://gerrit.wikimedia.org/r/#/c/442084/ for .999 is passing [12:16:35] maybe just needs a recheck? [12:16:43] addshore: Special:ConstraintReport [12:18:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442086 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:18:53] leszek_wmde1: hmm, qunit on the second one alos failed [12:18:54] addshore: maybe recheck helps. I reckon it might be related to https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseQualityConstraints/+/439981/ [12:18:59] waat [12:19:26] might be less work to wait until the code rolls out with the train this week, and just leave the bug open for 3/4 more days [12:19:38] fine with me [12:19:45] okay, lets do that then [12:19:48] what is it qunit failure saying? [12:20:17] leszek_wmde1: https://integration.wikimedia.org/ci/job/mediawiki-core-qunit-selenium-jessie/20593/console [12:20:41] oooh [12:20:44] browser tests [12:20:57] let's hold up and go with the train [12:21:05] should be fixed by friday [12:28:29] (03CR) 10Alexandros Kosiaris: [C: 031] Query optimizations [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440659 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:29:26] (03CR) 10jerkins-bot: [V: 04-1] Query optimizations [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440659 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:39:21] (03CR) 10Volans: [V: 032 C: 032] Query optimizations [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440659 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:40:34] (03PS2) 10Volans: Fix table ordering on click [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442042 (https://phabricator.wikimedia.org/T167504) [12:41:39] (03CR) 10jerkins-bot: [V: 04-1] Fix table ordering on click [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442042 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [12:42:07] (03CR) 10Volans: [V: 032 C: 032] Fix table ordering on click [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442042 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [12:53:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:53:10] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [12:54:18] that's graphite labs I believe [12:54:53] !log labcontrol1003 aptitude install ldap-utils python-ldappool [12:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:52] recovering now, I'm assuming overload [12:58:51] (03PS1) 10Elukey: profile::piwik::instance: allow hiera settings for trusted hosts [puppet] - 10https://gerrit.wikimedia.org/r/442092 [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180626T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:48] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11576/" [puppet] - 10https://gerrit.wikimedia.org/r/442092 (owner: 10Elukey) [13:00:49] o/ [13:01:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:01:50] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [13:02:00] (03PS2) 10Hashar: Add some namespace aliases to ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441988 (https://phabricator.wikimedia.org/T197058) (owner: 10Urbanecm) [13:02:16] (03CR) 10Hashar: [C: 032] Add some namespace aliases to ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441988 (https://phabricator.wikimedia.org/T197058) (owner: 10Urbanecm) [13:02:46] (03PS2) 10Hashar: Create a few of namespace aliases for ruwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440863 (https://phabricator.wikimedia.org/T197565) (owner: 10Urbanecm) [13:02:56] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440863 (https://phabricator.wikimedia.org/T197565) (owner: 10Urbanecm) [13:03:38] (03Merged) 10jenkins-bot: Add some namespace aliases to ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441988 (https://phabricator.wikimedia.org/T197058) (owner: 10Urbanecm) [13:03:55] (03CR) 10jenkins-bot: Add some namespace aliases to ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441988 (https://phabricator.wikimedia.org/T197058) (owner: 10Urbanecm) [13:04:26] (03Merged) 10jenkins-bot: Create a few of namespace aliases for ruwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440863 (https://phabricator.wikimedia.org/T197565) (owner: 10Urbanecm) [13:04:42] (03PS2) 10Hashar: Enable TemplateStyles on ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440859 (https://phabricator.wikimedia.org/T197526) (owner: 10Urbanecm) [13:04:44] (03PS2) 10Hashar: Change logo resources for ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440655 (https://phabricator.wikimedia.org/T197508) (owner: 10Urbanecm) [13:04:46] (03PS2) 10Hashar: Use uploaded HD logos for ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440656 (https://phabricator.wikimedia.org/T197508) (owner: 10Urbanecm) [13:04:48] (03PS2) 10Hashar: Enable zh-my variant on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440651 (https://phabricator.wikimedia.org/T193983) (owner: 10星耀晨曦) [13:06:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:06:20] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [13:06:47] what [13:06:50] it hasn't even synced yet [13:06:52] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add some namespace aliases to ruwikiquote - T197058 (duration: 00m 56s) [13:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:54] T197058: Abbreviations for namespaces to ruwikiquote - https://phabricator.wikimedia.org/T197058 [13:08:23] 10Operations, 10Cloud-Services, 10Graphite: Graphite returning 500 @ nagf and graphite url - https://phabricator.wikimedia.org/T198209#4315338 (10Paladox) [13:09:20] (03CR) 10jenkins-bot: Create a few of namespace aliases for ruwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440863 (https://phabricator.wikimedia.org/T197565) (owner: 10Urbanecm) [13:11:45] for varnish esams and misc, looks like they were one time spikes [13:17:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:17:21] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [13:19:21] yeah it is graphite on labs [13:19:28] I'll take a lok [13:19:58] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Create a few of namespace aliases for ruwiktionary - T197565 (duration: 00m 57s) [13:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:00] T197565: Aliases in the Russian Wiktionary - https://phabricator.wikimedia.org/T197565 [13:20:38] godog i think it is related to https://phabricator.wikimedia.org/T198209 [13:21:01] !log elastic@eqiad deleting stale index viwiki_general_1525301649 [13:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:39] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440859 (https://phabricator.wikimedia.org/T197526) (owner: 10Urbanecm) [13:21:40] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:21:42] paladox: indeed [13:21:50] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [13:22:24] !log elastic@eqiad: forcemerge (only_expunge_deletes=true) on wikidatawiki_content [13:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:56] (03Merged) 10jenkins-bot: Enable TemplateStyles on ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440859 (https://phabricator.wikimedia.org/T197526) (owner: 10Urbanecm) [13:23:12] (03CR) 10jenkins-bot: Enable TemplateStyles on ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440859 (https://phabricator.wikimedia.org/T197526) (owner: 10Urbanecm) [13:24:55] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440655 (https://phabricator.wikimedia.org/T197508) (owner: 10Urbanecm) [13:25:14] (03CR) 10Hashar: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440656 (https://phabricator.wikimedia.org/T197508) (owner: 10Urbanecm) [13:25:18] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable TemplateStyles on ruwikiquote - T197526 (duration: 00m 57s) [13:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:21] T197526: TemplateStyles in the Russian Wikiquote - https://phabricator.wikimedia.org/T197526 [13:26:05] (03Merged) 10jenkins-bot: Change logo resources for ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440655 (https://phabricator.wikimedia.org/T197508) (owner: 10Urbanecm) [13:26:47] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4315414 (10hashar) [13:27:20] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [13:27:55] !log hashar@deploy1001 Synchronized static/images/project-logos: Change logo resources for ruwikiquote - T197508 (duration: 00m 56s) [13:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:57] T197508: Update Russian Wikiquote logo - https://phabricator.wikimedia.org/T197508 [13:28:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:28:22] (03CR) 10Hashar: [C: 032] Use uploaded HD logos for ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440656 (https://phabricator.wikimedia.org/T197508) (owner: 10Urbanecm) [13:28:40] 10Operations, 10SRE-Access-Requests: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4315430 (10Milimetric) [13:28:47] (03CR) 10jenkins-bot: Change logo resources for ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440655 (https://phabricator.wikimedia.org/T197508) (owner: 10Urbanecm) [13:29:35] !log purgeList.php for the various ruwikiquote logos [13:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:47] (03Merged) 10jenkins-bot: Use uploaded HD logos for ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440656 (https://phabricator.wikimedia.org/T197508) (owner: 10Urbanecm) [13:31:23] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use uploaded HD logos for ruwikiquote - T197508 (duration: 00m 57s) [13:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:51] (03PS3) 10Ottomata: profile::hadoop::spark2: explicitly require hive client's config [puppet] - 10https://gerrit.wikimedia.org/r/440507 (owner: 10Elukey) [13:32:32] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::spark2: explicitly require hive client's config [puppet] - 10https://gerrit.wikimedia.org/r/440507 (owner: 10Elukey) [13:32:50] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440651 (https://phabricator.wikimedia.org/T193983) (owner: 10星耀晨曦) [13:33:42] (03CR) 10jenkins-bot: Use uploaded HD logos for ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440656 (https://phabricator.wikimedia.org/T197508) (owner: 10Urbanecm) [13:34:03] (03Merged) 10jenkins-bot: Enable zh-my variant on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440651 (https://phabricator.wikimedia.org/T193983) (owner: 10星耀晨曦) [13:34:17] (03CR) 10jenkins-bot: Enable zh-my variant on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440651 (https://phabricator.wikimedia.org/T193983) (owner: 10星耀晨曦) [13:36:42] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable zh-my variant on zhwiki - T193983 (duration: 00m 57s) [13:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:44] T193983: Enable zh-my language localization at the Chinese Wikipedia - https://phabricator.wikimedia.org/T193983 [13:36:49] (03PS4) 10Ottomata: profile::hadoop::spark2: explicitly require hive client's config [puppet] - 10https://gerrit.wikimedia.org/r/440507 (owner: 10Elukey) [13:37:00] Urbanecm: all your patches are deployed :] [13:37:05] !log European swat completed [13:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:28] i am still baby sitting the namespaceDupes.php run for ruwiktionary ( https://phabricator.wikimedia.org/T197565 ) [13:42:21] <_joe_> hashar: I just imagined you cuddling the php interpreter [13:42:41] <_joe_> "don't cry all those notices, my beloved!" [13:43:00] ;]]] [13:46:31] (03PS2) 10Aaron Schulz: Use ReplicatedBagOStuff for WAN cache on deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441881 [13:47:05] 10Operations: systemd-logind fails with result 'timeout' in db2093 and dns4001 - https://phabricator.wikimedia.org/T198215#4315520 (10Vgutierrez) [14:01:17] (03CR) 10Elukey: profile::hadoop::spark2: explicitly require hive client's config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440507 (owner: 10Elukey) [14:04:05] (03PS1) 10Rush: openstack: add notes to keystone bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/442106 [14:04:52] (03CR) 10Rush: [C: 032] openstack: add notes to keystone bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/442106 (owner: 10Rush) [14:05:32] (03PS5) 10Ottomata: profile::hadoop::spark2: explicitly require hive client's config [puppet] - 10https://gerrit.wikimedia.org/r/440507 (owner: 10Elukey) [14:17:02] (03CR) 10Gehel: [C: 04-1] "mostly minor comments inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [14:17:06] (03CR) 10Ottomata: [C: 032] profile::hadoop::spark2: explicitly require hive client's config [puppet] - 10https://gerrit.wikimedia.org/r/440507 (owner: 10Elukey) [14:17:34] (03PS6) 10Ottomata: profile::hadoop::spark2: explicitly require hive client's config [puppet] - 10https://gerrit.wikimedia.org/r/440507 (owner: 10Elukey) [14:17:37] (03CR) 10Ottomata: [V: 032 C: 032] profile::hadoop::spark2: explicitly require hive client's config [puppet] - 10https://gerrit.wikimedia.org/r/440507 (owner: 10Elukey) [14:19:13] (03PS3) 10Ottomata: Re-enable job and change-prop topic mirroring from main-eqiad -> main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/440386 (https://phabricator.wikimedia.org/T197254) [14:19:23] !log Re-enable job and change-prop Kafka topic mirroring from main-eqiad -> main-codfw [14:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:29] (03CR) 10Ottomata: [V: 032 C: 032] Re-enable job and change-prop topic mirroring from main-eqiad -> main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/440386 (https://phabricator.wikimedia.org/T197254) (owner: 10Ottomata) [14:22:18] hashar, thank you a lot! [14:29:26] (03PS1) 10Volans: Set CSP header for all views [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442110 (https://phabricator.wikimedia.org/T167504) [14:30:24] (03CR) 10jerkins-bot: [V: 04-1] Set CSP header for all views [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442110 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [14:31:20] PROBLEM - Host labvirt1020 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:45] 10Operations, 10ops-eqiad, 10DBA: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072#4315664 (10Cmjohnson) @marostegui next week sometime is the best I can do right now. [14:35:05] (03PS1) 10Volans: debmonitor: remove CSP header now set upstream [puppet] - 10https://gerrit.wikimedia.org/r/442111 (https://phabricator.wikimedia.org/T191299) [14:35:42] (03CR) 10Volans: "To be merged after Idd8ed2053538142505121400c56c884d60ee9ac4 has been released in prod." [puppet] - 10https://gerrit.wikimedia.org/r/442111 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:39:21] 10Operations, 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (watching): Stop and remove old job runners - https://phabricator.wikimedia.org/T198220#4315678 (10Pchelolo) p:05Triage>03Normal [14:39:45] bstorm_: I just put labvirt1020 in downtime until next week (I don't know if you're working on it or if it's someone else) [14:40:02] I'm not...I suspect DC ops might be. [14:40:11] Thanks, though! :) [14:41:10] Did it alert? I don't have one for it. [14:41:53] There is a pending task that needs to be done or four for it. [14:42:07] I don't think it paged, it just complained here in the channel [14:43:10] !log rm syslog.1.gz puppet.log.1.gz on tegment to fix cronspam [14:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:19] uff tegmen [14:43:41] 10Operations, 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (watching): Stop and remove old job runners - https://phabricator.wikimedia.org/T198220#4315709 (10mobrovac) [14:43:43] Ah ok [14:43:48] 10Operations, 10Analytics, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Stop and remove old job runners - https://phabricator.wikimedia.org/T198220#4315678 (10mobrovac) [14:44:59] (03PS3) 10Krinkle: Use ReplicatedBagOStuff for WAN cache on deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441881 (owner: 10Aaron Schulz) [14:45:04] (03PS2) 10Andrew Bogott: Beta: Add scap repository for dumps/dumps [puppet] - 10https://gerrit.wikimedia.org/r/441491 (owner: 10Thcipriani) [14:45:06] (03CR) 10Krinkle: [C: 032] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441881 (owner: 10Aaron Schulz) [14:45:51] (03CR) 10Andrew Bogott: [C: 032] Beta: Add scap repository for dumps/dumps [puppet] - 10https://gerrit.wikimedia.org/r/441491 (owner: 10Thcipriani) [14:46:38] (03Merged) 10jenkins-bot: Use ReplicatedBagOStuff for WAN cache on deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441881 (owner: 10Aaron Schulz) [14:46:53] (03CR) 10Andrew Bogott: [C: 032] "Sorry for the delay! Was away at the offsite." [puppet] - 10https://gerrit.wikimedia.org/r/441491 (owner: 10Thcipriani) [14:47:52] (03PS2) 10Andrew Bogott: shinkengen: Ignore instances that are turned off in Nova [puppet] - 10https://gerrit.wikimedia.org/r/440562 (owner: 10Alex Monk) [14:49:02] (03CR) 10Andrew Bogott: [C: 032] shinkengen: Ignore instances that are turned off in Nova [puppet] - 10https://gerrit.wikimedia.org/r/440562 (owner: 10Alex Monk) [14:49:11] (03CR) 10Andrew Bogott: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/440562 (owner: 10Alex Monk) [14:49:28] (03CR) 10jenkins-bot: Use ReplicatedBagOStuff for WAN cache on deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441881 (owner: 10Aaron Schulz) [14:54:52] (03CR) 10Andrew Bogott: "I'm not convinced about the elimination of hieradata/labs.yaml -- common.yaml applies to MANY more hosts and it seems useful to only set t" [labs/private] - 10https://gerrit.wikimedia.org/r/423189 (owner: 10EddieGP) [14:58:31] (03PS1) 10Mforns: Add HDFS whitelist path to EventLoggingSanitization job [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176) [15:00:53] 10Operations, 10ops-eqiad, 10DBA: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072#4315764 (10Marostegui) Sounds good! Just let us know with a day in advance so we can organize it :) Thanks! [15:00:56] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4315763 (10Cmjohnson) labvirt1020 Cabling to 10G ports completed Switch cfg completed Bios changes as per Faidon's instructions completed... [15:01:30] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4315765 (10Bstorm) Thanks! [15:03:08] (03PS2) 10Mforns: Add HDFS whitelist path to EventLoggingSanitization job [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176) [15:03:46] (03CR) 10Mforns: [C: 04-1] "This shouldn't be merged until we put the whitelist inside refinery." [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [15:21:08] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4315808 (10Cmjohnson) labnet1003 cabled Updated BIOS per Faidon's instructions Updaetd the switch cfg xe-7/0/9 up up labnet1003 eth0 xe-7/0/19 up u... [15:27:21] 10Operations, 10SRE-Access-Requests: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4315824 (10Amire80) [15:29:36] (03PS1) 10Papaul: DNS: ADD DNS entries for authdns2001 [dns] - 10https://gerrit.wikimedia.org/r/442129 (https://phabricator.wikimedia.org/T196664) [15:36:19] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install authdns2001.wikimedia.org - https://phabricator.wikimedia.org/T196664#4315868 (10Papaul) [15:48:03] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4315907 (10Cmjohnson) labnet1004 cabled bios updated switch cfg updated xe-4/0/3 up up labnet1004 eth0 xe-4/0/46 up up labnet1004 eth1 [15:52:39] 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873#4315939 (10Cmjohnson) This is scheduled for this coming Friday 29/6/2018 at 1000(EST) [15:54:51] 10Operations, 10ops-eqiad, 10DC-Ops: Replace memory bank on scb1002 - https://phabricator.wikimedia.org/T196901#4272179 (10Cmjohnson) the hardware log does not show any indication of a bad DIMM I can probably pull from a decommissioned spare /admin1-> racadm getsel Record: 1 Date/Time: 01/11/201... [15:55:38] 10Operations, 10ops-eqiad, 10DC-Ops: Replace memory bank on scb1002 - https://phabricator.wikimedia.org/T196901#4315952 (10Cmjohnson) @joe can you stress the DIMM? A simple reseating of the DIMM may also work. Let me know if I can power it down and do that. Thanks [16:00:04] godog, moritzm, and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180626T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:12:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Decommission osm-web100[1-4] - https://phabricator.wikimedia.org/T182033#4316014 (10Cmjohnson) [16:12:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Decommission osm-web100[1-4] - https://phabricator.wikimedia.org/T182033#3810615 (10Cmjohnson) 05Open>03Resolved [16:14:17] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission osm-cp100[1-4] - https://phabricator.wikimedia.org/T182034#4316035 (10Cmjohnson) [16:14:23] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission osm-cp100[1-4] - https://phabricator.wikimedia.org/T182034#3810632 (10Cmjohnson) 05Open>03Resolved [16:15:42] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442138 [16:17:23] 10Operations, 10Discovery-Search, 10hardware-requests: replace elastic2001-2024 (codfw) with newer servers - https://phabricator.wikimedia.org/T198169#4316079 (10EBernhardson) Increased disk makes sense to me. I would perhaps lean towards 4x 800GB over 2x 1.6TB, but it's probably not a big deal. What i'm thi... [16:18:53] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442138 (owner: 10Marostegui) [16:20:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442138 (owner: 10Marostegui) [16:20:22] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442138 (owner: 10Marostegui) [16:21:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decommission mobile 1004 and mobile1005 - https://phabricator.wikimedia.org/T181750#4316110 (10Cmjohnson) [16:21:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decommission mobile 1004 and mobile1005 - https://phabricator.wikimedia.org/T181750#3800915 (10Cmjohnson) 05Open>03Resolved [16:21:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1105:3312 after alter table (duration: 00m 57s) [16:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:56] 10Operations, 10Maps-Sprint, 10Maps (Maps-data): Monitor PostgreSQL connection slots - https://phabricator.wikimedia.org/T168767#4316180 (10Pnorman) p:05Normal>03Low [16:27:06] (03PS1) 10Subramanya Sastry: Replace Tidy with RemexHtml everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442142 (https://phabricator.wikimedia.org/T175706) [16:27:59] (03CR) 10Subramanya Sastry: "For swat deploy on July 5th." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442142 (https://phabricator.wikimedia.org/T175706) (owner: 10Subramanya Sastry) [16:32:35] (03CR) 10EBernhardson: Prep work for multi-instance elasticsearch refactor (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [16:36:49] 10Operations, 10SRE-Access-Requests: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4316237 (10RobH) [16:41:35] (03PS4) 10EddieGP: Remove labs//common.yaml hiera path [labs/private] - 10https://gerrit.wikimedia.org/r/423189 [16:42:02] (03PS9) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 [16:42:04] (03PS8) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 [16:42:06] (03PS11) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [16:42:08] (03PS14) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [16:42:12] (03PS42) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [16:42:22] (03PS3) 10EddieGP: cloud hiera: Remove unused paths from hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/423190 [16:43:41] 10Operations, 10SRE-Access-Requests: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4316269 (10RobH) a:03Amire80 As part of clinic duty, I've appended in the checklist. In reviewing those steps, I don't see @amire80's signature on the L3 document. @amire80: Plea... [16:44:10] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#4316274 (10Cmjohnson) [16:44:12] 10Operations, 10SRE-Access-Requests: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4316275 (10RobH) p:05Triage>03Normal [16:44:26] 10Operations, 10DBA, 10decommission, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4316279 (10Cmjohnson) [16:44:29] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4316278 (10Cmjohnson) [16:45:48] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#4316287 (10Cmjohnson) [16:45:52] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#3967299 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [16:47:25] (03PS1) 10RobH: adding amire80 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/442143 (https://phabricator.wikimedia.org/T198211) [16:47:41] Anyone from the security team around here? [16:48:49] (03CR) 10EddieGP: [C: 031] Fix en-rtl in Special:SiteMatrix in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441422 (https://phabricator.wikimedia.org/T195675) (owner: 10C. Scott Ananian) [16:50:52] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#4316326 (10jcrespo) [16:51:36] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#3013769 (10jcrespo) CC @vgutierrez pending work regarding TLS rollout FYI [16:51:40] (03PS1) 10RobH: adding amire80 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/442144 (https://phabricator.wikimedia.org/T198211) [16:52:16] (03Abandoned) 10RobH: adding amire80 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/442144 (https://phabricator.wikimedia.org/T198211) (owner: 10RobH) [16:52:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4316339 (10RobH) [16:52:59] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4315430 (10RobH) [16:53:39] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission unused host wmf3565 - https://phabricator.wikimedia.org/T190225#4316341 (10Cmjohnson) [16:53:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4316343 (10Cmjohnson) [16:53:52] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission unused host wmf3565 - https://phabricator.wikimedia.org/T190225#4066759 (10Cmjohnson) 05Open>03Resolved [16:55:26] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4316352 (10Amire80) My manager is @Arrbee. I have shell access and I use other servers, and I have a faint recollection of having a manager approval already, bu... [16:58:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4316390 (10RobH) >>! In T198211#4316352, @Amire80 wrote: > My manager is @Arrbee. I have shell access and I use other servers, and I have a faint recollection o... [16:58:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4316401 (10RobH) a:05Amire80>03Arrbee [16:59:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4315430 (10RobH) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180626T1700). [17:00:05] (03PS1) 10WMDE-Fisch: Change FileImporter config data location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442146 (https://phabricator.wikimedia.org/T198050) [17:02:08] 10Operations, 10Puppet, 10Analytics, 10Cassandra, and 3 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4316415 (10Pnorman) We don't see anything for the maps team to do on this - at the very least, we don't think it needs any resources from us. [17:03:08] (03CR) 10Jforrester: Replace Tidy with RemexHtml everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442142 (https://phabricator.wikimedia.org/T175706) (owner: 10Subramanya Sastry) [17:04:29] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4316426 (10RobH) I've emailed both @faidon and @mark for how to handle this, since it was merged without the meeting approval (at... [17:16:18] !log starting branch cut for 1.32-wmf.10 [17:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:05] (03CR) 10Gehel: Prep work for multi-instance elasticsearch refactor (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [17:36:51] 10Operations, 10DBA, 10decommission, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4316559 (10Marostegui) 05Open>03Resolved All the hosts are now totally decommissioned. So this is all done! Thanks everyone for getting all these hosts decommissioned!! [17:44:31] (03CR) 10Niharika29: [C: 031] "Let's SWAT this later today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441944 (https://phabricator.wikimedia.org/T198143) (owner: 10MusikAnimal) [17:47:32] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Change FileImporter config data location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442146 (https://phabricator.wikimedia.org/T198050) (owner: 10WMDE-Fisch) [17:47:37] (03CR) 10Dvorapa: [C: 031] Fix en-rtl in Special:SiteMatrix in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441422 (https://phabricator.wikimedia.org/T195675) (owner: 10C. Scott Ananian) [17:50:10] (03PS1) 10Bstorm: Change labvirt1020 MAC [puppet] - 10https://gerrit.wikimedia.org/r/442158 (https://phabricator.wikimedia.org/T194964) [17:53:10] (03PS1) 10Dduvall: Group0 to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442160 [17:58:16] (03CR) 10Framawiki: prometheus: tools: scrape paws metrics into prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio) [17:59:05] !log dduvall@deploy1001 Started scap: testwiki to php-1.32.0-wmf.10 and rebuild l10n cache [17:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180626T1800) [18:00:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4315430 (10Krenair) > which I think should imply being part of researchers for what it's worth. Well at that point what exactly distinguishes it from researchers? [18:01:28] !log dduvall@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_2905568126" --threads=30 --lang en --quiet' returned non-zero exit status 255 (duration: 02m 23s) [18:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:55] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4316677 (10Arrbee) >>! In T198211#4316390, @RobH wrote: > > Expansions typically still require manager approval. Thanks for signing the L3! Hello, please cons... [18:04:05] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4316678 (10Arrbee) a:05Arrbee>03None [18:05:03] (03PS3) 10Chico Venancio: prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) [18:05:27] (03CR) 10jerkins-bot: [V: 04-1] prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio) [18:05:33] (03CR) 10Bstorm: [C: 032] Change labvirt1020 MAC [puppet] - 10https://gerrit.wikimedia.org/r/442158 (https://phabricator.wikimedia.org/T194964) (owner: 10Bstorm) [18:07:02] (03CR) 10Alex Monk: [WIP] Central certificates service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [18:08:39] (03CR) 10Alex Monk: [WIP] Central certificates service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [18:14:09] !log scap sync for testwiki is currently failing on l10update due to MwEmbedSupport's deprecation but no change yet to mediawiki/extensions to remove the extension. blocking train on T197918 [18:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:12] T197918: Archive the MwEmbedSupport extension - https://phabricator.wikimedia.org/T197918 [18:14:16] (03PS4) 10Chico Venancio: prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) [18:14:31] greg-g: ^ fyi [18:14:58] marxarelli: crap [18:15:04] James_F: ^ [18:15:43] fwict, we just need a patch to mediawiki/extensions to remove the extension [18:20:25] (03CR) 10Alex Monk: [WIP] Central certificates service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [18:20:38] Meh. [18:20:50] RECOVERY - Host labvirt1020 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [18:20:59] We can drop the i18n right now. [18:21:36] It's just the name of the extension for Special:Version. [18:21:38] wait, MwEmbedSupport _was_ removed from .gitmodules [18:21:49] Yeah, it shouldn't have branched it. [18:22:08] Oh, but there's one extension-list for all hetdeploy versions? [18:22:30] i'm not sure where l10update gets its extension list [18:22:33] thcipriani: ^ ? [18:23:15] extension list is generated and put into wmf-config/ExtensionMessages-[version] iirc [18:23:57] 10Operations, 10fundraising-tech-ops, 10netops: new pfw policy for monitor server - https://phabricator.wikimedia.org/T198237#4316809 (10cwdent) [18:23:58] oh [18:24:04] I have a vague memory of this happening before [18:24:06] * thcipriani digs [18:24:15] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4196993 (10pmiazga) Time to see how service performs with new settings/changes [18:24:31] i'll take vague over none [18:24:46] what's the error message you're seeing [18:24:47] ? [18:24:58] thcipriani: https://phabricator.wikimedia.org/T191056 [18:25:16] oh crap. my full copy/paste didn't work [18:25:27] tmux in tmux blues. sec [18:26:07] I was just coming back to say...looks normal [18:26:28] (Also 1.31.0-wmf.28 should probably get deleted.) [18:27:00] aside from that, yeah [18:28:44] thcipriani: https://phabricator.wikimedia.org/T191056#4316774 [18:30:51] https://phabricator.wikimedia.org/T125678 [18:31:31] marxarelli: so from the looks of it twentyafterfour added support for an extension-list file inside the php-[version] directory that is checked before the global wmf-config/extension-list file [18:32:06] yeah that's untested though [18:32:15] oh good :) [18:32:20] it should work [18:32:25] twentyafterfour: how have you worked around this in the past? [18:32:53] thcipriani: there's an explicit call to wfLoadExtension( 'MwEmbedSupport' ); in CommonSettings.php [18:33:16] thcipriani: never have been able to work around it properly [18:33:23] conditional on wmgUseMwEmbedSupport [18:33:29] line 904 [18:33:31] there is always a race between wmf-config and the branches [18:34:07] Yes, see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/441518 etc. [18:34:17] so... add `$wmgUseMwEmbedSupport = true` to LocalSettings.php? :) [18:34:17] But we can't merge those before wmf.10 is everywhere. [18:34:18] marxarelli: but it's wrapped in if ( $wmgUseMwEmbedSupport ) [18:34:24] James_F: right [18:34:43] so, maybe add it to LocalSettings.php until after group2 deploy [18:34:52] I guess we could set wmgUseMwEmbedSupport to false for group0 right now, and scap quickly so that group0 only loses video for a bit? [18:34:55] then remove it and merge the patch [18:35:14] Hmm, no, because TMH would PHP fatal. [18:35:37] it would? for 1.32.0-wmf.10? [18:35:41] For wmf.8 [18:35:58] We need to set wmgUseMwEmbedSupport to false for wmf.10 but true for wmf.8 somehow. [18:36:07] right, i'm saying add the override to php-1.32.0-wmf.10/LocalSettings.php [18:36:15] Oh, yeah, that should work. [18:36:49] cool. thcipriani, twentyafterfour: ^ seem (not in)sane? [18:37:24] seems sane [18:37:41] ^ [18:37:55] alright [18:38:21] I'll rephrase to: seems like it'll work :) [18:39:30] Does LocalSettings.php get executed before or after InitialiseSettings.php? [18:39:41] !log setting $wmgUseMwEmbedSupport = false in php-1.32.0-wmf.10/LocalSettings.php to extension registry exception (see T197918 and T191056) [18:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:44] T191056: 1.32.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T191056 [18:39:45] T197918: Archive the MwEmbedSupport extension - https://phabricator.wikimedia.org/T197918 [18:40:15] hrm, no wait, maybe it'll get overridden :( [18:40:18] James_F: good question. it should be after [18:40:24] but maybe not [18:40:25] James_F: after? but it includes CommonSettings which is where the exception originates [18:40:29] hmm [18:40:30] marxarelli: Looking at the top of https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php it's "both". [18:40:41] oh yes, it's both that's right [18:40:50] :D [18:40:54] huh. [18:40:56] I remember finding some crazy circular stuff in that code [18:41:01] mediawiki/LocalSettings.php calls CommonSettings.php which calls InitialiseSettings.php; we need to over-ride after Init and before Common. [18:41:37] So setting only in LocalSettings won't work? [18:42:13] if ( $wmgVersionNumber == etc [18:42:28] Add $wgIgnoreMwEmbed in LocalSettings, set true, change CommonSettings to be $wgUseMwEmbedSupport && !$wgIgnoreMwEmbed then? [18:42:48] Or just wmgVersioNumber, just as much of a hack. [18:43:14] your version seems like less of a hack though :) [18:43:26] Want me to write a patch? [18:43:36] sure [18:43:41] Hmm. Maybe better to leave to marxarelli as it'll need live hacking prod… [18:43:42] James_F: would be great [18:43:47] OK, one second. [18:44:55] (03PS3) 10Aaron Schulz: Make mediawiki.org write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 [18:45:14] maybe do !(isset($wgIgnoreMwEmbed) && $wgIgnoreMwEmbed) to avoid an uptick in E_NOTICEs? [18:45:37] Just isset is enough. [18:45:42] ah true [18:46:36] I still think we may run into an additional problem with the extension-list that I was trying to solve initially. Not entirely sure though. [18:46:39] (03PS1) 10Jforrester: Be able to over-ride wmgUseMwEmbedSupport from LocalSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442163 [18:46:45] Can't do a LocalSettings.php patch from gerrit, have to do it locally, right? [18:47:07] James_F: i can do that once we have the config patch merged [18:47:12] Kk. [18:47:19] ^^ [18:48:05] I think that's correct since LocalSettings is generated [18:48:20] And has private stuff in it. [18:48:43] (03PS10) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 [18:48:45] (03PS9) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 [18:48:47] (03PS12) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [18:48:49] (03PS15) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [18:48:51] (03PS43) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [18:49:24] (03PS2) 10Jforrester: Stop loading the MwEmbedSupport extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441518 [18:49:26] (03PS2) 10Jforrester: Stop loading the MwEmbedSupport extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441519 [18:49:28] (03PS2) 10Jforrester: Stop loading the MwEmbedSupport extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441520 [18:49:30] (03PS2) 10Jforrester: Stop loading the MwEmbedSupport extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441521 [18:49:42] (Just a rebase of that stack, not needed to unblock life now.) [18:50:02] (03PS4) 10Aaron Schulz: Make mediawiki.org write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 [18:50:41] (03CR) 10EBernhardson: Prep work for multi-instance elasticsearch refactor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [18:50:57] (03PS2) 10Dduvall: Be able to over-ride wmgUseMwEmbedSupport from LocalSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442163 (https://phabricator.wikimedia.org/T191056) (owner: 10Jforrester) [18:51:25] ^ just referencing the deployment task [18:51:26] marxarelli: Oh, yeah, should have tagged it [18:51:29] James_F: thanks! [18:52:24] (03CR) 10Dduvall: [C: 032] Be able to over-ride wmgUseMwEmbedSupport from LocalSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442163 (https://phabricator.wikimedia.org/T191056) (owner: 10Jforrester) [18:53:40] (03Merged) 10jenkins-bot: Be able to over-ride wmgUseMwEmbedSupport from LocalSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442163 (https://phabricator.wikimedia.org/T191056) (owner: 10Jforrester) [18:53:56] (03CR) 10jenkins-bot: Be able to over-ride wmgUseMwEmbedSupport from LocalSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442163 (https://phabricator.wikimedia.org/T191056) (owner: 10Jforrester) [18:54:18] (03PS1) 10Rush: openstack: addition keystone bootstrap instructions [puppet] - 10https://gerrit.wikimedia.org/r/442166 [18:54:33] Hmm. LocalSettings.php says it's Web-viewable, but it's .gitignored from https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/wmf/1.32.0-wmf.10 [18:55:34] * James_F shrugs. [19:00:04] marxarelli: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180626T1900). [19:02:05] !log dduvall@deploy1001 Started scap: testwiki to php-1.32.0-wmf.10 and rebuild l10n cache [19:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:11] * James_F crosses fingers. [19:04:23] * thcipriani too [19:04:35] so far so good [19:04:42] Now I want to peer into marxarelli's `screen` session. :-) [19:05:25] thar be dragons :) [19:05:39] "19:03:56 Updating LocalisationCache for 1.32.0-wmf.10 using 30 thread(s)" is cranking away [19:05:48] neat [19:06:08] * thcipriani wanders away [19:09:28] marxarelli: Sorry that I forgot about scap's funniness. :-( [19:11:00] James_F: oh, no worries. TIL more about MW initialization/settings :) [19:11:40] and fatalmonitor is currently happy so i'm happy [19:12:26] Yay. [19:17:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211#4317129 (10RobH) As long as no objections are noted, I'll merge this request live on Friday, June 29th, 2018. [19:28:55] (03PS11) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 [19:28:57] (03PS10) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 [19:28:59] (03PS13) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [19:29:01] (03PS16) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [19:29:03] (03PS44) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [19:36:50] PROBLEM - Disk space on mwdebug2002 is CRITICAL: DISK CRITICAL - free space: / 1107 MB (2% inode=71%) [19:41:17] !log dduvall@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.10 and rebuild l10n cache (duration: 39m 11s) [19:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:11] !log cdb-rebuild failed on mwdebug2002 due to lack of disk space [19:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:55] mwdebug2002 again? [19:59:55] the sync was successful aside from the cdb update on mwdebug2002 due to it falling over [20:00:08] so i'm going to move ahead with group0 [20:01:57] 10Operations, 10Cloud-Services, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138#4317285 (10Krenair) >>! In T198138#4314012, @demon wrote: >>>! In T198138#4313147, @Krenair wrote: >> Might be better to ensure that all privileged users at least know that they... [20:21:14] !log dduvall@deploy1001 Pruned MediaWiki: 1.32.0-wmf.4 [keeping static files] (duration: 03m 21s) [20:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:24] !log manually cleaning up on mwdebug2002 due to full disk and scap failures [20:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:10] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54500 MB (3% inode=99%) [20:27:12] (03PS2) 10Thcipriani: Scap clean: remove remote cache directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441920 (https://phabricator.wikimedia.org/T157030) [20:29:31] RECOVERY - Disk space on maps1001 is OK: DISK OK [20:34:31] RECOVERY - Disk space on mwdebug2002 is OK: DISK OK [20:34:53] !log removed /srv/mediawiki/php-1.32.0-wmf.4 on mwdebug2002 to free disk space [20:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:25] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10Shinken, 10Graphite: Clean up labs graphite datapoints - https://phabricator.wikimedia.org/T111540#4317390 (10Krenair) Can someone archive deployment-prep.deployment-tin.diskspace._mnt.byte_percentfree ? The mount no longer exists [20:38:25] !log regenerated debian installer for jessie 8.11 point release (released 3 days ago) [20:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:33] (03PS1) 10Krinkle: Beta Cluster: Increase wgNavigationTimingSamplingFactor from 0.1% to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442198 [20:47:43] (03PS2) 10Krinkle: Beta Cluster: Increase wgNavigationTimingSamplingFactor from 0.1% to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442198 (https://phabricator.wikimedia.org/T195314) [20:47:48] !log removing orphaned /srv/mediawiki/php-1.31.0-wmf.28 dir on deploy1001 [20:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:43] (03PS3) 10MusikAnimal: Enable Draft namespace and AfC mode for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441944 (https://phabricator.wikimedia.org/T198143) [20:50:04] (03PS2) 10Andrew Bogott: deployment-prep: Fix shinken check for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/440561 (owner: 10Alex Monk) [20:50:19] (03CR) 10Imarlier: [C: 031] "One small quibble, LGTM whether that change is made or no." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442198 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [20:50:50] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: Fix shinken check for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/440561 (owner: 10Alex Monk) [20:53:04] (03PS3) 10Krinkle: Beta Cluster: Increase wgNavigationTimingSamplingFactor from 0.1% to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442198 [20:53:45] (03CR) 10Krinkle: Beta Cluster: Increase wgNavigationTimingSamplingFactor from 0.1% to 10% (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442198 (owner: 10Krinkle) [20:53:49] (03CR) 10Krinkle: [C: 032] Beta Cluster: Increase wgNavigationTimingSamplingFactor from 0.1% to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442198 (owner: 10Krinkle) [20:55:03] (03Merged) 10jenkins-bot: Beta Cluster: Increase wgNavigationTimingSamplingFactor from 0.1% to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442198 (owner: 10Krinkle) [20:55:59] !log dduvall@deploy1001 Started scap: testwiki to php-1.32.0-wmf.10 and rebuild l10n cache [20:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:16] (03CR) 10jenkins-bot: Beta Cluster: Increase wgNavigationTimingSamplingFactor from 0.1% to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442198 (owner: 10Krinkle) [21:11:26] (03PS11) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 [21:11:28] (03PS14) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [21:11:30] (03PS17) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [21:11:32] (03PS45) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [21:12:43] (03CR) 10jerkins-bot: [V: 04-1] prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (owner: 10EBernhardson) [21:18:50] !log dduvall@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.10 and rebuild l10n cache (duration: 22m 50s) [21:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:31] (03PS15) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [21:20:33] (03PS18) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [21:20:35] (03PS46) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [21:20:42] (03PS2) 10Dduvall: Group0 to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442160 [21:23:12] (03CR) 10Dduvall: [C: 032] Group0 to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442160 (owner: 10Dduvall) [21:24:32] (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442160 (owner: 10Dduvall) [21:24:53] 10Operations, 10Cloud-Services, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138#4317516 (10demon) Teaching to not use `-A` is good. Setting it in your `~/.ssh/config` is useful in case it's not disabled system-wide by default :) [21:26:09] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Group0 to 1.32.0-wmf.10 [21:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:39] (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442160 (owner: 10Dduvall) [21:32:03] Train all going well? :) [21:40:39] addshore: a few small but annoying bumps but all seems well now [21:52:17] modules/docker/manifests/engine.pp: keyfile => 'puppet:///modules/docker/docker.gpg', [21:52:34] # find -name docker.gpg [21:52:34] # [21:54:07] (03PS2) 10Rush: openstack: addition keystone bootstrap instructions [puppet] - 10https://gerrit.wikimedia.org/r/442166 [21:55:19] marxarelli: awesome! [21:56:55] (03CR) 10Krinkle: [C: 031] "LGTM. Though would prefer that before we flip all non-test wikis (currently next commit), that we first straighten the naming difference b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 (owner: 10Aaron Schulz) [22:05:54] (03CR) 10C. Scott Ananian: [C: 031] Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [22:14:39] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4317640 (10Krinkle) [22:15:51] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4317643 (10Krenair) ```lang=diff,name=crappy cherry-picked hack to try to get npm installed and puppet happy diff --... [22:24:25] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4317653 (10Krenair) I've armed keyholder on the new host. So what next @thcipriani? [22:33:03] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4317658 (10Krinkle) [22:35:18] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4317659 (10Bstorm) >>! In T194964#4305423, @ayounsi wrote: >>>! In T194964#4298470, @Bstorm wrote: >> The bad, for some reason, even thoug... [22:37:20] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.12 seconds [22:42:12] Anyone recognise 10.4.0.58 ? [22:42:18] Maybe an old .pmtpa.wmflabs IP? [22:42:55] (03PS1) 10Volans: Logging: avoid duplicate logging [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442214 (https://phabricator.wikimedia.org/T167504) [22:42:57] (03PS1) 10Volans: Client CLI: silence external loggers [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442215 (https://phabricator.wikimedia.org/T191300) [22:42:59] (03PS1) 10Volans: Client CLI: send also empty updates to the server [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442216 (https://phabricator.wikimedia.org/T191300) [22:43:01] (03PS1) 10Volans: Client CLI: distinguish the type of update further [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442217 (https://phabricator.wikimedia.org/T191300) [22:43:36] and now the -1s.. as CI still have a very old tox :( [22:44:00] (03CR) 10jerkins-bot: [V: 04-1] Logging: avoid duplicate logging [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442214 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [22:44:02] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: silence external loggers [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442215 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [22:44:07] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: send also empty updates to the server [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442216 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [22:44:14] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: distinguish the type of update further [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442217 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [22:44:22] yeah here we are [22:44:25] ; 10.4.0.0/21 - guest VMs subnet [22:44:35] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4317706 (10Krinkle) Note to self with regards to the perf-site. The site's Apache configuration ([src](https://github.com/wikimedia/puppet/blo... [22:44:39] alright that explains that I guess [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180626T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:14:05] (03PS12) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 [23:14:07] (03PS16) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [23:14:09] (03PS19) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [23:14:11] (03PS47) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [23:15:56] (03PS12) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 [23:16:08] (03PS13) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 [23:16:19] (03PS17) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [23:16:29] (03PS20) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [23:16:40] (03PS48) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [23:21:27] * ebernhardson spams it up [23:21:35] (03PS13) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 [23:21:37] (03PS14) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 [23:21:39] (03PS18) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [23:21:41] (03PS21) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [23:21:43] (03PS49) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [23:40:30] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 6.01 seconds [23:46:17] 10Operations, 10Scap, 10Wikimedia-Incident: Update Debian Package for Scap3 to 3.8.3-1 - https://phabricator.wikimedia.org/T198277#4317855 (10thcipriani) p:05Triage>03Unbreak! [23:47:09] (03PS1) 10Thcipriani: Scap: Bump version to 3.8.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/442226 (https://phabricator.wikimedia.org/T19827) [23:59:17] (03PS1) 10EBernhardson: Improve diff output with sorting [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/442228