[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190308T0000). [00:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:43] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga2001 is OK: (C)60 le (W)70 le 74.09 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:00:54] \o [00:01:26] !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2001.wikimedia.org,service=pdns_recursor [00:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:07] I can SWAT [00:05:00] thanks thcipriani [00:05:21] thcipriani: 2 are beta cluster so should be pretty easy :) [00:05:31] 10Operations, 10netops: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 (10ayounsi) After changing the port range to IANA recommended range and restarting Bird, we can see the BFD packets leaving from the proper port: `IP dns2001.wikimedia.org.55170 > cr1-codfw.wikimedia.org.4784: UDP, length 24`... [00:05:53] jdlrobson: does this need to be a change in IS-labs.php? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/495023 [00:06:54] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson) [00:07:13] (03CR) 10Jdlrobson: [C: 04-1] Enable advanced mobile contributions mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 (owner: 10Jdlrobson) [00:07:17] yess eeke [00:07:21] thanks for sanity checking me [00:07:36] fixing now [00:07:39] cool, thanks [00:08:56] (03PS2) 10Jdlrobson: Enable advanced mobile contributions mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 [00:09:02] corrected! [00:09:49] thansk [00:09:51] *thanks [00:10:26] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 (owner: 10Jdlrobson) [00:11:38] (03Merged) 10jenkins-bot: Enable advanced mobile contributions mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 (owner: 10Jdlrobson) [00:11:53] (03PS2) 10Thcipriani: Cleanup beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson) [00:11:59] (03CR) 10Thcipriani: Cleanup beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson) [00:12:05] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson) [00:12:51] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Krinkle) I don't have input on the numbers, but I do think we need a p99 and max as well... [00:13:01] (03Merged) 10jenkins-bot: Cleanup beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson) [00:13:05] for some reason in the new gerrit UI it's harder for me to tell if things can merge or if they need a rebase. [00:15:41] (03CR) 10jenkins-bot: Enable advanced mobile contributions mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 (owner: 10Jdlrobson) [00:15:45] (03CR) 10jenkins-bot: Cleanup beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson) [00:16:09] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:495024|Cleanup beta cluster config]] T213599; [[gerrit:495023|Enable advanced mobile contributions mode on beta cluster]] beta-only (noop) sync (duration: 00m 49s) [00:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:12] T213599: Clean up beta cluster mobile configuration - https://phabricator.wikimedia.org/T213599 [00:18:01] jdlrobson: minervaneue update is live on mwdebug1002, check please [00:18:16] Sweet! Checking! [00:19:44] fix confirmed! please sync! [00:20:20] * thcipriani does [00:23:43] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.20/skins/MinervaNeue/resources/skins.minerva.scripts/toc.js: SWAT: [[gerrit:495021|Passing page parameter to TOC toggler]] T217820 (duration: 00m 50s) [00:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:46] T217820: Regression: "Expand all sections" setting in Minerva is broken - https://phabricator.wikimedia.org/T217820 [00:23:50] ^ jdlrobson all live now [00:24:03] next run of beta-scap-eqiad should make the beta things live [00:25:41] thcipriani: checking again :S [00:26:13] thanks i'll ping you in a bit if i'm not seeing them, if there's any problems it can wait till monday anyhow [00:26:30] thanks for the SWAT fix though! [00:27:28] sure thing, thanks for checking patches! [00:28:27] thcipriani it should show "Merge Conflict" but at least in 2.16 it is much more noticable. [00:28:56] old ui says something like "Cannot Merge" in red letters [00:31:22] thcipriani the new ui on 2.16 shows a status bandge now [00:31:25] *bage [00:31:51] nice, less subtle would be good :) [00:32:01] thcipriani like https://gerrit-review.googlesource.com/c/gerrit/+/201895/5 [01:00:01] thcipriani the health check plugins is fixed in https://gerrit-review.googlesource.com/c/plugins/healthcheck/+/217192 [01:02:12] (03PS1) 10MarkAHershberger: Add --timely option to help with debugging [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 [01:03:24] (03CR) 10jerkins-bot: [V: 04-1] Add --timely option to help with debugging [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 (owner: 10MarkAHershberger) [01:29:36] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10zhuyifei1999) ` root@tools-sgeexec-0914:~# strace -s 1024 -p 543 -p 560 -p 561 -p 562 -p 563 -p 564 -p 568 -p 56... [01:37:35] (03CR) 10Krinkle: [C: 03+1] "LGTM. Won't be used until I7de713adbcc2f1ee9f2 goes out, but is harmless to roll out ahead of time, as far as I can see." [puppet] - 10https://gerrit.wikimedia.org/r/494726 (https://phabricator.wikimedia.org/T217395) (owner: 10Gilles) [03:53:50] (03PS2) 10KartikMistry: Enable ExternalGuidance to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493672 (https://phabricator.wikimedia.org/T216129) [04:27:38] (03PS1) 10Paladox: Add missing "wikimedia" plugin to tools/bzl/plugins.bzl [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495170 [04:28:05] (03PS2) 10Paladox: Add missing "wikimedia" plugin to tools/bzl/plugins.bzl [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495170 [04:28:52] (03CR) 10Paladox: [V: 03+2 C: 03+2] Add missing "wikimedia" plugin to tools/bzl/plugins.bzl [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495170 (owner: 10Paladox) [06:15:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495173 [06:16:47] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495173 (owner: 10Marostegui) [06:17:23] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/495174 [06:17:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495173 (owner: 10Marostegui) [06:18:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 (duration: 00m 51s) [06:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:10] (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/495174 (owner: 10Marostegui) [06:20:38] !log Reload haproxy on dbproxy1010 to depool labsdb1010 [06:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495173 (owner: 10Marostegui) [06:21:01] !log Stop replication on s3 on labsdb1009 and labsdb1011 [06:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:07] !log Deploy schema change on s3 db1077 with replication (lag will happen on s3 labs) [06:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:13] marostegui: o/ [06:35:32] going to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494874/, ok for you? [06:35:42] o/ [06:35:43] let me see [06:35:44] (03PS13) 10Elukey: Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [06:35:48] Ah [06:36:09] (03CR) 10Marostegui: [C: 03+1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [06:36:13] super :) [06:36:15] I thought I already +1ed [06:36:16] Sorry! [06:36:40] (03CR) 10jerkins-bot: [V: 04-1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [06:38:14] (03CR) 10Elukey: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15038/" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [06:38:53] marostegui: nono I only wanted to know if it was ok for you :) [06:49:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495175 [06:50:49] 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10MoritzMuehlenhoff) @Legoktm , @bd808 : The PHP 7.2 packages from the component/php72 are working fine in the Mediawiki PHP productio... [06:50:58] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495175 (owner: 10Marostegui) [06:51:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495175 (owner: 10Marostegui) [06:52:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1076 for mysql upgrade (duration: 00m 49s) [06:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:45] !log Stop MySQL on db1076 for upgrade [06:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495175 (owner: 10Marostegui) [06:57:52] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495176 [07:02:55] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495176 (owner: 10Marostegui) [07:03:52] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495176 (owner: 10Marostegui) [07:04:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1076 after mysql upgrade (duration: 00m 48s) [07:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:44] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495176 (owner: 10Marostegui) [07:07:32] (03PS1) 10Marostegui: db-eqiad.php: Repool db1076 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495179 [07:11:08] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1076 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495179 (owner: 10Marostegui) [07:12:03] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1076 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495179 (owner: 10Marostegui) [07:13:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1076 into API after mysql upgrade (duration: 00m 48s) [07:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:57] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1076 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495179 (owner: 10Marostegui) [07:17:24] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495180 [07:21:09] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:19] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [07:21:22] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495180 (owner: 10Marostegui) [07:21:23] uh? [07:21:37] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:39] PROBLEM - Wikitech-static main page has content on labtestweb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:47] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495180 (owner: 10Marostegui) [07:22:05] (03PS3) 10Dzahn: DHCP Partman: Add DHCP MAC and partman for restbase2018,2020 [puppet] - 10https://gerrit.wikimedia.org/r/495138 (https://phabricator.wikimedia.org/T217368) (owner: 10Papaul) [07:22:49] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1076 after mysql upgrade (duration: 00m 49s) [07:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:07] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 'd rather we just added the hour, minute, second to the nightly flag instead of adding another one. It should be way smaller as a change" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 (owner: 10MarkAHershberger) [07:28:25] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495180 (owner: 10Marostegui) [07:30:11] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/495181 [07:30:37] (03PS2) 10Marostegui: Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/495181 [07:31:03] (03CR) 10Dzahn: [C: 03+2] DHCP Partman: Add DHCP MAC and partman for restbase2018,2020 [puppet] - 10https://gerrit.wikimedia.org/r/495138 (https://phabricator.wikimedia.org/T217368) (owner: 10Papaul) [07:31:18] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/495181 (owner: 10Marostegui) [07:31:59] (03PS4) 10Dzahn: DHCP Partman: Add DHCP MAC and partman for restbase2018,2020 [puppet] - 10https://gerrit.wikimedia.org/r/495138 (https://phabricator.wikimedia.org/T217368) (owner: 10Papaul) [07:34:29] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster [07:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:00] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 30s) [07:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:57] RECOVERY - Wikitech-static main page has content on labtestweb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 33651 bytes in 3.308 second response time [07:35:57] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 33651 bytes in 3.321 second response time [07:36:03] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.06 ms [07:36:37] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 33651 bytes in 0.214 second response time [07:37:52] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/495183 [07:39:04] (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/495183 (owner: 10Marostegui) [07:41:06] weird. i have a change where i simply move an .erb template to a new location, no change inside it, yet the compiler "fails to parse" it [07:45:48] (03PS1) 10Marostegui: db-eqiad.php: More traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184 [07:46:20] (03PS2) 10Marostegui: db-eqiad.php: More traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184 [07:47:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184 (owner: 10Marostegui) [07:48:30] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184 (owner: 10Marostegui) [07:49:02] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185 [07:49:06] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185 [07:49:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1076 after mysql upgrade (duration: 00m 49s) [07:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:12] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185 (owner: 10Marostegui) [07:50:50] (03CR) 10jenkins-bot: db-eqiad.php: More traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184 (owner: 10Marostegui) [07:51:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185 (owner: 10Marostegui) [07:51:21] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185 (owner: 10Marostegui) [07:51:29] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster [07:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 (duration: 00m 48s) [07:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:48] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 01m 18s) [07:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:04] elukey: did labsdb1012 change work? [07:55:06] marostegui: yep! [07:55:12] I can telnet from all analytics now [07:55:20] so my team can start the test from hadoop [07:56:42] coool [07:57:51] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster [07:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:53] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 02s) [07:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:51] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster [07:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:31] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 40s) [07:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:30] (03PS7) 10Dzahn: xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) [08:08:07] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495186 [08:08:58] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15040/" [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [08:09:05] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495186 (owner: 10Marostegui) [08:10:02] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495186 (owner: 10Marostegui) [08:11:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1076 (duration: 00m 48s) [08:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:44] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495186 (owner: 10Marostegui) [08:16:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495188 [08:17:48] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1099:3311,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495188 (owner: 10Marostegui) [08:18:45] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495188 (owner: 10Marostegui) [08:20:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099:3311,db1096:3315 (duration: 00m 48s) [08:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:20] (03CR) 10Dzahn: [C: 03+2] profile: point to real modules for specs [puppet] - 10https://gerrit.wikimedia.org/r/480957 (owner: 10Hashar) [08:24:12] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/495189 [08:25:06] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495188 (owner: 10Marostegui) [08:27:53] (03CR) 10Dzahn: [C: 03+2] "needs manual rebase apparently" [puppet] - 10https://gerrit.wikimedia.org/r/480957 (owner: 10Hashar) [08:28:06] (03PS4) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [08:28:38] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [08:29:40] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/495189 (owner: 10Marostegui) [08:31:14] !log Reload haproxy on dbproxy1010 to repool labsdb1011 [08:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:27] (03PS1) 10Marostegui: dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/495190 [08:36:10] (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/495190 (owner: 10Marostegui) [08:37:19] !log Reload haproxy on dbproxy1011 to depool labsdb1009 [08:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:00] 10Operations, 10MediaWiki-Vagrant: Can't provision elk role on Vagrant anymore: logstash Debian package is nowhere to be found - https://phabricator.wikimedia.org/T217666 (10Gilles) [08:45:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495191 [08:50:47] 10Operations, 10serviceops: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10akosiaris) [08:51:09] 10Operations, 10serviceops: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10akosiaris) p:05Triage→03Normal [09:02:59] 10Operations, 10serviceops: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10akosiaris) [09:04:05] 10Operations, 10serviceops: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10akosiaris) [09:13:04] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1099:3311,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495191 (owner: 10Marostegui) [09:14:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495191 (owner: 10Marostegui) [09:15:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3311,db1096:3315 (duration: 00m 49s) [09:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495191 (owner: 10Marostegui) [09:21:16] 10Operations, 10MediaWiki-Vagrant: Can't provision elk role on Vagrant anymore: logstash Debian package is nowhere to be found - https://phabricator.wikimedia.org/T217666 (10Gilles) p:05Triage→03Normal a:03Gilles [09:21:27] 10Operations, 10MediaWiki-Vagrant, 10Performance-Team: Can't provision elk role on Vagrant anymore: logstash Debian package is nowhere to be found - https://phabricator.wikimedia.org/T217666 (10Gilles) [09:28:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495193 [09:31:44] 10Operations, 10MediaWiki-Vagrant, 10Performance-Team, 10Patch-For-Review: Can't provision elk role on Vagrant anymore: logstash Debian package is nowhere to be found - https://phabricator.wikimedia.org/T217666 (10Gilles) 05Open→03Resolved [09:31:46] (03PS1) 10Dzahn: Revert "Revert "icinga: merge https and http checks"" [puppet] - 10https://gerrit.wikimedia.org/r/495194 [09:32:25] (03PS2) 10Dzahn: Revert "Revert "icinga: merge https and http checks"" [puppet] - 10https://gerrit.wikimedia.org/r/495194 [09:32:42] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "icinga: merge https and http checks"" [puppet] - 10https://gerrit.wikimedia.org/r/495194 (owner: 10Dzahn) [09:36:54] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1080 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495193 (owner: 10Marostegui) [09:37:51] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495193 (owner: 10Marostegui) [09:38:42] there will be some alerts for Elastic search but we are already debugging them [09:39:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1080, db1110 (duration: 00m 49s) [09:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:18] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2001 is CRITICAL: (null) [09:39:18] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: (null) [09:39:22] those [09:39:28] downtimig now [09:39:29] sorry [09:39:34] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: (null) [09:39:52] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: (null) [09:39:52] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: (null) [09:39:54] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2004 is CRITICAL: (null) [09:39:58] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: (null) [09:39:59] no problem, we only had seconds to do that after they got created [09:40:00] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: (null) [09:40:00] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1002 is CRITICAL: (null) [09:40:02] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2003 is CRITICAL: (null) [09:40:06] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: (null) [09:41:02] also doing some (without disabling all other checks on those hosts) [09:41:42] yep.. I only disable those checks [09:41:48] perfect [09:42:10] that's why i did not use the script , it does all services on a host [09:43:52] check_command check_elasticsearch_shards_threshold!9200!>=0.34 debugging ... it appears to work when running manual [09:44:44] for all hosts? [09:45:28] tested with relforge1001 which is in the list above [09:45:38] also with both hostname or IP [09:45:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495193 (owner: 10Marostegui) [09:45:51] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [09:45:51] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [09:45:58] oh, lol ? [09:47:18] Ok [09:48:35] well.. "works for one random host but not the others" is weird [09:48:44] especially when i can do that for multiple host and get OK on shell [09:49:56] (03PS5) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [09:50:24] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I would put the puppet code in modules/openstack/manifests/nova/compute/service.pp since I didn't detect anything mitaka specific or jessi" [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) (owner: 10GTirloni) [09:50:51] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:51:29] mutante: I think found the issue [09:51:40] !log temp disabling puppet on icinga to debug an issue with elastic checks [09:51:41] mutante: https://icinga.wikimedia.org/cgi-bin/icinga/config.cgi?type=command&host=relforge1001&service=ElasticSearch%20health%20check%20for%20shards%20on%209200&expand=check_elasticsearch_shards_threshold%219200%21%3E%3D0.15 [09:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:13] onimisionipe: " " around the ">" part, ack [09:52:20] that's the same thing i was gonna test [09:54:33] 3/3 CRITICAL - elasticsearch 9200://relforge1002:>=0.15/_cluster/health error while fetching: No connection adapters were found for '9200://relforge1002:>=0.15/_cluster/health' [09:54:47] onimisionipe: ^ oh look what happens next if we add " " [09:55:15] the host name is not even in there [09:55:53] hmm [09:56:15] 9200://relforge [09:56:23] something is mixed up with the order of the arguments [09:56:30] yes [09:56:30] that should be the protocol in there [09:56:37] yep [09:59:50] mutante: the order of args seems Ok in the CR [10:00:01] I might be missing something [10:01:16] i can't say i see the reason yet either. what you said [10:05:42] (03PS1) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/495195 [10:11:23] (03PS6) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [10:12:13] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:12:49] (03PS2) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/495195 [10:13:27] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/495195 (owner: 10Marostegui) [10:14:28] !log Reload haproxy on dbproxy1011 to repool labsdb1009 [10:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:55] (03PS7) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [10:19:29] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:21:07] godog: ^ [10:21:37] if you are around, we need some help with some issues with icinga [10:24:13] (03PS8) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [10:24:15] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:26:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080 and db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495196 [10:28:53] (03PS9) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [10:29:51] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:32:03] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1080 and db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495196 (owner: 10Marostegui) [10:33:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080 and db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495196 (owner: 10Marostegui) [10:33:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080 and db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495196 (owner: 10Marostegui) [10:34:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1080, db1110 (duration: 00m 49s) [10:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:37] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2003 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 51, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [10:34:37] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [10:34:51] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2001 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 51, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [10:34:51] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [10:34:51] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [10:34:51] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [10:34:52] mutante ^ [10:35:14] onimisionipe: lol, wtf :) [10:35:19] i re-enabled puppet [10:35:21] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 104, in [10:35:21] : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [10:35:24] that is what happened [10:35:26] mutante: did you revert or something [10:35:33] no, i just let puppet run again [10:35:48] i guess it needed 2 runs to first adjust the check command [10:35:52] and then all the checks using it [10:36:06] which would explain why it worked for one host .. maybe :p [10:37:47] well, that was some wasted time :) but glad it works, hehee [10:39:45] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [10:39:45] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [10:40:21] onimisionipe: there we go.. ^ just that it is too long for a single line [10:40:25] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [10:40:25] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [10:41:19] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [10:41:19] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [10:41:29] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 104, in [10:41:29] : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [10:42:29] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2004 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 51, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [10:42:29] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [10:43:09] sorry for spam [10:43:20] recovery spam is the good spam :) [10:43:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:44:29] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 51, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [10:44:29] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [10:44:43] mutante: Thanks a lot for the help! [10:44:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:45:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:45:10] all green [10:45:15] oops [10:45:52] onimisionipe: you're welcome [10:47:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:48:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:49:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:07:55] !log uploaded tideways 4.0.7-1+wmf1 for component/php72 (T216712) [11:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:58] T216712: Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 [11:10:33] 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10Dzahn) phab1002 done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/494885 doc1001 done in https://gerrit.wikimedia.org/r/... [11:10:37] 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10dr0ptp4kt) Great framing, nice job! One question, though, what's this part about and how does... [11:15:53] 10Operations, 10ops-eqiad, 10Analytics, 10DBA, and 2 others: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10Marostegui) 05Open→03Resolved As per our earlier chat - this seems to be working fine after the puppet change to get the FW opened for labsdb1012 [11:31:19] Request from via cp1077 cp1077, Varnish XID 1011843221 [11:32:43] Back again [11:39:58] (03PS2) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 4 [puppet] - 10https://gerrit.wikimedia.org/r/495008 [11:40:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some first comments" (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [11:49:54] (03PS1) 10Dzahn: openstack: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/495205 [11:59:13] 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10dr0ptp4kt) ^ Well, I intended for that to be on email. But it stands: I think Olga put this i... [12:00:21] PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:00:21] PROBLEM - HTTP availability for Varnish at eqiad on icinga2001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:00:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:00:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:00:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:03:57] RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:04:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:05:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:05:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:06:21] RECOVERY - HTTP availability for Varnish at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:07:08] !log rolling security updates of slite3 on jessie and trusty [12:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:17:11] PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:17:25] PROBLEM - HTTP availability for Varnish at codfw on icinga2001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:17:37] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2842 bytes in 0.312 second response time [12:17:55] PROBLEM - HTTP availability for Varnish at eqsin on icinga2001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:17:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:18:14] * arturo paged [12:18:29] PROBLEM - HTTP availability for Varnish at eqiad on icinga2001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:18:34] ^ ema_ vgutierrez bblack [12:18:39] PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2840 bytes in 1.298 second response time [12:18:49] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2817 bytes in 0.312 second response time [12:18:55] Request from [redacted] via cp1077 cp1077, Varnish XID 1070238236 Error: 503, Backend fetch failed at Fri, 08 Mar 2019 12:15:49 GMT and Request from [redacted] via cp1077 cp1077, Varnish XID 184418667 Error: 503, Backend fetch failed at Fri, 08 Mar 2019 12:17:13 GMT - but I guess you know :) [12:19:05] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17211 bytes in 0.306 second response time [12:19:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:19:21] PROBLEM - HTTP availability for Varnish at esams on icinga2001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:19:23] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:19:31] what's up? [12:19:33] it's all 2001, is something happening with that box? [12:19:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:19:51] should we depool ulsfo? [12:19:56] <_joe_> I don't think it's 2001 [12:20:00] <_joe_> and yes, depool ulsfo [12:20:04] ok doing [12:20:06] RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17212 bytes in 1.108 second response time [12:20:09] <_joe_> but it seems it's codfw/eqsin/ulsfo [12:20:16] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17198 bytes in 0.331 second response time [12:20:17] it's not only ulsfo [12:20:21] I see errors everywhere: https://grafana.wikimedia.org/d/000000508/prometheus-varnish-http-errors-datacenters?orgId=1 [12:20:21] <_joe_> yeah [12:20:23] https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=4&fullscreen&orgId=1&from=1552047213993&to=1552047605664 [12:20:27] <_joe_> it's also codfw [12:20:31] ah it's not only ulsfo, depooling would not help [12:20:45] <_joe_> yeah [12:20:46] eqiad as well [12:20:53] all text + esams/eqiad upload [12:21:01] <_joe_> jfc [12:21:03] le [12:21:59] cp1077 seems misbehaving from https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [12:22:09] <_joe_> ok so, apart from what elukey just found [12:22:12] <_joe_> anything else [12:22:18] <_joe_> ? [12:22:31] I got nothing yet [12:22:32] seems kinda recovering [12:22:36] me neither [12:23:08] yeah.. cp1077 looks to be in pain [12:23:13] nothing on the load-balancers dashboard [12:23:23] volans: there's also some mailbox lag, not sure if related [12:23:32] is there any dependency to sqlite? (I am guessing no) but it was the only thing ongoing at the time [12:23:42] RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:23:42] RECOVERY - HTTP availability for Varnish at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:24:06] do we graph the mailbox lag? [12:24:18] RECOVERY - HTTP availability for Varnish at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:24:18] RECOVERY - HTTP availability for Varnish at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:24:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:24:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:24:22] jynus: the only cp server updated was cp1008.wikimedia.org [12:24:24] https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 lowered everything to 97% availability and it's recovering now [12:24:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:24:39] volans: we do yes - https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=13&fullscreen [12:24:55] jbond42: cp1008 is pink unicorn. It doesn't serve production traffic [12:25:01] so maybe network? [12:25:04] RECOVERY - HTTP availability for Varnish at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:25:05] jynus: all the LVSes/varnishes are stretch, those updates are all for jessie/trusty [12:25:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:25:08] across the internet? [12:25:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:25:23] all of our pops? [12:25:26] I don't know, I am just thowing options [12:25:43] a network issue on eqiad should be enough to create issues evertywhere [12:25:43] vgutierrez: did you do anything with cp1077? [12:25:48] yeah and that's good, I am trying to figure out reasons it can't be that so we reach the root cause [12:25:54] volans: noep [12:25:56] *nope [12:26:22] akosiaris: cp1077 is eqiad, if misbehaving could explain a widespread temporary issue? [12:26:23] first whines were about 1 hr 45 mins ago (10:43 utc) [12:26:26] in here [12:26:48] volans: I can see an increase in uplaoad retransmits, but that could be just a consequence [12:26:54] sorry ,that was for akosiaris [12:27:02] elukey: the only thing that I'm having hard time to explain with cp1077 is why we got also esams/eqiad upload with issues [12:27:08] elukey: we 've have cps misbehaving before and the issues were not network wide [12:27:22] I am not sure if it is related, probably not [12:27:25] https://logstash.wikimedia.org/goto/13e5e1f56d28e0c40e6ce9e846086f2d [12:27:51] looking into network issues anyway via librenms [12:27:52] cp1077 is a text node, so upload issues shouldn't be related to it [12:28:05] akosiaris: yep yep agreed, the mailbox lag is always concerning me, in the past it was causing horrible things [12:28:13] jijiki: that looks bad but doesn't seem new [12:28:16] and it is also upload yes [12:28:28] volans: --^ [12:29:11] lots of abusefilter errors [12:32:02] Mar 8 12:25:29 asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1: Rx loss cleared [12:32:02] Mar 8 12:28:29 asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1: Rx loss set [12:32:06] hmm [12:32:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:32:28] oh, so akosiaris now you listen to me :-P [12:32:36] jynus: I always do :P [12:33:07] but that would not explain such a big pain [12:33:35] (wikicommons down) [12:33:35] the issues seem ongoing based on that graph ^ [12:33:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:33:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:34:01] Steinsplitter: there is some instability, but things should not be hard down, retry [12:34:26] PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:34:26] PROBLEM - HTTP availability for Varnish at eqiad on icinga2001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:34:33] jynus :) [12:34:38] PROBLEM - HTTP availability for Varnish at codfw on icinga2001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:34:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:35:04] PROBLEM - HTTP availability for Varnish at esams on icinga2001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:35:04] PROBLEM - HTTP availability for Varnish at eqsin on icinga2001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:35:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:35:41] fyi cp1077 is peaking again [12:36:04] nothing weird in traffic graphs btw [12:36:32] there is a 98.5% availability, that is a lot of errors [12:36:46] and seems recurring [12:37:14] text al dcs [12:37:15] nothing weird in peering transit or core ports in librenms [12:38:14] RECOVERY - HTTP availability for Varnish at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:38:38] RECOVERY - HTTP availability for Varnish at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:38:43] hmm [12:38:54] that QSFP said again [12:39:10] Mar 8 12:33:30 asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1: Rx loss set [12:39:12] RECOVERY - HTTP availability for Varnish at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:39:23] Steinsplitter: I don't see any issues with commons, at least browsing [12:39:35] what's on fpc3 [12:39:48] RECOVERY - HTTP availability for Varnish at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:39:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:39:52] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:40:00] looks like ipsec bounced on cp1077 [12:40:04] apergos: his issues are were real, there is at times a 2-1.5% of errors [12:40:24] RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:40:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:40:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:41:01] with peaks of 350 errors/s [12:41:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:41:42] https://grafana.wikimedia.org/d/000000464/prometheus-varnish-aggregate-client-status-code?orgId=1&var-site=eqiad&var-site=codfw&var-site=ulsfo&var-site=eqsin&var-site=esams&var-cache_type=varnish-text&var-cache_type=varnish-upload&var-status_type=5&from=1552034133713&to=1552048829080 [12:44:26] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Our docs aren't good :-) You could probably just point everything to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troublesho" [puppet] - 10https://gerrit.wikimedia.org/r/495205 (owner: 10Dzahn) [12:45:48] akosiaris: at the moment, my only recommendation would be to depool cp1077 [12:45:59] because it has a strange network pattern [12:46:06] unlike the others [12:46:07] I got nothing better either [12:46:10] sure [12:46:13] lemme do that [12:46:21] 2 dips on network [12:46:27] while the other don't have that [12:47:00] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1077.* [12:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:33] !log depooling cp1077 just in case, high mailbox lag https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=13&fullscreen [12:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:37] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Update label and switch to rename labvirt1009 to cloudvirt1009 - https://phabricator.wikimedia.org/T216281 (10aborrero) a:05aborrero→03Cmjohnson [12:51:27] I am getting errors as well [12:51:46] errors on which, Bsadowski1? [12:51:46] "Request from xxxxxx via cp1089 cp1089, Varnish XID 162562291" [12:51:53] Error: 503, Backend fetch failed at Fri, 08 Mar 2019 12:51:09 GMT [12:52:04] wikis? [12:52:08] Simple [12:52:11] thx [12:55:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:55:06] 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo) [12:55:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:56:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:57:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:57:42] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10TheDJ) Just stumbled across this in CommonSettings: ` $wgSVGConverters['rsvg-broken'] = '$path/rsvg-convert -w $widt... [13:03:18] 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo) [13:05:13] (03PS3) 10Mathew.onipe: elasticsearch: refactor elastic icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) [13:05:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:06:05] 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10Vgutierrez) {F28347567} it looks indeed like purge requests [13:06:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:06:20] PROBLEM - HTTP availability for Varnish at eqsin on icinga2001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:06:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:06:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:06:56] PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:06:56] PROBLEM - HTTP availability for Varnish at eqiad on icinga2001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:06:58] 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo) [13:07:08] PROBLEM - HTTP availability for Varnish at codfw on icinga2001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:07:22] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2842 bytes in 0.315 second response time [13:07:26] All my requests to Meta result in a Varnish error [13:07:32] nvm [13:07:37] PROBLEM - HTTP availability for Varnish at esams on icinga2001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:07:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:08:12] 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo) >>! In T217893#5011033, @Vgutierrez wrote: > {F28347567} it looks indeed like purge requests I updated the comment- Those seem recurring, there wa... [13:08:40] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17209 bytes in 0.342 second response time [13:08:48] the report is only about IPv6, weird [13:08:51] 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10hashar) The spike of PURGE requests to the Varnish text frontends seems to be recurring. A view over 24 hours from https://grafana.wikimedia.org/d/000000180... [13:09:11] no it's ipv4 as well [13:09:43] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1077.eqiad.wmnet [13:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:18] vgutierrez: was it still pooled? I am pretty sure I depooled it [13:10:35] akosiaris: according to conftool yes [13:10:50] ah dammit, yes I now see the sal log [13:10:54] *confctl sorry [13:10:55] ` set/pooled=yes; selector: name=cp1077.*` akosiaris :-P [13:10:55] I passed pooled=yes [13:11:01] yeah yeah my bad [13:11:01] oh, I didn't check that [13:11:08] sorry [13:11:18] so it went over me too! [13:12:36] RECOVERY - HTTP availability for Varnish at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:12:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:12:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:13:10] RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:13:12] RECOVERY - HTTP availability for Varnish at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:13:24] RECOVERY - HTTP availability for Varnish at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:13:26] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:13:37] 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10Vgutierrez) cp1077 effectively depooled at 13:09 UTC [13:13:50] RECOVERY - HTTP availability for Varnish at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:14:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:14:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:15:49] (03PS3) 10GTirloni: openstack: Automatically start/stop VMs on hypervisor boot/shutdown [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) [13:17:48] (03PS4) 10Mathew.onipe: elasticsearch: refactor elastic icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) [13:19:01] marostegui, jynus - Hi my beloved DBAs - Letting you know I have started a test sqooping from labsdb1012 with a few more workers than usual - So far the host is (very) busy but doens't complain - Pleae let me know if I should stop :) [13:19:40] joal: not a good time now [13:19:46] there is ongoing issues [13:19:52] please ping us at a later time [13:20:24] ack! [13:21:08] (03CR) 10GTirloni: "Puppet compiler output - https://puppet-compiler.wmflabs.org/compiler1002/15043/" [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) (owner: 10GTirloni) [13:31:56] (03PS5) 10Mathew.onipe: elasticsearch: refactor elastic icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) [13:40:59] 10Operations, 10Traffic: Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo) [13:43:13] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Make the user agent configurable for Wikidata Query Service Updater - https://phabricator.wikimedia.org/T217896 (10Gehel) [13:43:42] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Make the user agent configurable for Wikidata Query Service Updater - https://phabricator.wikimedia.org/T217896 (10Gehel) [13:48:11] joal: lasdb1012 is "yours" so it is not having any other user traffic that might be affected by the load you might be generating [13:49:09] marostegui: I knew that, I prefer however to let you know, in case there is a moment in the day when not everything is burning and you want to check how it is doing with me hammering it :) [13:49:20] the only load worries would be if you can do the same thing, but faster in a different way, and if it creates too much load that yourself get affected [13:49:21] joal: sure! appreciate it :) [13:49:45] the issues should be fixed now [13:50:05] Happy to hear that - Thanks for caring :) [13:51:38] (03PS6) 10Mathew.onipe: elasticsearch: refactor elastic icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) [13:54:26] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Gehel) [13:59:52] (03CR) 10Marostegui: "As requested, gave a quick look at daily_snapshot.py so far so good apart from the things we already discussed in IRC" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [14:13:30] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10BBlack) Looking at an internal version of the flavor=dump outputs of an entity, related observations: Test request from the in... [14:15:40] (03PS2) 10Dzahn: openstack: monitoring: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/495205 [14:15:55] !log gilles@deploy1001 Started deploy [performance/navtiming@f2d8a5f]: (no justification provided) [14:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:00] !log gilles@deploy1001 Finished deploy [performance/navtiming@f2d8a5f]: (no justification provided) (duration: 00m 05s) [14:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:19] (03PS3) 10Dzahn: openstack: monitoring: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/495205 [14:22:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/495205 (owner: 10Dzahn) [14:23:15] (03PS1) 10Arturo Borrero Gonzalez: postinst: explicitly mention home directory [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495228 [14:23:17] (03PS1) 10Arturo Borrero Gonzalez: d/rules: avoid running dh_installinit [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495229 [14:23:19] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.4 unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495230 [14:27:09] (03PS2) 10Arturo Borrero Gonzalez: postinst: explicitly mention home directory [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495228 [14:27:11] (03PS2) 10Arturo Borrero Gonzalez: d/rules: avoid running dh_installinit [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495229 [14:27:13] (03PS2) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.4 unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495230 [14:27:15] (03PS1) 10Arturo Borrero Gonzalez: d/control: include python dep [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495232 [14:27:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] postinst: explicitly mention home directory [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495228 (owner: 10Arturo Borrero Gonzalez) [14:28:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/control: include python dep [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495232 (owner: 10Arturo Borrero Gonzalez) [14:28:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/rules: avoid running dh_installinit [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495229 (owner: 10Arturo Borrero Gonzalez) [14:28:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: generate entry for 0.4 unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495230 (owner: 10Arturo Borrero Gonzalez) [14:31:52] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: release package 0.4 as unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495233 [14:34:27] !log T215605 add prometheus-rabbitmq-exporter v0.4 to stretch-wikimedia [14:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:30] T215605: cloudvps: missing packages in stretch for cloudcontrol servers - https://phabricator.wikimedia.org/T215605 [14:35:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: release package 0.4 as unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495233 (owner: 10Arturo Borrero Gonzalez) [14:40:20] (03PS2) 10GTirloni: ldap: increase group TTL from 60 to 3600 seconds in labs [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) [14:42:07] (03Abandoned) 10GTirloni: openldap: Set thread pool based on processor count [puppet] - 10https://gerrit.wikimedia.org/r/494911 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [14:43:28] (03CR) 10GTirloni: "Hey Arturo, when you get some spare time, could you review this? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [14:44:21] (03Abandoned) 10GTirloni: wmcs::nfs::misc - Backup for misc server (cloudstore1008) [puppet] - 10https://gerrit.wikimedia.org/r/485375 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [14:45:08] (03PS3) 10GTirloni: ldap: increase group TTL from 60 to 300 seconds in labs [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) [14:56:02] (03PS7) 10Dzahn: create a new role 'hmmp' to replace role(simplelamp) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) [15:09:37] (03Abandoned) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 (owner: 10Paladox) [15:10:31] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10MSantos) @MoritzMuehlenhoff and @Mholloway >>! In T216521#4986352,... [15:11:22] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10MSantos) 05Open→03Invalid [15:13:22] (03CR) 10Effie Mouzeli: [C: 03+2] Have coal watch the PaintTiming schema [puppet] - 10https://gerrit.wikimedia.org/r/494726 (https://phabricator.wikimedia.org/T217395) (owner: 10Gilles) [15:13:33] (03CR) 10Effie Mouzeli: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15047/webperf1001.eqiad.wmnet/ Looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/494726 (https://phabricator.wikimedia.org/T217395) (owner: 10Gilles) [15:13:56] (03PS2) 10Effie Mouzeli: Have coal watch the PaintTiming schema [puppet] - 10https://gerrit.wikimedia.org/r/494726 (https://phabricator.wikimedia.org/T217395) (owner: 10Gilles) [15:21:33] (03PS1) 10Alexandros Kosiaris: Fix typo in comments [software/service-checker] - 10https://gerrit.wikimedia.org/r/495237 [15:21:36] (03PS1) 10Alexandros Kosiaris: Add logging support [software/service-checker] - 10https://gerrit.wikimedia.org/r/495238 [15:25:38] https://fr.wikipedia.org/wiki/Hôtel_de_Blossac can we get a couple folks to see if the map in the infobox loads for them? if not, where are you located (country)? if so, where are you located (country)? [15:25:50] (it's failing for me and at least a couple other users in Europe) [15:25:59] tried a couple other random pages with maps, same [15:26:12] map loads for me [15:26:24] ok, another vote for 'works in the us' [15:26:28] (03CR) 10Gehel: [C: 04-1] "mostly minor comments about structure" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:27:00] do we have some idea on the nature of the failure? does some request timeout or 503? [15:27:50] nope, hey andrewbogott let's move the discussion on the maps here? [15:28:11] sure [15:28:39] https://en.wikipedia.org/wiki/R504_Kolyma_Highway#/map/0 fail for me, etc [15:29:20] same with mutante's cache-busting trick, wth foo it still fails [15:29:37] same for me [15:29:45] it works for me as https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac#/map/0?foo [15:29:49] it's failing for me as well with a 400 on [15:29:49] https://maps.wikimedia.org/geoline?getgeojson=1&query=++SELECT+%3Fid+%3Flength+++%28if%28%3Fid+%3D+wd%3AQ1142859%2C+%27%23C12838%27%2C+%27%2307c63e%27%29+as+%3Fstroke%29+++%28concat%28%27Line+length%3A+%27%2C+str%28%3Flength%29%2C+%27+km%27%29+as+%3Fdescription%29+++%28if%28BOUND%28%3Flink%29%2C+++++++concat%28%27%5B%5B%27%2C+substr%28str%28%3Flink%29%2C31%2C500%29%2C+%27%7C%27%2C+%3FidLabel%2C+%27%5D%5D%27%29%2C+++++++%3FidLabe [15:29:50] l%29++++as+%3Ftitle%29+WHERE+%7B+++++%7B%3Fid+wdt%3AP16+wd%3AQ260792.%7D++++++SERVICE+wikibase%3Alabel+%7B+++++bd%3AserviceParam+wikibase%3Alanguage+%27en%27+.+++++%3Fid+rdfs%3Alabel+%3FidLabel+.+++%7D+++OPTIONAL+%7B%3Flink+schema%3Aabout+%3Fid.+++%3Flink+schema%3AisPartOf+%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E.%7D+%7D+GROUP+BY+%3Fid+%3Flink+%3FidLabel+%3Flength++ [15:29:50] so the enwiki or frwiki fetch succeeds, but then it loads data from maps.wikimedia.org, I'm assuming that's where the failure is, in a subrequest [15:29:54] bblack: in devtools I see a call to maps.wikimedia.org that returns HTTP 400 [15:30:03] same as cdanis :) [15:30:04] https://maps.wikimedia.org/geoshape?getgeojson=1&ids=Q3145754 returns HTTP 400 payload "headers is not defined" [15:30:32] those URIs look pretty funky [15:30:33] !log gilles@deploy1001 Started deploy [performance/coal@8766469]: (no justification provided) [15:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:39] !log gilles@deploy1001 Finished deploy [performance/coal@8766469]: (no justification provided) (duration: 00m 06s) [15:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:42] why is there a select query inside a URI, and do we really support that? [15:30:57] gah I have no idea what to search for in logstash [15:31:02] (03PS2) 10Alexandros Kosiaris: Add logging support [software/service-checker] - 10https://gerrit.wikimedia.org/r/495238 [15:31:24] yes we do support it [15:31:40] ok, we probably shouldn't, that seems a little crazy [15:31:47] the response refers to some headers missing: "headers is not defined" [15:31:49] but that's not something we can fix in the moment [15:32:02] dunno if client or server side though [15:32:07] this is to get some info from wikidata for building the map [15:32:17] regardless [15:32:30] so that's a big feature to cut if we do cut it, but as you say, discuss later [15:32:36] there's a SQL query encoded inside a URI, and it's fetched as a subrequest to a normal article pageview [15:32:39] for me it is as if a blank map got cached [15:32:41] it's wrong on many levels [15:32:56] * andrewbogott waiting for this to turn out to have an accidental wmcs dependency [15:33:01] lol [15:33:10] don't leave town :-P [15:33:38] mmm, karthotherian does not seem to be in wmflabs codesearch? [15:33:52] maps* [15:33:55] kartotherian is a production service behind cache_upload [15:33:57] it's not a labs thing, it's a production extension [15:34:04] so other fr maps links work for me [15:34:08] (maps.wikimedia.org goes through cache_upload to kartotherian) [15:34:17] yes yes, what I meant is that the source code does not seem to be *indexed* by https://codesearch.wmflabs.org [15:34:19] and in general the tiles seem to work [15:34:21] maps1001 et.. nothing in nginx error log [15:34:22] as the thing I want to do is search for this error message ;) [15:35:23] ah, sorry c danis, misunderstood [15:35:37] I am not sure this is a traffic error could be a client code one [15:36:07] yeah it's only the esams edge giving the 400 [15:36:23] Kartographer is the extension [15:36:30] hmmm maybe, but the error is returned by karthotherian backends [15:36:31] (that is indexed) [15:36:34] as in, the 400 is correct, just it is trying to load the incorrect url [15:36:39] looking at the headers... x-powered-by: kartotherian: 1.0.0 (439910027a55bf7c1effb2d34a1f42a5a995268d) [15:37:19] and the ones that work for me (200 through other edges) have: [15:37:22] x-powered-by: kartotherian: 0.0.38 (c49f37c39515675d95d3dd7da09ca535ec0d448b) [15:37:29] oops [15:38:09] so it's eqiad vs codfw [15:38:39] codfw + ulsfo + eqsin edges work (goes to codfw kartotherian backends running 0.0.38), eqiad + esams edges go to eqiad karto backends running this 1.0.0 [15:38:57] see SAL yesterday [15:38:59] 18:39 gehel@puppetmaster1001: conftool action : set/pooled=yes; selector: dc=codfw,cluster=maps,name=maps2004.codfw.wmnet [15:39:02] 18:32 mbsantos@deploy1001: Finished deploy [kartotherian/deploy@248b8c4] (stretch): Updating eqiad cluster before repool of maps2004.codfw.wmnet (duration: 01m 25s) [15:39:14] I've DNS-overridden my traffic to go via ulsfo and it works [15:39:37] damn, what did I do again? [15:40:00] gehel: what's the current intended state of kartoetherian versions on the two dcs' maps clusters? [15:40:04] if I don't override the resolution of text-lb, and just override maps.wikimedia.org to go to upload-lb.ulsfo, it also works [15:40:05] gehel: it appears there are different versions of kartotherian [15:40:18] so it does appear that the new kartotherian release is at fault [15:40:33] in the working pageload, x-powered-by: kartotherian: 0.0.38 (c49f37c39515675d95d3dd7da09ca535ec0d448b) [15:40:51] and 1.0.0 fails? huh [15:41:08] right, but is 1.0.0 even supposed to be live? maybe they were meant to stay depooled or something, I have no idea [15:41:18] neither do I [15:41:18] yep, we're in the process of migrating to stretch and there are a few servers lagging behind [15:41:39] does stretch imply a kartotherian upgrade from 0.0.38 to 1.0.0 as well? [15:41:44] the only recent change should be on maps2004, the other ones should not have anything new [15:42:00] hmm so the version mismatch is on purpose? [15:42:14] gehel: what we're observing so far, is certain maps URIs return 200 OK as expected, from codfw-maps cluster, which claims to run karto 0.0.38 [15:42:25] and the same URIs fail when routed into eqiad-maps cluster, which claims to run karto 1.0.0 [15:42:44] and the failed URL are on geoshape? [15:42:47] yes [15:43:10] so that's not tiles serving but one of the peripheral services [15:43:40] example https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac?foo#/map/0 [15:43:52] sorry, I said tiles were not loading, but that was not true, I was just getting a blank page [15:43:59] maps-eqiad should not have changed recently, so we're probably hitting a previously existing bug [15:44:16] it looks like maps2001-2003.codfw are running jessie, maps1001-1004.eqiad + maps2004.codfw are running stretch [15:44:17] sorry, this: https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac#/map/0 [15:44:17] mateusbs17: ^ you might know something about all that [15:44:24] It can also be seen on the infobox at https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac [15:44:35] bblack: correct, that's at least what I'm expecting [15:44:41] scrolling down, it show nothing to me [15:44:49] and all are pool [15:44:52] *pooled [15:45:10] !log OS install on restbase2019 and restbase2020 [15:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:21] all pooled is also what's expected [15:45:22] gehel: the strong suspicion at this point is all the stretches are failing at this geoshape query, and all the jessies aren't [15:45:36] and they're reporting wildly different karto versions, too [15:46:31] yep, but eqiad has been running stretch for a number of weeks [15:47:08] is the OS version actually related to the karto version? [15:47:10] fwiw from https://logstash.wikimedia.org/goto/126571f3ef3e2163b342235e3795d05d it looks like there are at least two recurring errors [15:47:14] it's not a package [15:47:15] bblack: gehel: I found the issue, and it is related to unplished code https://github.com/kartotherian/geoshapes/releases/tag/v1.0.4 [15:47:24] 'headers is not defined’ and 'Bad geojson - unknown type ExternalData' [15:48:14] mateusbs17: you mean: this is a known issue and we have a fix ready to be deployed? [15:48:33] btw it is pretty easy to reproduce this with varying upload-lb locations from the command line: curl -v --resolve maps.wikimedia.org:443:$(dig +short upload-lb.codfw.wikimedia.org) 'https://maps.wikimedia.org/geoshape?getgeojson=1&ids=Q3145754' [15:48:38] (3) maps[2001-2003].codfw.wmnet [15:48:42] version 0.0.1 [15:48:43] yes [15:48:46] https://github.com/kartotherian/geoshapes/commit/8995ed4ac0050c9e8ae36e78d860e4e84b3185b9 [15:48:56] but that fix is in 1.0.3 as well [15:48:57] some are 1.0 and some are 0.0.1 [15:49:13] sudo cumin 'maps*' 'grep version /srv/deployment/kartotherian/deploy/src/package.json' [15:49:30] so it sounds like we should depool maps@eqiad, yes? [15:49:40] gehel: Yes. bblack: this v1.0.3 is not matching npm for some reason. v1.0.4 fixes that. [15:49:51] cdanis: I'm not sure, I can log failures when querying the codfw applayer directly now, too [15:50:16] I am finishing some build tests and the next kartotherian deploy will come with a fix [15:50:28] someone give me a reason to not immediately depool maps@eqiad [15:50:28] maps2004 is already upgraded on codfw, so it would also exhibit the problem [15:50:35] maybe the whole codfw/eqiad split is a red herring about who has these items cached [15:50:48] I get the 400 when I internally query any of maps2001-4.codfw [15:50:53] (behind the caches) [15:50:56] i was able to make a URL work by adding ?foo to it.. without changing location [15:50:56] huh, I am yet to reproduce a failure on codfw with caches bypassed [15:51:10] cdanis: you can try depooling eqiad [15:51:18] 10Operations, 10Gerrit, 10Release-Engineering-Team: Deploy multi-site plugin to cobalt and gerrit2001 - https://phabricator.wikimedia.org/T217174 (10Paladox) a:05Paladox→03None This can be deployed to prod (if there's kafka in prod). In my testing this worked really well (we only want replication from co... [15:51:28] I don't think we should depool [15:51:39] may I suggest to search or create a task first [15:51:43] :-) [15:51:45] in terms of error log volume maps2004 is quite low compared to the eqiad maps hosts [15:51:53] if the problem is really the stretch upgrade we'll also need to depool 2004 [15:51:58] all the 200's I get against codfw are cache hits, so far, and when I add a random param to bust, they're 400s [15:52:02] and the risk at that point is overloading the service [15:52:05] so we can send people to that may be having issues [15:52:25] mateusbs17: do you know if we already have a task for this? [15:52:43] oh randm 200/400 maybe, depends on server? the data is confusing! [15:53:03] the state of that cluster is confusing :/ [15:53:27] We have too different problems on the example above: geoshapes and the page_props external data linking with JsonConfig. [15:53:31] how about getting the same karto version on all first? [15:54:09] well if strtch has a new one (a few different minor versions of it) and some of those are suspect, I'm not sure we can start there [15:54:13] I see that the geoshapes problem is not properly deployed. I spotted this a couple hours ago and started fixing the version [15:54:23] i dont see a relation to stretch since it's deployed with scap and not as a deb ? [15:54:37] ugh, some of my previous tests are faulty, curl apparently overriding --resolve by using ipv6 instead? [15:54:48] mutante: there are fixes in the 1.0.0 for stretch compatibility [15:54:50] * Added maps.wikimedia.org:443:208.80.153.240 to DNS cache [15:54:50] * Trying 2620:0:861:ed1a::2:b... [15:54:52] wtf :P [15:55:13] gehel: ah [15:55:14] bblack: -4 [15:55:31] yeah I see that, but that seems like really dumb default behavior [15:55:46] yeah.. unexpected at least [15:55:53] I didn't say --resolve-only-for-ipv4-but-then-use-ipv6-instead [15:55:56] I said --resolve :P [15:56:00] lolol [15:56:16] querying the hosts directly, maps200[123] work and maps2004 does not, just for the simple https://maps.wikimedia.org/geoshape?getgeojson=1&ids=Q3145754 [15:56:44] we can depool only maps2004 since hat is also what changed yesterday [15:56:49] maps2004 is also < x-powered-by: kartotherian: 1.0.0 (439910027a55bf7c1effb2d34a1f42a5a995268d) [15:57:02] and 2001,2,3 have which? [15:57:15] 0.0.1 [15:57:18] right [15:57:21] depending where you look [15:57:33] 0.0.38 according to the headers [15:57:35] so ignore the stuff I said in the middle about some codfw seeming to fail [15:57:36] ok [15:57:36] but 1001-1004 also have 1.0.0 [15:57:47] it is the version split that defines the 400 vs 200 responses [15:57:54] i am looking at package.json in the deploy dir though [15:58:16] x-powered-by: kartotherian: 0.0.38 (c49f37c39515675d95d3dd7da09ca535ec0d448b) [15:58:16] that one [15:58:21] so 1001-1004 and 2004 are bad due to having 1.0.0 and 2001-3 are good due to having 0.0.38 [15:58:22] (the OS version split, and the http-reported karto version, which may or may not correlate to some other version) [15:58:24] got it [15:58:41] ok, some more info: [15:58:50] since the thing that happened yesterday is pooling 2004 and 2004 looks faulty, let's depool just that one? [15:58:54] so if we want to fix this quickly, yes, we could depool all the stretches and maybe-overload the remainin 3 in codfw that are older [15:59:11] or we can upgrade the stretches to the 1.0.4 karto package for the bugfix [15:59:12] that issue was monkey patched because we did not want to package a full version during the stretch upgrade [15:59:31] https://phabricator.wikimedia.org/P8173 [15:59:58] awesome thanks c danis [16:00:12] yesterday deployment did erase that monkey patch [16:00:19] if we overload the servers are they worse off than now? [16:00:32] can they cause any other sort of issue? [16:00:38] apergos: yes, that would be worse [16:00:46] ah it figures [16:00:53] at the moment, all the tile serving is fine, which is the main use case [16:01:08] right [16:01:15] do we actually think 3 servers will be overloaded? [16:01:17] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) [16:01:23] the risk is minor [16:01:26] because that means maybe we need to expand these clusters [16:01:34] mateus will have a proper deploy with this single fix in a few minutes [16:01:41] it should be an ok situation in design terms: have 1/4 hardware machines dead in a site, and depooling 1/2 redundant sites [16:01:44] as I said, I think it is more important to create a ticket and communicate at this point [16:01:53] a few minutes? worth the wait [16:01:56] ok let's wait on the fix then [16:02:03] ticket is T217898 [16:02:06] T217898: Geoshape service fails to deliver geoshapes from OSM - https://phabricator.wikimedia.org/T217898 [16:02:06] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi All yours [16:02:07] gehel: thanks [16:02:49] waiting sounds fine, we already had that since 24h or so i guess [16:03:01] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=maps&var-instance=All the maps clusters do look like they run reasonably hot on CPU some fraction of the time [16:03:11] so I agree that overload is a possible issue [16:03:23] I'm going to ask a few postmortem-y questions. [16:03:44] - are there any blackbox probes done against the HTTP API by icinga or the like? [16:03:45] mmm [16:04:34] bblack: it's just the time to package kartotherian and deploy it [16:04:54] - why is there not an HTTP response code breakdown on https://grafana.wikimedia.org/d/000000030/service-kartotherian?refresh=5m&orgId=1 [16:05:20] cdanis: I don't think it is yet the time while the issue is ongoing [16:05:30] I don't need answers now jynus [16:05:44] I just want to get these out of my head into somewhere less ephemeral ;) [16:05:53] but if you'd rather that be a local text editor that's fine [16:06:00] cdanis: use the ticket :-) [16:06:15] I'm adding some notes to T217898, just to be sure to not forget those [16:06:20] PROBLEM - Device not healthy -SMART- on phab1002 is CRITICAL: cluster=misc device=sdc instance=phab1002:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1002&var-datasource=eqiad+prometheus/ops [16:06:25] cdanis: feel free to add more things on the ticket [16:06:34] we can all use that pastebin and later embed the pastebin into the ticket [16:08:54] mutante: which pastebin? (I lost track) [16:09:05] https://phabricator.wikimedia.org/P8173 [16:09:13] thanks! [16:11:03] once the issue is fixed in production, does anyone else think it would be a good idea to open a proper incident report under https://wikitech.wikimedia.org/wiki/Incident_documentation, construct a timeline, and get an idea of the number of affected requests? [16:11:45] yes [16:11:48] sounds like an incident, ack [16:12:21] Oh yes! [16:12:39] I can do a first pass on the incident report once the fire is under control [16:15:21] (03CR) 10Anomie: "1.33.0-wmf.20 is now on all wikis. This should be safe to deploy." [puppet] - 10https://gerrit.wikimedia.org/r/493323 (https://phabricator.wikimedia.org/T217162) (owner: 10Anomie) [16:17:17] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@d71df87] (stretch): UBN geoshapes services (T217898) [16:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:21] T217898: Geoshape service fails to deliver geoshapes from OSM - https://phabricator.wikimedia.org/T217898 [16:19:17] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@d71df87] (stretch): UBN geoshapes services (T217898) (duration: 02m 00s) [16:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:42] https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac works for me [16:21:46] looks like you fixed it [16:21:52] mateusbs17: thanks a lot!!! [16:21:58] (03CR) 10Krinkle: [C: 03+1] "LGTM. I'm unable to find docs for the 'mail' program used here ( maybe mailx? https://linux.die.net/man/1/mail ), or a -a param, but I ass" [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans) [16:22:13] mateusbs17: confirmed working, thanks [16:22:55] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@cc302de] (stretch): UBN geoshapes services on maps2004.codfw.wmnet (T217898) [16:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:58] T217898: Geoshape service fails to deliver geoshapes from OSM - https://phabricator.wikimedia.org/T217898 [16:23:19] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@cc302de] (stretch): UBN geoshapes services on maps2004.codfw.wmnet (T217898) (duration: 00m 24s) [16:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:34] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) That for this, it's appreciated. Note that we haven't still decided over which time window the availability will be calculated,... [16:26:02] (03CR) 10Muehlenhoff: "/usr/bin/mail uses the alternatives mechanism but is ultimately provided by bsd-mailx which is installed as a dependency by Icinga." [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans) [16:26:08] maps, or at least the one, have returned for me. nice! [16:26:23] I will add more details on T217898, but the problem now seems that's fixed. [16:27:53] (03CR) 10Dzahn: "yea, that's mailx but still different. -a is adding a header not attachment." [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans) [16:27:53] I've created a stub incident report (https://wikitech.wikimedia.org/wiki/Incident_documentation/20190308-wdqs) I'll add more content, but feel free to already add whatever you have. [16:30:17] (03PS1) 10Alexandros Kosiaris: cxserver: Add kademlia support [deployment-charts] - 10https://gerrit.wikimedia.org/r/495252 [16:30:39] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Works in testing. Merging, we can always improve on it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/492301 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [16:31:23] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Worked as well in minikube, Production is not so easy as we lack a coredns component, but this will have to do for now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/495252 (owner: 10Alexandros Kosiaris) [16:32:23] (03PS1) 10Mforns: Reenable reportupdater jobs after analytics replica shard fix [puppet] - 10https://gerrit.wikimedia.org/r/495253 (https://phabricator.wikimedia.org/T215289) [16:36:32] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] openstack - Convert cron jobs to systemd timers (0316 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [16:39:29] (03PS1) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) [16:39:41] (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [16:40:17] 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10bd808) >>! In T216712#5010593, @MoritzMuehlenhoff wrote: > Is php-xdebug (and php-tideways) used in the stretch-based Toolforge PHP... [16:41:46] (03PS2) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) [16:41:59] (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [16:43:33] mateusbs17: I still see maps1004.eqiad returning a 400 (and on kartotherian 1.0.0, not 1.0.1) [16:43:34] (03PS4) 10Dzahn: openstack: monitoring: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/495205 [16:45:25] cdanis, mateusbs17: I confirm that package.json on maps1004 is still 1.0.0 [16:46:08] (03CR) 10Matthias Mullie: External api url for wbsearchentities api call on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [16:46:59] (03PS3) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) [16:47:10] (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [16:47:13] (03CR) 10Cparle: External api url for wbsearchentities api call on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [16:47:51] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@acf2694] (stretch): UBN geoshapes services on maps1004.eqiad.wmnet (T217898) [16:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:54] T217898: Geoshape service fails to deliver geoshapes from OSM - https://phabricator.wikimedia.org/T217898 [16:47:59] (03CR) 10Dzahn: [C: 03+2] "thank you too for the review" [puppet] - 10https://gerrit.wikimedia.org/r/495205 (owner: 10Dzahn) [16:48:12] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@acf2694] (stretch): UBN geoshapes services on maps1004.eqiad.wmnet (T217898) (duration: 00m 22s) [16:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:36] looks good now, thanks :) [16:49:11] cdanis: Thanks for finding this one [16:50:58] (03PS4) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) [16:51:56] (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [16:53:43] (03CR) 10Reedy: [C: 04-1] "needs moar tabs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [16:53:48] (03PS5) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) [16:54:57] (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [16:55:47] (03PS6) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) [17:02:32] (03PS1) 10Bstorm: toolforge: reformatting promethus grid engine collector [puppet] - 10https://gerrit.wikimedia.org/r/495261 [17:02:34] (03PS1) 10Bstorm: gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) [17:02:45] nice dropoff of 400s in upload: [17:02:48] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&panelId=2&fullscreen&var-site=All&var-cache_type=upload&var-status_type=4&from=now-3h&to=now [17:04:11] bblack: so how would one do an accurate census of 400s that were kartotherian? [17:05:01] (03CR) 10Bstorm: "Note that I included a commit that reformats with black so that doesn't screw up the diff. I think I did this right, and it runs. I can " [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm) [17:06:43] RECOVERY - Device not healthy -SMART- on phab1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1002&var-datasource=eqiad+prometheus/ops [17:14:00] (03PS1) 10Paladox: Update image-diff plugin [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495264 [17:14:10] cdanis: I don't think we have that "live", but we could get it near-realtime from analytics stuff [17:14:31] (03CR) 10Paladox: [V: 03+2 C: 03+2] "This should fix it so that the plugin can be installed now." [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495264 (owner: 10Paladox) [17:15:10] (03CR) 10Ottomata: [C: 03+2] Reenable reportupdater jobs after analytics replica shard fix [puppet] - 10https://gerrit.wikimedia.org/r/495253 (https://phabricator.wikimedia.org/T215289) (owner: 10Mforns) [17:15:17] (03PS2) 10Ottomata: Reenable reportupdater jobs after analytics replica shard fix [puppet] - 10https://gerrit.wikimedia.org/r/495253 (https://phabricator.wikimedia.org/T215289) (owner: 10Mforns) [17:15:19] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Reenable reportupdater jobs after analytics replica shard fix [puppet] - 10https://gerrit.wikimedia.org/r/495253 (https://phabricator.wikimedia.org/T215289) (owner: 10Mforns) [17:18:04] (03PS1) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495265 [17:18:45] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [17:18:49] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [17:18:49] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [17:19:15] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [17:19:21] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [17:20:36] 10Operations, 10netops: Increase network capacity (2018-19 Q3 Goal) - https://phabricator.wikimedia.org/T213122 (10ayounsi) [17:20:43] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) 05Open→03Resolved the redundancy testing is outside the scope of the goal, so everything needed here is done. [17:21:49] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled [17:22:37] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) [17:22:51] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) [17:24:05] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled [17:26:30] 10Operations, 10decommission, 10Patch-For-Review, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10jijiki) [17:26:34] what's up with aqs/druid? [17:26:49] (03CR) 10Bstorm: "Cherry-picked this into toolsbeta and it works great. Merging." [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis) [17:26:53] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy [17:26:57] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [17:26:59] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:26:59] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [17:27:10] (03PS2) 10Bstorm: toolforge: switch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis) [17:27:13] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [17:27:21] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [17:27:29] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [17:27:41] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [17:27:53] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal [17:28:51] (03PS1) 10CRusnov: Add report which checks against puppetdb and compares serial numbers [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 [17:28:58] (03CR) 10Bstorm: [C: 03+2] toolforge: switch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis) [17:30:00] !log decom in progress for rdb100[123478] via T209181 [17:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:03] T209181: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 [17:30:06] who can help with a possible Android app unique bug/vandalism ? [17:30:25] NotASpy: is this the same issue as reported at https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/Incidents#dick_picks_on_the_Wikipedia_app ? [17:30:43] it is, and I've confirmed with a screenshot if needed [17:31:06] yeah, I can reproduce on desktop. [17:31:16] done the usual sweep for template vandalism, Wikidata vandalism etc, and can't find any obvious candidates [17:31:17] (03CR) 10Muehlenhoff: "The change is fine as-is, but will need some future fixup when the exporter gets built for buster, see https://phabricator.wikimedia.org/T" [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495228 (owner: 10Arturo Borrero Gonzalez) [17:33:28] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10Muehlenhoff) >>! In T216521#5011235, @MSantos wrote: > I tested Mich... [17:33:41] cdanis: actually, it has been figured out. It's a caching issue from some vandalism about 12 hours ago. Purging is fixing it. [17:47:04] (03CR) 10BryanDavis: [C: 03+1] "Seems worth trying as a temporary measure to reduce load on the LDAP directories while we continue to dig for deeper solutions." [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [17:47:11] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:52] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH) [17:49:11] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 (10debt) 05Open→03Resolved [17:51:19] Hey all! Wondering if anyone is available to deploy a quick labs config change (or if we can skip the deploy to production if it's labs-only now) [17:51:23] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade logstash plugins to 5.6.14 - https://phabricator.wikimedia.org/T216993 (10debt) 05Open→03Resolved [17:51:25] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 (10debt) [17:51:29] https://gerrit.wikimedia.org/r/495256 [17:52:13] 10Operations, 10Maps (Kartotherian), 10Wikimedia-Incident: Create test in spec.yaml for the kartotherian / geoshape service - https://phabricator.wikimedia.org/T217910 (10Gehel) [17:52:26] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10monitoring, and 2 others: Icinga monitoring for elasticsearch doesn't notice OOM conditions (this is happening on cloud) - https://phabricator.wikimedia.org/T76090 (10debt) 05Open→03Resolved [18:00:43] (03PS4) 10Paladox: Add "multi-site" plugin so gerrit can have multi masters [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494865 [18:03:13] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH) NETWORK PORT INFO rdb1001:asw2-c-eqiad:ge-4/0/9 rdb1002:asw2-c-eqiad:ge-7/0/18 rdb1003:asw-a-eqiad:ge-4/0/43 rdb1004:asw2-b-eqia... [18:03:44] (03CR) 10BryanDavis: gridengine: Add prometheus monitor for queue/host health (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm) [18:03:51] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1001.eqiad.wmnet and performed the following actions: - Revoked Pu... [18:04:04] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1002.eqiad.wmnet and performed the following actions: - Revoked Pu... [18:04:16] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1003.eqiad.wmnet and performed the following actions: - Revoked Pu... [18:04:28] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1004.eqiad.wmnet and performed the following actions: - Revoked Pu... [18:04:41] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1007.eqiad.wmnet and performed the following actions: - Revoked Pu... [18:04:54] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1008.eqiad.wmnet and performed the following actions: - Revoked Pu... [18:05:20] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10MSantos) We have the epic with more general information that needs t... [18:05:33] (03CR) 10Bstorm: gridengine: Add prometheus monitor for queue/host health (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm) [18:06:49] (03PS1) 10RobH: decom rdb100[123478].eqiad.wmnet dns entries [dns] - 10https://gerrit.wikimedia.org/r/495274 (https://phabricator.wikimedia.org/T209181) [18:08:49] (03CR) 10RobH: [C: 03+2] decom rdb100[123478].eqiad.wmnet dns entries [dns] - 10https://gerrit.wikimedia.org/r/495274 (https://phabricator.wikimedia.org/T209181) (owner: 10RobH) [18:09:49] (03PS1) 10RobH: decom of rdb100[123478] [puppet] - 10https://gerrit.wikimedia.org/r/495275 (https://phabricator.wikimedia.org/T209181) [18:11:05] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH) [18:11:19] (03CR) 10RobH: [C: 03+2] decom of rdb100[123478] [puppet] - 10https://gerrit.wikimedia.org/r/495275 (https://phabricator.wikimedia.org/T209181) (owner: 10RobH) [18:12:09] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH) a:03Cmjohnson [18:12:47] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:14:12] bblack: re: aqs/druid - the new edit metrics were published in druid, and caused a series of cache miss in there that bubbled up in aqs as well (the alerts are all for edit metrics afaics) [18:14:20] in theory it shouldn't have happened [18:14:29] (just seen the alerts) [18:15:30] ok thanks :) [18:16:46] I am not super happy about it, will do a proper investigation on monday :( [18:17:41] (03CR) 10Reedy: [C: 03+2] External api url for wbsearchentities api call on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [18:17:45] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Wikimedia-Incident: Create test in spec.yaml for the kartotherian / geoshape service - https://phabricator.wikimedia.org/T217910 (10MSantos) [18:17:46] (03CR) 10Reedy: [C: 03+2] "(labs!)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [18:18:52] (03Merged) 10jenkins-bot: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [18:19:15] (03CR) 10jenkins-bot: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle) [18:20:31] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: beta only (duration: 00m 49s) [18:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:05] (03PS3) 10Fomafix: Avoid redirects from HTTPS to HTTP and back to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/469262 [18:23:39] (03PS2) 10Fomafix: Add redirects from 'lzh' to 'zh-classical' [puppet] - 10https://gerrit.wikimedia.org/r/481533 (https://phabricator.wikimedia.org/T167513) [18:24:02] (03PS8) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [18:25:09] (03PS2) 10Fomafix: Add 'lzh' as alias for 'zh-classical' [dns] - 10https://gerrit.wikimedia.org/r/481532 (https://phabricator.wikimedia.org/T167513) [18:25:56] (03PS2) 10Fomafix: Add 'sgs' as alias for 'bat-smg' [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830) [18:26:56] (03PS2) 10Bstorm: gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) [18:27:20] (03CR) 10Bstorm: gridengine: Add prometheus monitor for queue/host health (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm) [19:03:58] (03PS4) 10Paladox: WIP: Update gerrit to 2.16.6 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495012 [19:17:39] (03PS1) 10MarkTraceur: Fix typo in config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495282 (https://phabricator.wikimedia.org/T217157) [19:20:37] (03CR) 10Reedy: [C: 03+2] Fix typo in config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495282 (https://phabricator.wikimedia.org/T217157) (owner: 10MarkTraceur) [19:21:08] !log installing php updates on netmon1002 [19:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:06] (03Merged) 10jenkins-bot: Fix typo in config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495282 (https://phabricator.wikimedia.org/T217157) (owner: 10MarkTraceur) [19:25:31] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: beta only (duration: 00m 50s) [19:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:31] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:04] (03CR) 10jenkins-bot: Fix typo in config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495282 (https://phabricator.wikimedia.org/T217157) (owner: 10MarkTraceur) [19:42:41] (03PS1) 10Bstorm: osmdb: Switch the replica to the VM that needs to become the master [puppet] - 10https://gerrit.wikimedia.org/r/495290 (https://phabricator.wikimedia.org/T193264) [19:46:31] (03PS1) 10Legoktm: php72: Switch from thirdparty/php72 to component/php72 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/495291 [19:46:42] (03PS2) 10Legoktm: php72: Switch from thirdparty/php72 to component/php72 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/495291 (https://phabricator.wikimedia.org/T216712) [19:51:21] PROBLEM - HHVM rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:21] RECOVERY - HHVM rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 81541 bytes in 0.292 second response time [19:53:23] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [19:55:27] (03CR) 10BryanDavis: gridengine: Add prometheus monitor for queue/host health (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm) [20:12:33] (03CR) 10Bstorm: gridengine: Add prometheus monitor for queue/host health (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm) [20:14:46] (03PS2) 10Bstorm: toolforge: reformatting promethus grid engine collector [puppet] - 10https://gerrit.wikimedia.org/r/495261 [20:14:48] (03PS3) 10Bstorm: gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) [20:32:11] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:33:11] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 [20:35:33] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 [20:35:45] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:35:46] ^ hm [20:35:47] ? [20:36:25] PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [20:36:50] oom-killer [20:41:09] RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [20:41:27] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 [20:49:57] (03CR) 10BryanDavis: [C: 03+1] gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm) [21:03:17] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) We've been around this topic a number of times, so I'll write a summary where we're at so far. I'm sorry it's going... [21:05:26] (03PS4) 10Bstorm: gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) [21:11:21] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) > disable cache busting by default, enable it internally This would immediately break all external updaters. They'd... [21:11:25] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) > disable cache busting by default, enable it internally This would immediately break all external updaters. They'd... [21:47:46] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) > don't do cache busting on events older than X This however gave me an idea. If we kept a map of all latest revisi... [22:27:53] (03CR) 10Bstorm: [C: 03+2] gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm) [22:28:07] (03CR) 10Bstorm: [C: 03+2] toolforge: reformatting promethus grid engine collector [puppet] - 10https://gerrit.wikimedia.org/r/495261 (owner: 10Bstorm) [23:08:50] (03PS1) 10MaxSem: Remove $wgMediaInTargetLanguage, matches the MW default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495408 [23:21:23] (03CR) 10Nuria: [C: 03+1] "Adding higher puppet beings so they can +2" [puppet] - 10https://gerrit.wikimedia.org/r/484994 (https://phabricator.wikimedia.org/T209857) (owner: 10Gilles) [23:49:10] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10bd808)