[00:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190308T0000).
[00:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:43] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga2001 is OK: (C)60 le (W)70 le 74.09 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:00:54] <jdlrobson>	 \o
[00:01:26] <logmsgbot>	 !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2001.wikimedia.org,service=pdns_recursor
[00:01:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:07] <thcipriani>	 I can SWAT
[00:05:00] <jdlrobson>	 thanks thcipriani 
[00:05:21] <jdlrobson>	 thcipriani: 2 are beta cluster so should be pretty easy :)
[00:05:31] <wikibugs>	 10Operations, 10netops: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 (10ayounsi) After changing the port range to IANA recommended range and restarting Bird, we can see the BFD packets leaving from the proper port: `IP dns2001.wikimedia.org.55170 > cr1-codfw.wikimedia.org.4784: UDP, length 24`...
[00:05:53] <thcipriani>	 jdlrobson: does this need to be a change in IS-labs.php? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/495023
[00:06:54] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson)
[00:07:13] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] Enable advanced mobile contributions mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 (owner: 10Jdlrobson)
[00:07:17] <jdlrobson>	 yess eeke
[00:07:21] <jdlrobson>	 thanks for sanity checking me
[00:07:36] <jdlrobson>	 fixing now
[00:07:39] <thcipriani>	 cool, thanks
[00:08:56] <wikibugs>	 (03PS2) 10Jdlrobson: Enable advanced mobile contributions mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023
[00:09:02] <jdlrobson>	 corrected!
[00:09:49] <thcipriani>	 thansk
[00:09:51] <thcipriani>	 *thanks
[00:10:26] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 (owner: 10Jdlrobson)
[00:11:38] <wikibugs>	 (03Merged) 10jenkins-bot: Enable advanced mobile contributions mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 (owner: 10Jdlrobson)
[00:11:53] <wikibugs>	 (03PS2) 10Thcipriani: Cleanup beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson)
[00:11:59] <wikibugs>	 (03CR) 10Thcipriani: Cleanup beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson)
[00:12:05] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson)
[00:12:51] <wikibugs>	 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Krinkle) I don't have input on the numbers, but I do think we need a p99 and max as well...
[00:13:01] <wikibugs>	 (03Merged) 10jenkins-bot: Cleanup beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson)
[00:13:05] <thcipriani>	 for some reason in the new gerrit UI it's harder for me to tell if things can merge or if they need a rebase.
[00:15:41] <wikibugs>	 (03CR) 10jenkins-bot: Enable advanced mobile contributions mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 (owner: 10Jdlrobson)
[00:15:45] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) (owner: 10Jdlrobson)
[00:16:09] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:495024|Cleanup beta cluster config]] T213599; [[gerrit:495023|Enable advanced mobile contributions mode on beta cluster]] beta-only (noop) sync (duration: 00m 49s)
[00:16:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:12] <stashbot>	 T213599: Clean up beta cluster mobile configuration - https://phabricator.wikimedia.org/T213599
[00:18:01] <thcipriani>	 jdlrobson: minervaneue update is live on mwdebug1002, check please
[00:18:16] <jdlrobson>	 Sweet! Checking!
[00:19:44] <jdlrobson>	 fix confirmed! please sync!
[00:20:20] * thcipriani does
[00:23:43] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.20/skins/MinervaNeue/resources/skins.minerva.scripts/toc.js: SWAT: [[gerrit:495021|Passing page parameter to TOC toggler]] T217820 (duration: 00m 50s)
[00:23:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:46] <stashbot>	 T217820: Regression: "Expand all sections" setting in Minerva is broken - https://phabricator.wikimedia.org/T217820
[00:23:50] <thcipriani>	 ^ jdlrobson all live now
[00:24:03] <thcipriani>	 next run of beta-scap-eqiad should make the beta things live
[00:25:41] <jdlrobson>	 thcipriani: checking again :S
[00:26:13] <jdlrobson>	 thanks i'll ping you in a bit if i'm not seeing them, if there's any problems it can wait till monday anyhow
[00:26:30] <jdlrobson>	 thanks for the SWAT fix though!
[00:27:28] <thcipriani>	 sure thing, thanks for checking patches!
[00:28:27] <paladox>	 thcipriani it should show "Merge Conflict" but at least in 2.16 it is much more noticable.
[00:28:56] <thcipriani>	 old ui says something like "Cannot Merge" in red letters
[00:31:22] <paladox>	 thcipriani the new ui on 2.16 shows a status bandge now
[00:31:25] <paladox>	 *bage
[00:31:51] <thcipriani>	 nice, less subtle would be good :)
[00:32:01] <paladox>	 thcipriani like https://gerrit-review.googlesource.com/c/gerrit/+/201895/5
[01:00:01] <paladox>	 thcipriani the health check plugins is fixed in https://gerrit-review.googlesource.com/c/plugins/healthcheck/+/217192
[01:02:12] <wikibugs>	 (03PS1) 10MarkAHershberger: Add --timely option to help with debugging [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158
[01:03:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add --timely option to help with debugging [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 (owner: 10MarkAHershberger)
[01:29:36] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10zhuyifei1999) ` root@tools-sgeexec-0914:~# strace -s 1024 -p 543 -p 560 -p 561 -p 562 -p 563 -p 564 -p 568 -p 56...
[01:37:35] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM. Won't be used until  I7de713adbcc2f1ee9f2 goes out, but is harmless to roll out ahead of time, as far as I can see." [puppet] - 10https://gerrit.wikimedia.org/r/494726 (https://phabricator.wikimedia.org/T217395) (owner: 10Gilles)
[03:53:50] <wikibugs>	 (03PS2) 10KartikMistry: Enable ExternalGuidance to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493672 (https://phabricator.wikimedia.org/T216129)
[04:27:38] <wikibugs>	 (03PS1) 10Paladox: Add missing "wikimedia" plugin to tools/bzl/plugins.bzl [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495170
[04:28:05] <wikibugs>	 (03PS2) 10Paladox: Add missing "wikimedia" plugin to tools/bzl/plugins.bzl [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495170
[04:28:52] <wikibugs>	 (03CR) 10Paladox: [V: 03+2 C: 03+2] Add missing "wikimedia" plugin to tools/bzl/plugins.bzl [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495170 (owner: 10Paladox)
[06:15:29] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495173
[06:16:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495173 (owner: 10Marostegui)
[06:17:23] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/495174
[06:17:44] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495173 (owner: 10Marostegui)
[06:18:57] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 (duration: 00m 51s)
[06:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/495174 (owner: 10Marostegui)
[06:20:38] <marostegui>	 !log Reload haproxy on dbproxy1010 to depool labsdb1010
[06:20:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:41] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495173 (owner: 10Marostegui)
[06:21:01] <marostegui>	 !log Stop replication on s3 on labsdb1009 and labsdb1011
[06:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:07] <marostegui>	 !log Deploy schema change on s3 db1077 with replication (lag will happen on s3 labs)
[06:22:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:13] <elukey>	 marostegui: o/
[06:35:32] <elukey>	 going to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494874/, ok for you?
[06:35:42] <marostegui>	 o/
[06:35:43] <marostegui>	 let me see
[06:35:44] <wikibugs>	 (03PS13) 10Elukey: Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231)
[06:35:48] <marostegui>	 Ah
[06:36:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey)
[06:36:13] <elukey>	 super :)
[06:36:15] <marostegui>	 I thought I already +1ed 
[06:36:16] <marostegui>	 Sorry!
[06:36:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey)
[06:38:14] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15038/" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey)
[06:38:53] <elukey>	 marostegui: nono I only wanted to know if it was ok for you :)
[06:49:46] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495175
[06:50:49] <wikibugs>	 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10MoritzMuehlenhoff) @Legoktm , @bd808 : The PHP 7.2 packages from the component/php72 are working fine in the Mediawiki PHP productio...
[06:50:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495175 (owner: 10Marostegui)
[06:51:54] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495175 (owner: 10Marostegui)
[06:52:54] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1076 for mysql upgrade (duration: 00m 49s)
[06:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:45] <marostegui>	 !log Stop MySQL on db1076 for upgrade
[06:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:40] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495175 (owner: 10Marostegui)
[06:57:52] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495176
[07:02:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495176 (owner: 10Marostegui)
[07:03:52] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495176 (owner: 10Marostegui)
[07:04:56] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1076 after mysql upgrade (duration: 00m 48s)
[07:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:44] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495176 (owner: 10Marostegui)
[07:07:32] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1076 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495179
[07:11:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1076 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495179 (owner: 10Marostegui)
[07:12:03] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1076 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495179 (owner: 10Marostegui)
[07:13:11] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1076 into API after mysql upgrade (duration: 00m 48s)
[07:13:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:57] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1076 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495179 (owner: 10Marostegui)
[07:17:24] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495180
[07:21:09] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:21:19] <icinga-wm>	 PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[07:21:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495180 (owner: 10Marostegui)
[07:21:23] <marostegui>	 uh?
[07:21:37] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:21:39] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labtestweb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:21:47] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495180 (owner: 10Marostegui)
[07:22:05] <wikibugs>	 (03PS3) 10Dzahn: DHCP Partman: Add DHCP MAC and partman for restbase2018,2020 [puppet] - 10https://gerrit.wikimedia.org/r/495138 (https://phabricator.wikimedia.org/T217368) (owner: 10Papaul)
[07:22:49] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1076 after mysql upgrade (duration: 00m 49s)
[07:22:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 'd rather we just added the hour, minute, second to the nightly flag instead of adding another one. It should be way smaller as a change" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 (owner: 10MarkAHershberger)
[07:28:25] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495180 (owner: 10Marostegui)
[07:30:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/495181
[07:30:37] <wikibugs>	 (03PS2) 10Marostegui: Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/495181
[07:31:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP Partman: Add DHCP MAC and partman for restbase2018,2020 [puppet] - 10https://gerrit.wikimedia.org/r/495138 (https://phabricator.wikimedia.org/T217368) (owner: 10Papaul)
[07:31:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/495181 (owner: 10Marostegui)
[07:31:59] <wikibugs>	 (03PS4) 10Dzahn: DHCP Partman: Add DHCP MAC and partman for restbase2018,2020 [puppet] - 10https://gerrit.wikimedia.org/r/495138 (https://phabricator.wikimedia.org/T217368) (owner: 10Papaul)
[07:34:29] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster
[07:34:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:00] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 30s)
[07:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:57] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labtestweb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 33651 bytes in 3.308 second response time
[07:35:57] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 33651 bytes in 3.321 second response time
[07:36:03] <icinga-wm>	 RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.06 ms
[07:36:37] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 33651 bytes in 0.214 second response time
[07:37:52] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/495183
[07:39:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/495183 (owner: 10Marostegui)
[07:41:06] <mutante>	 weird. i have a change where i simply move an .erb template to a new location, no change inside it, yet the compiler "fails to parse" it
[07:45:48] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: More traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184
[07:46:20] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: More traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184
[07:47:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184 (owner: 10Marostegui)
[07:48:30] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: More traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184 (owner: 10Marostegui)
[07:49:02] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185
[07:49:06] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185
[07:49:34] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1076 after mysql upgrade (duration: 00m 49s)
[07:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185 (owner: 10Marostegui)
[07:50:50] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: More traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495184 (owner: 10Marostegui)
[07:51:08] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185 (owner: 10Marostegui)
[07:51:21] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495185 (owner: 10Marostegui)
[07:51:29] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster
[07:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:12] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 (duration: 00m 48s)
[07:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:48] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 01m 18s)
[07:52:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:04] <marostegui>	 elukey: did labsdb1012 change work?
[07:55:06] <elukey>	 marostegui: yep!
[07:55:12] <elukey>	 I can telnet from all analytics now
[07:55:20] <elukey>	 so my team can start the test from hadoop
[07:56:42] <marostegui>	 coool
[07:57:51] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster
[07:57:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:53] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 02s)
[07:57:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:51] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster
[07:58:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:31] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 40s)
[07:59:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:30] <wikibugs>	 (03PS7) 10Dzahn: xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761)
[08:08:07] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495186
[08:08:58] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15040/" [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn)
[08:09:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495186 (owner: 10Marostegui)
[08:10:02] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495186 (owner: 10Marostegui)
[08:11:04] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1076 (duration: 00m 48s)
[08:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:44] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495186 (owner: 10Marostegui)
[08:16:50] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495188
[08:17:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1099:3311,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495188 (owner: 10Marostegui)
[08:18:45] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495188 (owner: 10Marostegui)
[08:20:15] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099:3311,db1096:3315 (duration: 00m 48s)
[08:20:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] profile: point to real modules for specs [puppet] - 10https://gerrit.wikimedia.org/r/480957 (owner: 10Hashar)
[08:24:12] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/495189
[08:25:06] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495188 (owner: 10Marostegui)
[08:27:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "needs manual rebase apparently" [puppet] - 10https://gerrit.wikimedia.org/r/480957 (owner: 10Hashar)
[08:28:06] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292)
[08:28:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[08:29:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/495189 (owner: 10Marostegui)
[08:31:14] <marostegui>	 !log Reload haproxy on dbproxy1010 to repool labsdb1011
[08:31:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:27] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/495190
[08:36:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/495190 (owner: 10Marostegui)
[08:37:19] <marostegui>	 !log Reload haproxy on dbproxy1011 to depool labsdb1009
[08:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:00] <wikibugs>	 10Operations, 10MediaWiki-Vagrant: Can't provision elk role on Vagrant anymore: logstash Debian package is nowhere to be found - https://phabricator.wikimedia.org/T217666 (10Gilles)
[08:45:31] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495191
[08:50:47] <wikibugs>	 10Operations, 10serviceops: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10akosiaris)
[08:51:09] <wikibugs>	 10Operations, 10serviceops: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10akosiaris) p:05Triage→03Normal
[09:02:59] <wikibugs>	 10Operations, 10serviceops: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10akosiaris)
[09:04:05] <wikibugs>	 10Operations, 10serviceops: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10akosiaris)
[09:13:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1099:3311,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495191 (owner: 10Marostegui)
[09:14:01] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495191 (owner: 10Marostegui)
[09:15:00] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3311,db1096:3315 (duration: 00m 49s)
[09:15:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:29] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495191 (owner: 10Marostegui)
[09:21:16] <wikibugs>	 10Operations, 10MediaWiki-Vagrant: Can't provision elk role on Vagrant anymore: logstash Debian package is nowhere to be found - https://phabricator.wikimedia.org/T217666 (10Gilles) p:05Triage→03Normal a:03Gilles
[09:21:27] <wikibugs>	 10Operations, 10MediaWiki-Vagrant, 10Performance-Team: Can't provision elk role on Vagrant anymore: logstash Debian package is nowhere to be found - https://phabricator.wikimedia.org/T217666 (10Gilles)
[09:28:32] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495193
[09:31:44] <wikibugs>	 10Operations, 10MediaWiki-Vagrant, 10Performance-Team, 10Patch-For-Review: Can't provision elk role on Vagrant anymore: logstash Debian package is nowhere to be found - https://phabricator.wikimedia.org/T217666 (10Gilles) 05Open→03Resolved
[09:31:46] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "icinga: merge https and http checks"" [puppet] - 10https://gerrit.wikimedia.org/r/495194
[09:32:25] <wikibugs>	 (03PS2) 10Dzahn: Revert "Revert "icinga: merge https and http checks"" [puppet] - 10https://gerrit.wikimedia.org/r/495194
[09:32:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "Revert "icinga: merge https and http checks"" [puppet] - 10https://gerrit.wikimedia.org/r/495194 (owner: 10Dzahn)
[09:36:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1080 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495193 (owner: 10Marostegui)
[09:37:51] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495193 (owner: 10Marostegui)
[09:38:42] <mutante>	 there will be some alerts for Elastic search but we are already debugging them
[09:39:09] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1080, db1110 (duration: 00m 49s)
[09:39:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:18] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2001 is CRITICAL: (null)
[09:39:18] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: (null)
[09:39:22] <mutante>	 those 
[09:39:28] <onimisionipe>	 downtimig now
[09:39:29] <onimisionipe>	 sorry
[09:39:34] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: (null)
[09:39:52] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: (null)
[09:39:52] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: (null)
[09:39:54] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2004 is CRITICAL: (null)
[09:39:58] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: (null)
[09:39:59] <mutante>	 no problem, we only had seconds to do that after they got created
[09:40:00] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: (null)
[09:40:00] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1002 is CRITICAL: (null)
[09:40:02] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2003 is CRITICAL: (null)
[09:40:06] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: (null)
[09:41:02] <mutante>	 also doing some (without disabling all other checks on those hosts)
[09:41:42] <onimisionipe>	 yep.. I only disable those checks
[09:41:48] <mutante>	 perfect
[09:42:10] <mutante>	 that's why i did not use the script , it does all services on a host
[09:43:52] <mutante>	 check_command                  check_elasticsearch_shards_threshold!9200!>=0.34   debugging ... it appears to work when running manual
[09:44:44] <onimisionipe>	 for all hosts?
[09:45:28] <mutante>	 tested with relforge1001 which is in the list above
[09:45:38] <mutante>	 also with both hostname or IP
[09:45:46] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 and db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495193 (owner: 10Marostegui)
[09:45:51] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, 
[09:45:51] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[09:45:58] <mutante>	 oh, lol ?
[09:47:18] <onimisionipe>	 Ok
[09:48:35] <mutante>	 well.. "works for one random host but not the others" is weird
[09:48:44] <mutante>	 especially when i can do that for multiple host and get OK on shell
[09:49:56] <wikibugs>	 (03PS5) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292)
[09:50:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I would put the puppet code in modules/openstack/manifests/nova/compute/service.pp since I didn't detect anything mitaka specific or jessi" [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) (owner: 10GTirloni)
[09:50:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[09:51:29] <onimisionipe>	 mutante: I think  found the issue
[09:51:40] <mutante>	 !log temp disabling puppet on icinga to debug an issue with elastic checks
[09:51:41] <onimisionipe>	 mutante: https://icinga.wikimedia.org/cgi-bin/icinga/config.cgi?type=command&host=relforge1001&service=ElasticSearch%20health%20check%20for%20shards%20on%209200&expand=check_elasticsearch_shards_threshold%219200%21%3E%3D0.15
[09:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:13] <mutante>	 onimisionipe: " " around the ">" part, ack
[09:52:20] <mutante>	 that's the same thing i was gonna test
[09:54:33] <mutante>	 3/3 CRITICAL - elasticsearch 9200://relforge1002:>=0.15/_cluster/health error while fetching: No connection adapters were found for '9200://relforge1002:>=0.15/_cluster/health'
[09:54:47] <mutante>	 onimisionipe: ^ oh look what happens next if we add " " 
[09:55:15] <mutante>	 the host name is not even in there
[09:55:53] <onimisionipe>	 hmm
[09:56:15] <mutante>	 9200://relforge
[09:56:23] <mutante>	 something is mixed up with the order of the arguments
[09:56:30] <onimisionipe>	 yes
[09:56:30] <mutante>	 that should be the protocol in there
[09:56:37] <onimisionipe>	 yep
[09:59:50] <onimisionipe>	 mutante: the order of args seems Ok in the CR
[10:00:01] <onimisionipe>	 I might be missing something
[10:01:16] <mutante>	 i can't say i see the reason yet either. what you said
[10:05:42] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/495195
[10:11:23] <wikibugs>	 (03PS6) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292)
[10:12:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[10:12:49] <wikibugs>	 (03PS2) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/495195
[10:13:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/495195 (owner: 10Marostegui)
[10:14:28] <marostegui>	 !log Reload haproxy on dbproxy1011 to repool labsdb1009
[10:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:55] <wikibugs>	 (03PS7) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292)
[10:19:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[10:21:07] <onimisionipe>	 godog: ^
[10:21:37] <onimisionipe>	 if you are around, we need some help with some issues with icinga
[10:24:13] <wikibugs>	 (03PS8) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292)
[10:24:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[10:26:32] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080 and db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495196
[10:28:53] <wikibugs>	 (03PS9) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292)
[10:29:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[10:32:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1080 and db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495196 (owner: 10Marostegui)
[10:33:01] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080 and db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495196 (owner: 10Marostegui)
[10:33:14] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080 and db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495196 (owner: 10Marostegui)
[10:34:19] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1080, db1110 (duration: 00m 49s)
[10:34:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:37] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2003 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 51, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, 
[10:34:37] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[10:34:51] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2001 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 51, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, 
[10:34:51] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[10:34:51] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, 
[10:34:51] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[10:34:52] <onimisionipe>	 mutante ^
[10:35:14] <mutante>	 onimisionipe: lol, wtf :)
[10:35:19] <mutante>	 i re-enabled puppet 
[10:35:21] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 104, in
[10:35:21] <icinga-wm>	 : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0
[10:35:24] <mutante>	 that is what happened
[10:35:26] <onimisionipe>	 mutante: did you revert or something
[10:35:33] <mutante>	 no, i just let puppet run again
[10:35:48] <mutante>	 i guess it needed 2 runs to first adjust the check command
[10:35:52] <mutante>	 and then all the checks using it
[10:36:06] <mutante>	 which would explain why it worked for one host .. maybe :p
[10:37:47] <mutante>	 well, that was some wasted time :) but glad it works, hehee
[10:39:45] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, 
[10:39:45] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[10:40:21] <mutante>	 onimisionipe: there we go.. ^  just that it is too long for a single line
[10:40:25] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, 
[10:40:25] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[10:41:19] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, 
[10:41:19] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[10:41:29] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 104, in
[10:41:29] <icinga-wm>	 : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0
[10:42:29] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2004 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 51, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, 
[10:42:29] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[10:43:09] <onimisionipe>	 sorry for spam
[10:43:20] <mutante>	 recovery spam is the good spam :)
[10:43:49] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:44:29] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 51, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, 
[10:44:29] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[10:44:43] <onimisionipe>	 mutante: Thanks a lot for the help!
[10:44:45] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:45:09] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:45:10] <onimisionipe>	 all green
[10:45:15] <onimisionipe>	 oops
[10:45:52] <mutante>	 onimisionipe: you're welcome 
[10:47:17] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:48:07] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:49:03] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:07:55] <moritzm>	 !log uploaded tideways 4.0.7-1+wmf1 for component/php72 (T216712)
[11:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:58] <stashbot>	 T216712: Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712
[11:10:33] <wikibugs>	 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10Dzahn) phab1002 done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/494885  doc1001 done in https://gerrit.wikimedia.org/r/...
[11:10:37] <wikibugs>	 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10dr0ptp4kt) Great framing, nice job! One question, though, what's this part about and how does...
[11:15:53] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10DBA, and 2 others: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10Marostegui) 05Open→03Resolved As per our earlier chat - this seems to be working fine after the puppet change to get the FW opened for labsdb1012
[11:31:19] <hauskatze>	 Request from <redacted> via cp1077 cp1077, Varnish XID 1011843221
[11:32:43] <hauskatze>	 Back again
[11:39:58] <wikibugs>	 (03PS2) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 4 [puppet] - 10https://gerrit.wikimedia.org/r/495008
[11:40:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some first comments" (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto)
[11:49:54] <wikibugs>	 (03PS1) 10Dzahn: openstack: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/495205
[11:59:13] <wikibugs>	 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10dr0ptp4kt) ^ Well, I intended for that to be on email. But it stands: I think Olga put this i...
[12:00:21] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:00:21] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga2001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:00:35] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:00:43] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:00:59] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:03:57] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:04:19] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:05:23] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:05:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:06:21] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:07:08] <jbond42>	 !log rolling security updates of slite3 on jessie and trusty
[12:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:15] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:17:11] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:17:25] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga2001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:17:37] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2842 bytes in 0.312 second response time
[12:17:55] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga2001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:17:57] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:18:14] * arturo paged
[12:18:29] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga2001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:18:34] <jynus>	 ^ ema_ vgutierrez bblack
[12:18:39] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2840 bytes in 1.298 second response time
[12:18:49] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2817 bytes in 0.312 second response time
[12:18:55] <Oresrian>	 Request from [redacted] via cp1077 cp1077, Varnish XID 1070238236 Error: 503, Backend fetch failed at Fri, 08 Mar 2019 12:15:49 GMT and Request from [redacted] via cp1077 cp1077, Varnish XID 184418667 Error: 503, Backend fetch failed at Fri, 08 Mar 2019 12:17:13 GMT   - but I guess you know :)
[12:19:05] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17211 bytes in 0.306 second response time
[12:19:07] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:19:21] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga2001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:19:23] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:19:31] <volans>	 what's up?
[12:19:33] <apergos>	 it's all 2001, is something happening with that box?
[12:19:35] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:19:51] <akosiaris>	 should we depool ulsfo?
[12:19:56] <_joe_>	 I don't think it's 2001
[12:20:00] <_joe_>	 and yes, depool ulsfo
[12:20:04] <akosiaris>	 ok doing
[12:20:06] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17212 bytes in 1.108 second response time
[12:20:09] <_joe_>	 but it seems it's codfw/eqsin/ulsfo
[12:20:16] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17198 bytes in 0.331 second response time
[12:20:17] <volans>	 it's not only ulsfo
[12:20:21] <jynus>	 I see errors everywhere: https://grafana.wikimedia.org/d/000000508/prometheus-varnish-http-errors-datacenters?orgId=1
[12:20:21] <_joe_>	 yeah
[12:20:23] <volans>	 https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=4&fullscreen&orgId=1&from=1552047213993&to=1552047605664
[12:20:27] <_joe_>	 it's also codfw
[12:20:31] <akosiaris>	 ah it's not only ulsfo, depooling would not help
[12:20:45] <_joe_>	 yeah
[12:20:46] <akosiaris>	 eqiad as well
[12:20:53] <volans>	 all text + esams/eqiad upload
[12:21:01] <_joe_>	 jfc
[12:21:03] <volans>	 le
[12:21:59] <elukey>	 cp1077 seems misbehaving from https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend
[12:22:09] <_joe_>	 ok so, apart from what elukey just found
[12:22:12] <_joe_>	 anything else
[12:22:18] <_joe_>	 ?
[12:22:31] <akosiaris>	 I got nothing yet
[12:22:32] <volans>	 seems kinda recovering
[12:22:36] <moritzm>	 me neither
[12:23:08] <vgutierrez>	 yeah.. cp1077 looks to be in pain
[12:23:13] <volans>	 nothing on the load-balancers dashboard
[12:23:23] <elukey>	 volans: there's also some mailbox lag, not sure if related
[12:23:32] <jynus>	 is there any dependency to sqlite? (I am guessing no) but it was the only thing ongoing at the time
[12:23:42] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:23:42] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:24:06] <volans>	 do we graph the mailbox lag?
[12:24:18] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:24:18] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:24:20] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:24:20] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:24:22] <jbond42>	 jynus: the only cp server updated was cp1008.wikimedia.org
[12:24:24] <akosiaris>	 https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 lowered everything to 97% availability and it's recovering now
[12:24:32] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:24:39] <elukey>	 volans: we do yes - https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=13&fullscreen
[12:24:55] <vgutierrez>	 jbond42: cp1008 is pink unicorn. It doesn't serve production traffic
[12:25:01] <jynus>	 so maybe network?
[12:25:04] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:25:05] <moritzm>	 jynus: all the LVSes/varnishes are stretch, those updates are all for jessie/trusty
[12:25:06] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:25:08] <akosiaris>	 across the internet?
[12:25:14] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:25:23] <akosiaris>	 all of our pops?
[12:25:26] <jynus>	 I don't know, I am just thowing options
[12:25:43] <jynus>	 a network issue on eqiad should be enough to create issues evertywhere
[12:25:43] <volans>	 vgutierrez: did you do anything with cp1077?
[12:25:48] <akosiaris>	 yeah and that's good, I am trying to figure out reasons it can't be that so we reach the root cause
[12:25:54] <vgutierrez>	 volans: noep
[12:25:56] <vgutierrez>	 *nope
[12:26:22] <elukey>	 akosiaris: cp1077 is eqiad, if misbehaving could explain a widespread temporary issue?
[12:26:23] <apergos>	 first whines were about 1 hr 45 mins ago  (10:43 utc)
[12:26:26] <apergos>	 in here
[12:26:48] <jynus>	 volans: I can see an increase in uplaoad retransmits, but that could be just a consequence
[12:26:54] <jynus>	 sorry ,that was for akosiaris
[12:27:02] <volans>	 elukey: the only thing that I'm having hard time to explain with cp1077 is why we got also esams/eqiad upload with issues
[12:27:08] <akosiaris>	 elukey: we 've have cps misbehaving before and the issues were not network wide
[12:27:22] <jijiki>	 I am not sure if it is related, probably not 
[12:27:25] <jijiki>	 https://logstash.wikimedia.org/goto/13e5e1f56d28e0c40e6ce9e846086f2d
[12:27:51] <akosiaris>	 looking into network issues anyway via librenms
[12:27:52] <vgutierrez>	 cp1077 is a text node, so upload issues shouldn't be related to it
[12:28:05] <elukey>	 akosiaris: yep yep agreed, the mailbox lag is always concerning me, in the past it was causing horrible things
[12:28:13] <jynus>	 jijiki: that looks bad but doesn't seem new
[12:28:16] <elukey>	 and it is also upload yes
[12:28:28] <elukey>	 volans: --^
[12:29:11] <jynus>	 lots of abusefilter errors
[12:32:02] <akosiaris>	 Mar  8 12:25:29  asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1: Rx loss cleared
[12:32:02] <akosiaris>	 Mar  8 12:28:29  asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1: Rx loss set
[12:32:06] <akosiaris>	 hmm
[12:32:16] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:32:28] <jynus>	 oh, so akosiaris now you listen to me :-P
[12:32:36] <akosiaris>	 jynus: I always do :P
[12:33:07] <akosiaris>	 but that would not explain such a big pain
[12:33:35] <Steinsplitter>	 (wikicommons down)
[12:33:35] <jynus>	 the issues seem ongoing based on that graph ^
[12:33:54] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:33:54] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:34:01] <jynus>	 Steinsplitter: there is some instability, but things should not be hard down, retry
[12:34:26] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:34:26] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga2001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:34:33] <Steinsplitter>	 jynus :)
[12:34:38] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga2001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:34:48] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:35:04] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga2001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:35:04] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga2001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:35:16] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:35:41] <jbond42>	 fyi cp1077 is peaking again
[12:36:04] <akosiaris>	 nothing weird in traffic graphs btw
[12:36:32] <jynus>	 there is a 98.5% availability, that is a lot of errors
[12:36:46] <jynus>	 and seems recurring
[12:37:14] <jynus>	 text al dcs
[12:37:15] <akosiaris>	 nothing weird in peering transit or core ports in librenms
[12:38:14] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:38:38] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:38:43] <akosiaris>	 hmm
[12:38:54] <akosiaris>	 that QSFP said again
[12:39:10] <akosiaris>	 Mar  8 12:33:30  asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1: Rx loss set
[12:39:12] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:39:23] <apergos>	 Steinsplitter: I don't see any issues with commons, at least browsing
[12:39:35] <akosiaris>	 what's on fpc3
[12:39:48] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:39:50] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:39:52] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:40:00] <jbond42>	 looks like ipsec bounced on cp1077
[12:40:04] <jynus>	 apergos: his issues are were real, there is at times a 2-1.5% of errors
[12:40:24] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:40:38] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:40:46] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:41:01] <jynus>	 with peaks of 350 errors/s
[12:41:14] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:41:42] <jynus>	 https://grafana.wikimedia.org/d/000000464/prometheus-varnish-aggregate-client-status-code?orgId=1&var-site=eqiad&var-site=codfw&var-site=ulsfo&var-site=eqsin&var-site=esams&var-cache_type=varnish-text&var-cache_type=varnish-upload&var-status_type=5&from=1552034133713&to=1552048829080
[12:44:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Our docs aren't good :-) You could probably just point everything to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troublesho" [puppet] - 10https://gerrit.wikimedia.org/r/495205 (owner: 10Dzahn)
[12:45:48] <jynus>	 akosiaris: at the moment, my only recommendation would be to depool cp1077
[12:45:59] <jynus>	 because it has a strange network pattern
[12:46:06] <jynus>	 unlike the others
[12:46:07] <akosiaris>	 I got nothing better either
[12:46:10] <akosiaris>	 sure
[12:46:13] <akosiaris>	 lemme do that
[12:46:21] <jynus>	 2 dips on network
[12:46:27] <jynus>	 while the other don't have that
[12:47:00] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1077.*
[12:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:33] <akosiaris>	 !log depooling cp1077 just in case, high mailbox lag https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=13&fullscreen
[12:47:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:37] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Update label and switch to rename labvirt1009 to cloudvirt1009 - https://phabricator.wikimedia.org/T216281 (10aborrero) a:05aborrero→03Cmjohnson
[12:51:27] <Bsadowski1>	 I am getting errors as well
[12:51:46] <apergos>	 errors on which, Bsadowski1?
[12:51:46] <Bsadowski1>	 "Request from xxxxxx via cp1089 cp1089, Varnish XID 162562291"
[12:51:53] <Bsadowski1>	 Error: 503, Backend fetch failed at Fri, 08 Mar 2019 12:51:09 GMT
[12:52:04] <apergos>	 wikis?
[12:52:08] <Bsadowski1>	 Simple
[12:52:11] <apergos>	 thx
[12:55:04] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:55:06] <wikibugs>	 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo)
[12:55:12] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:56:16] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:57:36] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:57:42] <wikibugs>	 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10TheDJ) Just stumbled across this in CommonSettings: ` $wgSVGConverters['rsvg-broken'] = '$path/rsvg-convert -w $widt...
[13:03:18] <wikibugs>	 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo)
[13:05:13] <wikibugs>	 (03PS3) 10Mathew.onipe: elasticsearch: refactor elastic icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921)
[13:05:58] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:06:05] <wikibugs>	 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10Vgutierrez) {F28347567} it looks indeed like purge requests
[13:06:06] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:06:20] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga2001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:06:24] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:06:24] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:06:56] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:06:56] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga2001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:06:58] <wikibugs>	 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo)
[13:07:08] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga2001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:07:22] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2842 bytes in 0.315 second response time
[13:07:26] <Bsadowski1>	 All my requests to Meta result in a Varnish error
[13:07:32] <Bsadowski1>	 nvm
[13:07:37] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga2001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:07:53] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:08:12] <wikibugs>	 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo) >>! In T217893#5011033, @Vgutierrez wrote: > {F28347567} it looks indeed like purge requests  I updated the comment- Those seem recurring, there wa...
[13:08:40] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17209 bytes in 0.342 second response time
[13:08:48] <arturo>	 the report is only about IPv6, weird
[13:08:51] <wikibugs>	 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10hashar) The spike of PURGE requests to the Varnish text frontends seems to be recurring. A view over 24 hours from https://grafana.wikimedia.org/d/000000180...
[13:09:11] <akosiaris>	 no it's ipv4 as well
[13:09:43] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1077.eqiad.wmnet
[13:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:18] <akosiaris>	 vgutierrez: was it still pooled? I am pretty sure I depooled it
[13:10:35] <vgutierrez>	 akosiaris: according to conftool yes
[13:10:50] <akosiaris>	 ah dammit, yes I now see the sal log
[13:10:54] <vgutierrez>	 *confctl sorry
[13:10:55] <arturo>	 ` set/pooled=yes; selector: name=cp1077.*` akosiaris :-P
[13:10:55] <akosiaris>	 I passed pooled=yes
[13:11:01] <akosiaris>	 yeah yeah my bad
[13:11:01] <jynus>	 oh, I didn't check that
[13:11:08] <akosiaris>	 sorry
[13:11:18] <jynus>	 so it went over me too!
[13:12:36] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:12:38] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:12:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:13:10] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:13:12] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:13:24] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:13:26] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:13:37] <wikibugs>	 10Operations, 10Traffic: Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10Vgutierrez) cp1077 effectively depooled at 13:09 UTC
[13:13:50] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[13:14:04] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:14:48] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:15:49] <wikibugs>	 (03PS3) 10GTirloni: openstack: Automatically start/stop VMs on hypervisor boot/shutdown [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040)
[13:17:48] <wikibugs>	 (03PS4) 10Mathew.onipe: elasticsearch: refactor elastic icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921)
[13:19:01] <joal>	 marostegui, jynus -  Hi my beloved DBAs - Letting you know I have started a test sqooping from labsdb1012 with a few more workers than usual - So far the host is (very) busy but doens't complain - Pleae let me know if I should stop :)
[13:19:40] <jynus>	 joal: not a good time now
[13:19:46] <jynus>	 there is ongoing issues
[13:19:52] <jynus>	 please ping us at a later time
[13:20:24] <joal>	 ack!
[13:21:08] <wikibugs>	 (03CR) 10GTirloni: "Puppet compiler output - https://puppet-compiler.wmflabs.org/compiler1002/15043/" [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) (owner: 10GTirloni)
[13:31:56] <wikibugs>	 (03PS5) 10Mathew.onipe: elasticsearch: refactor elastic icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921)
[13:40:59] <wikibugs>	 10Operations, 10Traffic: Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing - https://phabricator.wikimedia.org/T217893 (10jcrespo)
[13:43:13] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: Make the user agent configurable for Wikidata Query Service Updater - https://phabricator.wikimedia.org/T217896 (10Gehel)
[13:43:42] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: Make the user agent configurable for Wikidata Query Service Updater - https://phabricator.wikimedia.org/T217896 (10Gehel)
[13:48:11] <marostegui>	 joal: lasdb1012 is "yours" so it is not having any other user traffic that might be affected by the load you might be generating
[13:49:09] <joal>	 marostegui: I knew that, I prefer however to let you know, in case there is a moment in the day when not everything is burning and you want to check how it is doing with me hammering it :)
[13:49:20] <jynus>	 the only load worries would be if you can do the same thing, but faster in a different way, and if it creates too much load that yourself get affected
[13:49:21] <marostegui>	 joal: sure! appreciate it :)
[13:49:45] <jynus>	 the issues should be fixed now
[13:50:05] <joal>	 Happy to hear that - Thanks for caring :)
[13:51:38] <wikibugs>	 (03PS6) 10Mathew.onipe: elasticsearch: refactor elastic icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921)
[13:54:26] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Gehel)
[13:59:52] <wikibugs>	 (03CR) 10Marostegui: "As requested, gave a quick look at daily_snapshot.py so far so good apart from the things we already discussed in IRC" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[14:13:30] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10BBlack) Looking at an internal version of the flavor=dump outputs of an entity, related observations:  Test request from the in...
[14:15:40] <wikibugs>	 (03PS2) 10Dzahn: openstack: monitoring: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/495205
[14:15:55] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/navtiming@f2d8a5f]: (no justification provided)
[14:15:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:00] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/navtiming@f2d8a5f]: (no justification provided) (duration: 00m 05s)
[14:16:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:19] <wikibugs>	 (03PS3) 10Dzahn: openstack: monitoring: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/495205
[14:22:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/495205 (owner: 10Dzahn)
[14:23:15] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: postinst: explicitly mention home directory [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495228
[14:23:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: d/rules: avoid running dh_installinit [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495229
[14:23:19] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.4 unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495230
[14:27:09] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: postinst: explicitly mention home directory [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495228
[14:27:11] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: d/rules: avoid running dh_installinit [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495229
[14:27:13] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.4 unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495230
[14:27:15] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: d/control: include python dep [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495232
[14:27:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] postinst: explicitly mention home directory [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495228 (owner: 10Arturo Borrero Gonzalez)
[14:28:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/control: include python dep [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495232 (owner: 10Arturo Borrero Gonzalez)
[14:28:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/rules: avoid running dh_installinit [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495229 (owner: 10Arturo Borrero Gonzalez)
[14:28:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: generate entry for 0.4 unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495230 (owner: 10Arturo Borrero Gonzalez)
[14:31:52] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: d/changelog: release package 0.4 as unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495233
[14:34:27] <arturo>	 !log T215605 add prometheus-rabbitmq-exporter v0.4 to stretch-wikimedia
[14:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:30] <stashbot>	 T215605: cloudvps: missing packages in stretch for cloudcontrol servers - https://phabricator.wikimedia.org/T215605
[14:35:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: release package 0.4 as unstable [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495233 (owner: 10Arturo Borrero Gonzalez)
[14:40:20] <wikibugs>	 (03PS2) 10GTirloni: ldap: increase group TTL from 60 to 3600 seconds in labs [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280)
[14:42:07] <wikibugs>	 (03Abandoned) 10GTirloni: openldap: Set thread pool based on processor count [puppet] - 10https://gerrit.wikimedia.org/r/494911 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni)
[14:43:28] <wikibugs>	 (03CR) 10GTirloni: "Hey Arturo, when you get some spare time, could you review this? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni)
[14:44:21] <wikibugs>	 (03Abandoned) 10GTirloni: wmcs::nfs::misc - Backup for misc server (cloudstore1008) [puppet] - 10https://gerrit.wikimedia.org/r/485375 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni)
[14:45:08] <wikibugs>	 (03PS3) 10GTirloni: ldap: increase group TTL from 60 to 300 seconds in labs [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280)
[14:56:02] <wikibugs>	 (03PS7) 10Dzahn: create a new role 'hmmp' to replace role(simplelamp) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662)
[15:09:37] <wikibugs>	 (03Abandoned) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 (owner: 10Paladox)
[15:10:31] <wikibugs>	 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10MSantos) @MoritzMuehlenhoff and @Mholloway   >>! In T216521#4986352,...
[15:11:22] <wikibugs>	 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10MSantos) 05Open→03Invalid
[15:13:22] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Have coal watch the PaintTiming schema [puppet] - 10https://gerrit.wikimedia.org/r/494726 (https://phabricator.wikimedia.org/T217395) (owner: 10Gilles)
[15:13:33] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15047/webperf1001.eqiad.wmnet/  Looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/494726 (https://phabricator.wikimedia.org/T217395) (owner: 10Gilles)
[15:13:56] <wikibugs>	 (03PS2) 10Effie Mouzeli: Have coal watch the PaintTiming schema [puppet] - 10https://gerrit.wikimedia.org/r/494726 (https://phabricator.wikimedia.org/T217395) (owner: 10Gilles)
[15:21:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Fix typo in comments [software/service-checker] - 10https://gerrit.wikimedia.org/r/495237
[15:21:36] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add logging support [software/service-checker] - 10https://gerrit.wikimedia.org/r/495238
[15:25:38] <apergos>	 https://fr.wikipedia.org/wiki/Hôtel_de_Blossac  can we get a couple folks to see if the map in the infobox loads for them? if not, where are you located (country)? if so, where are you located (country)?
[15:25:50] <apergos>	 (it's failing for me and at least a couple other users in Europe)
[15:25:59] <apergos>	 tried a couple other random pages with maps, same
[15:26:12] <bblack>	 map loads for me
[15:26:24] <apergos>	 ok, another vote for 'works in the us'
[15:26:28] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "mostly minor comments about structure" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[15:27:00] <bblack>	 do we have some idea on the nature of the failure? does some request timeout or 503?
[15:27:50] <apergos>	 nope, hey andrewbogott let's move the discussion on the maps here?
[15:28:11] <andrewbogott>	 sure
[15:28:39] <apergos>	 https://en.wikipedia.org/wiki/R504_Kolyma_Highway#/map/0  fail for me, etc
[15:29:20] <apergos>	 same with mutante's cache-busting trick, wth foo it still fails
[15:29:37] <cdanis>	 same for me
[15:29:45] <mutante>	 it works for me as  https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac#/map/0?foo
[15:29:49] <vgutierrez>	 it's failing for me as well with a 400 on 
[15:29:49] <vgutierrez>	 https://maps.wikimedia.org/geoline?getgeojson=1&query=++SELECT+%3Fid+%3Flength+++%28if%28%3Fid+%3D+wd%3AQ1142859%2C+%27%23C12838%27%2C+%27%2307c63e%27%29+as+%3Fstroke%29+++%28concat%28%27Line+length%3A+%27%2C+str%28%3Flength%29%2C+%27+km%27%29+as+%3Fdescription%29+++%28if%28BOUND%28%3Flink%29%2C+++++++concat%28%27%5B%5B%27%2C+substr%28str%28%3Flink%29%2C31%2C500%29%2C+%27%7C%27%2C+%3FidLabel%2C+%27%5D%5D%27%29%2C+++++++%3FidLabe
[15:29:50] <vgutierrez>	 l%29++++as+%3Ftitle%29+WHERE+%7B+++++%7B%3Fid+wdt%3AP16+wd%3AQ260792.%7D++++++SERVICE+wikibase%3Alabel+%7B+++++bd%3AserviceParam+wikibase%3Alanguage+%27en%27+.+++++%3Fid+rdfs%3Alabel+%3FidLabel+.+++%7D+++OPTIONAL+%7B%3Flink+schema%3Aabout+%3Fid.+++%3Flink+schema%3AisPartOf+%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E.%7D+%7D+GROUP+BY+%3Fid+%3Flink+%3FidLabel+%3Flength++
[15:29:50] <bblack>	 so the enwiki or frwiki fetch succeeds, but then it loads data from maps.wikimedia.org, I'm assuming that's where the failure is, in a subrequest
[15:29:54] <cdanis>	 bblack: in devtools I see a call to maps.wikimedia.org that returns HTTP 400
[15:30:03] <vgutierrez>	 same as cdanis :)
[15:30:04] <cdanis>	 https://maps.wikimedia.org/geoshape?getgeojson=1&ids=Q3145754 returns HTTP 400 payload "headers is not defined"
[15:30:32] <bblack>	 those URIs look pretty funky
[15:30:33] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/coal@8766469]: (no justification provided)
[15:30:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:39] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/coal@8766469]: (no justification provided) (duration: 00m 06s)
[15:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:42] <bblack>	 why is there a select query inside a URI, and do we really support that?
[15:30:57] <apergos>	 gah I have no idea what to search for in logstash
[15:31:02] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add logging support [software/service-checker] - 10https://gerrit.wikimedia.org/r/495238
[15:31:24] <apergos>	 yes we do support it
[15:31:40] <bblack>	 ok, we probably shouldn't, that seems a little crazy
[15:31:47] <vgutierrez>	 the response refers to some headers missing: "headers is not defined"
[15:31:49] <bblack>	 but that's not something we can fix in the moment
[15:32:02] <vgutierrez>	 dunno if client or server side though
[15:32:07] <apergos>	 this is to get some info from wikidata for building the map
[15:32:17] <bblack>	 regardless
[15:32:30] <apergos>	 so that's a big feature to cut if we do cut it, but as you say, discuss later
[15:32:36] <bblack>	 there's a SQL query encoded inside a URI, and it's fetched as a subrequest to a normal article pageview
[15:32:39] <mutante>	 for me it is as if a blank map got cached
[15:32:41] <bblack>	 it's wrong on many levels
[15:32:56] * andrewbogott waiting for this to turn out to have an accidental wmcs dependency
[15:33:01] <apergos>	 lol
[15:33:10] <apergos>	 don't leave town :-P
[15:33:38] <cdanis>	 mmm, karthotherian does not seem to be in wmflabs codesearch?
[15:33:52] <mutante>	 maps*
[15:33:55] <bblack>	 kartotherian is a production service behind cache_upload
[15:33:57] <apergos>	 it's not a labs thing, it's a production extension
[15:34:04] <jynus>	 so other fr maps links work for me
[15:34:08] <bblack>	 (maps.wikimedia.org goes through cache_upload to kartotherian)
[15:34:17] <cdanis>	 yes yes, what I meant is that the source code does not seem to be *indexed* by https://codesearch.wmflabs.org
[15:34:19] <jynus>	 and in general the tiles seem to work
[15:34:21] <mutante>	 maps1001 et.. nothing in nginx error log
[15:34:22] <cdanis>	 as the thing I want to do is search for this error message ;)
[15:35:23] <apergos>	 ah, sorry c danis, misunderstood
[15:35:37] <jynus>	 I am not sure this is a traffic error could be a client code one
[15:36:07] <bblack>	 yeah it's only the esams edge giving the 400
[15:36:23] <apergos>	 Kartographer is the extension
[15:36:30] <vgutierrez>	 hmmm maybe, but the error is returned by karthotherian backends
[15:36:31] <apergos>	 (that is indexed)
[15:36:34] <jynus>	 as in, the 400 is correct, just it is trying to load the incorrect url
[15:36:39] <vgutierrez>	 looking at the headers... x-powered-by: kartotherian: 1.0.0 (439910027a55bf7c1effb2d34a1f42a5a995268d)
[15:37:19] <bblack>	 and the ones that work for me (200 through other edges) have:
[15:37:22] <bblack>	 x-powered-by: kartotherian: 0.0.38 (c49f37c39515675d95d3dd7da09ca535ec0d448b)
[15:37:29] <vgutierrez>	 oops
[15:38:09] <bblack>	 so it's eqiad vs codfw
[15:38:39] <bblack>	 codfw + ulsfo + eqsin edges work (goes to codfw kartotherian backends running 0.0.38), eqiad + esams edges go to eqiad karto backends running this 1.0.0
[15:38:57] <mutante>	 see SAL yesterday
[15:38:59] <mutante>	 18:39 gehel@puppetmaster1001: conftool action : set/pooled=yes; selector: dc=codfw,cluster=maps,name=maps2004.codfw.wmnet
[15:39:02] <mutante>	 18:32 mbsantos@deploy1001: Finished deploy [kartotherian/deploy@248b8c4] (stretch): Updating eqiad cluster before repool of maps2004.codfw.wmnet (duration: 01m 25s)
[15:39:14] <cdanis>	 I've DNS-overridden my traffic to go via ulsfo and it works
[15:39:37] <gehel>	 damn, what did I do again?
[15:40:00] <bblack>	 gehel: what's the current intended state of kartoetherian versions on the two dcs' maps clusters?
[15:40:04] <cdanis>	 if I don't override the resolution of text-lb, and just override maps.wikimedia.org to go to upload-lb.ulsfo, it also works
[15:40:05] <mutante>	 gehel: it appears there are different versions of kartotherian
[15:40:18] <cdanis>	 so it does appear that the new kartotherian release is at fault
[15:40:33] <cdanis>	 in the working pageload, x-powered-by: kartotherian: 0.0.38 (c49f37c39515675d95d3dd7da09ca535ec0d448b)
[15:40:51] <apergos>	 and 1.0.0 fails? huh
[15:41:08] <bblack>	 right, but is 1.0.0 even supposed to be live? maybe they were meant to stay depooled or something, I have no idea
[15:41:18] <apergos>	 neither do I
[15:41:18] <gehel>	 yep, we're in the process of migrating to stretch and there are a few servers lagging behind
[15:41:39] <bblack>	 does stretch imply a kartotherian upgrade from 0.0.38 to 1.0.0 as well?
[15:41:44] <gehel>	 the only recent change should be on maps2004, the other ones should not have anything new
[15:42:00] <vgutierrez>	 hmm so the version mismatch is on purpose?
[15:42:14] <bblack>	 gehel: what we're observing so far, is certain maps URIs return 200 OK as expected, from codfw-maps cluster, which claims to run karto 0.0.38
[15:42:25] <bblack>	 and the same URIs fail when routed into eqiad-maps cluster, which claims to run karto 1.0.0
[15:42:44] <gehel>	 and the failed URL are on geoshape?
[15:42:47] <bblack>	 yes
[15:43:10] <gehel>	 so that's not tiles serving but one of the peripheral services
[15:43:40] <mutante>	 example https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac?foo#/map/0
[15:43:52] <jynus>	 sorry, I said tiles were not loading, but that was not true, I was just getting a blank page
[15:43:59] <gehel>	 maps-eqiad should not have changed recently, so we're probably hitting a previously existing bug
[15:44:16] <bblack>	 it looks like maps2001-2003.codfw are running jessie, maps1001-1004.eqiad + maps2004.codfw are running stretch
[15:44:17] <mutante>	 sorry, this: https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac#/map/0   
[15:44:17] <gehel>	 mateusbs17: ^ you might know something about all that
[15:44:24] <jynus>	 It can also be seen on the infobox at https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac
[15:44:35] <gehel>	 bblack: correct, that's at least what I'm expecting
[15:44:41] <jynus>	 scrolling down, it show nothing to me
[15:44:49] <bblack>	 and all are pool
[15:44:52] <bblack>	 *pooled
[15:45:10] <papaul>	 !log OS install on restbase2019 and restbase2020
[15:45:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:21] <gehel>	 all pooled is also what's expected
[15:45:22] <bblack>	 gehel: the strong suspicion at this point is all the stretches are failing at this geoshape query, and all the jessies aren't
[15:45:36] <bblack>	 and they're reporting wildly different karto versions, too
[15:46:31] <gehel>	 yep, but eqiad has been running stretch for a number of weeks
[15:47:08] <mutante>	 is the OS version actually related to the karto version?
[15:47:10] <herron>	 fwiw from https://logstash.wikimedia.org/goto/126571f3ef3e2163b342235e3795d05d it looks like there are at least two recurring errors
[15:47:14] <mutante>	 it's not a package
[15:47:15] <mateusbs17>	 bblack: gehel: I found the issue, and it is related to unplished code https://github.com/kartotherian/geoshapes/releases/tag/v1.0.4
[15:47:24] <herron>	 'headers is not defined’ and 'Bad geojson - unknown type ExternalData'
[15:48:14] <gehel>	 mateusbs17: you mean: this is a known issue and we have a fix ready to be deployed?
[15:48:33] <cdanis>	 btw it is pretty easy to reproduce this with varying upload-lb locations from the command line: curl -v --resolve maps.wikimedia.org:443:$(dig +short upload-lb.codfw.wikimedia.org) 'https://maps.wikimedia.org/geoshape?getgeojson=1&ids=Q3145754'
[15:48:38] <mutante>	 (3) maps[2001-2003].codfw.wmnet                                                                    
[15:48:42] <mutante>	 version 0.0.1
[15:48:43] <bblack>	 yes
[15:48:46] <bblack>	 https://github.com/kartotherian/geoshapes/commit/8995ed4ac0050c9e8ae36e78d860e4e84b3185b9
[15:48:56] <bblack>	 but that fix is in 1.0.3 as well
[15:48:57] <mutante>	 some are 1.0 and some are 0.0.1
[15:49:13] <mutante>	  sudo cumin 'maps*' 'grep version /srv/deployment/kartotherian/deploy/src/package.json'
[15:49:30] <cdanis>	 so it sounds like we should depool maps@eqiad, yes?
[15:49:40] <mateusbs17>	 gehel: Yes. bblack: this v1.0.3 is not matching npm for some reason. v1.0.4 fixes that.
[15:49:51] <bblack>	 cdanis: I'm not sure, I can log failures when querying the codfw applayer directly now, too
[15:50:16] <mateusbs17>	 I am finishing some build tests and the next kartotherian deploy will come with a fix
[15:50:28] <cdanis>	 someone give me a reason to not immediately depool maps@eqiad
[15:50:28] <gehel>	 maps2004 is already upgraded on codfw, so it would also exhibit the problem
[15:50:35] <bblack>	 maybe the whole codfw/eqiad split is a red herring about who has these items cached
[15:50:48] <bblack>	 I get the 400 when I internally query any of maps2001-4.codfw
[15:50:53] <bblack>	 (behind the caches)
[15:50:56] <mutante>	 i was able to make a URL work by adding ?foo to it.. without changing location
[15:50:56] <cdanis>	 huh, I am yet to reproduce a failure on codfw with caches bypassed
[15:51:10] <gehel>	 cdanis: you can try depooling eqiad
[15:51:18] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team: Deploy multi-site plugin to cobalt and gerrit2001 - https://phabricator.wikimedia.org/T217174 (10Paladox) a:05Paladox→03None This can be deployed to prod (if there's kafka in prod). In my testing this worked really well (we only want replication from co...
[15:51:28] <bblack>	 I don't think we should depool
[15:51:39] <jynus>	 may I suggest to search or create a task first
[15:51:43] <jynus>	 :-)
[15:51:45] <herron>	 in terms of error log volume maps2004 is quite low compared to the eqiad maps hosts
[15:51:53] <gehel>	 if the problem is really the stretch upgrade we'll also need to depool 2004
[15:51:58] <bblack>	 all the 200's I get against codfw are cache hits, so far, and when I add a random param to bust, they're 400s
[15:52:02] <gehel>	 and the risk at that point is overloading the service
[15:52:05] <jynus>	 so we can send people to that may be having issues
[15:52:25] <gehel>	 mateusbs17: do you know if we already have a task for this?
[15:52:43] <bblack>	 oh randm 200/400 maybe, depends on server? the data is confusing!
[15:53:03] <gehel>	 the state of that cluster is confusing :/
[15:53:27] <mateusbs17>	 We have too different problems on the example above: geoshapes and the page_props external data linking with JsonConfig.
[15:53:31] <mutante>	 how about getting the same karto version on all first?
[15:54:09] <apergos>	 well if strtch has a new one (a few different minor versions of it) and some of those are suspect, I'm not sure we can start there
[15:54:13] <mateusbs17>	 I see that the geoshapes problem is not properly deployed. I spotted this a couple hours ago and started fixing the version
[15:54:23] <mutante>	 i dont see a relation to stretch since it's deployed with scap and not as a deb ?
[15:54:37] <bblack>	 ugh, some of my previous tests are faulty, curl apparently overriding --resolve by using ipv6 instead?
[15:54:48] <gehel>	 mutante: there are fixes in the 1.0.0 for stretch compatibility
[15:54:50] <bblack>	 * Added maps.wikimedia.org:443:208.80.153.240 to DNS cache
[15:54:50] <bblack>	 *   Trying 2620:0:861:ed1a::2:b...
[15:54:52] <bblack>	 wtf :P
[15:55:13] <mutante>	 gehel: ah
[15:55:14] <vgutierrez>	 bblack: -4
[15:55:31] <bblack>	 yeah I see that, but that seems like really dumb default behavior
[15:55:46] <vgutierrez>	 yeah.. unexpected at least
[15:55:53] <bblack>	 I didn't say --resolve-only-for-ipv4-but-then-use-ipv6-instead
[15:55:56] <bblack>	 I said --resolve :P
[15:56:00] <apergos>	 lolol
[15:56:16] <cdanis>	 querying the hosts directly, maps200[123] work and maps2004 does not, just for the simple https://maps.wikimedia.org/geoshape?getgeojson=1&ids=Q3145754
[15:56:44] <mutante>	 we can depool only maps2004  since hat is also what changed yesterday
[15:56:49] <cdanis>	 maps2004 is also < x-powered-by: kartotherian: 1.0.0 (439910027a55bf7c1effb2d34a1f42a5a995268d)
[15:57:02] <apergos>	 and 2001,2,3 have which?
[15:57:15] <mutante>	 0.0.1 
[15:57:18] <bblack>	 right
[15:57:21] <mutante>	 depending where you look
[15:57:33] <vgutierrez>	 0.0.38 according to the headers
[15:57:35] <bblack>	 so ignore the stuff I said in the middle about some codfw seeming to fail
[15:57:36] <apergos>	 ok
[15:57:36] <mutante>	 but 1001-1004 also have 1.0.0
[15:57:47] <bblack>	 it is the version split that defines the 400 vs 200 responses
[15:57:54] <mutante>	 i am looking at package.json in the deploy dir though
[15:58:16] <vgutierrez>	 x-powered-by: kartotherian: 0.0.38 (c49f37c39515675d95d3dd7da09ca535ec0d448b)
[15:58:16] <vgutierrez>	 that one
[15:58:21] <apergos>	 so 1001-1004 and 2004 are bad due to having 1.0.0  and 2001-3 are good due to having 0.0.38   
[15:58:22] <bblack>	 (the OS version split, and the http-reported karto version, which may or may not correlate to some other version)
[15:58:24] <apergos>	 got it
[15:58:41] <gehel>	 ok, some more info:
[15:58:50] <mutante>	 since the thing that happened yesterday is pooling 2004 and 2004 looks faulty, let's depool just that one?
[15:58:54] <bblack>	 so if we want to fix this quickly, yes, we could depool all the stretches and maybe-overload the remainin 3 in codfw that are older
[15:59:11] <bblack>	 or we can upgrade the stretches to the 1.0.4 karto package for the bugfix
[15:59:12] <gehel>	 that issue was monkey patched because we did not want to package a full version during the stretch upgrade
[15:59:31] <cdanis>	 https://phabricator.wikimedia.org/P8173
[15:59:58] <apergos>	 awesome thanks c danis
[16:00:12] <gehel>	 yesterday deployment did erase that monkey patch
[16:00:19] <apergos>	 if we overload the servers are they worse off than now?
[16:00:32] <apergos>	 can they cause any other sort of issue?
[16:00:38] <gehel>	 apergos: yes, that would be worse
[16:00:46] <apergos>	 ah it figures 
[16:00:53] <gehel>	 at the moment, all the tile serving is fine, which is the main use case
[16:01:08] <apergos>	 right
[16:01:15] <bblack>	 do we actually think 3 servers will be overloaded?
[16:01:17] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul)
[16:01:23] <gehel>	 the risk is minor
[16:01:26] <bblack>	 because that means maybe we need to expand these clusters
[16:01:34] <gehel>	 mateus will have a proper deploy with this single fix in a few minutes
[16:01:41] <bblack>	 it should be an ok situation in design terms: have 1/4 hardware machines dead in a site, and depooling 1/2 redundant sites
[16:01:44] <jynus>	 as I said, I think it is more important to create a ticket and communicate at this point
[16:01:53] <apergos>	 a few minutes? worth the wait
[16:01:56] <bblack>	 ok let's wait on the fix then
[16:02:03] <gehel>	 ticket is T217898
[16:02:06] <stashbot>	 T217898: Geoshape service fails to deliver geoshapes from OSM - https://phabricator.wikimedia.org/T217898
[16:02:06] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi All yours
[16:02:07] <jynus>	 gehel: thanks
[16:02:49] <mutante>	 waiting sounds fine, we already had that since 24h or so i guess
[16:03:01] <cdanis>	 https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=maps&var-instance=All the maps clusters do look like they run reasonably hot on CPU some fraction of the time
[16:03:11] <cdanis>	 so I agree that overload is a possible issue
[16:03:23] <cdanis>	 I'm going to ask a few postmortem-y questions.
[16:03:44] <cdanis>	 - are there any blackbox probes done against the HTTP API by icinga or the like?
[16:03:45] <apergos>	 mmm
[16:04:34] <mateusbs17>	 bblack: it's just the time to package kartotherian and deploy it
[16:04:54] <cdanis>	 - why is there not an HTTP response code breakdown on https://grafana.wikimedia.org/d/000000030/service-kartotherian?refresh=5m&orgId=1
[16:05:20] <jynus>	 cdanis: I don't think it is yet the time while the issue is ongoing
[16:05:30] <cdanis>	 I don't need answers now jynus
[16:05:44] <cdanis>	 I just want to get these out of my head into somewhere less ephemeral ;)
[16:05:53] <cdanis>	 but if you'd rather that be a local text editor that's fine
[16:06:00] <jynus>	 cdanis: use the ticket :-)
[16:06:15] <gehel>	 I'm adding some notes to T217898, just to be sure to not forget those
[16:06:20] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on phab1002 is CRITICAL: cluster=misc device=sdc instance=phab1002:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1002&var-datasource=eqiad+prometheus/ops
[16:06:25] <gehel>	 cdanis: feel free to add more things on the ticket
[16:06:34] <mutante>	 we can all use that pastebin and later embed the pastebin into the ticket
[16:08:54] <gehel>	 mutante: which pastebin? (I lost track)
[16:09:05] <cdanis>	 https://phabricator.wikimedia.org/P8173
[16:09:13] <gehel>	 thanks!
[16:11:03] <cdanis>	 once the issue is fixed in production, does anyone else think it would be a good idea to open a proper incident report under https://wikitech.wikimedia.org/wiki/Incident_documentation, construct a timeline, and get an idea of the number of affected requests?
[16:11:45] <apergos>	 yes
[16:11:48] <mutante>	 sounds like an incident, ack
[16:12:21] <gehel>	 Oh yes!
[16:12:39] <gehel>	 I can do a first pass on the incident report once the fire is under control
[16:15:21] <wikibugs>	 (03CR) 10Anomie: "1.33.0-wmf.20 is now on all wikis. This should be safe to deploy." [puppet] - 10https://gerrit.wikimedia.org/r/493323 (https://phabricator.wikimedia.org/T217162) (owner: 10Anomie)
[16:17:17] <logmsgbot>	 !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@d71df87] (stretch): UBN geoshapes services (T217898)
[16:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:21] <stashbot>	 T217898: Geoshape service fails to deliver geoshapes from OSM - https://phabricator.wikimedia.org/T217898
[16:19:17] <logmsgbot>	 !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@d71df87] (stretch): UBN geoshapes services (T217898) (duration: 02m 00s)
[16:19:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:42] <mutante>	 https://fr.wikipedia.org/wiki/H%C3%B4tel_de_Blossac  works for me
[16:21:46] <mutante>	 looks like you fixed it
[16:21:52] <gehel>	 mateusbs17: thanks a lot!!!
[16:21:58] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM. I'm unable to find docs for the 'mail' program used here ( maybe mailx? https://linux.die.net/man/1/mail ), or a -a param, but I ass" [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans)
[16:22:13] <mutante>	 mateusbs17: confirmed working, thanks
[16:22:55] <logmsgbot>	 !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@cc302de] (stretch): UBN geoshapes services on maps2004.codfw.wmnet (T217898)
[16:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:58] <stashbot>	 T217898: Geoshape service fails to deliver geoshapes from OSM - https://phabricator.wikimedia.org/T217898
[16:23:19] <logmsgbot>	 !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@cc302de] (stretch): UBN geoshapes services on maps2004.codfw.wmnet (T217898) (duration: 00m 24s)
[16:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:34] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) That for this, it's appreciated. Note that we haven't still decided over which time window the availability will be calculated,...
[16:26:02] <wikibugs>	 (03CR) 10Muehlenhoff: "/usr/bin/mail uses the alternatives mechanism but is ultimately provided by bsd-mailx which is installed as a dependency by Icinga." [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans)
[16:26:08] <apergos>	 maps, or at least the one, have returned for me. nice!
[16:26:23] <mateusbs17>	 I will add more details on T217898, but the problem now seems that's fixed.
[16:27:53] <wikibugs>	 (03CR) 10Dzahn: "yea, that's mailx but still different. -a is adding a header not attachment." [puppet] - 10https://gerrit.wikimedia.org/r/494464 (owner: 10Volans)
[16:27:53] <gehel>	 I've created a stub incident report (https://wikitech.wikimedia.org/wiki/Incident_documentation/20190308-wdqs) I'll add more content, but feel free to already add whatever you have.
[16:30:17] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: cxserver: Add kademlia support [deployment-charts] - 10https://gerrit.wikimedia.org/r/495252
[16:30:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Works in testing. Merging, we can always improve  on it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/492301 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris)
[16:31:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Worked as well in minikube, Production is not so easy as we lack a coredns component, but this will have to do for now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/495252 (owner: 10Alexandros Kosiaris)
[16:32:23] <wikibugs>	 (03PS1) 10Mforns: Reenable reportupdater jobs after analytics replica shard fix [puppet] - 10https://gerrit.wikimedia.org/r/495253 (https://phabricator.wikimedia.org/T215289)
[16:36:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] openstack - Convert cron jobs to systemd timers (0316 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni)
[16:39:29] <wikibugs>	 (03PS1) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157)
[16:39:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[16:40:17] <wikibugs>	 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10bd808) >>! In T216712#5010593, @MoritzMuehlenhoff wrote: > Is php-xdebug (and php-tideways) used in the stretch-based Toolforge PHP...
[16:41:46] <wikibugs>	 (03PS2) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157)
[16:41:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[16:43:33] <cdanis>	 mateusbs17: I still see maps1004.eqiad returning a 400 (and on kartotherian 1.0.0, not 1.0.1)
[16:43:34] <wikibugs>	 (03PS4) 10Dzahn: openstack: monitoring: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/495205
[16:45:25] <gehel>	 cdanis, mateusbs17: I confirm that package.json on maps1004 is still 1.0.0
[16:46:08] <wikibugs>	 (03CR) 10Matthias Mullie: External api url for wbsearchentities api call on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[16:46:59] <wikibugs>	 (03PS3) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157)
[16:47:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[16:47:13] <wikibugs>	 (03CR) 10Cparle: External api url for wbsearchentities api call on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[16:47:51] <logmsgbot>	 !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@acf2694] (stretch): UBN geoshapes services on maps1004.eqiad.wmnet (T217898)
[16:47:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:54] <stashbot>	 T217898: Geoshape service fails to deliver geoshapes from OSM - https://phabricator.wikimedia.org/T217898
[16:47:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thank you too for the review" [puppet] - 10https://gerrit.wikimedia.org/r/495205 (owner: 10Dzahn)
[16:48:12] <logmsgbot>	 !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@acf2694] (stretch): UBN geoshapes services on maps1004.eqiad.wmnet (T217898) (duration: 00m 22s)
[16:48:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:36] <cdanis>	 looks good now, thanks :)
[16:49:11] <mateusbs17>	 cdanis: Thanks for finding this one
[16:50:58] <wikibugs>	 (03PS4) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157)
[16:51:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[16:53:43] <wikibugs>	 (03CR) 10Reedy: [C: 04-1] "needs moar tabs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[16:53:48] <wikibugs>	 (03PS5) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157)
[16:54:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[16:55:47] <wikibugs>	 (03PS6) 10Cparle: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157)
[17:02:32] <wikibugs>	 (03PS1) 10Bstorm: toolforge: reformatting promethus grid engine collector [puppet] - 10https://gerrit.wikimedia.org/r/495261
[17:02:34] <wikibugs>	 (03PS1) 10Bstorm: gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845)
[17:02:45] <bblack>	 nice dropoff of 400s in upload:
[17:02:48] <bblack>	 https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&panelId=2&fullscreen&var-site=All&var-cache_type=upload&var-status_type=4&from=now-3h&to=now
[17:04:11] <cdanis>	 bblack: so how would one do an accurate census of 400s that were kartotherian?
[17:05:01] <wikibugs>	 (03CR) 10Bstorm: "Note that I included a commit that reformats with black so that doesn't screw up the diff.  I think I did this right, and it runs.  I can " [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm)
[17:06:43] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on phab1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1002&var-datasource=eqiad+prometheus/ops
[17:14:00] <wikibugs>	 (03PS1) 10Paladox: Update image-diff plugin [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495264
[17:14:10] <bblack>	 cdanis: I don't think we have that "live", but we could get it near-realtime from analytics stuff
[17:14:31] <wikibugs>	 (03CR) 10Paladox: [V: 03+2 C: 03+2] "This should fix it so that the plugin can be installed now." [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495264 (owner: 10Paladox)
[17:15:10] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Reenable reportupdater jobs after analytics replica shard fix [puppet] - 10https://gerrit.wikimedia.org/r/495253 (https://phabricator.wikimedia.org/T215289) (owner: 10Mforns)
[17:15:17] <wikibugs>	 (03PS2) 10Ottomata: Reenable reportupdater jobs after analytics replica shard fix [puppet] - 10https://gerrit.wikimedia.org/r/495253 (https://phabricator.wikimedia.org/T215289) (owner: 10Mforns)
[17:15:19] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Reenable reportupdater jobs after analytics replica shard fix [puppet] - 10https://gerrit.wikimedia.org/r/495253 (https://phabricator.wikimedia.org/T215289) (owner: 10Mforns)
[17:18:04] <wikibugs>	 (03PS1) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495265
[17:18:45] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[17:18:49] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[17:18:49] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[17:19:15] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[17:19:21] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[17:20:36] <wikibugs>	 10Operations, 10netops: Increase network capacity (2018-19 Q3 Goal) - https://phabricator.wikimedia.org/T213122 (10ayounsi)
[17:20:43] <wikibugs>	 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) 05Open→03Resolved the redundancy testing is outside the scope of the goal, so everything needed here is done.
[17:21:49] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled
[17:22:37] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet])
[17:22:51] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet])
[17:24:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled
[17:26:30] <wikibugs>	 10Operations, 10decommission, 10Patch-For-Review, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10jijiki)
[17:26:34] <bblack>	 what's up with aqs/druid?
[17:26:49] <wikibugs>	 (03CR) 10Bstorm: "Cherry-picked this into toolsbeta and it works great.  Merging." [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis)
[17:26:53] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy
[17:26:57] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy
[17:26:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[17:26:59] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy
[17:27:10] <wikibugs>	 (03PS2) 10Bstorm: toolforge: switch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis)
[17:27:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy
[17:27:21] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy
[17:27:29] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy
[17:27:41] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal
[17:27:53] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal
[17:28:51] <wikibugs>	 (03PS1) 10CRusnov: Add report which checks against puppetdb and compares serial numbers [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267
[17:28:58] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: switch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis)
[17:30:00] <robh>	 !log decom in progress for rdb100[123478] via T209181
[17:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:03] <stashbot>	 T209181: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181
[17:30:06] <NotASpy>	 who can help with a possible Android app unique bug/vandalism ?
[17:30:25] <cdanis>	 NotASpy: is this the same issue as reported at https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/Incidents#dick_picks_on_the_Wikipedia_app ?
[17:30:43] <NotASpy>	 it is, and I've confirmed with a screenshot if needed
[17:31:06] <cdanis>	 yeah, I can reproduce on desktop.
[17:31:16] <NotASpy>	 done the usual sweep for template vandalism, Wikidata vandalism etc, and can't find any obvious candidates
[17:31:17] <wikibugs>	 (03CR) 10Muehlenhoff: "The change is fine as-is, but will need some future fixup when the exporter gets built for buster, see https://phabricator.wikimedia.org/T" [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/495228 (owner: 10Arturo Borrero Gonzalez)
[17:33:28] <wikibugs>	 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10Muehlenhoff) >>! In T216521#5011235, @MSantos wrote: > I tested Mich...
[17:33:41] <NotASpy>	 cdanis: actually, it has been figured out. It's a caching issue from some vandalism about 12 hours ago. Purging is fixing it. 
[17:47:04] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "Seems worth trying as a temporary measure to reduce load on the LDAP directories while we continue to dig for deeper solutions." [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni)
[17:47:11] <icinga-wm>	 PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:47:52] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH)
[17:49:11] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 (10debt) 05Open→03Resolved
[17:51:19] <marktraceur>	 Hey all! Wondering if anyone is available to deploy a quick labs config change (or if we can skip the deploy to production if it's labs-only now)
[17:51:23] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade logstash plugins to 5.6.14 - https://phabricator.wikimedia.org/T216993 (10debt) 05Open→03Resolved
[17:51:25] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 (10debt)
[17:51:29] <marktraceur>	 https://gerrit.wikimedia.org/r/495256
[17:52:13] <wikibugs>	 10Operations, 10Maps (Kartotherian), 10Wikimedia-Incident: Create test in spec.yaml for the kartotherian / geoshape service - https://phabricator.wikimedia.org/T217910 (10Gehel)
[17:52:26] <wikibugs>	 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10monitoring, and 2 others: Icinga monitoring for elasticsearch doesn't notice OOM conditions (this is happening on cloud) - https://phabricator.wikimedia.org/T76090 (10debt) 05Open→03Resolved
[18:00:43] <wikibugs>	 (03PS4) 10Paladox: Add "multi-site" plugin so gerrit can have multi masters [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494865
[18:03:13] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH) NETWORK PORT INFO   rdb1001:asw2-c-eqiad:ge-4/0/9 rdb1002:asw2-c-eqiad:ge-7/0/18 rdb1003:asw-a-eqiad:ge-4/0/43 rdb1004:asw2-b-eqia...
[18:03:44] <wikibugs>	 (03CR) 10BryanDavis: gridengine: Add prometheus monitor for queue/host health (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm)
[18:03:51] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1001.eqiad.wmnet and performed the following actions: - Revoked Pu...
[18:04:04] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1002.eqiad.wmnet and performed the following actions: - Revoked Pu...
[18:04:16] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1003.eqiad.wmnet and performed the following actions: - Revoked Pu...
[18:04:28] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1004.eqiad.wmnet and performed the following actions: - Revoked Pu...
[18:04:41] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1007.eqiad.wmnet and performed the following actions: - Revoked Pu...
[18:04:54] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb1008.eqiad.wmnet and performed the following actions: - Revoked Pu...
[18:05:20] <wikibugs>	 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10MSantos) We have the epic with more general information that needs t...
[18:05:33] <wikibugs>	 (03CR) 10Bstorm: gridengine: Add prometheus monitor for queue/host health (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm)
[18:06:49] <wikibugs>	 (03PS1) 10RobH: decom rdb100[123478].eqiad.wmnet dns entries [dns] - 10https://gerrit.wikimedia.org/r/495274 (https://phabricator.wikimedia.org/T209181)
[18:08:49] <wikibugs>	 (03CR) 10RobH: [C: 03+2] decom rdb100[123478].eqiad.wmnet dns entries [dns] - 10https://gerrit.wikimedia.org/r/495274 (https://phabricator.wikimedia.org/T209181) (owner: 10RobH)
[18:09:49] <wikibugs>	 (03PS1) 10RobH: decom of rdb100[123478] [puppet] - 10https://gerrit.wikimedia.org/r/495275 (https://phabricator.wikimedia.org/T209181)
[18:11:05] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH)
[18:11:19] <wikibugs>	 (03CR) 10RobH: [C: 03+2] decom of rdb100[123478] [puppet] - 10https://gerrit.wikimedia.org/r/495275 (https://phabricator.wikimedia.org/T209181) (owner: 10RobH)
[18:12:09] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH) a:03Cmjohnson
[18:12:47] <icinga-wm>	 RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:14:12] <elukey>	 bblack: re: aqs/druid - the new edit metrics were published in druid, and caused a series of cache miss in there that bubbled up in aqs as well (the alerts are all for edit metrics afaics)
[18:14:20] <elukey>	 in theory it shouldn't have happened
[18:14:29] <elukey>	 (just seen the alerts)
[18:15:30] <bblack>	 ok thanks :)
[18:16:46] <elukey>	 I am not super happy about it, will do a proper investigation on monday :(
[18:17:41] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] External api url for wbsearchentities api call on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[18:17:45] <wikibugs>	 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Wikimedia-Incident: Create test in spec.yaml for the kartotherian / geoshape service - https://phabricator.wikimedia.org/T217910 (10MSantos)
[18:17:46] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] "(labs!)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[18:18:52] <wikibugs>	 (03Merged) 10jenkins-bot: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[18:19:15] <wikibugs>	 (03CR) 10jenkins-bot: External api url for wbsearchentities api call on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495256 (https://phabricator.wikimedia.org/T217157) (owner: 10Cparle)
[18:20:31] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: beta only (duration: 00m 49s)
[18:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:05] <wikibugs>	 (03PS3) 10Fomafix: Avoid redirects from HTTPS to HTTP and back to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/469262
[18:23:39] <wikibugs>	 (03PS2) 10Fomafix: Add redirects from 'lzh' to 'zh-classical' [puppet] - 10https://gerrit.wikimedia.org/r/481533 (https://phabricator.wikimedia.org/T167513)
[18:24:02] <wikibugs>	 (03PS8) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845)
[18:25:09] <wikibugs>	 (03PS2) 10Fomafix: Add 'lzh' as alias for 'zh-classical' [dns] - 10https://gerrit.wikimedia.org/r/481532 (https://phabricator.wikimedia.org/T167513)
[18:25:56] <wikibugs>	 (03PS2) 10Fomafix: Add 'sgs' as alias for 'bat-smg' [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830)
[18:26:56] <wikibugs>	 (03PS2) 10Bstorm: gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845)
[18:27:20] <wikibugs>	 (03CR) 10Bstorm: gridengine: Add prometheus monitor for queue/host health (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm)
[19:03:58] <wikibugs>	 (03PS4) 10Paladox: WIP: Update gerrit to 2.16.6 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495012
[19:17:39] <wikibugs>	 (03PS1) 10MarkTraceur: Fix typo in config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495282 (https://phabricator.wikimedia.org/T217157)
[19:20:37] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Fix typo in config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495282 (https://phabricator.wikimedia.org/T217157) (owner: 10MarkTraceur)
[19:21:08] <moritzm>	 !log installing php updates on netmon1002
[19:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:06] <wikibugs>	 (03Merged) 10jenkins-bot: Fix typo in config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495282 (https://phabricator.wikimedia.org/T217157) (owner: 10MarkTraceur)
[19:25:31] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: beta only (duration: 00m 50s)
[19:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:31] <icinga-wm>	 PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:29:04] <wikibugs>	 (03CR) 10jenkins-bot: Fix typo in config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495282 (https://phabricator.wikimedia.org/T217157) (owner: 10MarkTraceur)
[19:42:41] <wikibugs>	 (03PS1) 10Bstorm: osmdb: Switch the replica to the VM that needs to become the master [puppet] - 10https://gerrit.wikimedia.org/r/495290 (https://phabricator.wikimedia.org/T193264)
[19:46:31] <wikibugs>	 (03PS1) 10Legoktm: php72: Switch from thirdparty/php72 to component/php72 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/495291
[19:46:42] <wikibugs>	 (03PS2) 10Legoktm: php72: Switch from thirdparty/php72 to component/php72 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/495291 (https://phabricator.wikimedia.org/T216712)
[19:51:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:52:21] <icinga-wm>	 RECOVERY - HHVM rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 81541 bytes in 0.292 second response time
[19:53:23] <icinga-wm>	 RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[19:55:27] <wikibugs>	 (03CR) 10BryanDavis: gridengine: Add prometheus monitor for queue/host health (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm)
[20:12:33] <wikibugs>	 (03CR) 10Bstorm: gridengine: Add prometheus monitor for queue/host health (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm)
[20:14:46] <wikibugs>	 (03PS2) 10Bstorm: toolforge: reformatting promethus grid engine collector [puppet] - 10https://gerrit.wikimedia.org/r/495261
[20:14:48] <wikibugs>	 (03PS3) 10Bstorm: gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845)
[20:32:11] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:33:11] <icinga-wm>	 PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2
[20:35:33] <icinga-wm>	 PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2
[20:35:45] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:35:46] <ottomata>	 ^ hm
[20:35:47] <ottomata>	 ?
[20:36:25] <icinga-wm>	 PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore
[20:36:50] <ottomata>	 oom-killer
[20:41:09] <icinga-wm>	 RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore
[20:41:27] <icinga-wm>	 RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2
[20:49:57] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm)
[21:03:17] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) We've been around this topic a number of times, so I'll write a summary where we're at so far. I'm sorry it's going...
[21:05:26] <wikibugs>	 (03PS4) 10Bstorm: gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845)
[21:11:21] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) > disable cache busting by default, enable it internally  This would immediately break all external updaters. They'd...
[21:11:25] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) > disable cache busting by default, enable it internally  This would immediately break all external updaters. They'd...
[21:47:46] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) > don't do cache busting on events older than X  This however gave me an idea. If we kept a map of all latest revisi...
[22:27:53] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] gridengine: Add prometheus monitor for queue/host health [puppet] - 10https://gerrit.wikimedia.org/r/495262 (https://phabricator.wikimedia.org/T215845) (owner: 10Bstorm)
[22:28:07] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: reformatting promethus grid engine collector [puppet] - 10https://gerrit.wikimedia.org/r/495261 (owner: 10Bstorm)
[23:08:50] <wikibugs>	 (03PS1) 10MaxSem: Remove $wgMediaInTargetLanguage, matches the MW default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495408
[23:21:23] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] "Adding higher puppet beings so they can +2" [puppet] - 10https://gerrit.wikimedia.org/r/484994 (https://phabricator.wikimedia.org/T209857) (owner: 10Gilles)
[23:49:10] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10bd808)