[00:00:05] <jouncebot>	 Deploy window No Deploys - WMF US Staff holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190619T0000)
[01:50:18] <wikibugs>	 10Operations, 10Wikimedia Australia, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10StevenCrossin) Hi, @Dzahn can you help with this by any chance?
[04:09:43] <wikibugs>	 (03Abandoned) 10ArielGlenn: testing paging settings for labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/517693 (owner: 10ArielGlenn)
[04:13:15] <wikibugs>	 (03PS3) 10ArielGlenn: phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804)
[04:14:19] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804) (owner: 10ArielGlenn)
[04:16:19] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517787 (https://phabricator.wikimedia.org/T224852)
[04:17:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517787 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:18:07] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517787 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:18:19] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10ArielGlenn) I'll leave this ticket open until we see that the next month's report has shown up.
[04:18:22] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517787 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:19:53] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 T224852 (duration: 00m 57s)
[04:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:19:58] <stashbot>	 T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852
[04:20:01] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging]
[04:20:02] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver cluster staging completed
[04:20:02] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver finished
[04:20:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:20:07] <wikibugs>	 (03PS4) 10Marostegui: db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852)
[04:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:21:34] <onimisionipe>	 !log depooling maps1002 for reimaging into new partition scheme - T224395
[04:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:21:39] <stashbot>	 T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395
[04:23:39] <wikibugs>	 (03PS4) 10Marostegui: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852)
[04:24:07] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw]
[04:24:09] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver cluster codfw completed
[04:24:09] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver finished
[04:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:25:35] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad]
[04:25:36] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed
[04:25:36] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver finished
[04:25:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:25:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:25:46] <wikibugs>	 (03PS1) 10Bmansurov: Labs: enable surveys for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517789 (https://phabricator.wikimedia.org/T225819)
[04:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:26:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Labs: enable surveys for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517789 (https://phabricator.wikimedia.org/T225819) (owner: 10Bmansurov)
[04:28:19] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v1/page/{language}/{title}{/revision} (Fetch enwiki protected pa
[04:28:19] <icinga-wm>	 Test Fetch enwiki protected page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Translate enwiki protected page) is CRITICAL: Test Translate enwiki protected page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[04:28:22] <marostegui>	 !log Starting pre-steps for the s4 failover that will happen at 05:00 UTC - T224852
[04:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:28:27] <stashbot>	 T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852
[04:28:34] <marostegui>	 kart_: Is that your deployment ^
[04:30:29] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v1/page/{language}/{title}{/revision} (Fetch enwiki protected pa
[04:30:29] <icinga-wm>	 Test Fetch enwiki protected page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Translate enwiki protected page) is CRITICAL: Test Translate enwiki protected page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[04:31:03] <marostegui>	 kart_: ^
[04:31:50] <kart_>	 marostegui: looking..
[04:32:09] <marostegui>	 kart_: thanks - also, we have a maintenance window requested in 30 minutes to failover s4 master
[04:32:13] <marostegui>	 (requires read only)
[04:32:48] <kart_>	 marostegui: oh. right.
[04:34:51] <jynus>	 is that expected, or should it be reverted?
[04:35:13] <kart_>	 jynus: if more is happening, I can revert. Should not be..
[04:36:18] <kart_>	 waiting for some minutes. 
[04:36:33] <jynus>	 https://grafana.wikimedia.org/d/F7rttgqmz/cxserver?refresh=1m&panelId=15&fullscreen&orgId=1&from=1560915388343&to=1560918988343&var-dc=eqiad%20prometheus%2Fk8s&var-service=cxserver
[04:37:09] <kart_>	 OK. Seems broken.
[04:37:16] <kart_>	 Reverting..
[04:39:57] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging]
[04:39:58] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver cluster staging completed
[04:39:58] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver finished
[04:40:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:16] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw]
[04:40:17] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver cluster codfw completed
[04:40:17] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver finished
[04:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:38] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad]
[04:40:39] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed
[04:40:39] <logmsgbot>	 !log kartik@deploy1001 scap-helm cxserver finished
[04:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:41:07] <wikibugs>	 (03CR) 10Marostegui: mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/517361 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:41:14] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/517361 (https://phabricator.wikimedia.org/T224852)
[04:41:27] <wikibugs>	 (03CR) 10Marostegui: db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:41:27] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[04:41:36] <wikibugs>	 (03CR) 10Marostegui: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:41:49] <wikibugs>	 (03CR) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/517360 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:41:53] <wikibugs>	 (03PS3) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/517360 (https://phabricator.wikimedia.org/T224852)
[04:42:01] <kart_>	 marostegui: jynus reverted.
[04:42:05] <marostegui>	 thanks
[04:42:11] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[04:47:12] <jynus>	 kart_: "MT processing error for: en > qqq. Error: invalid distance too far back    at Zlib.zlibOnError [as onerror]"
[04:47:39] <jynus>	 (not ongoing anymore)
[04:47:46] <kart_>	 jynus: hit by https://github.com/nodejs/node/issues/22839 it seems.
[04:47:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/517361 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:48:42] <kart_>	 jynus: zlib need update on host in this case, but let's see how we can do it with docker.
[04:49:16] <jynus>	 kart_: sorry, we are moving with our own deployment atm
[04:50:12] <kart_>	 jynus: yeah. Nothing need to be done right now or no emergency.
[04:51:45] <marostegui>	 We are going to take over puppet and mediawiki deployment for the s4 failover, if you need to deploy please coordinate with us. I will communicate once it is all done and deployments can happen normally again
[04:52:01] <jynus>	 marostegui: I can see the new topology, no errors on logs
[04:52:07] <marostegui>	 yep :)
[04:53:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:53:48] <marostegui>	 ^ will not merge on deploy1001 yet 
[04:53:56] <marostegui>	 just +2 so I can create the revert and all that
[04:54:25] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:54:58] <jynus>	 ok
[04:54:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790
[04:56:21] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:57:07] <wikibugs>	 (03PS5) 10Marostegui: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852)
[04:58:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:58:36] <jynus>	 I see the banner
[04:58:48] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:58:59] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790
[04:59:02] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[04:59:05] <marostegui>	 jynus: 1 minute to go \o/
[04:59:24] <jynus>	 see it also on watchlist
[04:59:47] <jynus>	 you lead? I check?
[04:59:50] <marostegui>	 yep
[05:00:04] <jouncebot>	 marostegui and jynus: #bothumor My software never has bugs. It just develops random features. Rise for s4 database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190619T0500).
[05:00:05] <marostegui>	 jynus: ready?
[05:00:10] <jynus>	 yes
[05:00:14] <marostegui>	 let's go then
[05:00:16] <marostegui>	 !log Starting s4 failover from db1068 to db1081 - T224852
[05:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:00:21] <stashbot>	 T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852
[05:01:02] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set s4 on read-only T224852  (duration: 00m 34s)
[05:01:03] <marostegui>	 we are on RO
[05:01:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:01:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790 (owner: 10Marostegui)
[05:01:38] <jynus>	 confirmed
[05:01:44] <marostegui>	 failover done
[05:02:00] <marostegui>	 replication looks good
[05:02:03] <jynus>	 confirmed with tendril
[05:02:04] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790 (owner: 10Marostegui)
[05:02:18] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790 (owner: 10Marostegui)
[05:02:24] <jynus>	 don't see errors so far
[05:02:25] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Switchover s4 master eqiad from db1068 to db1081 T224852  (duration: 00m 33s)
[05:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:37] <marostegui>	 removing read only
[05:03:05] <jynus>	 still no errors on kibana
[05:03:20] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove s4 ready only T224852  (duration: 00m 33s)
[05:03:20] <marostegui>	 we are no longer in RO
[05:03:23] <marostegui>	 checking
[05:03:25] <jynus>	 I can see that on msg
[05:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:54] <jynus>	 I can edit
[05:03:58] <marostegui>	 me too
[05:04:16] <jynus>	 I can see rcs
[05:04:31] <jynus>	 the issues here is potential load/replication issues
[05:04:42] <jynus>	 (those could arise later on)
[05:05:05] <marostegui>	 yeah
[05:05:07] <marostegui>	 so far so good
[05:05:16] <jynus>	 but monitoring looking good so far
[05:05:23] <jynus>	 strange to see no errors?
[05:05:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/517360 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui)
[05:05:37] <marostegui>	 it was fast
[05:05:53] <jynus>	 yeah, but jobque yada yada
[05:05:56] <marostegui>	 yeah
[05:06:02] <jynus>	 plus the whole opcache thing
[05:06:05] <marostegui>	 Maybe it is fixed for good? :)
[05:06:36] <jynus>	 https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php&2 looking good
[05:06:58] <jynus>	 no traffic on old master
[05:07:19] <jynus>	 traffic on new one
[05:07:23] <marostegui>	 yeah
[05:07:27] <jynus>	 it is just too good to be true
[05:07:27] <marostegui>	 so far eveything looks good
[05:07:45] <marostegui>	 haha I know
[05:07:48] <marostegui>	 I have the same feeling
[05:07:52] <marostegui>	 I am double checking all the steps
[05:08:06] <jynus>	 also no gtid chonology protector complains?
[05:08:18] <marostegui>	 those normally come a bit later
[05:08:42] <jynus>	 no wikibase ones? although I am not sure how much it is being used for commons propotionally
[05:10:07] <jynus>	 lots of [{exception_id}] {exception_url} ErrorException from line 125 of /srv/mediawiki/php-1.34.0-wmf.10/includes/api/ApiQueryQueryPage.php: PHP Notice: Undefined property: stdClass::$value
[05:10:29] <jynus>	 but on frwikisource
[05:10:40] <marostegui>	 yeah, those have been there for days
[05:10:51] <marostegui>	 I am going to "release" the repos "locks"
[05:11:21] <marostegui>	 Failover was done, mediawiki and puppet deployments can happen as usual
[05:11:42] <jynus>	 ok, now I can see some exceptions
[05:11:48] <JJMC89[m]>	 just did a wikibase edit on Commons for ya
[05:11:53] <jynus>	 very low, but the expected number
[05:11:59] <jynus>	 JJMC89[m]: did it work?
[05:12:04] <marostegui>	 JJMC89[m]: worked?
[05:12:10] <JJMC89[m]>	 Yes
[05:12:14] <jynus>	 great!
[05:12:15] <jynus>	 thanks
[05:12:15] <marostegui>	 \o/
[05:12:43] <jynus>	 so 83 exceptions
[05:12:56] <jynus>	 [{exception_id}] {exception_url} Wikimedia\Rdbms\DBTransactionError from line 268 of /srv/mediawiki/php-1.34.0-wmf.10/includes/libs/rdbms/lbfactory/LBFactory.php: MediaWiki::restInPeace: transaction round 'LinksUpdate::doUpdate' still running.
[05:13:31] <jynus>	 linksupdate would be a minor issue
[05:13:37] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:13:51] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:14:08] <jynus>	 so everything seems fine
[05:14:14] <jynus>	 but lets keep monitoring
[05:14:18] <marostegui>	 yeah
[05:14:26] <marostegui>	 I am doing the rest of tasks and keeping an eye too
[05:14:50] <wikibugs>	 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) This happened successfully. Read only times (UTC):  Start: 05:01:02 Stop: 05:03:20 Total read only time: 2:18 minutes
[05:15:07] <wikibugs>	 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui)
[05:19:05] <jynus>	 I am still suspicious, because there is normally a non-trivial amoung of background renderinf onon files and videos on commons
[05:19:13] <jynus>	 *rendering on
[05:19:28] <marostegui>	 i am checking logstash as much as I can and so far so good
[05:19:38] <jynus>	 I know
[05:21:08] <jynus>	 I don't see many mass upload during read only: https://commons.wikimedia.org/wiki/Special:NewFiles
[05:21:41] <marostegui>	 yeah, maybe the long heads up we gave allowed power users to plan for their massive uploads
[05:23:22] <wikibugs>	 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) 05Open→03Resolved So far everything looks good, so closing this.
[05:27:26] <jijiki>	 :D
[05:27:50] <wikibugs>	 (03PS1) 10Marostegui: db1068: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/517792 (https://phabricator.wikimedia.org/T217396)
[05:29:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1068: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/517792 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui)
[05:32:20] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: db1138 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517793
[05:33:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: db1138 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517793 (owner: 10Marostegui)
[05:34:08] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: db1138 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517793 (owner: 10Marostegui)
[05:34:22] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: db1138 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517793 (owner: 10Marostegui)
[05:35:31] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1138 status (duration: 00m 55s)
[05:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:43] <wikibugs>	 (03PS1) 10Jcrespo: WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794
[05:36:45] <wikibugs>	 (03PS1) 10Jcrespo: CuminExecution: Update namespace so it works without being deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517795
[05:36:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo)
[05:36:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] CuminExecution: Update namespace so it works without being deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517795 (owner: 10Jcrespo)
[05:37:21] <marostegui>	 !log Upgrade db1068 (old s4 master) to 10.1.39
[05:37:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:44:24] <wikibugs>	 (03PS2) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917)
[05:46:03] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517797 (https://phabricator.wikimedia.org/T222682)
[05:47:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517797 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[05:48:39] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517797 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[05:48:53] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517797 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[05:50:00] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1135 T222682 (duration: 00m 56s)
[05:50:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:50:05] <stashbot>	 T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682
[05:51:17] <wikibugs>	 (03CR) 10ArielGlenn: "Note this is still very much a WIP and likely will eat all of your wikidata dumps for lunch." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn)
[05:58:58] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:01:04] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:06:12] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:06:32] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:29:34] <wikibugs>	 (03PS1) 10Marostegui: db1112: Move to s3 [puppet] - 10https://gerrit.wikimedia.org/r/517799 (https://phabricator.wikimedia.org/T225981)
[06:35:26] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/16996/" [puppet] - 10https://gerrit.wikimedia.org/r/517799 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[06:35:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1112: Move to s3 [puppet] - 10https://gerrit.wikimedia.org/r/517799 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[06:41:09] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981)
[06:42:45] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Nikerabbit) @Aklapper Do you mean the "You have been unsubscribed" email or one of the bounces? I don't see how the former would help, but I can share th...
[06:46:31] <wikibugs>	 (03PS9) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072)
[06:46:42] <wikibugs>	 (03CR) 10Mathew.onipe: Add maps reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[06:47:47] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981)
[06:50:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[06:51:20] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[06:51:35] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[06:53:17] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 T225981 (duration: 01m 06s)
[06:53:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:24] <stashbot>	 T225981: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981
[06:53:32] <wikibugs>	 10Operations, 10Annual-Report, 10serviceops: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10jijiki) p:05Triage→03High
[06:53:56] <wikibugs>	 10Operations, 10Wikimedia Australia, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10jijiki) p:05Triage→03Normal
[06:57:35] <marostegui>	 !log Stop MySQL on db1077 to transfer its data to db1112 - T225981
[06:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:16] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) We recently tried to upgrade to nodejs10 for cxserver but it seems zlib 1.2.11 is required.  Example error: `...
[07:01:59] <XioNoX>	 !log jnt push to ulsfo, remove old protect-old-lvs-servers term + update syslog target T224128
[07:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:04] <stashbot>	 T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128
[07:02:22] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) Are you using component/node10? This should be fixed already, see https://phabricator.wikimedia.org/T215...
[07:06:33] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:07:40] <wikibugs>	 (03CR) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn)
[07:08:53] <wikibugs>	 (03PS3) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917)
[07:09:15] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:09:55] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:10:57] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 768.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[07:11:26] <marostegui>	 ^ expected
[07:12:05] <marostegui>	 !log s3 will be lagging on labsdb hosts due to maintenance on db1077 - T225981
[07:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:11] <stashbot>	 T225981: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981
[07:13:22] <XioNoX>	 !log jnt push to eqsin, remove old protect-old-lvs-servers term + update syslog target T224128
[07:13:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:27] <stashbot>	 T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128
[07:17:24] <XioNoX>	 !log jnt push to eqord, remove old protect-old-lvs-servers term + update syslog target T224128
[07:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:30] <XioNoX>	 !log jnt push to eqdfw, remove old protect-old-lvs-servers term + update syslog target T224128
[07:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:35] <stashbot>	 T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128
[07:19:39] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:24:51] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Joe) >>! In T210704#5267695, @MoritzMuehlenhoff wrote: > Are you using component/node10? This should be fixed already, see...
[07:25:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: citoid, mathoid, termbox: Switch GC metric to microseconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/517803 (https://phabricator.wikimedia.org/T220709)
[07:28:08] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Joe) >>! In T210704#5267739, @Joe wrote: >>>! In T210704#5267695, @MoritzMuehlenhoff wrote: >> Are you using component/node...
[07:28:35] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:30:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] citoid, mathoid, termbox: Switch GC metric to microseconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/517803 (https://phabricator.wikimedia.org/T220709) (owner: 10Alexandros Kosiaris)
[07:32:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[07:34:42] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: eqiad,codfw]
[07:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:15] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging]
[07:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:29] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:37:31] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:38:01] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:38:49] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add forgotten citoid, mathoid, termbox helm packages [deployment-charts] - 10https://gerrit.wikimedia.org/r/517804
[07:39:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add forgotten citoid, mathoid, termbox helm packages [deployment-charts] - 10https://gerrit.wikimedia.org/r/517804 (owner: 10Alexandros Kosiaris)
[07:39:10] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) >>! In T210704#5267750, @Joe wrote: > To correct myself: we already use that component. I'm nonetheless creat...
[07:43:05] <wikibugs>	 (03PS1) 10Matthias Mullie: [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805
[07:46:13] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging]
[07:46:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:08] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "The bits all look right to me. I'd like someone else familiar with docroot setups to look at it though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516055 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza)
[07:48:37] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:49:52] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725)
[07:50:18] <yannf>	 Internal error: [XQnonApAMF0AAH55EVgAAACT] 2019-06-19 07:47:40: Fatal exception of type "BadMethodCallException" on Commons
[07:50:26] <yannf>	 is this known?
[07:51:17] <yannf>	 it seems the issue was temporary
[07:51:25] <marostegui>	 We did a failover on commons at 05:00 UTC
[07:51:25] <moritzm>	 !log installing vim security updates on stretch
[07:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:31] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725)
[07:52:14] <marostegui>	 yannf: looks like there is just one error, so probably just temporary indeed
[07:53:00] <marostegui>	 yannf: from what i can see it has happened in the last 24h a few times
[07:53:01] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10ema) >>! In T225998#5264757, @Gilles wrote: > loadEventEnd seems to have regressed around the time the change was deployed  I'm gonn...
[07:53:44] <marostegui>	 I am filtering for BadMethodCallException on commons
[07:56:53] <moritzm>	 !log rearmed keyholder on acmechief-test2001
[07:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:06] <vgutierrez>	 sigh.. I forgot that one :_(
[07:57:09] <vgutierrez>	 thx moritzm 
[07:57:55] <moritzm>	 I broke it, I fix it :-)
[07:59:55] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-staging-values.yaml staging stable/citoid [namespace: citoid, clusters: staging]
[07:59:56] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid cluster staging completed
[07:59:56] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid finished
[07:59:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:37] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-codfw-values.yaml production stable/citoid [namespace: citoid, clusters: codfw]
[08:00:39] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid cluster codfw completed
[08:00:39] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid finished
[08:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:26] <ema>	 !log cache nodes: resume rolling reboots for kernel and varnish upgrades T224694
[08:01:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:31] <stashbot>	 T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694
[08:02:03] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[08:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] haproxy: haproxy.cfg.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517755 (https://phabricator.wikimedia.org/T225284) (owner: 10Effie Mouzeli)
[08:05:28] <wikibugs>	 (03PS3) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073)
[08:05:47] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:05:53] <wikibugs>	 (03CR) 10Mathew.onipe: icinga: cirrus masters eligible check (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe)
[08:05:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe)
[08:06:47] <icinga-wm>	 PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim]
[08:07:10] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[08:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:27] <ema>	 hello icinga-wm!
[08:08:23] <wikibugs>	 (03PS4) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073)
[08:08:47] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[08:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:01] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui)
[08:11:51] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:12:23] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:13:05] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-eqiad-values.yaml production stable/citoid [namespace: citoid, clusters: eqiad]
[08:13:06] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid cluster eqiad completed
[08:13:06] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid finished
[08:13:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:31] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging]
[08:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:37] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging]
[08:13:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:44] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed
[08:13:44] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid finished
[08:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:23] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[08:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:07] * akosiaris looking into cr1-eqiad
[08:18:27] <wikibugs>	 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) Created https://github.com/Dev25/mcrouter_exporter/pull/10
[08:18:49] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw]
[08:18:51] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed
[08:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:53] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed
[08:18:54] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm mathoid finished
[08:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:01] <wikibugs>	 (03PS10) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257)
[08:20:13] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox upgrade -f termbox-values.yaml production stable/termbox [namespace: termbox, clusters: eqiad,codfw]
[08:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:24] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox cluster eqiad completed
[08:20:27] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox cluster codfw completed
[08:20:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:27] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox finished
[08:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:42] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox upgrade -f termbox-staging-values.yaml staging stable/termbox [namespace: termbox, clusters: staging]
[08:20:42] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox cluster staging completed
[08:20:43] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox finished
[08:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:19] <akosiaris>	 !log upgrade citoid, mathoid, termbox to latest chart releases to address the GC metric naming issue T220709 T222795
[08:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:25] <stashbot>	 T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709
[08:21:25] <stashbot>	 T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795
[08:22:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[08:24:16] <moritzm>	 !log installing new kernels with SACK fix on jessie servers
[08:24:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:28] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Aklapper) Would it be possible to run that script once manually within this month, to get the stats for May 2019? Otherwise we'll neve...
[08:28:48] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[08:28:50] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) I just ran it, but I think it gave the delta between today and 1 month ago as most of the queries are: `  SELECT COUNT(DIS...
[08:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:53] <wikibugs>	 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['maps1002.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906190828_gehel_2...
[08:30:44] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) So if we want the ones from may we need to modify all the queries on that script to make them to pick the right range, not...
[08:33:28] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10ArielGlenn) I could probably hack the script to let one optionally specify start date and end date, is it worth it though?
[08:33:57] <icinga-wm>	 RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[08:34:23] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[08:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:47] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Aklapper) No need to change any code, the [script](https://phabricator.wikimedia.org/source/operations-puppet/browse/production/module...
[08:35:46] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Aklapper) Argh. I should get my first coffee before posting I guess, because `date_format(NOW())`, indeed. Sorry.
[08:36:45] <akosiaris>	 !log cordon kubernetes2001 to investigate some IP out discard statistics
[08:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:57] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[08:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:21] <icinga-wm>	 PROBLEM - DPKG on restbase-dev1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[08:37:38] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Aklapper) Ah, true, indeed that won't help because there's no specific reason provided.  I have not received any `Bounce action notification` messages in...
[08:38:21] <wikibugs>	 (03CR) 10Cparle: [C: 03+1] [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805 (owner: 10Matthias Mullie)
[08:42:37] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) Regardless of the query...the email hasn't arrived yet and the script didn't show any errors. So probably some debugging i...
[08:42:42] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=kubernetes2001.*
[08:42:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:00] <akosiaris>	 !log depool kubernetes2001 from all services to investigate some IP out discard statistics
[08:43:07] <wikibugs>	 (03PS5) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073)
[08:43:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:20] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[08:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:21] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[08:51:51] <XioNoX>	 !log jnt push to codfw, remove old protect-old-lvs-servers term + update syslog target T224128
[08:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:55] <stashbot>	 T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128
[08:54:17] <akosiaris>	 !log uncordon kubernetes2001, reschedule some pods on it. Investigating out discards still
[08:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:34] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=kubernetes2002.*
[08:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:42] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=kubernetes2003.*
[08:56:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:57] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[08:56:59] <akosiaris>	 !log depool kubernetes200{2,3} for the same out discards investigation
[08:57:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:13] <wikibugs>	 (03CR) 10Hashar: "Applied and cleaned up the instances" [puppet] - 10https://gerrit.wikimedia.org/r/517092 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[08:59:08] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10RhinosF1) The email has arrived to us for may now @Marostegui
[09:00:36] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Applied / instances purged." [puppet] - 10https://gerrit.wikimedia.org/r/517091 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[09:00:38] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Applied / instances purged." [puppet] - 10https://gerrit.wikimedia.org/r/517092 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar)
[09:01:04] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) Yeah, it got some delay: ` Wed, Jun 19, 2019 at 10:27 AM (Delivered after 1752 seconds) `
[09:03:20] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[09:03:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:58] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[09:06:00] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2003.*
[09:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:08] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2002.*
[09:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:31] <akosiaris>	 !log repool kubernetes2002, kubernetes2003. Point proven, chasing down load
[09:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:36] <akosiaris>	 !log repool kubernetes2002, kubernetes2003. Point proven, chasing down lead
[09:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:48] <yannf>	 https://commons.wikimedia.org/wiki/File:Heidi_Klum_with_Liza_Minnelli_at_The_Heart_Truth_Fashion_Show_2008.jpg
[09:07:06] <yannf>	 the description was moved, but not the file :/
[09:07:38] <yannf>	 2nd time I got this today
[09:07:45] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) 05Open→03Resolved a:03ArielGlenn Looks fixed then
[09:08:15] <wikibugs>	 (03PS1) 10Ema: varnishfetcherror: do not overwrite multiple tags [puppet] - 10https://gerrit.wikimedia.org/r/517811 (https://phabricator.wikimedia.org/T224994)
[09:09:50] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[09:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:02] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[09:11:07] <wikibugs>	 (03PS6) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073)
[09:11:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe)
[09:12:54] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/517812 (https://phabricator.wikimedia.org/T212257)
[09:13:05] <wikibugs>	 (03PS7) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073)
[09:13:48] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2001.*
[09:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/517812 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[09:14:51] <marostegui>	 !log Start MySQL on db1077 - s3 labsdb lag should start catching up T225981
[09:14:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:57] <stashbot>	 T225981: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981
[09:17:06] <wikibugs>	 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps1002.eqiad.wmnet'] `  and were **ALL** successful.
[09:19:59] <XioNoX>	 !log jnt push to esams, remove old protect-old-lvs-servers term + update syslog target T224128
[09:20:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:05] <stashbot>	 T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128
[09:20:54] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517813 (https://phabricator.wikimedia.org/T225981)
[09:21:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[09:22:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517813 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[09:23:13] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517813 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[09:23:32] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517813 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[09:24:55] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1077 T225981 (duration: 01m 00s)
[09:24:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:00] <stashbot>	 T225981: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981
[09:25:18] <wikibugs>	 (03PS1) 10Elukey: profile::kerberos::keytabs: ensure the keytab's parent dir [puppet] - 10https://gerrit.wikimedia.org/r/517814 (https://phabricator.wikimedia.org/T212257)
[09:25:58] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[09:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:55] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517815
[09:29:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] profile::kerberos::keytabs: ensure the keytab's parent dir [puppet] - 10https://gerrit.wikimedia.org/r/517814 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[09:29:50] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[09:29:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::kerberos::keytabs: ensure the keytab's parent dir [puppet] - 10https://gerrit.wikimedia.org/r/517814 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[09:31:31] <wikibugs>	 (03PS2) 10Ema: varnishfetcherror: do not overwrite multiple tags [puppet] - 10https://gerrit.wikimedia.org/r/517811 (https://phabricator.wikimedia.org/T224994)
[09:32:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517815 (owner: 10Marostegui)
[09:32:54] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517815 (owner: 10Marostegui)
[09:33:08] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517815 (owner: 10Marostegui)
[09:34:08] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1077 (duration: 00m 55s)
[09:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:22] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[09:34:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:53] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[09:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:10] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517816
[09:42:19] <wikibugs>	 (03PS2) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051)
[09:42:44] <wikibugs>	 (03PS1) 10Elukey: Deploy keytabs to the Analytics Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/517817 (https://phabricator.wikimedia.org/T212257)
[09:43:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Deploy keytabs to the Analytics Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/517817 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[09:44:22] <wikibugs>	 (03PS3) 10Ema: varnishfetcherror: do not overwrite multiple tags [puppet] - 10https://gerrit.wikimedia.org/r/517811 (https://phabricator.wikimedia.org/T224994)
[09:44:50] <zeljkof>	 jouncebot: next
[09:44:50] <jouncebot>	 In 25 hour(s) and 15 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190620T1100)
[09:44:56] <zeljkof>	 :)
[09:45:14] <wikibugs>	 (03CR) 10Ema: [C: 03+2] varnishfetcherror: do not overwrite multiple tags [puppet] - 10https://gerrit.wikimedia.org/r/517811 (https://phabricator.wikimedia.org/T224994) (owner: 10Ema)
[09:46:17] <wikibugs>	 (03PS2) 10Muehlenhoff: rm old ssh public key for mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/517475 (owner: 10Mholloway)
[09:46:55] <wikibugs>	 (03PS10) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072)
[09:47:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517816 (owner: 10Marostegui)
[09:47:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] rm old ssh public key for mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/517475 (owner: 10Mholloway)
[09:48:11] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517816 (owner: 10Marostegui)
[09:48:26] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517816 (owner: 10Marostegui)
[09:49:17] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1077 (duration: 00m 55s)
[09:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:12] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517818 (https://phabricator.wikimedia.org/T225981)
[09:53:40] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe)
[09:53:52] <wikibugs>	 (03PS8) 10Gehel: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe)
[09:54:22] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[09:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:53] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[09:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517818 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[10:00:58] <wikibugs>	 (03PS1) 10Alaa Sarhan: Introduce config variables for new terms store in mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086)
[10:01:00] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517818 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[10:01:37] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[10:01:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:09] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1077 (duration: 00m 55s)
[10:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce config variables for new terms store in mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086) (owner: 10Alaa Sarhan)
[10:03:14] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[10:03:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:41] <ema>	 !log cp3030: increase varnish-be thread_pool_max from 12000 (250 * 48) to 14400 (300 * 48) to observe impact on fetcherrors
[10:05:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:50] <wikibugs>	 (03PS1) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225055)
[10:06:34] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517818 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui)
[10:07:10] <wikibugs>	 (03PS2) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225051)
[10:12:16] <wikibugs>	 (03PS1) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on production wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517822 (https://phabricator.wikimedia.org/T225051)
[10:14:39] <wikibugs>	 (03PS2) 10Alaa Sarhan: Introduce config variables for new terms store in mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086)
[10:15:15] <wikibugs>	 (03PS3) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225051)
[10:15:25] <wikibugs>	 (03PS2) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on production wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517822 (https://phabricator.wikimedia.org/T225051)
[10:16:16] <wikibugs>	 (03PS8) 10Gilles: Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339)
[10:16:32] <wikibugs>	 (03CR) 10Gilles: "(just rebasing for now)" [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles)
[10:20:07] <wikibugs>	 (03PS1) 10Elukey: profile::kerberos::keytabs: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/517823
[10:20:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::kerberos::keytabs: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/517823 (owner: 10Elukey)
[10:21:37] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[10:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:15] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[10:23:15] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10Marostegui)
[10:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:59] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) This host is no longer a master and will be decommissioned in a few days
[10:25:15] <wikibugs>	 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff)
[10:27:00] <logmsgbot>	 !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=99)
[10:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:58] <wikibugs>	 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff)
[10:28:30] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+2] [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805 (owner: 10Matthias Mullie)
[10:29:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] install - late_command: Ensure correct version of puppet/facter are installed [puppet] - 10https://gerrit.wikimedia.org/r/515087 (owner: 10Jbond)
[10:29:21] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox upgrade -f termbox-values.yaml production stable/termbox [namespace: termbox, clusters: eqiad]
[10:29:22] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox cluster eqiad completed
[10:29:22] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox finished
[10:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:28] <wikibugs>	 (03Merged) 10jenkins-bot: [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805 (owner: 10Matthias Mullie)
[10:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:30] <wikibugs>	 (03PS2) 10Jbond: install - late_command: Ensure correct version of puppet/facter are installed [puppet] - 10https://gerrit.wikimedia.org/r/515087
[10:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:43] <wikibugs>	 (03CR) 10jenkins-bot: [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805 (owner: 10Matthias Mullie)
[10:30:11] <moritzm>	 !log installing glibc and ca-certificates-java updates from stretch point release
[10:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:21] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[10:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:59] <jbond42>	 !log update late-install so it installs the correct puppet version https://gerrit.wikimedia.org/r/c/operations/puppet/+/515087
[10:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall logging: Enable logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511704 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[10:32:23] <wikibugs>	 (03PS2) 10Jbond: firewall logging: Enable logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511704 (https://phabricator.wikimedia.org/T116011)
[10:33:34] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox upgrade -f termbox-staging-values.yaml staging stable/termbox [namespace: termbox, clusters: staging]
[10:33:35] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox cluster staging completed
[10:33:35] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm termbox finished
[10:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:14] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[10:35:16] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:16] <moritzm>	 !log rebooting mx2001 for kernel security update
[10:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:46] <wikibugs>	 (03Abandoned) 10Jbond: RAID: replace hpssacli with sscli [puppet] - 10https://gerrit.wikimedia.org/r/505760 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond)
[10:38:32] <logmsgbot>	 !log ladsgroup@deploy1001 scap-helm termbox upgrade -f termbox-values.yaml production stable/termbox [namespace: termbox, clusters: codfw]
[10:38:34] <logmsgbot>	 !log ladsgroup@deploy1001 scap-helm termbox cluster codfw completed
[10:38:34] <logmsgbot>	 !log ladsgroup@deploy1001 scap-helm termbox finished
[10:38:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:07] <wikibugs>	 (03Abandoned) 10Jbond: pybaltest: re-add tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/511767 (owner: 10Jbond)
[10:40:40] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on restbase-dev1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Muehlenhoff T224260
[10:40:41] <wikibugs>	 (03Abandoned) 10Jbond: late_command: rollback puppet5 changes [puppet] - 10https://gerrit.wikimedia.org/r/514865 (owner: 10Jbond)
[10:42:27] <wikibugs>	 (03Abandoned) 10Jbond: puppet agent: mask service [puppet] - 10https://gerrit.wikimedia.org/r/515075 (owner: 10Jbond)
[10:42:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond)
[10:43:05] <wikibugs>	 (03PS10) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459
[10:47:01] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[10:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:10] <wikibugs>	 (03CR) 10Hoo man: [C: 03+1] Introduce config variables for new terms store in mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086) (owner: 10Alaa Sarhan)
[10:50:22] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[10:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:15] <moritzm>	 !log rebooting mx1001 for kernel security update
[10:52:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:13] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[10:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:45] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita)
[10:55:14] <wikibugs>	 (03CR) 10Hoo man: [C: 03+1] Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan)
[10:59:11] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[10:59:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828
[10:59:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:10] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:05:16] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita)
[11:07:13] <ema>	 !log cache nodes: pause rolling reboots for kernel and varnish upgrades T224694 T225998
[11:07:13] <wikibugs>	 (03PS9) 10Gilles: Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339)
[11:07:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:19] <stashbot>	 T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998
[11:07:19] <stashbot>	 T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694
[11:07:56] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero)
[11:09:42] <wikibugs>	 (03CR) 10Gilles: "@Ema the tests pass now, using run.py" [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles)
[11:12:58] <wikibugs>	 (03PS2) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051)
[11:13:36] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Aklapper) Let's see... https://www.mediawiki.org/wiki/User:Dvpita states that the user is "QA Engineer (contract) , Language team (International)". I don't see Jpita listed on the public...
[11:15:36] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita) how can I fix that?
[11:19:15] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 03+1] db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui)
[11:20:41] <wikibugs>	 (03PS1) 10Jbond: firewall logging: enable ulog by default [puppet] - 10https://gerrit.wikimedia.org/r/517832 (https://phabricator.wikimedia.org/T116011)
[11:20:43] <wikibugs>	 (03PS1) 10Jbond: firewall logging: clean up old roll-out classes [puppet] - 10https://gerrit.wikimedia.org/r/517833 (https://phabricator.wikimedia.org/T116011)
[11:20:45] <wikibugs>	 (03PS1) 10Jbond: firewall logging: clean up old role out class [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011)
[11:20:48] <wikibugs>	 (03PS1) 10Jbond: firewall logging: make profile::base::firewall::log  a private class [puppet] - 10https://gerrit.wikimedia.org/r/517835 (https://phabricator.wikimedia.org/T116011)
[11:22:54] <wikibugs>	 (03PS2) 10Jbond: firewall logging: clean up old roll-out class [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011)
[11:23:41] <wikibugs>	 (03PS2) 10Jbond: firewall logging: enable ulog by default [puppet] - 10https://gerrit.wikimedia.org/r/517832 (https://phabricator.wikimedia.org/T116011)
[11:29:03] <wikibugs>	 10Operations, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) p:05Triage→03Normal
[11:29:49] <wikibugs>	 10Operations, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn)
[11:34:10] <wikibugs>	 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn)
[11:44:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall logging: enable ulog by default [puppet] - 10https://gerrit.wikimedia.org/r/517832 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[11:45:16] <wikibugs>	 (03PS1) 10Jbond: Revert "firewall logging: enable ulog by default" [puppet] - 10https://gerrit.wikimedia.org/r/517838
[11:50:10] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:50:20] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:50:38] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): etcd: listen-peer-urls only supports IP addresses and no FQDNs - https://phabricator.wikimedia.org/T226095 (10aborrero)
[11:53:38] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10zeljkofilipin) @Aklapper do we have an official on-boarding document? :) Meaning, is there a process new hire should follow?  #release-engineering-team has a checklist at [[ https://www.m...
[11:54:32] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:54:40] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:58:17] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Arrbee) Hello, this is an approved request for @Jpita . Thanks.
[12:04:35] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "IS.php needs to be synced first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086) (owner: 10Alaa Sarhan)
[12:07:36] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:07:44] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:08:51] <wikibugs>	 (03PS3) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051)
[12:09:55] <wikibugs>	 (03PS4) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051)
[12:09:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan)
[12:10:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:11:45] <wikibugs>	 (03PS2) 10Matthias Mullie: Increase rate limits for newbies on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516633 (https://phabricator.wikimedia.org/T225148)
[12:12:07] <wikibugs>	 (03PS2) 10Matthias Mullie: [SDC] Enable depicts qualifiers on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517381
[12:12:14] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:14:06] <wikibugs>	 (03PS3) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051)
[12:14:52] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:14:58] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:16:18] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:16:26] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:16:58] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:20:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall logging: clean up old roll-out classes [puppet] - 10https://gerrit.wikimedia.org/r/517833 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:20:25] <wikibugs>	 (03PS2) 10Jbond: firewall logging: clean up old roll-out classes [puppet] - 10https://gerrit.wikimedia.org/r/517833 (https://phabricator.wikimedia.org/T116011)
[12:20:43] <wikibugs>	 (03PS3) 10Jbond: firewall logging: clean up old roll-out class [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011)
[12:20:48] <ema>	 !log cache nodes: resume rolling reboots for kernel and varnish upgrades T224694 T225998
[12:20:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:55] <stashbot>	 T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998
[12:20:55] <stashbot>	 T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694
[12:21:04] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[12:21:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:54] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Noop for the dump woerkers and servers, so signing off on that part of this commit." [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:22:16] <wikibugs>	 (03Abandoned) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on production wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517822 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan)
[12:22:49] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:22:54] <wikibugs>	 (03Abandoned) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan)
[12:25:10] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[12:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:37] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[12:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:49] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[12:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall logging: clean up old roll-out class [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:50:30] <wikibugs>	 (03PS2) 10Jbond: firewall logging: make profile::base::firewall::log  a private class [puppet] - 10https://gerrit.wikimedia.org/r/517835 (https://phabricator.wikimedia.org/T116011)
[12:50:46] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.16.42:9042 on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[12:51:38] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[12:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:50] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[12:51:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall logging: make profile::base::firewall::log  a private class [puppet] - 10https://gerrit.wikimedia.org/r/517835 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:53:08] <marostegui>	 !log Deploy schema change on the private wikis listed at T225643
[12:53:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:13] <stashbot>	 T225643: Schema change to oathauth_users - https://phabricator.wikimedia.org/T225643
[12:57:51] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[12:57:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:07] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[13:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:14] <wikibugs>	 (03CR) 10Hashar: "recheck" [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto)
[13:00:27] <wikibugs>	 10Operations, 10Patch-For-Review: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10jbond) Closing as logging is now enabled by default for any role with the `profile::base::firewall` class.   Please reopen if more work is required
[13:00:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto)
[13:02:14] <wikibugs>	 (03Abandoned) 10Jbond: Revert "firewall logging: enable ulog by default" [puppet] - 10https://gerrit.wikimedia.org/r/517838 (owner: 10Jbond)
[13:02:58] <wikibugs>	 (03PS3) 10Jbond: hiera: update search order [puppet] - 10https://gerrit.wikimedia.org/r/511686
[13:03:14] <wikibugs>	 (03CR) 10Jbond: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond)
[13:03:48] <wikibugs>	 (03PS1) 10Hashar: Add .gitreview [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517851
[13:04:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add .gitreview [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517851 (owner: 10Hashar)
[13:06:25] <wikibugs>	 (03PS2) 10Hashar: New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto)
[13:14:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto)
[13:17:52] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[13:17:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:08] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[13:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3  directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098)
[13:24:30] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[13:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:56] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10BBlack)
[13:27:03] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[13:27:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "If this is intended to perform tests on the behaviour of AsyncRoute, I think it's ok. On the long run, though, I'd prefer to see both an A" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz)
[13:44:30] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[13:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:04] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[13:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:18] <wikibugs>	 (03PS2) 10Ottomata: Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267)
[13:48:21] <cdanis>	 !log apt upgrade on wikitech-static
[13:48:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:41] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10Krinkle) If re-using `techblog.wikimedia.org`, please take care not to break existing urls. The root path would be fine to change as it...
[13:50:30] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata)
[13:50:39] <cdanis>	 !log rebooting wikitech-static
[13:50:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:49] <wikibugs>	 (03CR) 10jenkins-bot: Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata)
[13:51:39] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[13:51:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:30] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[13:53:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:23] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10BBlack) Implementing a blanket redirect to the legacy blog URI for `^/20(0[7-9]|1[0-8])/` should be feasible in VCL or Lua at the edge....
[13:56:04] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disabling Avro ApiAction Monolog channel - T222267 (duration: 00m 57s)
[13:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:10] <stashbot>	 T222267: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267
[13:57:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Premise LGTM, some comments though." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond)
[13:58:46] <icinga-wm>	 RECOVERY - cassandra CQL 10.64.16.42:9042 on maps1002 is OK: TCP OK - 0.000 second response time on 10.64.16.42 port 9042 https://phabricator.wikimedia.org/T93886
[14:00:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] CI: Create lightweight agent role for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/517111 (https://phabricator.wikimedia.org/T224069) (owner: 10Brennen Bearnes)
[14:00:25] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: CI: Create lightweight agent role for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/517111 (https://phabricator.wikimedia.org/T224069) (owner: 10Brennen Bearnes)
[14:05:01] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Aklapper) >>! In T226091#5268341, @zeljkofilipin wrote: > @Aklapper do we have an official on-boarding document? :) Meaning, is there a process new hire should follow?  I'm afraid you hav...
[14:06:33] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[14:06:36] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:41] <moritzm>	 !log rolling reboot of mwdebug servers for kernel security update
[14:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:41] <Elitre>	 hey guys, are there issues with job queues currently? I'm again in the situation of having sent a MassMessage hours ago and hasn't arrived yet... https://meta.wikimedia.org/wiki/Special:Log/massmessage
[14:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:47] <apergos>	 when I looked earlier at the graphs and the kibana errors there was nothing that looked weird
[14:10:35] <apergos>	 but joe had problems earlier with a template included in a page not forcing the rerender (a template included only on 1-2 pages)
[14:10:41] <wikibugs>	 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Gilles)
[14:11:40] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[14:11:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:04] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles)
[14:12:29] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Normalize thumbnail request URLs in Varnish to avoid cachebusting - https://phabricator.wikimedia.org/T216339 (10Gilles) 05Stalled→03Open
[14:12:34] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) @WMDE-leszek, @Tarrow.   I 've noticed we are missing one thing. We have a dashboard for the service's metrics in https://grafan...
[14:12:41] <wikibugs>	 10Operations, 10MassMessage, 10Patch-For-Review: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) 05duplicate→03Open I'm experiencing this again :/ Help? It's been several hours. https://meta.wikimedia.org/wiki/Special:Log/massmessage
[14:13:10] <Elitre>	 I've reopened a task since it's the same, can file a different one if necessary.
[14:13:31] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[14:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:35] <Elitre>	 If I have to ping someone in particular here or there, also please LMK. Thanks as usual :)
[14:15:29] <wikibugs>	 (03PS8) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280
[14:16:48] <jynus>	 Elitre: I don't see problems on the infrastructure handling jobs: https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1
[14:17:03] <jynus>	 but of course it could be the are not sent or processed in the first place
[14:19:50] <Elitre>	 jynus: thanks, I have no idea what's going on :/ there's no error messages. I thought maybe I had targeted an impossible namespace, but Johan's regular tests also didn't go thru, so it's not that.
[14:19:53] <Reedy>	 I don't see anything in the exception/fatal logs
[14:20:28] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[14:20:29] <wikibugs>	 (03CR) 10Jbond: "Thanks for the review, updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond)
[14:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:48] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[14:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:45] <wikibugs>	 (03CR) 10Jbond: "PCC - https://puppet-compiler.wmflabs.org/compiler1002/17020/" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond)
[14:21:59] <jynus>	 Reedy: same, no errors, no change of patterns (except on wikibase one s with is expected)
[14:22:53] <Reedy>	 https://www.mediawiki.org/wiki/MediaWiki_1.34/wmf.10/Changelog#MassMessage
[14:22:55] <jynus>	 Reedy: it could be something on code that doesn't error out but doesn't do stuff (?) do you know a good way to check the latest merges ?
[14:23:15] <Reedy>	 That list should be wmf.8...wmf.10 for the extension
[14:23:33] <Reedy>	 No obvious other commit message that suggests job queue changes
[14:23:52] <apergos>	 note that the earlier issue was not massMessage but refreshLinks or whatever queues those up
[14:24:01] <wikibugs>	 (03PS9) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280
[14:24:09] <Reedy>	 Elitre: It was fine on the 17th though, right?
[14:24:39] <Elitre>	 The last Tech News issue went thru just fine AFAIK
[14:24:49] <jynus>	 when was that?
[14:25:04] <Reedy>	 https://meta.wikimedia.org/wiki/Special:Log/massmessage
[14:25:08] <Reedy>	 20:37, 17 June 2019 Johan (WMF) talk contribs sent a message to Global message delivery/Targets/Tech ambassadors (Tech News: 2019-25) Tag: PHP7
[14:25:25] <Reedy>	 I note Johan's was PHP7 but Elitre's wasn't, so not obviously a PHP7 bug either
[14:26:01] <Reedy>	 Any idea which wikis mass message is most active on?
[14:26:12] <Reedy>	 https://en.wikipedia.org/wiki/Special:Log/massmessage last was yesterday well before the branch
[14:26:58] <wikibugs>	 10Operations: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 (10jbond) 05Open→03Resolved This should all be fixed no please reopen if more errors are observed
[14:27:08] <Elitre>	 I guess Meta and en.wiki, but it's a guess
[14:27:34] <wikibugs>	 (03PS10) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280
[14:28:26] <Reedy>	 Elitre: Any chance you could test one on enwiki?
[14:28:40] <Reedy>	 To see if we can narrow if it might be a MW change, or if it's some potential backend job queue issue
[14:28:48] <Elitre>	 I don't have MM rights there AFAIK
[14:29:10] <wikibugs>	 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jbond)
[14:29:13] <wikibugs>	 10Operations, 10Patch-For-Review: cron-spam: /usr/local/sbin/check-cumin-aliases - https://phabricator.wikimedia.org/T222443 (10jbond) 05Open→03Resolved Theses boxes where switch to spare so spam should be halted please reopen if not
[14:32:01] <Reedy>	 https://id.wikipedia.org/wiki/Istimewa:Catatan/massmessage
[14:32:29] <wikibugs>	 10Operations: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10jbond) 05Open→03Resolved
[14:32:52] <Reedy>	 ok, in JobExecutor.log...
[14:33:03] <Reedy>	 I don't see any .10 jobs for MassMessage
[14:33:06] <Reedy>	 literally only .8 ones
[14:33:10] <wikibugs>	 (03PS1) 10Elukey: Add fake kerberos keytabs for the Hadoop testing cluster [labs/private] - 10https://gerrit.wikimedia.org/r/517870
[14:33:12] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: Expose linux kernel firewall and connections statistics - https://phabricator.wikimedia.org/T215277 (10jbond) 05Stalled→03Resolved Closing this a new task can be created if we re-visit this
[14:33:27] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytabs for the Hadoop testing cluster [labs/private] - 10https://gerrit.wikimedia.org/r/517870 (owner: 10Elukey)
[14:34:06] <jynus>	 Reedy: so either not enqueued or not executed- code regression?
[14:34:17] <apergos>	 https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=now-2d&to=now&panelId=5&fullscreen
[14:34:24] <apergos>	 I wonder if this reflects what's going on
[14:34:42] <Reedy>	 Possibly... Other .10 jobs are running, so job queue stuff definitely isn't completely broken on .10
[14:34:43] <apergos>	 backlog shorter because things not getting queued?
[14:34:56] <wikibugs>	 (03PS1) 10Ottomata: Disable CirrusSearchRequestSet avro monolog channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T222268)
[14:35:06] <jynus>	 apergos: note that is only cirrus
[14:35:19] <jynus>	 which had a temporary anomaly
[14:35:22] <apergos>	 mm point
[14:35:37] <jynus>	 apergos: https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=now-7d&to=now&panelId=5&fullscreen
[14:35:47] <jynus>	 the rest look "normal"
[14:35:56] <apergos>	 oic
[14:35:57] <wikibugs>	 (03CR) 10Ottomata: "This should not be merged until the Search team verifies that they are using cirrusseach-request instead of this avro stream." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata)
[14:36:10] <wikibugs>	 (03PS1) 10Elukey: Add missing kerberos fake keytab for analytics1039 [labs/private] - 10https://gerrit.wikimedia.org/r/517872
[14:36:25] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add missing kerberos fake keytab for analytics1039 [labs/private] - 10https://gerrit.wikimedia.org/r/517872 (owner: 10Elukey)
[14:37:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: ipmi - pxe: Ensure ipmi is not overriding the boot order (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond)
[14:37:32] <wikibugs>	 10Operations, 10MassMessage, 10Patch-For-Review: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Reedy)
[14:37:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] hiera: update search order [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond)
[14:38:40] <wikibugs>	 10Operations, 10MassMessage, 10WMF-JobQueue: MassMessages apparently not being sent on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy)
[14:38:49] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17024/ - seems that we are ready to test :)" [puppet] - 10https://gerrit.wikimedia.org/r/504280 (owner: 10Elukey)
[14:39:09] <wikibugs>	 10Operations, 10MassMessage, 10WMF-JobQueue: MassMessages apparently not being sent on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy)
[14:39:29] <wikibugs>	 (03PS11) 10Elukey: Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280
[14:39:42] <elukey>	 ottomata: --^ we are ready to test it finally :D
[14:39:49] <wikibugs>	 10Operations, 10MassMessage, 10WMF-JobQueue: MassMessages apparently not being sent on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) Looking at `mwlog1001:/srv/mw-log/JobExecutor.log` there are no massmessage jobs for .10 only for .8
[14:40:10] <wikibugs>	 (03CR) 10Aaron Schulz: "Both would be nice for implementing BagOStuff::WRITE_SYNC. I guess such a feature could just use another prefix, like mw-wan-sync or such." [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz)
[14:40:14] <Reedy>	 apergos: jynus thoughts on rolling back all of group1?
[14:40:28] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[14:40:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:33] <revi>	 hi, global renames are stuck
[14:40:36] <wikibugs>	 (03PS1) 10Ottomata: Remove Monolog Kafka handler and configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517874 (https://phabricator.wikimedia.org/T222268)
[14:40:45] <Reedy>	 revi: ughhh
[14:40:49] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[14:40:52] <jynus>	 I think it may not be only mass message
[14:40:52] <revi>	 screenshots (too lazy on mobile) https://usercontent.irccloud-cdn.com/file/ziTYFoe1/IMG_0209.PNG
[14:40:53] <Reedy>	 That's a second data point though
[14:40:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:55] <jynus>	 but most jobs
[14:41:03] <apergos>	 uhhh
[14:41:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Update check_timedatectl to latest version from DSA repository [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527)
[14:41:17] <jynus>	 most .10 jobs
[14:41:32] <wikibugs>	 (03PS1) 10Reedy: Revert "group1 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517876 (https://phabricator.wikimedia.org/T226109)
[14:41:45] <apergos>	 no us staff here today because holiday. right 
[14:41:55] <jynus>	 I only see brief errors from the job queue on .10, when normally it is a constant noise
[14:42:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove Monolog Kafka handler and configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517874 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata)
[14:42:10] <jynus>	 probably the amount is not large enough to change patterns?
[14:42:39] <Reedy>	 considering 635 wikis are on .10 vs 299 on .8...
[14:42:43] <apergos>	 from group 1? maybe
[14:42:50] <Reedy>	 The number of log entries into that log file
[14:43:00] <Reedy>	 I'll revert
[14:43:03] <jynus>	 +1
[14:43:05] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Revert "group1 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517876 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy)
[14:43:10] <jynus>	 but let's check that solve it
[14:43:16] <Reedy>	 Sure
[14:43:30] <Reedy>	 We might want to revert group0 too, but leaving those broken might be useful
[14:43:32] <jynus>	 I am convinced of the issue, not sure if that is the cause
[14:43:47] <Reedy>	 Yeah, indeed
[14:44:10] <jynus>	 revi: even ig it looks like we are ignoring you, we are actually doing something about that :-D
[14:44:17] <revi>	 jynus: assumed so
[14:44:21] <jynus>	 more like Reedy is
[14:44:32] <wikibugs>	 (03PS3) 10Jbond: ipmi - pxe: Ensure ipmi is not overriding the boot order [puppet] - 10https://gerrit.wikimedia.org/r/517694
[14:44:32] <revi>	 by scrolling up and found discussing JobQueue
[14:44:33] <Reedy>	 revi: Thanks for coming and telling us though, appreciated
[14:44:42] <revi>	 well we spotted it few hours ago
[14:44:44] <jynus>	 indeed
[14:44:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517876 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy)
[14:44:48] <wikibugs>	 (03CR) 10Jbond: "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond)
[14:44:54] <revi>	 by someone in [global-rename] complaining 'rename stuck'
[14:45:21] <wikibugs>	 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) From email from @MarkTraceur   == Database needs == * 54 million files on Commons * Estimated average of 10-20 statements per file * Estimate...
[14:45:29] <revi>	 just issued 'stop rename' warning mails
[14:45:35] <wikibugs>	 10Operations, 10MassMessage, 10WMF-JobQueue, 10Patch-For-Review: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy)
[14:46:17] <wikibugs>	 (03CR) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517876 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy)
[14:46:36] <logmsgbot>	 !log reedy@deploy1001 rebuilt and synchronized wikiversions files: group1 back to .8 T226109
[14:46:51] <wikibugs>	 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue, 10Patch-For-Review: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Krinkle)
[14:46:56] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[14:47:19] <Reedy>	 There's some "Producer error" and some "Retry count exceeded"
[14:47:40] <wikibugs>	 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) p:05Triage→03High
[14:47:41] <apergos>	 I looked at the few of them I saw and they were most unenlightening
[14:47:47] <XioNoX>	 !log jnt push to eqiad, remove old protect-old-lvs-servers term + update syslog target T224128
[14:47:47] <wikibugs>	 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Krinkle) p:05High→03Unbreak! Train blockers are UBN.
[14:48:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:15] <stashbot>	 T226109: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109
[14:48:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:25] <stashbot>	 T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128
[14:48:28] <jynus>	 how to test at least with new jobs?
[14:48:30] <Reedy>	 Elitre: Want to give it a try on meta again please?
[14:48:34] <jynus>	 ^that
[14:48:35] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[14:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:47] <Elitre>	 Reedy: promise it won't deliver it twice? :)
[14:48:49] * apergos crosses fingers
[14:49:25] <revi>	 or try a test delivery
[14:49:39] <wikibugs>	 (03PS2) 10Bstorm: dumps distribution: enable the service for nfs to start on reboot [puppet] - 10https://gerrit.wikimedia.org/r/517117 (https://phabricator.wikimedia.org/T217474)
[14:49:41] <jynus>	 just 1 person would be enough
[14:49:41] <Reedy>	 I honestly can't
[14:49:44] <Reedy>	 But yeah, a test one is fine
[14:49:55] <revi>	 I can try test delivery if Elitre isn't doing it
[14:50:06] <Elitre>	 I am testing it. Thanks.
[14:50:10] <revi>	 k gud
[14:50:17] <revi>	 what will happen to stuck stuff?
[14:50:23] <revi>	 kick them again?
[14:50:26] <apergos>	 depends if it ever made it into the queue
[14:50:32] <jynus>	 we need to proof that was the issue first
[14:50:38] <revi>	 those currently in the RenameQueue
[14:50:44] <jynus>	 later we will deal with stuck, etc.
[14:50:48] <revi>	 kk
[14:51:16] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] dumps distribution: enable the service for nfs to start on reboot [puppet] - 10https://gerrit.wikimedia.org/r/517117 (https://phabricator.wikimedia.org/T217474) (owner: 10Bstorm)
[14:51:25] <wikibugs>	 (03CR) 10Jbond: "some questions probably obvious to most :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff)
[14:51:34] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828
[14:52:43] <revi>	 23:52:24 <bot> [[User talk:-revi]]; MediaWiki message delivery; /* Test */ new section; https://meta.wikimedia.org/w/index.php?diff=19160132&oldid=19159693&rcid=13727936
[14:52:55] <apergos>	 oh ho ho!
[14:53:06] <Reedy>	 So it's definitely something in .10 then
[14:53:10] <apergos>	 who's gonna git bisect that mess?
[14:53:16] <Elitre>	 sent.
[14:53:22] <Reedy>	 Though, revi is also sending messages in the future
[14:53:27] * Reedy eyes revi suspiciously
[14:53:31] <revi>	 >_<
[14:53:34] * Reedy grins
[14:53:35] <apergos>	 :-D
[14:54:03] <jynus>	 the problem is that if it was enqueing and not execution
[14:54:04] <moritzm>	 !log rolling reboot of URL downloaders for kernel security update
[14:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:01] <XioNoX>	 !log jnt push to knams, remove old protect-old-lvs-servers term + update syslog target (T224128) + replace /28 with /29 (T211254)
[14:55:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:07] <stashbot>	 T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128
[14:55:07] <stashbot>	 T211254: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254
[14:55:20] <Elitre>	 I will be back later to see if I'm good to go with the old message. TA.
[14:55:22] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[14:55:22] <logmsgbot>	 !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[14:55:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Thanks! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond)
[14:57:01] <Krinkle>	 Reedy: I hope it's only execution loss, and not queuing as well.
[14:57:09] <Krinkle>	 24 hours of jobs for all group1 wikis is pretty annoying.
[14:57:14] <jynus>	 indeed
[14:57:44] <Krinkle>	 e.g. cascading updates all gone, don't auto-correct. For page views, 30 days will eventually fix it, but not for whatlinkshere, search index, and category tree.
[14:57:51] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[14:57:52] <Krinkle>	 doesn't*
[14:57:54] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:57:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:58] <XioNoX>	 !log update syslog target on frack network devices (T224128)
[14:57:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:39] <wikibugs>	 (03CR) 10Elukey: "Aaron/Giuseppe: if you are ok I'd merge this and get as far as enabling it to mw canaries (testing on single nodes first). The idea is to " [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz)
[15:00:22] <wikibugs>	 (03CR) 10Bstorm: "Jhedden found that there's another piece to this, but this is going to be needed either way, so I'll merge it." [puppet] - 10https://gerrit.wikimedia.org/r/517470 (https://phabricator.wikimedia.org/T225265) (owner: 10Bstorm)
[15:00:38] <wikibugs>	 (03PS2) 10Bstorm: cloudstore: move secondary monitoring stuff into profile and fix it [puppet] - 10https://gerrit.wikimedia.org/r/517470 (https://phabricator.wikimedia.org/T225265)
[15:01:17] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10netops, 10User-herron: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 (10ayounsi) 05Open→03Resolved All done here!
[15:02:07] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: move secondary monitoring stuff into profile and fix it [puppet] - 10https://gerrit.wikimedia.org/r/517470 (https://phabricator.wikimedia.org/T225265) (owner: 10Bstorm)
[15:06:57] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[15:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:35] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[15:08:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:26] <wikibugs>	 (03CR) 10Aaron Schulz: "Seems OK by me." [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz)
[15:10:10] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3  directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez)
[15:12:46] <Reedy>	 AaronSchulz: You actually about?
[15:13:33] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[15:13:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:52] <_joe_>	 Krinkle: I don't think it was just execution, no
[15:14:43] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3  directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098)
[15:16:34] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[15:16:35] <Krinkle>	 _joe_: aye, yeah, the switch to Kafka has added a fair number of additional dependencies including UID generator and JSON encoding, both of which have repeatedly caused job outages or major perf overhead.
[15:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:54] <Krinkle>	 I've fixed UID generator to be less slow, and some uses of v1 have been swapped for v4.
[15:17:11] <Krinkle>	 although for the most part it seems pointless for jobs.
[15:17:23] <Krinkle>	 the json encoding is not yet solved, but being worked on.
[15:17:46] <Krinkle>	 I've also fixed the UID generator to not throw under normal conditions as it used to.
[15:18:02] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[15:18:03] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:18] <moritzm>	 !log rebooting boron for kernel security update
[15:18:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:40] <wikibugs>	 (03PS1) 10Elukey: Add a host override to analytics1032 to avoid using /dev/sdc [puppet] - 10https://gerrit.wikimedia.org/r/517881 (https://phabricator.wikimedia.org/T225864)
[15:20:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add a host override to analytics1032 to avoid using /dev/sdc [puppet] - 10https://gerrit.wikimedia.org/r/517881 (https://phabricator.wikimedia.org/T225864) (owner: 10Elukey)
[15:21:41] <_joe_>	 Krinkle: well the switch to kafka also caused the jobqueue to be reliable in most cases, instead of a constant dumpster fire as it used to be (in the general disinterest of everyone but me). I'd call it a win anyways
[15:21:43] <moritzm>	 !log rolling reboot of proton* for kernel security update
[15:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:34] <Krinkle>	 _joe_: Yeah, operationally a win. But from consumer and user perspective (MW and editors) we've had more outages and irrecoverable job loss than before. I'll get better I know, but just stating how it is. 
[15:23:53] <Krinkle>	 It'll*
[15:24:08] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[15:24:11] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:13] <Krinkle>	 also more visible fatals (e.g. in web reqs instead of in the queue)
[15:24:14] <_joe_>	 I disagree about the editors part. The jobs were being dropped all the time before the switch
[15:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:44] <_joe_>	 and no, we didn't have more outages, it's just that weeks long outages or lags on some jobs were just the norm.
[15:25:55] <Krinkle>	 I don't recall a user-visible outage (HTTP 500) being caused when we were on Redis. Afaik it's always been able to store new jobs, or the client was sufficiently simple/robust/or tolerant to not fatal.
[15:26:10] <Krinkle>	 But yes, lots of benefits, and jobs actually running is also important.
[15:26:46] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Degraded RAID on analytics1032 - https://phabricator.wikimedia.org/T225864 (10elukey) 05Open→03Resolved a:03elukey
[15:27:51] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3  directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098)
[15:28:26] <elukey>	 Krinkle: out of curiosity, what is the source of the HTTP 500 when sending jobs to eventbus?
[15:28:53] <elukey>	 corner cases that we didn't think about, etc.. ?
[15:29:45] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098)
[15:30:08] <elukey>	 Joe has a good point though: probably we didn't cause 500s (that are bad) with Redis, but I recall weeks of great pain when dealing with gigantic backlogs of jobs to be processed
[15:30:48] <elukey>	 sometimes jobs were executed after days from the enqueuing time, and usually because Joe was fixing it manually :D
[15:31:26] <hauskatze>	 Krinkle: are you discussing the global rename outage? If so, do we have a task?
[15:32:02] <cdanis>	 hauskatze: T226109
[15:32:02] <stashbot>	 T226109: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109
[15:32:11] <hauskatze>	 cdanis: thanks :)
[15:33:34] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[15:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:44] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10jijiki) p:05Triage→03Normal
[15:35:04] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098)
[15:35:39] <Krinkle>	 elukey: mainly these things: 1) UID generator fatalling, 2) JWT signing failing, 3) json encoding not able to represent some jobs (several times, still on-going), 4) kafka unavailable (only once faik), 5) EventBus implementation out of sync with core (several times)
[15:35:51] <Krinkle>	 and these are all on the MW side, hence fatalling
[15:36:34] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[15:36:36] <Krinkle>	 also several issues http/curl handling stuff that I didn't quite understand at the time. 
[15:36:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:39] <wikibugs>	 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Joe) A lot of jobs weren't even being executed/queued. Just looking at today's jobs  `lang=bash $ fgrep commonswiki JobExecutor.log | fgrep wmf.10 | per...
[15:37:35] <onimisionipe>	 !log pooled maps1002 - postgres init is complete and successfully joined to its cluster - T224395
[15:37:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:42] <stashbot>	 T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395
[15:38:06] <_joe_>	 Reedy: I think we should rollback group0 as well, see https://phabricator.wikimedia.org/T226109#5268921
[15:38:12] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098)
[15:38:34] <Reedy>	 _joe_: Sure, I did suggest that we might. Patch incoming
[15:39:02] <_joe_>	 also, the friday rule applies to off days during the week as well
[15:39:19] <Reedy>	 I wasn't the engine driver ;)
[15:39:55] <Reedy>	 _joe_: test wikis too? or just group0?
[15:39:56] <_joe_>	 I know
[15:39:58] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098)
[15:40:10] <_joe_>	 Reedy: test wikis is not my call
[15:40:21] <wikibugs>	 (03PS1) 10Reedy: Revert "group0 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109)
[15:40:24] <_joe_>	 but group0 for sure
[15:40:44] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[15:40:45] <wikibugs>	 (03PS2) 10Reedy: Revert "group0 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109)
[15:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:49] <wikibugs>	 (03PS3) 10Reedy: Revert "group0 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109)
[15:40:51] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Revert "group0 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy)
[15:40:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "group0 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy)
[15:41:49] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy)
[15:42:04] <wikibugs>	 (03CR) 10jenkins-bot: Revert "group0 wikis to 1.34.0-wmf.10  refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy)
[15:43:16] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[15:43:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:37] <logmsgbot>	 !log reedy@deploy1001 rebuilt and synchronized wikiversions files: group0 back to .8 T226109
[15:43:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:42] <stashbot>	 T226109: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109
[15:45:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Cmjohnson) The DIMM has been reseated and swapped to the opposite sides.
[15:45:49] <wikibugs>	 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy)
[15:47:08] <icinga-wm>	 RECOVERY - Host elastic1029 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[15:51:20] <wikibugs>	 (03PS1) 10Fsero: introducing helmfile.d values for staging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/517887 (https://phabricator.wikimedia.org/T212130)
[15:51:28] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:51:52] <icinga-wm>	 PROBLEM - HHVM rendering on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:52:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:52:56] <icinga-wm>	 RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 77491 bytes in 0.463 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:53:28] <icinga-wm>	 RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:53:38] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:53:40] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098)
[15:53:59] <paravoid>	 what's all that ^^?
[15:55:40] <wikibugs>	 (03PS1) 10Fsero: k8s, deploy: introducing helmfile for manage charts [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130)
[15:55:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1001/17032/" [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez)
[16:01:26] <ema>	 !log cache nodes: stop rolling reboots for today, 47/80 done T224694 T225998
[16:01:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:32] <stashbot>	 T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998
[16:01:32] <stashbot>	 T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694
[16:02:53] <wikibugs>	 (03PS1) 10Fsero: k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891
[16:03:39] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891 (owner: 10Fsero)
[16:07:21] <wikibugs>	 (03PS2) 10Fsero: k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891
[16:07:46] <wikibugs>	 (03PS3) 10Fsero: k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891
[16:07:53] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891 (owner: 10Fsero)
[16:10:22] <wikibugs>	 (03CR) 10Fsero: "PCC seems happy https://puppet-compiler.wmflabs.org/compiler1001/17034/deploy1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero)
[16:10:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Closing this for now, let me know if there is another issue. Keep in mind this...
[16:16:44] <hauskatze>	 Reedy: I was wondering if we could restart the queued global renames or if you need further debugging before running them
[16:17:50] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:18:11] <Reedy>	 hauskatze: I think you're ok for the moment...
[16:18:30] <hauskatze>	 Reedy: I meant https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress
[16:20:24] <Reedy>	 hauskatze: You mean, can I unstick them or something? (run the maintenance script)
[16:20:46] <hauskatze>	 Reedy: yup. But maybe we can wait
[16:20:55] <Reedy>	 I think it's fine tbh
[16:21:00] <hauskatze>	 I mean, there's no urgency
[16:23:29] <onimisionipe>	 !log pooling elastic1029 - T214283
[16:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:35] <stashbot>	 T214283: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283
[16:23:38] <Reedy>	 How do we deduce logwiki?
[16:23:43] <Reedy>	 just metawiki?
[16:24:01] <hauskatze>	 Reedy: yes, it's where the global rename started; which is always metawiki on production wikis
[16:24:30] <Reedy>	 we almost need a useful output of https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress to dump the commands
[16:24:50] <hauskatze>	 we can try to do the first one queued in that list and see how it goes
[16:25:03] <Reedy>	 I just did
[16:25:48] <hauskatze>	 argh
[16:25:56] <hauskatze>	 it keeps messing CentralAuth data: https://meta.wikimedia.org/wiki/Special:CentralAuth/Renamed_user_4fjkqyebdp5nlxbnzvc2n9lltjc5gv1
[16:26:05] <Reedy>	 it?
[16:26:35] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[16:26:36] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:56] <hauskatze>	 Reedy: the script
[16:27:21] <Reedy>	 so you don't want it running?
[16:27:22] <hauskatze>	 tgr or legoktm explained me why but I can't remember
[16:27:35] <hauskatze>	 Reedy: maybe we can wait
[16:27:39] <Reedy>	 lol
[16:28:00] <XioNoX>	 !log redirect ns1 to authdns1001
[16:28:01] <legoktm>	 hmm
[16:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: specify more certificates [puppet] - 10https://gerrit.wikimedia.org/r/517896 (https://phabricator.wikimedia.org/T226098)
[16:33:02] <icinga-wm>	 PROBLEM - HHVM rendering on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[16:33:09] <legoktm>	 hauskatze: there's a fixme to well, fix that, but right now its not going to work, the data is effectively lost
[16:33:13] <legoktm>	 https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CentralAuth/+/master/maintenance/fixStuckGlobalRename.php#85
[16:33:32] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[16:34:16] <wikibugs>	 10Operations, 10netops: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) Interesting! This would be useful before doing maintenance on a whole router. I opened an issue upstream asking for a per AS option, see https://github.com/mwiget/bgp_graceful_shutdown/issues/1  My su...
[16:34:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 77553 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[16:34:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1001/17035/toolsbeta-arturo-k8s-etcd-1.toolsbeta.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/517896 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez)
[16:34:43] <moritzm>	 !log rebooting authdns2001 for kernel security update
[16:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:58] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[16:36:02] <hauskatze>	 legoktm: thanks :) So I guess the data will be lost for every rename in the GlobalRenameProgress page right? Unless there's a way to tell the jobqueue to pick those renames w/o using the script...
[16:37:02] <legoktm>	 https://phabricator.wikimedia.org/T226109#5268921 says "So it looks like the jobs were not even being enqueued and are now lost."
[16:37:37] <XioNoX>	 !log rollback redirect ns1 to authdns1001
[16:37:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:11] <hauskatze>	 ouch :(
[16:38:24] <wikibugs>	 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Cmjohnson) I thought I had a ticket for this open with HPE but it doesn't look that way.  I will take care of it ASAP
[16:39:30] <XioNoX>	 !log redirect ns0 to authdns2001
[16:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:18] <wikibugs>	 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Cmjohnson) Dell is sending me a new Raid card, cables and backplane.  Sorry, it took so long, I had to call them after they denied my second request.
[16:40:48] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational
[16:41:23] <legoktm>	 hauskatze: so yeah, if Reedy or someone else wants to run the script that would be appreciated
[16:41:27] <legoktm>	 I'm not in a good place to do it right now
[16:41:38] <Reedy>	 I already parsed the list to make a list of copy paste commands
[16:42:11] <hauskatze>	 legoktm: okay. Since I can't do it, if Reedy wants to continue it'd be good otherwise I'll create a task
[16:42:46] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[16:42:47] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:56] <Reedy>	 hauskatze: Done
[16:44:08] <Reedy>	 https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress 8 still there atm
[16:44:11] <Reedy>	 5
[16:44:22] <Reedy>	 4
[16:44:27] <Reedy>	 3
[16:44:42] <Reedy>	 2
[16:44:58] <wikibugs>	 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) Great news! Thanks a lot!!
[16:45:06] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:45:07] <Reedy>	 1
[16:45:09] <moritzm>	 !log rebooting authdns1001 for kernel security update
[16:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:01] <hauskatze>	 and boom
[16:46:06] <hauskatze>	 thanks Sam :)
[16:47:08] <wikibugs>	 (03PS8) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104
[16:49:14] <wikibugs>	 (03PS9) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104
[16:49:48] <wikibugs>	 (03CR) 10Bstorm: "There!  Got rid of the submodule merge conflict so this can actually rebase/deploy" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm)
[16:50:31] <XioNoX>	 !log rollback redirect ns0 to authdns2001
[16:50:34] <moritzm>	 !log running racreset on multatuli
[16:50:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:57] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm)
[16:51:56] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: add etcd user to the puppet group [puppet] - 10https://gerrit.wikimedia.org/r/517905 (https://phabricator.wikimedia.org/T226098)
[16:52:04] <wikibugs>	 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10CDanis) @Reedy manually ran the global renames that were never queued properly.
[16:52:47] * legoktm hugs Reedy
[16:53:59] <hauskatze>	 oh cdanis also has a cat <3
[16:54:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) a:05Cmjohnson→03RobH I updated the switch config to private1-d.....both servers are currently off and ready for installs. assigning to @robh to install
[16:54:14] <cdanis>	 I have two, but only one of them will sit in my lap :(
[16:54:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson)
[16:56:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: etcd: add etcd user to the puppet group [puppet] - 10https://gerrit.wikimedia.org/r/517905 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez)
[16:56:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson)
[16:57:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson) @Bstorm @ayounsi  I will need very clear instructions on which racks/rows these servers can go in before I physically rack and cab...
[16:57:38] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: use complete fqdn in node name [puppet] - 10https://gerrit.wikimedia.org/r/517906 (https://phabricator.wikimedia.org/T226098)
[16:58:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: etcd: use complete fqdn in node name [puppet] - 10https://gerrit.wikimedia.org/r/517906 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez)
[17:03:02] <wikibugs>	 (03PS2) 10Muehlenhoff: Update check_timedatectl to latest version from DSA repository [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527)
[17:03:19] <wikibugs>	 (03CR) 10Muehlenhoff: Update check_timedatectl to latest version from DSA repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff)
[17:07:45] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Where I'd be most concerned was labstores, and they come up as a noop in the compiler.  Seems ok." [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond)
[17:58:46] <wikibugs>	 (03PS3) 10Bstorm: Add new terms normalized schema tables as public 1:1 views in labs. [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) (owner: 10Alaa Sarhan)
[18:04:35] <Elitre>	 cdanis, Reedy or anyone else, is it now safe to send MassMessages again? (re: https://phabricator.wikimedia.org/T226109 )
[18:04:49] <cdanis>	 AIUI it should be, Elitre
[18:05:23] <Elitre>	 alright, will test and then try again then. fingers crossed. thanks for working on this ^_^
[18:09:59] <legoktm>	 !log added MatmaRex to extension-VisualEditor-staff Gerrit group
[18:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:04] <MatmaRex>	 :o
[18:31:47] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[18:33:03] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[18:33:48] <wikibugs>	 (03PS1) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267)
[18:34:10] <wikibugs>	 (03PS2) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267)
[18:34:25] <wikibugs>	 (03PS3) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267)
[18:34:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata)
[18:39:35] <cdanis>	 Elitre: seems like it worked this time?
[18:41:43] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational
[18:45:20] <wikibugs>	 (03PS4) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267)
[18:45:59] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:46:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata)
[19:04:36] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Ensure no lossy WTE→VE switching in public wikis (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516567
[19:04:38] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Centralize enwiki's VisualEditor feedback page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517924 (https://phabricator.wikimedia.org/T224851)
[19:06:39] <wikibugs>	 (03PS5) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267)
[19:23:11] <wikibugs>	 (03PS1) 10Krinkle: peopleweb: Remove php module from httpd [puppet] - 10https://gerrit.wikimedia.org/r/517926
[19:23:12] <Krinkle>	 hashar: Reedy: ^
[19:25:04] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata)
[19:27:28] <revi>	 When I tried to do `git fetch` on ssh://vcs@git-ssh.wikimedia.org/source/tool-wmopbot.git I got "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!", where do I see the fingerprints? Should I just "meh"?
[19:30:03] <revi>	 nvm, found https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/git-ssh.wikimedia.org
[19:30:47] <legoktm>	 revi: https://wikitech.wikimedia.org/w/index.php?title=Help%3ASSH_Fingerprints%2Fgit-ssh.wikimedia.org&type=revision&diff=1827965&oldid=1768511
[19:31:19] <revi>	 gotcha
[19:34:23] <icinga-wm>	 PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:37] <icinga-wm>	 RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 77601 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:41:01] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational
[19:45:21] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:57:39] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "There is indeed now PHP support in user home dirs due to:" [puppet] - 10https://gerrit.wikimedia.org/r/517926 (owner: 10Krinkle)
[20:01:32] <wikibugs>	 (03PS2) 10Effie Mouzeli: haproxy: Disable global logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/517755 (https://phabricator.wikimedia.org/T225284)
[20:02:21] <wikibugs>	 (03CR) 10Effie Mouzeli: haproxy: Disable global logging to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517755 (https://phabricator.wikimedia.org/T225284) (owner: 10Effie Mouzeli)
[20:17:51] <wikibugs>	 (03PS3) 10Alaa Sarhan: Introduce config variables for new terms store in mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086)
[20:18:31] <wikibugs>	 (03PS4) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225051)
[20:23:16] <wikibugs>	 (03PS4) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051)
[20:26:13] <icinga-wm>	 PROBLEM - HHVM rendering on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:26:43] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:27:21] <icinga-wm>	 RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 77539 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:27:59] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 6.722 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:32:41] <wikibugs>	 (03PS1) 10Awight: Configuration migration for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517933 (https://phabricator.wikimedia.org/T87985)
[20:40:59] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational
[20:45:17] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:12:17] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes some pages load slowly on de.wp in Europe (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Aklapper) 05Stalled→03Open p:05Triage→03High
[21:34:29] <wikibugs>	 (03PS4) 10Bstorm: Add new terms normalized schema tables as public 1:1 views in labs. [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) (owner: 10Alaa Sarhan)
[21:35:23] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Add new terms normalized schema tables as public 1:1 views in labs. [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) (owner: 10Alaa Sarhan)
[21:37:55] <wikibugs>	 (03PS1) 10Alaa Sarhan: Switch property terms migration stage to WRITE_BOTH on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517948 (https://phabricator.wikimedia.org/T226129)
[21:41:46] <wikibugs>	 (03Abandoned) 10Alaa Sarhan: Switch property terms migration stage to WRITE_BOTH on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517948 (https://phabricator.wikimedia.org/T226129) (owner: 10Alaa Sarhan)
[22:04:03] <wikibugs>	 (03CR) 10Alaa Sarhan: [C: 03+1] Set EntityUsageTable addUsage batch size to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517669 (https://phabricator.wikimedia.org/T225500) (owner: 10Ladsgroup)
[22:11:18] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational
[22:15:36] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:41:36] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational
[22:45:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:47:12] <icinga-wm>	 PROBLEM - Check systemd state on es2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:41:44] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational
[23:46:00] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:55:41] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes some pages load slowly on de.wp in Europe (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10PM3) The "some" can be removed from the caption. I am experiencing this problem since Tuesday (dew...