[00:00:05] Deploy window No Deploys - WMF US Staff holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190619T0000) [01:50:18] 10Operations, 10Wikimedia Australia, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10StevenCrossin) Hi, @Dzahn can you help with this by any chance? [04:09:43] (03Abandoned) 10ArielGlenn: testing paging settings for labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/517693 (owner: 10ArielGlenn) [04:13:15] (03PS3) 10ArielGlenn: phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804) [04:14:19] (03CR) 10ArielGlenn: [C: 03+2] phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804) (owner: 10ArielGlenn) [04:16:19] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517787 (https://phabricator.wikimedia.org/T224852) [04:17:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517787 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:18:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517787 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:18:19] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10ArielGlenn) I'll leave this ticket open until we see that the next month's report has shown up. [04:18:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517787 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:19:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 T224852 (duration: 00m 57s) [04:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:58] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [04:20:01] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [04:20:02] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [04:20:02] !log kartik@deploy1001 scap-helm cxserver finished [04:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:07] (03PS4) 10Marostegui: db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) [04:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:34] !log depooling maps1002 for reimaging into new partition scheme - T224395 [04:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:39] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [04:23:39] (03PS4) 10Marostegui: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) [04:24:07] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [04:24:09] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [04:24:09] !log kartik@deploy1001 scap-helm cxserver finished [04:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:35] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [04:25:36] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [04:25:36] !log kartik@deploy1001 scap-helm cxserver finished [04:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:46] (03PS1) 10Bmansurov: Labs: enable surveys for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517789 (https://phabricator.wikimedia.org/T225819) [04:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:28] (03CR) 10jerkins-bot: [V: 04-1] Labs: enable surveys for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517789 (https://phabricator.wikimedia.org/T225819) (owner: 10Bmansurov) [04:28:19] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v1/page/{language}/{title}{/revision} (Fetch enwiki protected pa [04:28:19] Test Fetch enwiki protected page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Translate enwiki protected page) is CRITICAL: Test Translate enwiki protected page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [04:28:22] !log Starting pre-steps for the s4 failover that will happen at 05:00 UTC - T224852 [04:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:27] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [04:28:34] kart_: Is that your deployment ^ [04:30:29] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v1/page/{language}/{title}{/revision} (Fetch enwiki protected pa [04:30:29] Test Fetch enwiki protected page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Translate enwiki protected page) is CRITICAL: Test Translate enwiki protected page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [04:31:03] kart_: ^ [04:31:50] marostegui: looking.. [04:32:09] kart_: thanks - also, we have a maintenance window requested in 30 minutes to failover s4 master [04:32:13] (requires read only) [04:32:48] marostegui: oh. right. [04:34:51] is that expected, or should it be reverted? [04:35:13] jynus: if more is happening, I can revert. Should not be.. [04:36:18] waiting for some minutes. [04:36:33] https://grafana.wikimedia.org/d/F7rttgqmz/cxserver?refresh=1m&panelId=15&fullscreen&orgId=1&from=1560915388343&to=1560918988343&var-dc=eqiad%20prometheus%2Fk8s&var-service=cxserver [04:37:09] OK. Seems broken. [04:37:16] Reverting.. [04:39:57] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [04:39:58] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [04:39:58] !log kartik@deploy1001 scap-helm cxserver finished [04:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:16] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [04:40:17] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [04:40:17] !log kartik@deploy1001 scap-helm cxserver finished [04:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:38] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [04:40:39] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [04:40:39] !log kartik@deploy1001 scap-helm cxserver finished [04:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:07] (03CR) 10Marostegui: mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/517361 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:41:14] (03PS4) 10Marostegui: mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/517361 (https://phabricator.wikimedia.org/T224852) [04:41:27] (03CR) 10Marostegui: db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:41:27] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:41:36] (03CR) 10Marostegui: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:41:49] (03CR) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/517360 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:41:53] (03PS3) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/517360 (https://phabricator.wikimedia.org/T224852) [04:42:01] marostegui: jynus reverted. [04:42:05] thanks [04:42:11] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:47:12] kart_: "MT processing error for: en > qqq. Error: invalid distance too far back at Zlib.zlibOnError [as onerror]" [04:47:39] (not ongoing anymore) [04:47:46] jynus: hit by https://github.com/nodejs/node/issues/22839 it seems. [04:47:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/517361 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:48:42] jynus: zlib need update on host in this case, but let's see how we can do it with docker. [04:49:16] kart_: sorry, we are moving with our own deployment atm [04:50:12] jynus: yeah. Nothing need to be done right now or no emergency. [04:51:45] We are going to take over puppet and mediawiki deployment for the s4 failover, if you need to deploy please coordinate with us. I will communicate once it is all done and deployments can happen normally again [04:52:01] marostegui: I can see the new topology, no errors on logs [04:52:07] yep :) [04:53:37] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:53:48] ^ will not merge on deploy1001 yet [04:53:56] just +2 so I can create the revert and all that [04:54:25] (03Merged) 10jenkins-bot: db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:54:58] ok [04:54:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790 [04:56:21] (03CR) 10jenkins-bot: db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:57:07] (03PS5) 10Marostegui: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) [04:58:01] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:58:36] I see the banner [04:58:48] (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:58:59] (03PS2) 10Marostegui: Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790 [04:59:02] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:59:05] jynus: 1 minute to go \o/ [04:59:24] see it also on watchlist [04:59:47] you lead? I check? [04:59:50] yep [05:00:04] marostegui and jynus: #bothumor My software never has bugs. It just develops random features. Rise for s4 database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190619T0500). [05:00:05] jynus: ready? [05:00:10] yes [05:00:14] let's go then [05:00:16] !log Starting s4 failover from db1068 to db1081 - T224852 [05:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:21] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [05:01:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set s4 on read-only T224852 (duration: 00m 34s) [05:01:03] we are on RO [05:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:17] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790 (owner: 10Marostegui) [05:01:38] confirmed [05:01:44] failover done [05:02:00] replication looks good [05:02:03] confirmed with tendril [05:02:04] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790 (owner: 10Marostegui) [05:02:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Set s4 in read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517790 (owner: 10Marostegui) [05:02:24] don't see errors so far [05:02:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Switchover s4 master eqiad from db1068 to db1081 T224852 (duration: 00m 33s) [05:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:37] removing read only [05:03:05] still no errors on kibana [05:03:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove s4 ready only T224852 (duration: 00m 33s) [05:03:20] we are no longer in RO [05:03:23] checking [05:03:25] I can see that on msg [05:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:54] I can edit [05:03:58] me too [05:04:16] I can see rcs [05:04:31] the issues here is potential load/replication issues [05:04:42] (those could arise later on) [05:05:05] yeah [05:05:07] so far so good [05:05:16] but monitoring looking good so far [05:05:23] strange to see no errors? [05:05:30] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/517360 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:05:37] it was fast [05:05:53] yeah, but jobque yada yada [05:05:56] yeah [05:06:02] plus the whole opcache thing [05:06:05] Maybe it is fixed for good? :) [05:06:36] https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php&2 looking good [05:06:58] no traffic on old master [05:07:19] traffic on new one [05:07:23] yeah [05:07:27] it is just too good to be true [05:07:27] so far eveything looks good [05:07:45] haha I know [05:07:48] I have the same feeling [05:07:52] I am double checking all the steps [05:08:06] also no gtid chonology protector complains? [05:08:18] those normally come a bit later [05:08:42] no wikibase ones? although I am not sure how much it is being used for commons propotionally [05:10:07] lots of [{exception_id}] {exception_url} ErrorException from line 125 of /srv/mediawiki/php-1.34.0-wmf.10/includes/api/ApiQueryQueryPage.php: PHP Notice: Undefined property: stdClass::$value [05:10:29] but on frwikisource [05:10:40] yeah, those have been there for days [05:10:51] I am going to "release" the repos "locks" [05:11:21] Failover was done, mediawiki and puppet deployments can happen as usual [05:11:42] ok, now I can see some exceptions [05:11:48] just did a wikibase edit on Commons for ya [05:11:53] very low, but the expected number [05:11:59] JJMC89[m]: did it work? [05:12:04] JJMC89[m]: worked? [05:12:10] Yes [05:12:14] great! [05:12:15] thanks [05:12:15] \o/ [05:12:43] so 83 exceptions [05:12:56] [{exception_id}] {exception_url} Wikimedia\Rdbms\DBTransactionError from line 268 of /srv/mediawiki/php-1.34.0-wmf.10/includes/libs/rdbms/lbfactory/LBFactory.php: MediaWiki::restInPeace: transaction round 'LinksUpdate::doUpdate' still running. [05:13:31] linksupdate would be a minor issue [05:13:37] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:13:51] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:08] so everything seems fine [05:14:14] but lets keep monitoring [05:14:18] yeah [05:14:26] I am doing the rest of tasks and keeping an eye too [05:14:50] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) This happened successfully. Read only times (UTC): Start: 05:01:02 Stop: 05:03:20 Total read only time: 2:18 minutes [05:15:07] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) [05:19:05] I am still suspicious, because there is normally a non-trivial amoung of background renderinf onon files and videos on commons [05:19:13] *rendering on [05:19:28] i am checking logstash as much as I can and so far so good [05:19:38] I know [05:21:08] I don't see many mass upload during read only: https://commons.wikimedia.org/wiki/Special:NewFiles [05:21:41] yeah, maybe the long heads up we gave allowed power users to plan for their massive uploads [05:23:22] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) 05Open→03Resolved So far everything looks good, so closing this. [05:27:26] :D [05:27:50] (03PS1) 10Marostegui: db1068: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/517792 (https://phabricator.wikimedia.org/T217396) [05:29:55] (03CR) 10Marostegui: [C: 03+2] db1068: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/517792 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [05:32:20] (03PS1) 10Marostegui: db-eqiad.php: db1138 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517793 [05:33:19] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: db1138 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517793 (owner: 10Marostegui) [05:34:08] (03Merged) 10jenkins-bot: db-eqiad.php: db1138 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517793 (owner: 10Marostegui) [05:34:22] (03CR) 10jenkins-bot: db-eqiad.php: db1138 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517793 (owner: 10Marostegui) [05:35:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1138 status (duration: 00m 55s) [05:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:43] (03PS1) 10Jcrespo: WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 [05:36:45] (03PS1) 10Jcrespo: CuminExecution: Update namespace so it works without being deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517795 [05:36:47] (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [05:36:49] (03CR) 10jerkins-bot: [V: 04-1] CuminExecution: Update namespace so it works without being deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517795 (owner: 10Jcrespo) [05:37:21] !log Upgrade db1068 (old s4 master) to 10.1.39 [05:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:24] (03PS2) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [05:46:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517797 (https://phabricator.wikimedia.org/T222682) [05:47:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517797 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:48:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517797 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:48:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517797 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:50:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1135 T222682 (duration: 00m 56s) [05:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:05] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [05:51:17] (03CR) 10ArielGlenn: "Note this is still very much a WIP and likely will eat all of your wikidata dumps for lunch." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [05:58:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:01:04] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:06:12] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:06:32] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:34] (03PS1) 10Marostegui: db1112: Move to s3 [puppet] - 10https://gerrit.wikimedia.org/r/517799 (https://phabricator.wikimedia.org/T225981) [06:35:26] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/16996/" [puppet] - 10https://gerrit.wikimedia.org/r/517799 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [06:35:42] (03CR) 10Marostegui: [C: 03+2] db1112: Move to s3 [puppet] - 10https://gerrit.wikimedia.org/r/517799 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [06:41:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981) [06:42:45] 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Nikerabbit) @Aklapper Do you mean the "You have been unsubscribed" email or one of the bounces? I don't see how the former would help, but I can share th... [06:46:31] (03PS9) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [06:46:42] (03CR) 10Mathew.onipe: Add maps reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [06:47:47] (03PS2) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981) [06:50:24] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [06:51:20] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [06:51:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517801 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [06:53:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 T225981 (duration: 01m 06s) [06:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:24] T225981: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 [06:53:32] 10Operations, 10Annual-Report, 10serviceops: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10jijiki) p:05Triage→03High [06:53:56] 10Operations, 10Wikimedia Australia, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10jijiki) p:05Triage→03Normal [06:57:35] !log Stop MySQL on db1077 to transfer its data to db1112 - T225981 [06:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:16] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) We recently tried to upgrade to nodejs10 for cxserver but it seems zlib 1.2.11 is required. Example error: `... [07:01:59] !log jnt push to ulsfo, remove old protect-old-lvs-servers term + update syslog target T224128 [07:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:04] T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 [07:02:22] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) Are you using component/node10? This should be fixed already, see https://phabricator.wikimedia.org/T215... [07:06:33] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:40] (03CR) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [07:08:53] (03PS3) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [07:09:15] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:09:55] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:10:57] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 768.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:11:26] ^ expected [07:12:05] !log s3 will be lagging on labsdb hosts due to maintenance on db1077 - T225981 [07:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:11] T225981: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 [07:13:22] !log jnt push to eqsin, remove old protect-old-lvs-servers term + update syslog target T224128 [07:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:27] T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 [07:17:24] !log jnt push to eqord, remove old protect-old-lvs-servers term + update syslog target T224128 [07:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:30] !log jnt push to eqdfw, remove old protect-old-lvs-servers term + update syslog target T224128 [07:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:35] T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 [07:19:39] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:24:51] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Joe) >>! In T210704#5267695, @MoritzMuehlenhoff wrote: > Are you using component/node10? This should be fixed already, see... [07:25:52] (03PS1) 10Alexandros Kosiaris: citoid, mathoid, termbox: Switch GC metric to microseconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/517803 (https://phabricator.wikimedia.org/T220709) [07:28:08] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Joe) >>! In T210704#5267739, @Joe wrote: >>>! In T210704#5267695, @MoritzMuehlenhoff wrote: >> Are you using component/node... [07:28:35] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:30:16] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] citoid, mathoid, termbox: Switch GC metric to microseconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/517803 (https://phabricator.wikimedia.org/T220709) (owner: 10Alexandros Kosiaris) [07:32:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [07:34:42] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [07:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:15] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [07:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:29] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:37:31] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:38:01] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:38:49] (03PS1) 10Alexandros Kosiaris: Add forgotten citoid, mathoid, termbox helm packages [deployment-charts] - 10https://gerrit.wikimedia.org/r/517804 [07:39:01] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add forgotten citoid, mathoid, termbox helm packages [deployment-charts] - 10https://gerrit.wikimedia.org/r/517804 (owner: 10Alexandros Kosiaris) [07:39:10] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) >>! In T210704#5267750, @Joe wrote: > To correct myself: we already use that component. I'm nonetheless creat... [07:43:05] (03PS1) 10Matthias Mullie: [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805 [07:46:13] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [07:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:08] (03CR) 10ArielGlenn: [C: 03+1] "The bits all look right to me. I'd like someone else familiar with docroot setups to look at it though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516055 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [07:48:37] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:52] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) [07:50:18] Internal error: [XQnonApAMF0AAH55EVgAAACT] 2019-06-19 07:47:40: Fatal exception of type "BadMethodCallException" on Commons [07:50:26] is this known? [07:51:17] it seems the issue was temporary [07:51:25] We did a failover on commons at 05:00 UTC [07:51:25] !log installing vim security updates on stretch [07:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:31] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) [07:52:14] yannf: looks like there is just one error, so probably just temporary indeed [07:53:00] yannf: from what i can see it has happened in the last 24h a few times [07:53:01] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10ema) >>! In T225998#5264757, @Gilles wrote: > loadEventEnd seems to have regressed around the time the change was deployed I'm gonn... [07:53:44] I am filtering for BadMethodCallException on commons [07:56:53] !log rearmed keyholder on acmechief-test2001 [07:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:06] sigh.. I forgot that one :_( [07:57:09] thx moritzm [07:57:55] I broke it, I fix it :-) [07:59:55] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-staging-values.yaml staging stable/citoid [namespace: citoid, clusters: staging] [07:59:56] !log akosiaris@deploy1001 scap-helm citoid cluster staging completed [07:59:56] !log akosiaris@deploy1001 scap-helm citoid finished [07:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:37] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-codfw-values.yaml production stable/citoid [namespace: citoid, clusters: codfw] [08:00:39] !log akosiaris@deploy1001 scap-helm citoid cluster codfw completed [08:00:39] !log akosiaris@deploy1001 scap-helm citoid finished [08:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:26] !log cache nodes: resume rolling reboots for kernel and varnish upgrades T224694 [08:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:31] T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 [08:02:03] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [08:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] haproxy: haproxy.cfg.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517755 (https://phabricator.wikimedia.org/T225284) (owner: 10Effie Mouzeli) [08:05:28] (03PS3) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) [08:05:47] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:05:53] (03CR) 10Mathew.onipe: icinga: cirrus masters eligible check (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe) [08:05:57] (03CR) 10jerkins-bot: [V: 04-1] icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe) [08:06:47] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [08:07:10] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [08:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:27] hello icinga-wm! [08:08:23] (03PS4) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) [08:08:47] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [08:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:01] (03CR) 10Jcrespo: [C: 03+1] db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [08:11:51] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:12:23] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:13:05] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-eqiad-values.yaml production stable/citoid [namespace: citoid, clusters: eqiad] [08:13:06] !log akosiaris@deploy1001 scap-helm citoid cluster eqiad completed [08:13:06] !log akosiaris@deploy1001 scap-helm citoid finished [08:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:31] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [08:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:37] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [08:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:44] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [08:13:44] !log akosiaris@deploy1001 scap-helm mathoid finished [08:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:23] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [08:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:07] * akosiaris looking into cr1-eqiad [08:18:27] 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) Created https://github.com/Dev25/mcrouter_exporter/pull/10 [08:18:49] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [08:18:51] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [08:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:53] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [08:18:54] !log akosiaris@deploy1001 scap-helm mathoid finished [08:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:01] (03PS10) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) [08:20:13] !log akosiaris@deploy1001 scap-helm termbox upgrade -f termbox-values.yaml production stable/termbox [namespace: termbox, clusters: eqiad,codfw] [08:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:24] !log akosiaris@deploy1001 scap-helm termbox cluster eqiad completed [08:20:27] !log akosiaris@deploy1001 scap-helm termbox cluster codfw completed [08:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:27] !log akosiaris@deploy1001 scap-helm termbox finished [08:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:42] !log akosiaris@deploy1001 scap-helm termbox upgrade -f termbox-staging-values.yaml staging stable/termbox [namespace: termbox, clusters: staging] [08:20:42] !log akosiaris@deploy1001 scap-helm termbox cluster staging completed [08:20:43] !log akosiaris@deploy1001 scap-helm termbox finished [08:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:19] !log upgrade citoid, mathoid, termbox to latest chart releases to address the GC metric naming issue T220709 T222795 [08:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:25] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [08:21:25] T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 [08:22:02] (03CR) 10Elukey: [C: 03+2] Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [08:24:16] !log installing new kernels with SACK fix on jessie servers [08:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:28] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Aklapper) Would it be possible to run that script once manually within this month, to get the stats for May 2019? Otherwise we'll neve... [08:28:48] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [08:28:50] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) I just ran it, but I think it gave the delta between today and 1 month ago as most of the queries are: ` SELECT COUNT(DIS... [08:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:53] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['maps1002.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906190828_gehel_2... [08:30:44] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) So if we want the ones from may we need to modify all the queries on that script to make them to pick the right range, not... [08:33:28] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10ArielGlenn) I could probably hack the script to let one optionally specify start date and end date, is it worth it though? [08:33:57] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:34:23] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [08:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:47] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Aklapper) No need to change any code, the [script](https://phabricator.wikimedia.org/source/operations-puppet/browse/production/module... [08:35:46] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Aklapper) Argh. I should get my first coffee before posting I guess, because `date_format(NOW())`, indeed. Sorry. [08:36:45] !log cordon kubernetes2001 to investigate some IP out discard statistics [08:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:57] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [08:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:21] PROBLEM - DPKG on restbase-dev1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:37:38] 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Aklapper) Ah, true, indeed that won't help because there's no specific reason provided. I have not received any `Bounce action notification` messages in... [08:38:21] (03CR) 10Cparle: [C: 03+1] [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805 (owner: 10Matthias Mullie) [08:42:37] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) Regardless of the query...the email hasn't arrived yet and the script didn't show any errors. So probably some debugging i... [08:42:42] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=kubernetes2001.* [08:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:00] !log depool kubernetes2001 from all services to investigate some IP out discard statistics [08:43:07] (03PS5) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) [08:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:20] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [08:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:21] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:51:51] !log jnt push to codfw, remove old protect-old-lvs-servers term + update syslog target T224128 [08:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:55] T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 [08:54:17] !log uncordon kubernetes2001, reschedule some pods on it. Investigating out discards still [08:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:34] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=kubernetes2002.* [08:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:42] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=kubernetes2003.* [08:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:57] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [08:56:59] !log depool kubernetes200{2,3} for the same out discards investigation [08:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:13] (03CR) 10Hashar: "Applied and cleaned up the instances" [puppet] - 10https://gerrit.wikimedia.org/r/517092 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [08:59:08] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10RhinosF1) The email has arrived to us for may now @Marostegui [09:00:36] (03CR) 10Hashar: [C: 03+1] "Applied / instances purged." [puppet] - 10https://gerrit.wikimedia.org/r/517091 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [09:00:38] (03CR) 10Hashar: [C: 03+1] "Applied / instances purged." [puppet] - 10https://gerrit.wikimedia.org/r/517092 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [09:01:04] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) Yeah, it got some delay: ` Wed, Jun 19, 2019 at 10:27 AM (Delivered after 1752 seconds) ` [09:03:20] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:58] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [09:06:00] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2003.* [09:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:08] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2002.* [09:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:31] !log repool kubernetes2002, kubernetes2003. Point proven, chasing down load [09:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:36] !log repool kubernetes2002, kubernetes2003. Point proven, chasing down lead [09:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:48] https://commons.wikimedia.org/wiki/File:Heidi_Klum_with_Liza_Minnelli_at_The_Heart_Truth_Fashion_Show_2008.jpg [09:07:06] the description was moved, but not the file :/ [09:07:38] 2nd time I got this today [09:07:45] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10Marostegui) 05Open→03Resolved a:03ArielGlenn Looks fixed then [09:08:15] (03PS1) 10Ema: varnishfetcherror: do not overwrite multiple tags [puppet] - 10https://gerrit.wikimedia.org/r/517811 (https://phabricator.wikimedia.org/T224994) [09:09:50] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [09:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:02] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:11:07] (03PS6) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) [09:11:38] (03CR) 10jerkins-bot: [V: 04-1] icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe) [09:12:54] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/517812 (https://phabricator.wikimedia.org/T212257) [09:13:05] (03PS7) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) [09:13:48] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2001.* [09:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:15] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/517812 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [09:14:51] !log Start MySQL on db1077 - s3 labsdb lag should start catching up T225981 [09:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:57] T225981: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 [09:17:06] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps1002.eqiad.wmnet'] ` and were **ALL** successful. [09:19:59] !log jnt push to esams, remove old protect-old-lvs-servers term + update syslog target T224128 [09:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:05] T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 [09:20:54] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517813 (https://phabricator.wikimedia.org/T225981) [09:21:09] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:22:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517813 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [09:23:13] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517813 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [09:23:32] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517813 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [09:24:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1077 T225981 (duration: 01m 00s) [09:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:00] T225981: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 [09:25:18] (03PS1) 10Elukey: profile::kerberos::keytabs: ensure the keytab's parent dir [puppet] - 10https://gerrit.wikimedia.org/r/517814 (https://phabricator.wikimedia.org/T212257) [09:25:58] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:55] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517815 [09:29:09] (03CR) 10Muehlenhoff: [C: 03+1] profile::kerberos::keytabs: ensure the keytab's parent dir [puppet] - 10https://gerrit.wikimedia.org/r/517814 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [09:29:50] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:39] (03CR) 10Elukey: [C: 03+2] profile::kerberos::keytabs: ensure the keytab's parent dir [puppet] - 10https://gerrit.wikimedia.org/r/517814 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [09:31:31] (03PS2) 10Ema: varnishfetcherror: do not overwrite multiple tags [puppet] - 10https://gerrit.wikimedia.org/r/517811 (https://phabricator.wikimedia.org/T224994) [09:32:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517815 (owner: 10Marostegui) [09:32:54] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517815 (owner: 10Marostegui) [09:33:08] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517815 (owner: 10Marostegui) [09:34:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1077 (duration: 00m 55s) [09:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:22] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [09:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:53] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [09:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:10] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517816 [09:42:19] (03PS2) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051) [09:42:44] (03PS1) 10Elukey: Deploy keytabs to the Analytics Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/517817 (https://phabricator.wikimedia.org/T212257) [09:43:47] (03CR) 10Elukey: [C: 03+2] Deploy keytabs to the Analytics Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/517817 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [09:44:22] (03PS3) 10Ema: varnishfetcherror: do not overwrite multiple tags [puppet] - 10https://gerrit.wikimedia.org/r/517811 (https://phabricator.wikimedia.org/T224994) [09:44:50] jouncebot: next [09:44:50] In 25 hour(s) and 15 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190620T1100) [09:44:56] :) [09:45:14] (03CR) 10Ema: [C: 03+2] varnishfetcherror: do not overwrite multiple tags [puppet] - 10https://gerrit.wikimedia.org/r/517811 (https://phabricator.wikimedia.org/T224994) (owner: 10Ema) [09:46:17] (03PS2) 10Muehlenhoff: rm old ssh public key for mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/517475 (owner: 10Mholloway) [09:46:55] (03PS10) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [09:47:21] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517816 (owner: 10Marostegui) [09:47:53] (03CR) 10Muehlenhoff: [C: 03+2] rm old ssh public key for mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/517475 (owner: 10Mholloway) [09:48:11] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517816 (owner: 10Marostegui) [09:48:26] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517816 (owner: 10Marostegui) [09:49:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1077 (duration: 00m 55s) [09:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:12] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517818 (https://phabricator.wikimedia.org/T225981) [09:53:40] (03CR) 10Gehel: [C: 03+2] icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe) [09:53:52] (03PS8) 10Gehel: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe) [09:54:22] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:53] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517818 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [10:00:58] (03PS1) 10Alaa Sarhan: Introduce config variables for new terms store in mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086) [10:01:00] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517818 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [10:01:37] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1077 (duration: 00m 55s) [10:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:44] (03CR) 10jerkins-bot: [V: 04-1] Introduce config variables for new terms store in mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086) (owner: 10Alaa Sarhan) [10:03:14] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:41] !log cp3030: increase varnish-be thread_pool_max from 12000 (250 * 48) to 14400 (300 * 48) to observe impact on fetcherrors [10:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:50] (03PS1) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225055) [10:06:34] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517818 (https://phabricator.wikimedia.org/T225981) (owner: 10Marostegui) [10:07:10] (03PS2) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225051) [10:12:16] (03PS1) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on production wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517822 (https://phabricator.wikimedia.org/T225051) [10:14:39] (03PS2) 10Alaa Sarhan: Introduce config variables for new terms store in mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086) [10:15:15] (03PS3) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225051) [10:15:25] (03PS2) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on production wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517822 (https://phabricator.wikimedia.org/T225051) [10:16:16] (03PS8) 10Gilles: Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) [10:16:32] (03CR) 10Gilles: "(just rebasing for now)" [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [10:20:07] (03PS1) 10Elukey: profile::kerberos::keytabs: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/517823 [10:20:52] (03CR) 10Elukey: [C: 03+2] profile::kerberos::keytabs: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/517823 (owner: 10Elukey) [10:21:37] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:15] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:23:15] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10Marostegui) [10:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:59] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) This host is no longer a master and will be decommissioned in a few days [10:25:15] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [10:27:00] !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=99) [10:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:58] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [10:28:30] (03CR) 10Matthias Mullie: [C: 03+2] [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805 (owner: 10Matthias Mullie) [10:29:21] (03CR) 10Jbond: [C: 03+2] install - late_command: Ensure correct version of puppet/facter are installed [puppet] - 10https://gerrit.wikimedia.org/r/515087 (owner: 10Jbond) [10:29:21] !log akosiaris@deploy1001 scap-helm termbox upgrade -f termbox-values.yaml production stable/termbox [namespace: termbox, clusters: eqiad] [10:29:22] !log akosiaris@deploy1001 scap-helm termbox cluster eqiad completed [10:29:22] !log akosiaris@deploy1001 scap-helm termbox finished [10:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:28] (03Merged) 10jenkins-bot: [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805 (owner: 10Matthias Mullie) [10:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:30] (03PS2) 10Jbond: install - late_command: Ensure correct version of puppet/facter are installed [puppet] - 10https://gerrit.wikimedia.org/r/515087 [10:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:43] (03CR) 10jenkins-bot: [SDC] Enable other statements on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517805 (owner: 10Matthias Mullie) [10:30:11] !log installing glibc and ca-certificates-java updates from stretch point release [10:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:21] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:59] !log update late-install so it installs the correct puppet version https://gerrit.wikimedia.org/r/c/operations/puppet/+/515087 [10:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:07] (03CR) 10Jbond: [C: 03+2] firewall logging: Enable logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511704 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [10:32:23] (03PS2) 10Jbond: firewall logging: Enable logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511704 (https://phabricator.wikimedia.org/T116011) [10:33:34] !log akosiaris@deploy1001 scap-helm termbox upgrade -f termbox-staging-values.yaml staging stable/termbox [namespace: termbox, clusters: staging] [10:33:35] !log akosiaris@deploy1001 scap-helm termbox cluster staging completed [10:33:35] !log akosiaris@deploy1001 scap-helm termbox finished [10:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:14] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:35:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:16] !log rebooting mx2001 for kernel security update [10:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:46] (03Abandoned) 10Jbond: RAID: replace hpssacli with sscli [puppet] - 10https://gerrit.wikimedia.org/r/505760 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [10:38:32] !log ladsgroup@deploy1001 scap-helm termbox upgrade -f termbox-values.yaml production stable/termbox [namespace: termbox, clusters: codfw] [10:38:34] !log ladsgroup@deploy1001 scap-helm termbox cluster codfw completed [10:38:34] !log ladsgroup@deploy1001 scap-helm termbox finished [10:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:07] (03Abandoned) 10Jbond: pybaltest: re-add tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/511767 (owner: 10Jbond) [10:40:40] ACKNOWLEDGEMENT - DPKG on restbase-dev1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Muehlenhoff T224260 [10:40:41] (03Abandoned) 10Jbond: late_command: rollback puppet5 changes [puppet] - 10https://gerrit.wikimedia.org/r/514865 (owner: 10Jbond) [10:42:27] (03Abandoned) 10Jbond: puppet agent: mask service [puppet] - 10https://gerrit.wikimedia.org/r/515075 (owner: 10Jbond) [10:42:55] (03CR) 10Jbond: [C: 03+2] icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [10:43:05] (03PS10) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [10:47:01] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:10] (03CR) 10Hoo man: [C: 03+1] Introduce config variables for new terms store in mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086) (owner: 10Alaa Sarhan) [10:50:22] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:15] !log rebooting mx1001 for kernel security update [10:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:13] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:45] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita) [10:55:14] (03CR) 10Hoo man: [C: 03+1] Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan) [10:59:11] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:59:13] (03PS1) 10Giuseppe Lavagetto: New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 [10:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:05:16] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita) [11:07:13] !log cache nodes: pause rolling reboots for kernel and varnish upgrades T224694 T225998 [11:07:13] (03PS9) 10Gilles: Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) [11:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:19] T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 [11:07:19] T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 [11:07:56] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero) [11:09:42] (03CR) 10Gilles: "@Ema the tests pass now, using run.py" [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [11:12:58] (03PS2) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051) [11:13:36] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Aklapper) Let's see... https://www.mediawiki.org/wiki/User:Dvpita states that the user is "QA Engineer (contract) , Language team (International)". I don't see Jpita listed on the public... [11:15:36] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita) how can I fix that? [11:19:15] (03CR) 10Aaron Schulz: [C: 03+1] db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [11:20:41] (03PS1) 10Jbond: firewall logging: enable ulog by default [puppet] - 10https://gerrit.wikimedia.org/r/517832 (https://phabricator.wikimedia.org/T116011) [11:20:43] (03PS1) 10Jbond: firewall logging: clean up old roll-out classes [puppet] - 10https://gerrit.wikimedia.org/r/517833 (https://phabricator.wikimedia.org/T116011) [11:20:45] (03PS1) 10Jbond: firewall logging: clean up old role out class [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011) [11:20:48] (03PS1) 10Jbond: firewall logging: make profile::base::firewall::log a private class [puppet] - 10https://gerrit.wikimedia.org/r/517835 (https://phabricator.wikimedia.org/T116011) [11:22:54] (03PS2) 10Jbond: firewall logging: clean up old roll-out class [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011) [11:23:41] (03PS2) 10Jbond: firewall logging: enable ulog by default [puppet] - 10https://gerrit.wikimedia.org/r/517832 (https://phabricator.wikimedia.org/T116011) [11:29:03] 10Operations, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) p:05Triage→03Normal [11:29:49] 10Operations, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) [11:34:10] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) [11:44:49] (03CR) 10Jbond: [C: 03+2] firewall logging: enable ulog by default [puppet] - 10https://gerrit.wikimedia.org/r/517832 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [11:45:16] (03PS1) 10Jbond: Revert "firewall logging: enable ulog by default" [puppet] - 10https://gerrit.wikimedia.org/r/517838 [11:50:10] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:50:20] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:50:38] 10Operations, 10cloud-services-team (Kanban): etcd: listen-peer-urls only supports IP addresses and no FQDNs - https://phabricator.wikimedia.org/T226095 (10aborrero) [11:53:38] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10zeljkofilipin) @Aklapper do we have an official on-boarding document? :) Meaning, is there a process new hire should follow? #release-engineering-team has a checklist at [[ https://www.m... [11:54:32] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:54:40] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:58:17] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Arrbee) Hello, this is an approved request for @Jpita . Thanks. [12:04:35] (03CR) 10Ladsgroup: [C: 03+1] "IS.php needs to be synced first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086) (owner: 10Alaa Sarhan) [12:07:36] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:07:44] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:08:51] (03PS3) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051) [12:09:55] (03PS4) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051) [12:09:57] (03CR) 10jerkins-bot: [V: 04-1] Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan) [12:10:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:11:45] (03PS2) 10Matthias Mullie: Increase rate limits for newbies on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516633 (https://phabricator.wikimedia.org/T225148) [12:12:07] (03PS2) 10Matthias Mullie: [SDC] Enable depicts qualifiers on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517381 [12:12:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:14:06] (03PS3) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051) [12:14:52] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:14:58] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:16:18] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:16:26] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:16:58] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:20:16] (03CR) 10Jbond: [C: 03+2] firewall logging: clean up old roll-out classes [puppet] - 10https://gerrit.wikimedia.org/r/517833 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:20:25] (03PS2) 10Jbond: firewall logging: clean up old roll-out classes [puppet] - 10https://gerrit.wikimedia.org/r/517833 (https://phabricator.wikimedia.org/T116011) [12:20:43] (03PS3) 10Jbond: firewall logging: clean up old roll-out class [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011) [12:20:48] !log cache nodes: resume rolling reboots for kernel and varnish upgrades T224694 T225998 [12:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:55] T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 [12:20:55] T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 [12:21:04] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:54] (03CR) 10ArielGlenn: [C: 03+1] "Noop for the dump woerkers and servers, so signing off on that part of this commit." [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:22:16] (03Abandoned) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on production wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517822 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan) [12:22:49] (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:22:54] (03Abandoned) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517673 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan) [12:25:10] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:37] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:49] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:28] (03CR) 10Jbond: [C: 03+2] firewall logging: clean up old roll-out class [puppet] - 10https://gerrit.wikimedia.org/r/517834 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:50:30] (03PS2) 10Jbond: firewall logging: make profile::base::firewall::log a private class [puppet] - 10https://gerrit.wikimedia.org/r/517835 (https://phabricator.wikimedia.org/T116011) [12:50:46] PROBLEM - cassandra CQL 10.64.16.42:9042 on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [12:51:38] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:50] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:21] (03CR) 10Jbond: [C: 03+2] firewall logging: make profile::base::firewall::log a private class [puppet] - 10https://gerrit.wikimedia.org/r/517835 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:53:08] !log Deploy schema change on the private wikis listed at T225643 [12:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:13] T225643: Schema change to oathauth_users - https://phabricator.wikimedia.org/T225643 [12:57:51] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:07] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:14] (03CR) 10Hashar: "recheck" [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [13:00:27] 10Operations, 10Patch-For-Review: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10jbond) Closing as logging is now enabled by default for any role with the `profile::base::firewall` class. Please reopen if more work is required [13:00:45] (03CR) 10jerkins-bot: [V: 04-1] New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [13:02:14] (03Abandoned) 10Jbond: Revert "firewall logging: enable ulog by default" [puppet] - 10https://gerrit.wikimedia.org/r/517838 (owner: 10Jbond) [13:02:58] (03PS3) 10Jbond: hiera: update search order [puppet] - 10https://gerrit.wikimedia.org/r/511686 [13:03:14] (03CR) 10Jbond: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [13:03:48] (03PS1) 10Hashar: Add .gitreview [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517851 [13:04:00] (03CR) 10jerkins-bot: [V: 04-1] Add .gitreview [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517851 (owner: 10Hashar) [13:06:25] (03PS2) 10Hashar: New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [13:14:47] (03CR) 10jerkins-bot: [V: 04-1] New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [13:17:52] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:08] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:17] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) [13:24:30] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:56] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10BBlack) [13:27:03] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "If this is intended to perform tests on the behaviour of AsyncRoute, I think it's ok. On the long run, though, I'd prefer to see both an A" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [13:44:30] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:04] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:18] (03PS2) 10Ottomata: Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267) [13:48:21] !log apt upgrade on wikitech-static [13:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:41] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10Krinkle) If re-using `techblog.wikimedia.org`, please take care not to break existing urls. The root path would be fine to change as it... [13:50:30] (03CR) 10Ottomata: [C: 03+2] Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata) [13:50:39] !log rebooting wikitech-static [13:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:49] (03CR) 10jenkins-bot: Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata) [13:51:39] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:30] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:23] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10BBlack) Implementing a blanket redirect to the legacy blog URI for `^/20(0[7-9]|1[0-8])/` should be feasible in VCL or Lua at the edge.... [13:56:04] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disabling Avro ApiAction Monolog channel - T222267 (duration: 00m 57s) [13:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:10] T222267: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 [13:57:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Premise LGTM, some comments though." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond) [13:58:46] RECOVERY - cassandra CQL 10.64.16.42:9042 on maps1002 is OK: TCP OK - 0.000 second response time on 10.64.16.42 port 9042 https://phabricator.wikimedia.org/T93886 [14:00:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] CI: Create lightweight agent role for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/517111 (https://phabricator.wikimedia.org/T224069) (owner: 10Brennen Bearnes) [14:00:25] (03PS3) 10Alexandros Kosiaris: CI: Create lightweight agent role for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/517111 (https://phabricator.wikimedia.org/T224069) (owner: 10Brennen Bearnes) [14:05:01] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Aklapper) >>! In T226091#5268341, @zeljkofilipin wrote: > @Aklapper do we have an official on-boarding document? :) Meaning, is there a process new hire should follow? I'm afraid you hav... [14:06:33] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:06:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:41] !log rolling reboot of mwdebug servers for kernel security update [14:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:41] hey guys, are there issues with job queues currently? I'm again in the situation of having sent a MassMessage hours ago and hasn't arrived yet... https://meta.wikimedia.org/wiki/Special:Log/massmessage [14:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:47] when I looked earlier at the graphs and the kibana errors there was nothing that looked weird [14:10:35] but joe had problems earlier with a template included in a page not forcing the rerender (a template included only on 1-2 pages) [14:10:41] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Gilles) [14:11:40] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:04] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) [14:12:29] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Normalize thumbnail request URLs in Varnish to avoid cachebusting - https://phabricator.wikimedia.org/T216339 (10Gilles) 05Stalled→03Open [14:12:34] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) @WMDE-leszek, @Tarrow. I 've noticed we are missing one thing. We have a dashboard for the service's metrics in https://grafan... [14:12:41] 10Operations, 10MassMessage, 10Patch-For-Review: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Elitre) 05duplicate→03Open I'm experiencing this again :/ Help? It's been several hours. https://meta.wikimedia.org/wiki/Special:Log/massmessage [14:13:10] I've reopened a task since it's the same, can file a different one if necessary. [14:13:31] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:35] If I have to ping someone in particular here or there, also please LMK. Thanks as usual :) [14:15:29] (03PS8) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [14:16:48] Elitre: I don't see problems on the infrastructure handling jobs: https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1 [14:17:03] but of course it could be the are not sent or processed in the first place [14:19:50] jynus: thanks, I have no idea what's going on :/ there's no error messages. I thought maybe I had targeted an impossible namespace, but Johan's regular tests also didn't go thru, so it's not that. [14:19:53] I don't see anything in the exception/fatal logs [14:20:28] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:20:29] (03CR) 10Jbond: "Thanks for the review, updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond) [14:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:48] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:45] (03CR) 10Jbond: "PCC - https://puppet-compiler.wmflabs.org/compiler1002/17020/" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [14:21:59] Reedy: same, no errors, no change of patterns (except on wikibase one s with is expected) [14:22:53] https://www.mediawiki.org/wiki/MediaWiki_1.34/wmf.10/Changelog#MassMessage [14:22:55] Reedy: it could be something on code that doesn't error out but doesn't do stuff (?) do you know a good way to check the latest merges ? [14:23:15] That list should be wmf.8...wmf.10 for the extension [14:23:33] No obvious other commit message that suggests job queue changes [14:23:52] note that the earlier issue was not massMessage but refreshLinks or whatever queues those up [14:24:01] (03PS9) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [14:24:09] Elitre: It was fine on the 17th though, right? [14:24:39] The last Tech News issue went thru just fine AFAIK [14:24:49] when was that? [14:25:04] https://meta.wikimedia.org/wiki/Special:Log/massmessage [14:25:08] 20:37, 17 June 2019 Johan (WMF) talk contribs sent a message to Global message delivery/Targets/Tech ambassadors (Tech News: 2019-25) Tag: PHP7 [14:25:25] I note Johan's was PHP7 but Elitre's wasn't, so not obviously a PHP7 bug either [14:26:01] Any idea which wikis mass message is most active on? [14:26:12] https://en.wikipedia.org/wiki/Special:Log/massmessage last was yesterday well before the branch [14:26:58] 10Operations: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 (10jbond) 05Open→03Resolved This should all be fixed no please reopen if more errors are observed [14:27:08] I guess Meta and en.wiki, but it's a guess [14:27:34] (03PS10) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [14:28:26] Elitre: Any chance you could test one on enwiki? [14:28:40] To see if we can narrow if it might be a MW change, or if it's some potential backend job queue issue [14:28:48] I don't have MM rights there AFAIK [14:29:10] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jbond) [14:29:13] 10Operations, 10Patch-For-Review: cron-spam: /usr/local/sbin/check-cumin-aliases - https://phabricator.wikimedia.org/T222443 (10jbond) 05Open→03Resolved Theses boxes where switch to spare so spam should be halted please reopen if not [14:32:01] https://id.wikipedia.org/wiki/Istimewa:Catatan/massmessage [14:32:29] 10Operations: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10jbond) 05Open→03Resolved [14:32:52] ok, in JobExecutor.log... [14:33:03] I don't see any .10 jobs for MassMessage [14:33:06] literally only .8 ones [14:33:10] (03PS1) 10Elukey: Add fake kerberos keytabs for the Hadoop testing cluster [labs/private] - 10https://gerrit.wikimedia.org/r/517870 [14:33:12] 10Operations, 10observability, 10Patch-For-Review: Expose linux kernel firewall and connections statistics - https://phabricator.wikimedia.org/T215277 (10jbond) 05Stalled→03Resolved Closing this a new task can be created if we re-visit this [14:33:27] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytabs for the Hadoop testing cluster [labs/private] - 10https://gerrit.wikimedia.org/r/517870 (owner: 10Elukey) [14:34:06] Reedy: so either not enqueued or not executed- code regression? [14:34:17] https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=now-2d&to=now&panelId=5&fullscreen [14:34:24] I wonder if this reflects what's going on [14:34:42] Possibly... Other .10 jobs are running, so job queue stuff definitely isn't completely broken on .10 [14:34:43] backlog shorter because things not getting queued? [14:34:56] (03PS1) 10Ottomata: Disable CirrusSearchRequestSet avro monolog channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T222268) [14:35:06] apergos: note that is only cirrus [14:35:19] which had a temporary anomaly [14:35:22] mm point [14:35:37] apergos: https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=now-7d&to=now&panelId=5&fullscreen [14:35:47] the rest look "normal" [14:35:56] oic [14:35:57] (03CR) 10Ottomata: "This should not be merged until the Search team verifies that they are using cirrusseach-request instead of this avro stream." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [14:36:10] (03PS1) 10Elukey: Add missing kerberos fake keytab for analytics1039 [labs/private] - 10https://gerrit.wikimedia.org/r/517872 [14:36:25] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add missing kerberos fake keytab for analytics1039 [labs/private] - 10https://gerrit.wikimedia.org/r/517872 (owner: 10Elukey) [14:37:08] (03CR) 10Alexandros Kosiaris: ipmi - pxe: Ensure ipmi is not overriding the boot order (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond) [14:37:32] 10Operations, 10MassMessage, 10Patch-For-Review: MassMessage not delivering - https://phabricator.wikimedia.org/T221365 (10Reedy) [14:37:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] hiera: update search order [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [14:38:40] 10Operations, 10MassMessage, 10WMF-JobQueue: MassMessages apparently not being sent on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) [14:38:49] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17024/ - seems that we are ready to test :)" [puppet] - 10https://gerrit.wikimedia.org/r/504280 (owner: 10Elukey) [14:39:09] 10Operations, 10MassMessage, 10WMF-JobQueue: MassMessages apparently not being sent on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) [14:39:29] (03PS11) 10Elukey: Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [14:39:42] ottomata: --^ we are ready to test it finally :D [14:39:49] 10Operations, 10MassMessage, 10WMF-JobQueue: MassMessages apparently not being sent on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) Looking at `mwlog1001:/srv/mw-log/JobExecutor.log` there are no massmessage jobs for .10 only for .8 [14:40:10] (03CR) 10Aaron Schulz: "Both would be nice for implementing BagOStuff::WRITE_SYNC. I guess such a feature could just use another prefix, like mw-wan-sync or such." [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [14:40:14] apergos: jynus thoughts on rolling back all of group1? [14:40:28] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:33] hi, global renames are stuck [14:40:36] (03PS1) 10Ottomata: Remove Monolog Kafka handler and configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517874 (https://phabricator.wikimedia.org/T222268) [14:40:45] revi: ughhh [14:40:49] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:40:52] I think it may not be only mass message [14:40:52] screenshots (too lazy on mobile) https://usercontent.irccloud-cdn.com/file/ziTYFoe1/IMG_0209.PNG [14:40:53] That's a second data point though [14:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:55] but most jobs [14:41:03] uhhh [14:41:06] (03PS1) 10Muehlenhoff: Update check_timedatectl to latest version from DSA repository [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) [14:41:17] most .10 jobs [14:41:32] (03PS1) 10Reedy: Revert "group1 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517876 (https://phabricator.wikimedia.org/T226109) [14:41:45] no us staff here today because holiday. right [14:41:55] I only see brief errors from the job queue on .10, when normally it is a constant noise [14:42:00] (03CR) 10jerkins-bot: [V: 04-1] Remove Monolog Kafka handler and configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517874 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [14:42:10] probably the amount is not large enough to change patterns? [14:42:39] considering 635 wikis are on .10 vs 299 on .8... [14:42:43] from group 1? maybe [14:42:50] The number of log entries into that log file [14:43:00] I'll revert [14:43:03] +1 [14:43:05] (03CR) 10Reedy: [C: 03+2] Revert "group1 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517876 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy) [14:43:10] but let's check that solve it [14:43:16] Sure [14:43:30] We might want to revert group0 too, but leaving those broken might be useful [14:43:32] I am convinced of the issue, not sure if that is the cause [14:43:47] Yeah, indeed [14:44:10] revi: even ig it looks like we are ignoring you, we are actually doing something about that :-D [14:44:17] jynus: assumed so [14:44:21] more like Reedy is [14:44:32] (03PS3) 10Jbond: ipmi - pxe: Ensure ipmi is not overriding the boot order [puppet] - 10https://gerrit.wikimedia.org/r/517694 [14:44:32] by scrolling up and found discussing JobQueue [14:44:33] revi: Thanks for coming and telling us though, appreciated [14:44:42] well we spotted it few hours ago [14:44:44] indeed [14:44:46] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517876 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy) [14:44:48] (03CR) 10Jbond: "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond) [14:44:54] by someone in [global-rename] complaining 'rename stuck' [14:45:21] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) From email from @MarkTraceur == Database needs == * 54 million files on Commons * Estimated average of 10-20 statements per file * Estimate... [14:45:29] just issued 'stop rename' warning mails [14:45:35] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Patch-For-Review: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) [14:46:17] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517876 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy) [14:46:36] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: group1 back to .8 T226109 [14:46:51] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue, 10Patch-For-Review: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Krinkle) [14:46:56] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:47:19] There's some "Producer error" and some "Retry count exceeded" [14:47:40] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) p:05Triage→03High [14:47:41] I looked at the few of them I saw and they were most unenlightening [14:47:47] !log jnt push to eqiad, remove old protect-old-lvs-servers term + update syslog target T224128 [14:47:47] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Krinkle) p:05High→03Unbreak! Train blockers are UBN. [14:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:15] T226109: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 [14:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:25] T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 [14:48:28] how to test at least with new jobs? [14:48:30] Elitre: Want to give it a try on meta again please? [14:48:34] ^that [14:48:35] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:47] Reedy: promise it won't deliver it twice? :) [14:48:49] * apergos crosses fingers [14:49:25] or try a test delivery [14:49:39] (03PS2) 10Bstorm: dumps distribution: enable the service for nfs to start on reboot [puppet] - 10https://gerrit.wikimedia.org/r/517117 (https://phabricator.wikimedia.org/T217474) [14:49:41] just 1 person would be enough [14:49:41] I honestly can't [14:49:44] But yeah, a test one is fine [14:49:55] I can try test delivery if Elitre isn't doing it [14:50:06] I am testing it. Thanks. [14:50:10] k gud [14:50:17] what will happen to stuck stuff? [14:50:23] kick them again? [14:50:26] depends if it ever made it into the queue [14:50:32] we need to proof that was the issue first [14:50:38] those currently in the RenameQueue [14:50:44] later we will deal with stuck, etc. [14:50:48] kk [14:51:16] (03CR) 10Bstorm: [C: 03+2] dumps distribution: enable the service for nfs to start on reboot [puppet] - 10https://gerrit.wikimedia.org/r/517117 (https://phabricator.wikimedia.org/T217474) (owner: 10Bstorm) [14:51:25] (03CR) 10Jbond: "some questions probably obvious to most :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [14:51:34] (03PS3) 10Giuseppe Lavagetto: New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 [14:52:43] 23:52:24 [[User talk:-revi]]; MediaWiki message delivery; /* Test */ new section; https://meta.wikimedia.org/w/index.php?diff=19160132&oldid=19159693&rcid=13727936 [14:52:55] oh ho ho! [14:53:06] So it's definitely something in .10 then [14:53:10] who's gonna git bisect that mess? [14:53:16] sent. [14:53:22] Though, revi is also sending messages in the future [14:53:27] * Reedy eyes revi suspiciously [14:53:31] >_< [14:53:34] * Reedy grins [14:53:35] :-D [14:54:03] the problem is that if it was enqueing and not execution [14:54:04] !log rolling reboot of URL downloaders for kernel security update [14:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:01] !log jnt push to knams, remove old protect-old-lvs-servers term + update syslog target (T224128) + replace /28 with /29 (T211254) [14:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:07] T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 [14:55:07] T211254: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 [14:55:20] I will be back later to see if I'm good to go with the old message. TA. [14:55:22] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:55:22] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Thanks! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/517694 (owner: 10Jbond) [14:57:01] Reedy: I hope it's only execution loss, and not queuing as well. [14:57:09] 24 hours of jobs for all group1 wikis is pretty annoying. [14:57:14] indeed [14:57:44] e.g. cascading updates all gone, don't auto-correct. For page views, 30 days will eventually fix it, but not for whatlinkshere, search index, and category tree. [14:57:51] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:57:52] doesn't* [14:57:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:58] !log update syslog target on frack network devices (T224128) [14:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:39] (03CR) 10Elukey: "Aaron/Giuseppe: if you are ok I'd merge this and get as far as enabling it to mw canaries (testing on single nodes first). The idea is to " [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [15:00:22] (03CR) 10Bstorm: "Jhedden found that there's another piece to this, but this is going to be needed either way, so I'll merge it." [puppet] - 10https://gerrit.wikimedia.org/r/517470 (https://phabricator.wikimedia.org/T225265) (owner: 10Bstorm) [15:00:38] (03PS2) 10Bstorm: cloudstore: move secondary monitoring stuff into profile and fix it [puppet] - 10https://gerrit.wikimedia.org/r/517470 (https://phabricator.wikimedia.org/T225265) [15:01:17] 10Operations, 10Wikimedia-Logstash, 10netops, 10User-herron: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 (10ayounsi) 05Open→03Resolved All done here! [15:02:07] (03CR) 10Bstorm: [C: 03+2] cloudstore: move secondary monitoring stuff into profile and fix it [puppet] - 10https://gerrit.wikimedia.org/r/517470 (https://phabricator.wikimedia.org/T225265) (owner: 10Bstorm) [15:06:57] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [15:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:35] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [15:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:26] (03CR) 10Aaron Schulz: "Seems OK by me." [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [15:10:10] (03CR) 10Bstorm: [C: 03+1] toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez) [15:12:46] AaronSchulz: You actually about? [15:13:33] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [15:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:52] <_joe_> Krinkle: I don't think it was just execution, no [15:14:43] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) [15:16:34] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [15:16:35] _joe_: aye, yeah, the switch to Kafka has added a fair number of additional dependencies including UID generator and JSON encoding, both of which have repeatedly caused job outages or major perf overhead. [15:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:54] I've fixed UID generator to be less slow, and some uses of v1 have been swapped for v4. [15:17:11] although for the most part it seems pointless for jobs. [15:17:23] the json encoding is not yet solved, but being worked on. [15:17:46] I've also fixed the UID generator to not throw under normal conditions as it used to. [15:18:02] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:18:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:18] !log rebooting boron for kernel security update [15:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:40] (03PS1) 10Elukey: Add a host override to analytics1032 to avoid using /dev/sdc [puppet] - 10https://gerrit.wikimedia.org/r/517881 (https://phabricator.wikimedia.org/T225864) [15:20:12] (03CR) 10Elukey: [C: 03+2] Add a host override to analytics1032 to avoid using /dev/sdc [puppet] - 10https://gerrit.wikimedia.org/r/517881 (https://phabricator.wikimedia.org/T225864) (owner: 10Elukey) [15:21:41] <_joe_> Krinkle: well the switch to kafka also caused the jobqueue to be reliable in most cases, instead of a constant dumpster fire as it used to be (in the general disinterest of everyone but me). I'd call it a win anyways [15:21:43] !log rolling reboot of proton* for kernel security update [15:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:34] _joe_: Yeah, operationally a win. But from consumer and user perspective (MW and editors) we've had more outages and irrecoverable job loss than before. I'll get better I know, but just stating how it is. [15:23:53] It'll* [15:24:08] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:24:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:13] also more visible fatals (e.g. in web reqs instead of in the queue) [15:24:14] <_joe_> I disagree about the editors part. The jobs were being dropped all the time before the switch [15:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:44] <_joe_> and no, we didn't have more outages, it's just that weeks long outages or lags on some jobs were just the norm. [15:25:55] I don't recall a user-visible outage (HTTP 500) being caused when we were on Redis. Afaik it's always been able to store new jobs, or the client was sufficiently simple/robust/or tolerant to not fatal. [15:26:10] But yes, lots of benefits, and jobs actually running is also important. [15:26:46] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Degraded RAID on analytics1032 - https://phabricator.wikimedia.org/T225864 (10elukey) 05Open→03Resolved a:03elukey [15:27:51] (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) [15:28:26] Krinkle: out of curiosity, what is the source of the HTTP 500 when sending jobs to eventbus? [15:28:53] corner cases that we didn't think about, etc.. ? [15:29:45] (03PS4) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) [15:30:08] Joe has a good point though: probably we didn't cause 500s (that are bad) with Redis, but I recall weeks of great pain when dealing with gigantic backlogs of jobs to be processed [15:30:48] sometimes jobs were executed after days from the enqueuing time, and usually because Joe was fixing it manually :D [15:31:26] Krinkle: are you discussing the global rename outage? If so, do we have a task? [15:32:02] hauskatze: T226109 [15:32:02] T226109: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 [15:32:11] cdanis: thanks :) [15:33:34] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [15:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:44] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10jijiki) p:05Triage→03Normal [15:35:04] (03PS5) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) [15:35:39] elukey: mainly these things: 1) UID generator fatalling, 2) JWT signing failing, 3) json encoding not able to represent some jobs (several times, still on-going), 4) kafka unavailable (only once faik), 5) EventBus implementation out of sync with core (several times) [15:35:51] and these are all on the MW side, hence fatalling [15:36:34] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [15:36:36] also several issues http/curl handling stuff that I didn't quite understand at the time. [15:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:39] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Joe) A lot of jobs weren't even being executed/queued. Just looking at today's jobs `lang=bash $ fgrep commonswiki JobExecutor.log | fgrep wmf.10 | per... [15:37:35] !log pooled maps1002 - postgres init is complete and successfully joined to its cluster - T224395 [15:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:42] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [15:38:06] <_joe_> Reedy: I think we should rollback group0 as well, see https://phabricator.wikimedia.org/T226109#5268921 [15:38:12] (03PS6) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) [15:38:34] _joe_: Sure, I did suggest that we might. Patch incoming [15:39:02] <_joe_> also, the friday rule applies to off days during the week as well [15:39:19] I wasn't the engine driver ;) [15:39:55] _joe_: test wikis too? or just group0? [15:39:56] <_joe_> I know [15:39:58] (03PS7) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) [15:40:10] <_joe_> Reedy: test wikis is not my call [15:40:21] (03PS1) 10Reedy: Revert "group0 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) [15:40:24] <_joe_> but group0 for sure [15:40:44] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [15:40:45] (03PS2) 10Reedy: Revert "group0 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) [15:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:49] (03PS3) 10Reedy: Revert "group0 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) [15:40:51] (03CR) 10Reedy: [C: 03+2] Revert "group0 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy) [15:40:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "group0 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy) [15:41:49] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy) [15:42:04] (03CR) 10jenkins-bot: Revert "group0 wikis to 1.34.0-wmf.10 refs T220735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517885 (https://phabricator.wikimedia.org/T226109) (owner: 10Reedy) [15:43:16] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [15:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:37] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: group0 back to .8 T226109 [15:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:42] T226109: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 [15:45:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Cmjohnson) The DIMM has been reseated and swapped to the opposite sides. [15:45:49] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) [15:47:08] RECOVERY - Host elastic1029 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:51:20] (03PS1) 10Fsero: introducing helmfile.d values for staging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/517887 (https://phabricator.wikimedia.org/T212130) [15:51:28] PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:51:52] PROBLEM - HHVM rendering on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:52:24] PROBLEM - Apache HTTP on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:52:56] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 77491 bytes in 0.463 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:53:28] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:53:38] RECOVERY - Nginx local proxy to apache on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:53:40] (03PS8) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) [15:53:59] what's all that ^^? [15:55:40] (03PS1) 10Fsero: k8s, deploy: introducing helmfile for manage charts [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) [15:55:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1001/17032/" [puppet] - 10https://gerrit.wikimedia.org/r/517858 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez) [16:01:26] !log cache nodes: stop rolling reboots for today, 47/80 done T224694 T225998 [16:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:32] T225998: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 [16:01:32] T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 [16:02:53] (03PS1) 10Fsero: k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891 [16:03:39] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891 (owner: 10Fsero) [16:07:21] (03PS2) 10Fsero: k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891 [16:07:46] (03PS3) 10Fsero: k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891 [16:07:53] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/517891 (owner: 10Fsero) [16:10:22] (03CR) 10Fsero: "PCC seems happy https://puppet-compiler.wmflabs.org/compiler1001/17034/deploy1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [16:10:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Closing this for now, let me know if there is another issue. Keep in mind this... [16:16:44] Reedy: I was wondering if we could restart the queued global renames or if you need further debugging before running them [16:17:50] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:11] hauskatze: I think you're ok for the moment... [16:18:30] Reedy: I meant https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress [16:20:24] hauskatze: You mean, can I unstick them or something? (run the maintenance script) [16:20:46] Reedy: yup. But maybe we can wait [16:20:55] I think it's fine tbh [16:21:00] I mean, there's no urgency [16:23:29] !log pooling elastic1029 - T214283 [16:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:35] T214283: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 [16:23:38] How do we deduce logwiki? [16:23:43] just metawiki? [16:24:01] Reedy: yes, it's where the global rename started; which is always metawiki on production wikis [16:24:30] we almost need a useful output of https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress to dump the commands [16:24:50] we can try to do the first one queued in that list and see how it goes [16:25:03] I just did [16:25:48] argh [16:25:56] it keeps messing CentralAuth data: https://meta.wikimedia.org/wiki/Special:CentralAuth/Renamed_user_4fjkqyebdp5nlxbnzvc2n9lltjc5gv1 [16:26:05] it? [16:26:35] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [16:26:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:56] Reedy: the script [16:27:21] so you don't want it running? [16:27:22] tgr or legoktm explained me why but I can't remember [16:27:35] Reedy: maybe we can wait [16:27:39] lol [16:28:00] !log redirect ns1 to authdns1001 [16:28:01] hmm [16:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:50] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: specify more certificates [puppet] - 10https://gerrit.wikimedia.org/r/517896 (https://phabricator.wikimedia.org/T226098) [16:33:02] PROBLEM - HHVM rendering on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:33:09] hauskatze: there's a fixme to well, fix that, but right now its not going to work, the data is effectively lost [16:33:13] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CentralAuth/+/master/maintenance/fixStuckGlobalRename.php#85 [16:33:32] PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:34:16] 10Operations, 10netops: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) Interesting! This would be useful before doing maintenance on a whole router. I opened an issue upstream asking for a per AS option, see https://github.com/mwiget/bgp_graceful_shutdown/issues/1 My su... [16:34:28] RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 77553 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:34:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1001/17035/toolsbeta-arturo-k8s-etcd-1.toolsbeta.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/517896 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez) [16:34:43] !log rebooting authdns2001 for kernel security update [16:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:58] RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:36:02] legoktm: thanks :) So I guess the data will be lost for every rename in the GlobalRenameProgress page right? Unless there's a way to tell the jobqueue to pick those renames w/o using the script... [16:37:02] https://phabricator.wikimedia.org/T226109#5268921 says "So it looks like the jobs were not even being enqueued and are now lost." [16:37:37] !log rollback redirect ns1 to authdns1001 [16:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:11] ouch :( [16:38:24] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Cmjohnson) I thought I had a ticket for this open with HPE but it doesn't look that way. I will take care of it ASAP [16:39:30] !log redirect ns0 to authdns2001 [16:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:18] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Cmjohnson) Dell is sending me a new Raid card, cables and backplane. Sorry, it took so long, I had to call them after they denied my second request. [16:40:48] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational [16:41:23] hauskatze: so yeah, if Reedy or someone else wants to run the script that would be appreciated [16:41:27] I'm not in a good place to do it right now [16:41:38] I already parsed the list to make a list of copy paste commands [16:42:11] legoktm: okay. Since I can't do it, if Reedy wants to continue it'd be good otherwise I'll create a task [16:42:46] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [16:42:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:56] hauskatze: Done [16:44:08] https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress 8 still there atm [16:44:11] 5 [16:44:22] 4 [16:44:27] 3 [16:44:42] 2 [16:44:58] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) Great news! Thanks a lot!! [16:45:06] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:07] 1 [16:45:09] !log rebooting authdns1001 for kernel security update [16:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:01] and boom [16:46:06] thanks Sam :) [16:47:08] (03PS8) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 [16:49:14] (03PS9) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 [16:49:48] (03CR) 10Bstorm: "There! Got rid of the submodule merge conflict so this can actually rebase/deploy" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [16:50:31] !log rollback redirect ns0 to authdns2001 [16:50:34] !log running racreset on multatuli [16:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:57] (03CR) 10Bstorm: [C: 03+2] dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [16:51:56] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: add etcd user to the puppet group [puppet] - 10https://gerrit.wikimedia.org/r/517905 (https://phabricator.wikimedia.org/T226098) [16:52:04] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10CDanis) @Reedy manually ran the global renames that were never queued properly. [16:52:47] * legoktm hugs Reedy [16:53:59] oh cdanis also has a cat <3 [16:54:02] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) a:05Cmjohnson→03RobH I updated the switch config to private1-d.....both servers are currently off and ready for installs. assigning to @robh to install [16:54:14] I have two, but only one of them will sit in my lap :( [16:54:19] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) [16:56:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: etcd: add etcd user to the puppet group [puppet] - 10https://gerrit.wikimedia.org/r/517905 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez) [16:56:19] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson) [16:57:22] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson) @Bstorm @ayounsi I will need very clear instructions on which racks/rows these servers can go in before I physically rack and cab... [16:57:38] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: use complete fqdn in node name [puppet] - 10https://gerrit.wikimedia.org/r/517906 (https://phabricator.wikimedia.org/T226098) [16:58:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: etcd: use complete fqdn in node name [puppet] - 10https://gerrit.wikimedia.org/r/517906 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez) [17:03:02] (03PS2) 10Muehlenhoff: Update check_timedatectl to latest version from DSA repository [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) [17:03:19] (03CR) 10Muehlenhoff: Update check_timedatectl to latest version from DSA repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [17:07:45] (03CR) 10Bstorm: [C: 03+1] "Where I'd be most concerned was labstores, and they come up as a noop in the compiler. Seems ok." [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [17:58:46] (03PS3) 10Bstorm: Add new terms normalized schema tables as public 1:1 views in labs. [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) (owner: 10Alaa Sarhan) [18:04:35] cdanis, Reedy or anyone else, is it now safe to send MassMessages again? (re: https://phabricator.wikimedia.org/T226109 ) [18:04:49] AIUI it should be, Elitre [18:05:23] alright, will test and then try again then. fingers crossed. thanks for working on this ^_^ [18:09:59] !log added MatmaRex to extension-VisualEditor-staff Gerrit group [18:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:04] :o [18:31:47] PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [18:33:03] RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [18:33:48] (03PS1) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) [18:34:10] (03PS2) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) [18:34:25] (03PS3) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) [18:34:49] (03CR) 10jerkins-bot: [V: 04-1] camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata) [18:39:35] Elitre: seems like it worked this time? [18:41:43] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational [18:45:20] (03PS4) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) [18:45:59] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:46:01] (03CR) 10jerkins-bot: [V: 04-1] camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata) [19:04:36] (03PS2) 10Bartosz Dziewoński: Ensure no lossy WTE→VE switching in public wikis (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516567 [19:04:38] (03PS1) 10Bartosz Dziewoński: Centralize enwiki's VisualEditor feedback page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517924 (https://phabricator.wikimedia.org/T224851) [19:06:39] (03PS5) 10Ottomata: camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) [19:23:11] (03PS1) 10Krinkle: peopleweb: Remove php module from httpd [puppet] - 10https://gerrit.wikimedia.org/r/517926 [19:23:12] hashar: Reedy: ^ [19:25:04] (03CR) 10Ottomata: [C: 03+2] camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/517918 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata) [19:27:28] When I tried to do `git fetch` on ssh://vcs@git-ssh.wikimedia.org/source/tool-wmopbot.git I got "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!", where do I see the fingerprints? Should I just "meh"? [19:30:03] nvm, found https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/git-ssh.wikimedia.org [19:30:47] revi: https://wikitech.wikimedia.org/w/index.php?title=Help%3ASSH_Fingerprints%2Fgit-ssh.wikimedia.org&type=revision&diff=1827965&oldid=1768511 [19:31:19] gotcha [19:34:23] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:35:37] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 77601 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:41:01] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational [19:45:21] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:57:39] (03CR) 10Hashar: [C: 03+1] "There is indeed now PHP support in user home dirs due to:" [puppet] - 10https://gerrit.wikimedia.org/r/517926 (owner: 10Krinkle) [20:01:32] (03PS2) 10Effie Mouzeli: haproxy: Disable global logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/517755 (https://phabricator.wikimedia.org/T225284) [20:02:21] (03CR) 10Effie Mouzeli: haproxy: Disable global logging to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517755 (https://phabricator.wikimedia.org/T225284) (owner: 10Effie Mouzeli) [20:17:51] (03PS3) 10Alaa Sarhan: Introduce config variables for new terms store in mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517819 (https://phabricator.wikimedia.org/T226086) [20:18:31] (03PS4) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517820 (https://phabricator.wikimedia.org/T225051) [20:23:16] (03PS4) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051) [20:26:13] PROBLEM - HHVM rendering on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:26:43] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:27:21] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 77539 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:27:59] RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 6.722 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:41] (03PS1) 10Awight: Configuration migration for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517933 (https://phabricator.wikimedia.org/T87985) [20:40:59] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational [20:45:17] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:12:17] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes some pages load slowly on de.wp in Europe (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Aklapper) 05Stalled→03Open p:05Triage→03High [21:34:29] (03PS4) 10Bstorm: Add new terms normalized schema tables as public 1:1 views in labs. [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) (owner: 10Alaa Sarhan) [21:35:23] (03CR) 10Bstorm: [C: 03+2] Add new terms normalized schema tables as public 1:1 views in labs. [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) (owner: 10Alaa Sarhan) [21:37:55] (03PS1) 10Alaa Sarhan: Switch property terms migration stage to WRITE_BOTH on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517948 (https://phabricator.wikimedia.org/T226129) [21:41:46] (03Abandoned) 10Alaa Sarhan: Switch property terms migration stage to WRITE_BOTH on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517948 (https://phabricator.wikimedia.org/T226129) (owner: 10Alaa Sarhan) [22:04:03] (03CR) 10Alaa Sarhan: [C: 03+1] Set EntityUsageTable addUsage batch size to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517669 (https://phabricator.wikimedia.org/T225500) (owner: 10Ladsgroup) [22:11:18] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational [22:15:36] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:41:36] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational [22:45:52] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:47:12] PROBLEM - Check systemd state on es2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:41:44] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational [23:46:00] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:55:41] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes some pages load slowly on de.wp in Europe (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10PM3) The "some" can be removed from the caption. I am experiencing this problem since Tuesday (dew...