[00:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191211T0000). [00:00:05] tgr: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:03:57] (03CR) 10Jforrester: [C: 03+1] mediawiki: Remove unused HHVM files [puppet] - 10https://gerrit.wikimedia.org/r/556282 (https://phabricator.wikimedia.org/T229792) (owner: 10Krinkle) [00:05:05] o/ [00:05:24] I can do self-serve SWAT [00:10:05] (03CR) 10Gergő Tisza: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [00:11:01] (03Merged) 10jenkins-bot: Add growthexperiments dblist, for puppet usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [00:17:18] huh [00:17:41] Special:Notifications gives a completely different result on mwdebug vs. prod [00:18:22] which can't possibly be related to this patch [00:22:56] also I get HTTP 400 half the time [00:26:19] thcipriani: Which mwdebug? [00:26:22] Bah. [00:26:25] tgr: ^^^ [00:26:54] mwdebug1002 [00:27:01] 1001 seems to be working normally [00:27:12] 1002 is pretty much useless for testing [00:27:37] the different result because API and ResourceLoader requests also fail with a 400 randomly [00:28:20] tgr: Yes, no-one should ever use 1002 for anything. It's totally broken. Did you not see the e-mails warning about this? [00:28:33] apparently I did not [00:28:35] We should probably put an MOTD up, or just depool the damn thing. [00:28:51] From ~2 months ago now. [00:29:43] James_F: is there a scap dependency order for the dblist change? I'd assume not, given that it's not loaded by MediaWiki [00:30:17] tgr: The yaml files aren't read by anything in production, just the dblist. [00:30:38] and the dblist is not read since it's not added to multiversion [00:30:40] tgr: So do as you will. Personally, I'd not bother syncing the YAML files at all. [00:30:59] Yeah, it'll be readable by puppet from the maintenance servers though. [00:35:00] !log tgr@deploy1001 Synchronized dblists/growthexperiments.dblist: SWAT: [[gerrit:546894|Add growthexperiments dblist, for puppet usage (T208369)]] (duration: 01m 02s) [00:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:07] T208369: Welcome survey: anonymize data after one year - https://phabricator.wikimedia.org/T208369 [00:37:45] !log tgr@deploy1001 Synchronized wmf-config/config: SWAT: [[gerrit:546894|Add growthexperiments dblist, for puppet usage (T208369)]] (duration: 01m 01s) [00:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:46] I sleep better if everything is properly synced :) [00:39:14] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:546894|Add growthexperiments dblist, for puppet usage (T208369)]] (duration: 01m 00s) [00:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:34] * James_F grins. [00:42:47] James_F: I found Timo's email about Logstash being broken on mwdebug1002 (I saw that at some point and forgot) but nothing about MediaWiki itself being broken, which seemed to be the case now [00:43:08] is that a new bug? [00:43:24] or not really worth investigating if it's broken anyway? [00:44:38] It's unclear to me, but I've seen multiple blips and issues with mwdebug1002, and if anything it's worsened since it was re-imaged. [00:49:42] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/556200 (https://phabricator.wikimedia.org/T188917) (owner: 10Filippo Giunchedi) [00:50:05] (03CR) 10Cwhite: [C: 03+1] monitoring: page on low HTTP global availability [puppet] - 10https://gerrit.wikimedia.org/r/555987 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [01:11:02] James_F: so apparently requests always go to debug1001, due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/549895 [01:11:10] and I can verify that it works [01:11:47] but using the X-WM-D: 1002 header I got frequent HTTP 400 errors, and for 1001 I did not [01:11:51] very weird [01:12:36] ugh, i'd thought that it was mwdebug1001 that was busted when testing stuff yesterday, and i think had this impression reinforced by some documentation that suggested using 1002 instead. [01:12:36] maybe I just got lucky? [01:12:51] (or preferring 1002, or something to that effect.) [01:14:11] I retract that, setting 1002 in the header still takes you to 1002 [01:14:23] I don't understand what that patch does then [01:14:44] anyway if a debug host is broken for a long time, we should probably remove it from the browser extension [01:35:11] (03PS1) 10Gergő Tisza: Add MOTD to mwdebug1002 warning about T214734 [puppet] - 10https://gerrit.wikimedia.org/r/556302 (https://phabricator.wikimedia.org/T214734) [01:37:18] (03CR) 10Gergő Tisza: "Per James' comment on IRC that there should be a MOTD warning about this." [puppet] - 10https://gerrit.wikimedia.org/r/556302 (https://phabricator.wikimedia.org/T214734) (owner: 10Gergő Tisza) [01:40:47] (03CR) 10Brennen Bearnes: [C: 03+1] Add MOTD to mwdebug1002 warning about T214734 [puppet] - 10https://gerrit.wikimedia.org/r/556302 (https://phabricator.wikimedia.org/T214734) (owner: 10Gergő Tisza) [01:42:14] (03PS8) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) [02:57:36] (03PS1) 10Andrew Bogott: Add some comments about cloudvirts that are set aside for ceph testing. [puppet] - 10https://gerrit.wikimedia.org/r/556308 (https://phabricator.wikimedia.org/T225320) [02:58:56] (03CR) 10Andrew Bogott: [C: 03+2] Add some comments about cloudvirts that are set aside for ceph testing. [puppet] - 10https://gerrit.wikimedia.org/r/556308 (https://phabricator.wikimedia.org/T225320) (owner: 10Andrew Bogott) [04:29:49] (03PS4) 10Andrew Bogott: nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) [04:30:57] (03CR) 10jerkins-bot: [V: 04-1] nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [04:32:35] (03PS5) 10Andrew Bogott: nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) [04:37:01] (03PS6) 10Andrew Bogott: nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) [04:43:43] (03PS1) 10Andrew Bogott: nova-api: inject default user_data script for new VMs [puppet] - 10https://gerrit.wikimedia.org/r/556311 (https://phabricator.wikimedia.org/T181375) [04:45:05] (03PS5) 10Andrew Bogott: Bootstrapvz: remove firstboot script, enable cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/425421 [04:45:12] (03CR) 10jerkins-bot: [V: 04-1] nova-api: inject default user_data script for new VMs [puppet] - 10https://gerrit.wikimedia.org/r/556311 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [05:12:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:16:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:41:16] (03PS2) 10TechneSiyam: Added bnwikibooks,bnwikisource,ukwikivoyage under wiki hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556214 [05:45:16] !log Deploy schema change on dbstore1004:3314 [05:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:35] !log Compress cx_corpora on db2131 T240325 [06:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:41] T240325: Compress wikisahred.cx_corpora on x1 hosts - https://phabricator.wikimedia.org/T240325 [06:04:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:05:34] (03PS1) 10Marostegui: site.pp: Remove puppet references for db1062 [puppet] - 10https://gerrit.wikimedia.org/r/556317 (https://phabricator.wikimedia.org/T239188) [06:07:14] (03PS1) 10Marostegui: wmnet: Remove production DNS for db1062 [dns] - 10https://gerrit.wikimedia.org/r/556318 (https://phabricator.wikimedia.org/T239188) [06:07:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:10] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for db1062 [puppet] - 10https://gerrit.wikimedia.org/r/556317 (https://phabricator.wikimedia.org/T239188) (owner: 10Marostegui) [06:09:17] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS for db1062 [dns] - 10https://gerrit.wikimedia.org/r/556318 (https://phabricator.wikimedia.org/T239188) (owner: 10Marostegui) [06:20:00] (03PS1) 10Marostegui: mariadb: Set db2070 to spare [puppet] - 10https://gerrit.wikimedia.org/r/556319 (https://phabricator.wikimedia.org/T239684) [06:20:34] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2070 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556320 (https://phabricator.wikimedia.org/T239684) [06:21:08] (03PS2) 10Effie Mouzeli: mediawiki: Remove unused HHVM files [puppet] - 10https://gerrit.wikimedia.org/r/556282 (https://phabricator.wikimedia.org/T229792) (owner: 10Krinkle) [06:21:47] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2070 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556320 (https://phabricator.wikimedia.org/T239684) (owner: 10Marostegui) [06:22:19] !log Remove db2070 from tendril and zarcillo T239684 [06:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:25] T239684: Decommission db2070.codfw.wmnet - https://phabricator.wikimedia.org/T239684 [06:22:36] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2070 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556320 (https://phabricator.wikimedia.org/T239684) (owner: 10Marostegui) [06:24:21] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2070 from config T239684 (duration: 01m 18s) [06:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Set db2070 to spare [puppet] - 10https://gerrit.wikimedia.org/r/556319 (https://phabricator.wikimedia.org/T239684) (owner: 10Marostegui) [06:25:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2070 from config T239684 (duration: 01m 08s) [06:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2070 from config as it will be decommissioned T239684', diff saved to https://phabricator.wikimedia.org/P9848 and previous config saved to /var/cache/conftool/dbconfig/20191211-062700-marostegui.json [06:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:39] !log Stop MySQL on db2070 - T239684 [06:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:44] T239684: Decommission db2070.codfw.wmnet - https://phabricator.wikimedia.org/T239684 [06:34:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:42:45] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:42:47] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:43:03] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:43:21] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:43:21] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:43:21] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:43:31] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:43:35] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [06:44:22] ^checking [06:44:42] !log Stop mysql on db1124 for upgrade [06:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:17] !log restart graphoid on scb1001 [06:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:14] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:48:18] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:49:50] PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [06:51:12] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [06:57:26] !log Upgrade x1 codfw [06:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:13] !log Compress cx_corpora on db2096 T240325 [06:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:18] T240325: Compress wikisahred.cx_corpora on x1 hosts - https://phabricator.wikimedia.org/T240325 [07:36:33] (03PS3) 10Alexandros Kosiaris: k8s: Introduce kubetcd[12]00[456], kubestagetcd100[456] [puppet] - 10https://gerrit.wikimedia.org/r/556202 (https://phabricator.wikimedia.org/T239838) [07:36:35] (03PS1) 10Alexandros Kosiaris: k8s: Add roles to new etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/556324 (https://phabricator.wikimedia.org/T239838) [07:50:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:51:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:51:40] !log Upgrade db2096 (x1 codfw master) [07:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:54:40] !log Compress cx_corpora on db1140:3320 T240325 [07:54:40] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:45] T240325: Compress wikisahred.cx_corpora on x1 hosts - https://phabricator.wikimedia.org/T240325 [07:56:53] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:02:40] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3055.esams.wmnet [08:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:34] !log powercycle cp3055 - down since hours ago, no ssh, no mgmt serial console usable [08:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:12] jouncebot: now [08:05:12] No deployments scheduled for the next 3 hour(s) and 54 minute(s) [08:05:15] jouncebot: next [08:05:15] In 3 hour(s) and 54 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191211T1200) [08:05:25] * Urbanecm is going to deploy last time throttle rule [08:05:38] (03PS1) 10Urbanecm: Add throttle rule for Czech student workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556330 [08:05:48] (03CR) 10Urbanecm: [C: 03+2] Add throttle rule for Czech student workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556330 (owner: 10Urbanecm) [08:06:27] (03CR) 10jerkins-bot: [V: 04-1] Add throttle rule for Czech student workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556330 (owner: 10Urbanecm) [08:06:46] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:06:46] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:06:50] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:07:00] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:07:08] RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:07:14] (03PS2) 10Urbanecm: Add throttle rule for Czech student workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556330 [08:07:24] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:07:25] (03CR) 10Urbanecm: [C: 03+2] Add throttle rule for Czech student workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556330 (owner: 10Urbanecm) [08:07:26] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:07:32] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:07:36] RECOVERY - Host cp3055 is UP: PING OK - Packet loss = 0%, RTA = 83.38 ms [08:07:50] RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:08:06] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [08:08:10] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [08:08:21] (03Merged) 10jenkins-bot: Add throttle rule for Czech student workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556330 (owner: 10Urbanecm) [08:08:22] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [08:10:12] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: f62edfe: Add throttle rule for Czech student workshop (duration: 01m 02s) [08:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:51] !log Clear signup throttle for IP 195.113.183.5 [08:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:16] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10elukey) Announced the shutdown of stat1007 for Thu Dec 12th 15:30 CET (more or less) since it is a more crowded and used node. Since stat1... [08:12:53] (03CR) 10Elukey: "Marcel, the change looks ok but as always it would be great to have some ensure => absent here and there to properly clean up, otherwise w" [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [08:17:58] 10Operations, 10Graphoid, 10serviceops: Graphoid "No graph found." failures - 11 Dec 2019 - https://phabricator.wikimedia.org/T240419 (10akosiaris) 05Open→03Resolved a:03akosiaris Graphoid is to be undeployed mid of next quarter. With that in mind, the alerts were because some one edited the monitored... [08:34:03] !log Upgrade db1140 [08:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:49] !log Compress cx_corpora on x1 master (db1120) - T240325 [08:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:54] T240325: Compress wikisahred.cx_corpora on x1 hosts - https://phabricator.wikimedia.org/T240325 [08:40:39] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:41:01] (03CR) 10Hashar: "recheck T240175" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) (owner: 10KartikMistry) [08:51:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Introduce kubetcd[12]00[456], kubestagetcd100[456] [puppet] - 10https://gerrit.wikimedia.org/r/556202 (https://phabricator.wikimedia.org/T239838) (owner: 10Alexandros Kosiaris) [09:02:42] (03PS1) 10Kosta Harlan: GrowthExperiments: Switch beta labs wikis to use local search/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556334 (https://phabricator.wikimedia.org/T235717) [09:03:54] (03Abandoned) 10Kosta Harlan: GrowthExperiments: Switch beta labs wikis to use local search/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555932 (https://phabricator.wikimedia.org/T235717) (owner: 10Kosta Harlan) [09:04:51] !log running Translate/refresh-translatable-pages.php --jobqueue for Translate wikis - T235027 T235188 [09:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:58] T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache - https://phabricator.wikimedia.org/T235188 [09:04:58] T235027: Translate does not update content page when saving units - https://phabricator.wikimedia.org/T235027 [09:11:23] (03PS2) 10Kosta Harlan: GrowthExperiments: Configure testwiki to use local search & config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555928 (https://phabricator.wikimedia.org/T235717) [09:13:57] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) On 2019-12-10 cp3055 went down too: ` 19:33 <+icinga-wm> PROBLEM - Host cp3055 is DOWN: PING CRITICAL - Packet loss = 100% ` Depooled and power-cycled by @elukey on 2019-12-11T08:04. [09:14:05] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) [09:14:47] !log repool cp3055 T238305 [09:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:52] T238305: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 [09:18:25] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: fail on single-quote Prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/556200 (https://phabricator.wikimedia.org/T188917) (owner: 10Filippo Giunchedi) [09:19:15] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10ema) [09:19:22] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10ema) p:05Triage→03Normal [09:19:42] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) [09:20:23] (03CR) 10Ema: [C: 03+1] monitoring: page on low HTTP global availability [puppet] - 10https://gerrit.wikimedia.org/r/555987 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [09:23:24] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10jcrespo) See: T240177 T237730 backup2001 was updated to new bios last time it crashed. [09:24:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, make sure you check the network devices ACLs for labmon1001's address and ask netops to add cloudmetrics1002 address. +1 fro" [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [09:25:12] !log cp1075: depool ats-be to test set_server_resp_no_store https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556201/ T227432 [09:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:18] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:25:19] !log ema@cumin1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=ats-be [09:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:11] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) Do we have somewhere to collect the kernel versions of the hosts and whether they were upgraded before/after the crash? I upgraded db2125's kernel when it crashed to: ` root@db2125:~# u... [09:26:30] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: add explicit IDs to plugins [puppet] - 10https://gerrit.wikimedia.org/r/556173 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [09:27:37] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) [09:27:54] (03PS1) 10RLazarus: New upstream version 1.12.2. [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/556336 [09:28:54] (03PS2) 10RLazarus: New upstream version 1.12.2. [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/556336 [09:29:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New upstream version 1.12.2. [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/556336 (owner: 10RLazarus) [09:30:17] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) [09:33:35] !log roll-restart logstash in codfw/eqiad after https://gerrit.wikimedia.org/r/c/operations/puppet/+/556173 [09:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:34] (03CR) 10Ema: [C: 03+2] ATS: use set_server_resp_no_store, do not hide CC [puppet] - 10https://gerrit.wikimedia.org/r/556201 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:39:29] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "Revert "lvs: add entries for logstash-next and kibana-next"" [puppet] - 10https://gerrit.wikimedia.org/r/556036 (owner: 10Herron) [09:39:31] (03PS1) 10Elukey: cdh::hadoop: remove ipv6 constraints [puppet] - 10https://gerrit.wikimedia.org/r/556337 (https://phabricator.wikimedia.org/T240255) [09:40:08] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "Revert "dns: add kibana-next and logstash-next service addresses"" [dns] - 10https://gerrit.wikimedia.org/r/556035 (owner: 10Herron) [09:41:34] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) [09:41:54] (03CR) 10Elukey: "Added John to the party since I want to make sure that using augeas is ok (to avoid being hunted down when he upgrades to Puppet 6 for exa" [puppet] - 10https://gerrit.wikimedia.org/r/556337 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [09:43:41] (03PS1) 10Jbond: admin: add kubectl zsh autocomplete [puppet] - 10https://gerrit.wikimedia.org/r/556338 [09:44:58] !log cp1075: repool ats-be after successful set_server_resp_no_store test P9849 T227432 [09:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:04] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:45:25] !log ema@cumin1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be [09:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:45] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) [09:50:25] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [09:51:20] (03CR) 10Hashar: "recheck T240423 (no more use debian/rules clean)" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) (owner: 10KartikMistry) [09:52:22] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556212 (owner: 10TechneSiyam) [09:52:43] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556214 (owner: 10TechneSiyam) [09:53:23] (03Abandoned) 10Urbanecm: modified initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555981 (owner: 10TechneSiyam) [09:53:46] (03Abandoned) 10Urbanecm: HD logos for wikiprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555980 (owner: 10TechneSiyam) [09:56:45] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: include metrics-server [puppet] - 10https://gerrit.wikimedia.org/r/556340 (https://phabricator.wikimedia.org/T240402) [10:03:10] !log cp-ats: apply set_server_resp_no_store patch https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556201/ to all hosts T227432 [10:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:16] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [10:06:24] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) Log and boot: https://drive.google.com/file/d/1YL-j3M9fMFGq9EkHxxOL6kVtOf4uyM-e/view?usp=sharing https://drive.google.com/file/d/1E-5dZ_fitSE5TW0RmrrYRZsGt4DTitFn/view?usp=sharing [10:06:36] (03PS11) 10Arturo Borrero Gonzalez: toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) [10:18:28] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5292 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:20:01] jobqueue has had an increased amount of cirrus search jobs for about 10 minutes (been watching it for my script runs) [10:23:44] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:24:58] Nikerabbit: are these ElasticaWrite jobs? [10:25:50] cirrusSearchElasticaWrite yeah [10:26:22] Nikerabbit: this is being investigated: T224425 [10:26:22] T224425: MW Job consumers sometimes pause for several minutes - https://phabricator.wikimedia.org/T224425 [10:27:00] dcausse: okay, so I assume it's not caused by me [10:27:25] no I don't think so, you are reindexing ttm indices? [10:28:20] dcausse: no ttm should be unaffected, I am basically doing a refresh on all translation pages (PageName/langcode) to make sure they are up to date with latest translations [10:28:38] add comma after "no" [10:28:46] ok I see [10:31:32] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Switch beta labs wikis to use local search/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556334 (https://phabricator.wikimedia.org/T235717) (owner: 10Kosta Harlan) [10:32:22] (03Merged) 10jenkins-bot: GrowthExperiments: Switch beta labs wikis to use local search/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556334 (https://phabricator.wikimedia.org/T235717) (owner: 10Kosta Harlan) [10:33:22] (03PS1) 10Giuseppe Lavagetto: New envoy version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/556342 [10:34:14] !log Finished running Translate/refresh-translatable-pages.php --jobqueue for Translate wikis - T235027 T235188 [10:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:20] T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache - https://phabricator.wikimedia.org/T235188 [10:34:21] T235027: Translate does not update content page when saving units - https://phabricator.wikimedia.org/T235027 [10:36:16] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:36:44] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:38:03] (03PS1) 10Ema: systemd: add icinga check for journal patterns [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) [10:39:13] !log draining kubernetes1001.eqiad.wmnet to restart calico-node [10:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103:3314 for schema change T233135', diff saved to https://phabricator.wikimedia.org/P9851 and previous config saved to /var/cache/conftool/dbconfig/20191211-104506-marostegui.json [10:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:13] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [10:45:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [10:45:52] !log Deploy schema change on db1103:3314 [10:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:39] !log draining kubernetes1002.eqiad.wmnet to restart calico-node [10:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:42] !log draining kubernetes1003.eqiad.wmnet to restart calico-node [10:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:52] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-logging-external_43192: Servers kubernetes1001.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:53:02] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-logging-external_43192: Servers kubernetes1002.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:53:24] PROBLEM - LVS HTTP IPv4 on eventgate-logging-external.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.50 and port 43192: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:55:12] RECOVERY - LVS HTTP IPv4 on eventgate-logging-external.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 851 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:55:13] ^^ Im gussing this is caused restarting calico [10:55:30] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [10:56:26] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:58:20] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:58:30] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:58:54] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:59:04] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [11:00:16] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:00:40] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:03:03] (03PS1) 10Ema: ATS: add icinga check for logs skipped by trafficserver{,-tls} [puppet] - 10https://gerrit.wikimedia.org/r/556345 (https://phabricator.wikimedia.org/T237608) [11:05:21] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10MoritzMuehlenhoff) Some observations: - I'm pretty sure this is unrelated to the kernel, we've seen these crashes with both 4.9 and 4.19 - backup2001 had latest firmware when it crashed - backup200... [11:06:06] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) >>! In T238305#5731093, @jcrespo wrote: > See: T240177 T237730 backup2001 was updated to new bios last time it crashed. cp3053 too (T239041) and has been running fine since, FWIW. [11:07:39] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10MoritzMuehlenhoff) Given that the firmware updates itself were still showing these symptons, this wouldn't hurt, but I doubt it's a complete fix, I wrote up some proposal at https://phabricator.wikimedia.org/T238305#5731421,... [11:09:33] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1001/19905/" [puppet] - 10https://gerrit.wikimedia.org/r/556345 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [11:10:58] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: page on low HTTP global availability [puppet] - 10https://gerrit.wikimedia.org/r/555987 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [11:11:00] (03CR) 10Urbanecm: [C: 04-1] "fyi, per https://wikitech.wikimedia.org/wiki/SWAT_deploys ,this is non-swattable. You need to separate this into several patches, so each " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [11:11:20] (03CR) 10Ema: "Not incredibly useful pcc here: https://puppet-compiler.wmflabs.org/compiler1001/19906/" [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [11:19:42] (03PS2) 10Ema: ATS: lookup cache for cookie requests [puppet] - 10https://gerrit.wikimedia.org/r/556217 (https://phabricator.wikimedia.org/T227432) [11:21:01] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10jcrespo) >>! In T238305#5731424, @ema wrote: >>>! In T238305#5731093, @jcrespo wrote: >> See: T240177 T237730 backup2001 was updated to new bios last time it crashed. > > cp3053 too (T239041) and... [11:21:55] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [11:22:09] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:22:10] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:22:27] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [11:22:31] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:22:45] PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:24:32] (03PS1) 10Giuseppe Lavagetto: Fix debian version in changelog [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/556350 [11:24:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix debian version in changelog [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/556350 (owner: 10Giuseppe Lavagetto) [11:25:29] again edits in the test page --^ [11:26:54] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) Moritz points at T238305#5731421 that maybe it is the same issue as: https://www.dell.com/community/PowerEdge-OS-Forum/Random-Reboot-R740/td-p/5169703/page/3 ` root@backup2001:~$ cat /sys/devices/system/... [11:27:42] !log draining kubernetes1006.eqiad.wmnet to restart calico-node [11:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:17] !log draining kubernetes1005.eqiad.wmnet to restart calico-node [11:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:33] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:33:08] <_joe_> jbond42: maybe drain one node at a time? [11:33:48] _joe_: i am draining one node at a time but i think i was not leaving enough time for the BGP session to re-establish when i did the first three [11:33:50] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:34:08] 10Operations, 10Icinga, 10observability, 10Patch-For-Review, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Tentatively resolving, we'll be paging if >= 1% of global traffic is 5xx [11:34:09] im now checking on cr1 before moving to the next [11:34:10] <_joe_> jbond42: yeah that's probably it [11:34:16] <_joe_> great, thanks [11:34:37] this is what im doing https://wikitech.wikimedia.org/wiki/Kubernetes#Restarting_calico-node [11:36:42] !log draining kubernetes1004.eqiad.wmnet to restart calico-node [11:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:25] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:45:27] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:49:34] 10Puppet, 10observability: puppetization of check_prometheus is not robust to the use of single quotes - https://phabricator.wikimedia.org/T188917 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Complete! `check_prometheus` will fail on queries with single quotes [11:50:03] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:51:23] (03CR) 10Ema: [C: 03+2] ATS: lookup cache for cookie requests [puppet] - 10https://gerrit.wikimedia.org/r/556217 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:52:15] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [11:52:53] !log draining kubernetes2001 to restart calico-node [11:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:23] (03PS3) 10Volans: Fix some spelling issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552919 (owner: 10Faidon Liambotis) [11:55:29] !log draining kubernetes2002 to restart calico-node [11:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:46] !log draining kubernetes2003 to restart calico-node [11:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:33] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-logging-external_43192: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191211T1200). [12:00:04] kostajh: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:07] PROBLEM - LVS HTTP IPv4 on eventgate-logging-external.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.50 and port 43192: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:00:09] o/ [12:00:12] I can SWAT today! [12:00:13] !log installing git security updates [12:00:16] I'm deploying for UBN right now [12:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:23] \o [12:00:25] Amir1: go ahead and ping me once you're doen [12:00:28] wait a minute [12:00:35] sure [12:01:03] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-logging-external_43192: Servers kubernetes2004.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:01:09] confirming it's fixed on the mwdebug1001 [12:01:55] RECOVERY - LVS HTTP IPv4 on eventgate-logging-external.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 851 bytes in 1.155 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:02:32] Urbanecm: Andrew-WMDE and I can deploy the Cite patch, please ping when you're done. [12:02:35] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:02:46] awight there is no cite patch scheduled [12:02:56] i see one for morning swat [12:03:03] anyway, sure, i can ping you once done [12:03:07] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:03:21] Urbanecm: thanks! I dropped it into the wrong slot... fixing now. [12:03:45] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [12:03:46] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/Wikibase/data-access: [[gerrit:556353|Fix idlookup dropping pageids (T236691 T240410)]] (duration: 01m 03s) [12:03:51] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:53] T236691: Implement alternative `SiteLinkLookup` for MediaInfo entities - https://phabricator.wikimedia.org/T236691 [12:03:53] T240410: Wikidata updates(?) are triggering "Call to a member function getPageLanguage() on null" on wmf.10 wikis - https://phabricator.wikimedia.org/T240410 [12:03:58] my deployment is done [12:04:03] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:04:06] thank you Amir1 [12:04:15] RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:04:21] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:04:30] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555928 (https://phabricator.wikimedia.org/T235717) (owner: 10Kosta Harlan) [12:04:37] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:04:37] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [12:04:39] RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:04:45] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:04:57] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:04:57] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:04:59] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:04:59] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:04:59] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:05:24] (03Merged) 10jenkins-bot: GrowthExperiments: Configure testwiki to use local search & config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555928 (https://phabricator.wikimedia.org/T235717) (owner: 10Kosta Harlan) [12:05:59] kostajh: could you please make sure it behaves as expected at mwdebug1001? [12:06:09] Urbanecm: yep [12:06:11] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:07:40] Urbanecm: LGTM [12:07:46] kostajh: thanks, syncing [12:08:33] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:09:54] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 7651c1a: GrowthExperiments: Configure testwiki to use local search & config (T235717) (duration: 01m 02s) [12:09:58] kostajh: done! [12:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:00] T235717: Newcomer tasks: non-HTTP-based ConfigurationLoader and TaskSuggester - https://phabricator.wikimedia.org/T235717 [12:10:01] awight: I'm done [12:10:29] Urbanecm: thanks! [12:10:49] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:13:27] awight and I are going to start our deployment [12:15:05] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:16:51] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:26:21] 10Operations, 10DNS, 10Domains, 10Traffic: Donate wikiźródła.pl and wikisłownik.pl to the Foundation - https://phabricator.wikimedia.org/T240446 (10tomasz) [12:29:12] (03PS3) 10Jcrespo: admin: Add accraze to analytics-privadata-users [puppet] - 10https://gerrit.wikimedia.org/r/556168 (https://phabricator.wikimedia.org/T240243) [12:30:41] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce admin script to add x509 certs to k8s secrets [puppet] - 10https://gerrit.wikimedia.org/r/556363 (https://phabricator.wikimedia.org/T240402) [12:30:57] (03CR) 10Jcrespo: [C: 03+2] admin: Add accraze to analytics-privadata-users [puppet] - 10https://gerrit.wikimedia.org/r/556168 (https://phabricator.wikimedia.org/T240243) (owner: 10Jcrespo) [12:34:35] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) a:05jcrespo→03ACraze ` Notice: /Stage[main]/Admin/Admin::Hashuser[accraze]/Admin::User[accraze]/User[accraze]/ensure: created Notice:... [12:35:23] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) p:05Triage→03Normal [12:38:23] !log andrew-wmde@deploy1001 scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [12:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:55] 10Operations, 10Core Platform Team, 10Release-Engineering-Team, 10Wikimedia-Rdbms: WikiPage::updateCategoryCounts putting heavy load on commonswiki - https://phabricator.wikimedia.org/T240405 (10jcrespo) Please note the tendril results are rolling (the .5 is a relative range that is now meaningless). There... [12:45:43] 10Operations, 10Core Platform Team, 10Release-Engineering-Team, 10Wikimedia-Rdbms: WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10jcrespo) [12:46:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: introduce admin script to add x509 certs to k8s secrets [puppet] - 10https://gerrit.wikimedia.org/r/556363 (https://phabricator.wikimedia.org/T240402) (owner: 10Arturo Borrero Gonzalez) [12:47:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: make cloudmetrics1002 the primary instead of labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:48:08] 10Operations, 10Core Platform Team, 10Release-Engineering-Team, 10Wikimedia-Rdbms: WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10jcrespo) updating title as load was not really high, in fact there was lower loa... [12:54:17] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: include metrics-server [puppet] - 10https://gerrit.wikimedia.org/r/556340 (https://phabricator.wikimedia.org/T240402) [12:54:18] !log andrew-wmde@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/Cite: SWAT: [[gerrit:556351|Use messagelocalizer in CiteErrorReporter (T239988)]] (duration: 01m 04s) [12:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:24] T239988: Review feature: Reference errors are split by user interface language - https://phabricator.wikimedia.org/T239988 [12:55:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: include metrics-server [puppet] - 10https://gerrit.wikimedia.org/r/556340 (https://phabricator.wikimedia.org/T240402) (owner: 10Arturo Borrero Gonzalez) [13:01:21] (03CR) 10Phamhi: [C: 03+2] wmcs: make cloudmetrics1002 the primary instead of labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:03:04] We still need a few minutes... We have a revert patch which is stuck in CI... [13:09:16] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: metrics: include some hints and comments [puppet] - 10https://gerrit.wikimedia.org/r/556369 (https://phabricator.wikimedia.org/T237643) [13:10:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: metrics: include some hints and comments [puppet] - 10https://gerrit.wikimedia.org/r/556369 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [13:10:26] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10Marostegui) >>! In T240177#5731486, @jcrespo wrote: > Moritz points at T238305#5731421 that maybe it is the same issue as: https://www.dell.com/community/PowerEdge-OS-Forum/Random-Reboot-R740/td-p/5169703/page/3 >... [13:17:35] (03PS5) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) [13:19:48] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10User-greg: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10jcrespo) a:03greg Assigning to @greg for approval, as "service owner... [13:22:14] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) > You've changed it to `powersave` or it is originally set to `powersave`? Didn't change anything, I pasted it as it is now. Most servers, including non-crashing backup1001 seems to be in that mode. [13:25:16] !log andrew-wmde@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Cite: SWAT: [[gerrit:556367|Revert "Lazily fetch user interface language to prevent cache split" ()]] (duration: 01m 02s) [13:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:21] phamhi: hi, just making sure you've seen my comment re: network ACLs https://gerrit.wikimedia.org/r/c/operations/puppet/+/554844#message-c8ad2f9459f2d35317e3f8ced0835fc7f2e92d13 I realized I voted +1 but the comment says otherwise [13:26:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:27:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:32:20] (03PS5) 10Herron: Revert "Revert "lvs: add entries for logstash-next and kibana-next"" [puppet] - 10https://gerrit.wikimedia.org/r/556036 [13:32:40] Andrew-WMDE: fwiw, the errors you saw are all from canary servers. I still don't understand why there was a spike, but it looks to have passed. [13:36:14] (03CR) 10Herron: [C: 03+2] Revert "Revert "lvs: add entries for logstash-next and kibana-next"" [puppet] - 10https://gerrit.wikimedia.org/r/556036 (owner: 10Herron) [13:37:16] !log EU SWAT complete [13:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:41] (03PS2) 10Herron: Revert "Revert "dns: add kibana-next and logstash-next service addresses"" [dns] - 10https://gerrit.wikimedia.org/r/556035 [13:49:06] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 93 connections established with conf1004.eqiad.wmnet:4001 (min=94) https://wikitech.wikimedia.org/wiki/PyBal [13:49:26] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:443]) https://wikitech.wikimedia.org/wiki/PyBal [13:49:33] (03CR) 10Herron: [C: 03+2] Revert "Revert "dns: add kibana-next and logstash-next service addresses"" [dns] - 10https://gerrit.wikimedia.org/r/556035 (owner: 10Herron) [13:50:40] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 51 connections established with conf2001.codfw.wmnet:2379 (min=52) https://wikitech.wikimedia.org/wiki/PyBal [13:50:56] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.51:443]) https://wikitech.wikimedia.org/wiki/PyBal [13:51:42] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.51:443]) https://wikitech.wikimedia.org/wiki/PyBal [13:51:48] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 51 connections established with conf2001.codfw.wmnet:2379 (min=52) https://wikitech.wikimedia.org/wiki/PyBal [13:51:54] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:443]) https://wikitech.wikimedia.org/wiki/PyBal [13:52:22] (03PS1) 10Volans: Fix some spelling issues [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/556375 [13:52:49] (03Abandoned) 10Volans: Fix some spelling issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552919 (owner: 10Faidon Liambotis) [13:52:52] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 61 connections established with conf1004.eqiad.wmnet:4001 (min=62) https://wikitech.wikimedia.org/wiki/PyBal [13:53:12] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/556375 (owner: 10Volans) [13:53:31] (03PS1) 10Herron: Revert "Revert "Revert "lvs: add entries for logstash-next and kibana-next""" [puppet] - 10https://gerrit.wikimedia.org/r/556376 [13:54:13] (03CR) 10Herron: [C: 03+2] Revert "Revert "Revert "lvs: add entries for logstash-next and kibana-next""" [puppet] - 10https://gerrit.wikimedia.org/r/556376 (owner: 10Herron) [13:54:33] (03PS1) 10Herron: Revert "Revert "Revert "dns: add kibana-next and logstash-next service addresses""" [dns] - 10https://gerrit.wikimedia.org/r/556377 [13:55:45] (03CR) 10Herron: [C: 03+2] Revert "Revert "Revert "dns: add kibana-next and logstash-next service addresses""" [dns] - 10https://gerrit.wikimedia.org/r/556377 (owner: 10Herron) [13:55:49] (03CR) 10Volans: [C: 03+2] Fix some spelling issues [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/556375 (owner: 10Volans) [13:58:00] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana is broken https://wikitech.wikimedia.org/wiki/Confd [13:58:08] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana is broken https://wikitech.wikimedia.org/wiki/Confd [14:00:32] !log uploaded envoyproxy-1.12.2 to reprepro [14:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:10] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 51 connections established with conf2001.codfw.wmnet:2379 (min=51) https://wikitech.wikimedia.org/wiki/PyBal [14:02:28] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:03:16] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:03:22] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 51 connections established with conf2001.codfw.wmnet:2379 (min=51) https://wikitech.wikimedia.org/wiki/PyBal [14:03:28] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:04:28] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 61 connections established with conf1004.eqiad.wmnet:4001 (min=61) https://wikitech.wikimedia.org/wiki/PyBal [14:05:14] (03PS6) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [14:06:22] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 93 connections established with conf1004.eqiad.wmnet:4001 (min=93) https://wikitech.wikimedia.org/wiki/PyBal [14:06:42] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:07:14] (03CR) 10jerkins-bot: [V: 04-1] Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [14:07:48] (03PS7) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [14:18:14] 10Operations, 10netops: Add cloudmetrics1002 to network devices ACL - https://phabricator.wikimedia.org/T240456 (10Phamhi) [14:19:27] (03PS2) 10CDanis: dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) [14:19:37] !log updating envoyproxy to 1.12.2 on mwmaint, restbase T238050 [14:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:43] T238050: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 [14:22:56] (03CR) 10Volans: "Couple of nits inline, looks good otherwise." (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:36:24] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:39:36] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10MoritzMuehlenhoff) We're setting the governer via cpufrequtils class and cp/lvs hosts are already configured to use "performance", so I'd suggest to test that setting on one of the affected cp* hosts to gain addit... [14:42:09] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 (10Mathew.onipe) 05Open→03Resolved This is resolved now. I worked with Filippo to fix this via the following commands: ` sfdisk -d /dev/sda | sfdisk... [14:42:41] (03CR) 10Ottomata: [C: 03+1] "Elukey none of this stuff was included in production anywhere, and the cloud VPS nodes have been deleted, so just removing this stuff shou" [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [14:43:26] !log updating envoyproxy to 1.12.2 on all codfw T238050 [14:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:31] T238050: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 [14:43:38] (03CR) 10Ottomata: [C: 03+1] "Ah, except for the redirect site on thorium. I can delete that manually." [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [14:45:01] !log updating envoyproxy to 1.12.2 on all eqiad T238050 [14:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:53] 10Operations, 10Core Platform Team, 10Release-Engineering-Team, 10Wikimedia-Rdbms: WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10Anomie) I think that {8acea5491d} fixes this. That patch is in wmf.10, but wasn'... [14:46:07] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10Anomie) [14:48:54] <[1997kB]> cp5011 frontend, Varnish XID 299822413 [14:48:54] <[1997kB]> Error: 503, Backend fetch failed at Wed, 11 Dec 2019 14:47:58 GMT [14:48:57] (03PS3) 10CDanis: dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) [14:49:01] (03CR) 10CDanis: dbctl: generate externalLoads (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:50:36] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:51:33] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) > disable C / C1E settings without changing to performance profile? Do you know by any chance if performance governor sets that automatically (only needs to be changed) or it is a (potential) requirement... [14:51:38] (03CR) 10jerkins-bot: [V: 04-1] dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [14:51:47] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) a:05Papaul→03jcrespo [14:52:40] (03PS4) 10CDanis: dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) [14:55:10] (03CR) 10RLazarus: [C: 03+2] New envoy version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/556342 (owner: 10Giuseppe Lavagetto) [14:56:48] (03CR) 10RLazarus: [V: 03+2 C: 03+2] New envoy version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/556342 (owner: 10Giuseppe Lavagetto) [14:56:50] (03CR) 10Giuseppe Lavagetto: [V: 03+2] New envoy version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/556342 (owner: 10Giuseppe Lavagetto) [14:56:52] (03PS5) 10CDanis: dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) [15:00:05] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10Phamhi) As part of the failover test from labmon1001 to cloundmetrics1002, we have discovered that cloudmetrics1002's 10.64.4.1... [15:03:27] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10MoritzMuehlenhoff) >>! In T240177#5732180, @jcrespo wrote: >> disable C / C1E settings without changing to performance profile? > > Do you know by any chance if performance governor sets that automatically (only... [15:07:31] (03PS6) 10CDanis: dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) [15:09:11] (03PS1) 10RLazarus: blubberoid: Specify Envoy version 1.12.2-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/556386 (https://phabricator.wikimedia.org/T238050) [15:10:52] (03PS1) 10BBlack: authdns: fix comment in scripts.pp [puppet] - 10https://gerrit.wikimedia.org/r/556387 (https://phabricator.wikimedia.org/T240285) [15:10:54] (03PS1) 10BBlack: authdns: remove stretch clause [puppet] - 10https://gerrit.wikimedia.org/r/556388 (https://phabricator.wikimedia.org/T240285) [15:11:39] (03CR) 10BBlack: [V: 03+2 C: 03+2] authdns: fix comment in scripts.pp [puppet] - 10https://gerrit.wikimedia.org/r/556387 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [15:11:45] (03CR) 10BBlack: [V: 03+2 C: 03+2] authdns: remove stretch clause [puppet] - 10https://gerrit.wikimedia.org/r/556388 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [15:12:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: Specify Envoy version 1.12.2-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/556386 (https://phabricator.wikimedia.org/T238050) (owner: 10RLazarus) [15:12:39] (03Merged) 10jenkins-bot: blubberoid: Specify Envoy version 1.12.2-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/556386 (https://phabricator.wikimedia.org/T238050) (owner: 10RLazarus) [15:13:03] (03PS7) 10CDanis: dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) [15:13:14] (03PS1) 10Andrew Bogott: Add shinken group and contacts for the 'gratitude' cloud-vps project [puppet] - 10https://gerrit.wikimedia.org/r/556389 (https://phabricator.wikimedia.org/T238424) [15:13:28] (03CR) 10Volans: [C: 03+1] "LGTM! Ship-it" [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [15:14:34] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [15:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:53] (03CR) 10CDanis: [C: 03+2] dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [15:15:33] (03PS1) 10Phamhi: Revert "wmcs: make cloudmetrics1002 the primary instead of labmon1001" [puppet] - 10https://gerrit.wikimedia.org/r/556392 [15:15:38] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10jcrespo) @Anomie I can confim it was commonswiki,... [15:17:43] (03Merged) 10jenkins-bot: dbctl: generate externalLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/556224 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [15:20:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/556392 (owner: 10Phamhi) [15:21:03] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [15:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:55] (03PS1) 10BBlack: authdns: move global monitoring to profile [puppet] - 10https://gerrit.wikimedia.org/r/556394 (https://phabricator.wikimedia.org/T240285) [15:22:57] (03PS1) 10BBlack: authdns: move host monitoring to profile [puppet] - 10https://gerrit.wikimedia.org/r/556395 (https://phabricator.wikimedia.org/T240285) [15:22:59] (03PS1) 10BBlack: authdns: refactor conf monitoring [puppet] - 10https://gerrit.wikimedia.org/r/556396 (https://phabricator.wikimedia.org/T240285) [15:23:02] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [15:23:05] (03PS5) 10Tchanders: Enable CheckUser's Special:Investigate page on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) [15:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:09] (03PS1) 10Tchanders: Use wgCheckUserForceSummary instead of wmgCheckUserForceSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556397 [15:23:14] (03PS1) 10Tchanders: Remove unused wmgCheckUserForceSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556398 [15:25:23] (03CR) 10Tchanders: "Thanks Urbanecm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [15:28:15] (03PS2) 10BBlack: authdns: refactor conf monitoring [puppet] - 10https://gerrit.wikimedia.org/r/556396 (https://phabricator.wikimedia.org/T240285) [15:31:45] (03CR) 10BBlack: [C: 03+2] authdns: move global monitoring to profile [puppet] - 10https://gerrit.wikimedia.org/r/556394 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [15:31:51] (03CR) 10BBlack: [C: 03+2] authdns: move host monitoring to profile [puppet] - 10https://gerrit.wikimedia.org/r/556395 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [15:31:57] (03CR) 10BBlack: [C: 03+2] authdns: refactor conf monitoring [puppet] - 10https://gerrit.wikimedia.org/r/556396 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [15:36:46] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.9625 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:38:38] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) regarding performance_governor, T225713 combined with this ticket seems unclear what is the best option. [15:45:39] (03PS2) 10CDanis: dbctl: also read externalLoads from dbctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556236 (https://phabricator.wikimedia.org/T229686) [15:45:47] (03CR) 10CDanis: [C: 04-2] "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556236 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [15:48:39] (03PS3) 10CDanis: dbctl: also read externalLoads from dbctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556236 (https://phabricator.wikimedia.org/T229686) [15:48:56] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata, 10Structured-Data-Backlog (Current Work): Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) {F31470388} generated via https://github.com/apergos/misc-wmf-crap/blob/master/sdc-growth/get_slo... [15:50:25] (03PS1) 10Bstorm: nginx-ingress: Have ingress pods request realistic resting resources [puppet] - 10https://gerrit.wikimedia.org/r/556404 (https://phabricator.wikimedia.org/T239405) [15:52:53] (03PS1) 10BBlack: XXX WIP discovery move [puppet] - 10https://gerrit.wikimedia.org/r/556406 [15:58:45] (03CR) 10Ladsgroup: "Nice! Thank you for doing it." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [15:59:16] (03CR) 10Ladsgroup: Enable CheckUser's Special:Investigate page on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [15:59:37] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10User-greg: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10brennen) [16:00:50] (03PS2) 10Bstorm: toolforge-calico: Set up yaml and config to use calicoctl as a pod [puppet] - 10https://gerrit.wikimedia.org/r/554969 (https://phabricator.wikimedia.org/T239406) [16:01:59] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10User-greg: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10Jdforrester-WMF) Service owner is @thcipriani, not Greg. [16:02:54] (03CR) 10Bstorm: [C: 03+2] toolforge-calico: Set up yaml and config to use calicoctl as a pod [puppet] - 10https://gerrit.wikimedia.org/r/554969 (https://phabricator.wikimedia.org/T239406) (owner: 10Bstorm) [16:03:46] (03PS1) 10Ottomata: Add intake-{logging,analytics}.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/556411 (https://phabricator.wikimedia.org/T236386) [16:05:17] (03PS2) 10BBlack: authdns: move discovery to profile and refactor [puppet] - 10https://gerrit.wikimedia.org/r/556406 (https://phabricator.wikimedia.org/T240285) [16:05:22] (03Abandoned) 10Ottomata: Route all /produce/logging/* to eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/554318 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [16:08:17] I'm going to deploy these two: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/556403 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/556402 [16:15:48] (03PS3) 10BBlack: authdns: move discovery to profile and refactor [puppet] - 10https://gerrit.wikimedia.org/r/556406 (https://phabricator.wikimedia.org/T240285) [16:19:34] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10User-greg: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10jcrespo) > Service owner is @thcipriani, not Greg. Sorry about that.... [16:19:36] (03PS4) 10BBlack: authdns: move discovery to profile and refactor [puppet] - 10https://gerrit.wikimedia.org/r/556406 (https://phabricator.wikimedia.org/T240285) [16:19:48] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10User-greg: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10jcrespo) [16:22:34] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Add Maryum to Puppet - https://phabricator.wikimedia.org/T239300 (10Gehel) 05Open→03Resolved [16:22:39] (03PS1) 10Ottomata: Public routing from intake-logging.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/556413 (https://phabricator.wikimedia.org/T236386) [16:22:41] (03CR) 10BBlack: [C: 03+2] authdns: move discovery to profile and refactor [puppet] - 10https://gerrit.wikimedia.org/r/556406 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [16:22:43] (03PS1) 10Jforrester: Deploy DiscussionTools: Part I, extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556414 (https://phabricator.wikimedia.org/T240468) [16:22:45] (03PS1) 10Jforrester: Deploy DiscussionTools: Part II, InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556415 (https://phabricator.wikimedia.org/T240468) [16:22:47] (03PS1) 10Jforrester: Deploy DiscussionTools: Part III, CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556416 (https://phabricator.wikimedia.org/T240468) [16:22:49] (03PS1) 10Jforrester: [BETA] Enable DiscussionTools on Beta English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556417 (https://phabricator.wikimedia.org/T240468) [16:24:30] (03CR) 10Phamhi: [C: 03+2] Revert "wmcs: make cloudmetrics1002 the primary instead of labmon1001" [puppet] - 10https://gerrit.wikimedia.org/r/556392 (owner: 10Phamhi) [16:24:40] (03PS2) 10Phamhi: Revert "wmcs: make cloudmetrics1002 the primary instead of labmon1001" [puppet] - 10https://gerrit.wikimedia.org/r/556392 [16:24:41] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=ats-be [16:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:25] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10EBernhardson) IMO the important things to check beyond physical space: * Does the PSU have enough overhead to support GPU's (current wx 91... [16:28:20] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.075 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:28:36] !log cp1075 ats-be: temporarily switch to plain HTTP for api and appservers (apache directly instead of nginx) [16:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:51] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be [16:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:04] (03PS2) 10Jbond: admin: add kubectl zsh autocomplete [puppet] - 10https://gerrit.wikimedia.org/r/556338 [16:38:00] (03CR) 10Jbond: [C: 03+2] admin: add kubectl zsh autocomplete [puppet] - 10https://gerrit.wikimedia.org/r/556338 (owner: 10Jbond) [16:38:21] cdanis: rlazarus: i think you both use zsh ^^^ may be of intrest [16:38:32] 👀 [16:38:38] ooh, thanks [16:38:41] sweet, thanks! [16:39:01] :) np [16:39:08] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Wikibase/lib/includes/Store/Sql/SqlEntityInfoBuilder.php: Consider any type of empty value as uncached in SqlEntityInfoBuilder (T237984) (duration: 01m 03s) [16:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:14] T237984: Some property labels are not displayed on Item pages - https://phabricator.wikimedia.org/T237984 [16:40:22] (03CR) 10Filippo Giunchedi: [C: 03+1] Add intake-{logging,analytics}.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/556411 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [16:40:35] (03CR) 10Filippo Giunchedi: [C: 03+1] Public routing from intake-logging.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/556413 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [16:43:35] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/Wikibase/lib/includes/Store/Sql/SqlEntityInfoBuilder.php: Consider any type of empty value as uncached in SqlEntityInfoBuilder (T237984) (duration: 01m 03s) [16:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:11] (03CR) 10Jforrester: [C: 03+1] Add MOTD to mwdebug1002 warning about T214734 [puppet] - 10https://gerrit.wikimedia.org/r/556302 (https://phabricator.wikimedia.org/T214734) (owner: 10Gergő Tisza) [16:49:33] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [16:57:59] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10jcrespo) a:05greg→03thcipriani Assigning to @thcipriani for approval, as "service... [17:04:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add intake-{logging,analytics}.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/556411 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [17:04:03] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [17:25:26] (03PS1) 10Majavah: Enable SandboxLink extension on hywwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556422 (https://phabricator.wikimedia.org/T239387) [17:27:48] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10jcrespo) p:05Triage→03Normal [17:40:24] (03PS1) 10BBlack: authdns: move discovery check out of scripts [puppet] - 10https://gerrit.wikimedia.org/r/556423 (https://phabricator.wikimedia.org/T240285) [17:40:26] (03PS1) 10BBlack: authdns: invert require on exec [puppet] - 10https://gerrit.wikimedia.org/r/556424 (https://phabricator.wikimedia.org/T240285) [17:41:48] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10jcrespo) a:03ema Proposing merging this ticket into T238305 (or resolve it), unless there is some host-specific tasks pending for cp3055, like upgrading the firmware and assigning to someone that could do that (@robh remot... [17:45:30] (03CR) 10BBlack: [C: 03+2] authdns: move discovery check out of scripts [puppet] - 10https://gerrit.wikimedia.org/r/556423 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [17:45:35] (03CR) 10BBlack: [C: 03+2] authdns: invert require on exec [puppet] - 10https://gerrit.wikimedia.org/r/556424 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [17:47:16] (03PS1) 10Alexandros Kosiaris: admin: mimic jbond [puppet] - 10https://gerrit.wikimedia.org/r/556428 [17:47:19] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10RobH) I was tagged into this, so I'm guessing the info is needed for firmware? The server is running the following: Bios 2.2.11 - this is very outdated, urgent flagged update currently is 2.4.8 ilom 3.34.34.34 - this is... [17:48:07] (03PS1) 10BBlack: authdns: move config to profile [puppet] - 10https://gerrit.wikimedia.org/r/556429 (https://phabricator.wikimedia.org/T240285) [17:48:09] (03PS1) 10BBlack: authdns: move update system to profile and refac [puppet] - 10https://gerrit.wikimedia.org/r/556430 (https://phabricator.wikimedia.org/T240285) [17:48:34] (03PS2) 10Alexandros Kosiaris: admin: mimic jbond [puppet] - 10https://gerrit.wikimedia.org/r/556428 [17:50:45] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10jcrespo) @Robh Indeed this would need owner confirmation and depooling, not asking you to do anything. Was tagging you just to confirm a remote upgrade was possible and reasonable for 3xxx datacenter, given its particular lo... [17:51:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: mimic jbond [puppet] - 10https://gerrit.wikimedia.org/r/556428 (owner: 10Alexandros Kosiaris) [17:51:34] 10Operations, 10Traffic: Investigate trafficserver-tls crash on cp3064 - https://phabricator.wikimedia.org/T240183 (10jcrespo) Same comment as T240425#5732940 [17:54:58] 10Operations, 10vm-requests, 10Patch-For-Review: EQIAD+CODFW: (9) VM request for kubernetes etcd - https://phabricator.wikimedia.org/T239838 (10akosiaris) 05Open→03Resolved [17:55:11] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10RobH) >>! In T240425#5732979, @jcrespo wrote: > @Robh Indeed this would need owner confirmation and depooling, not asking you to do anything. Was tagging you just to confirm a remote upgrade was possible and reasonable for 3... [17:58:13] (03PS2) 10BBlack: authdns: move config to profile [puppet] - 10https://gerrit.wikimedia.org/r/556429 (https://phabricator.wikimedia.org/T240285) [17:58:15] (03PS2) 10BBlack: authdns: move update system to profile and refac [puppet] - 10https://gerrit.wikimedia.org/r/556430 (https://phabricator.wikimedia.org/T240285) [17:59:11] 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Tracking), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Jdlrobson) This suggests it is being indexed?: https://www.google.com/search?safe=active&ei=Sy7xXaXmDcOt0PEP5I... [18:01:03] (03CR) 10Dmaza: [C: 03+1] Enable CheckUser's Special:Investigate page on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [18:01:07] (03CR) 10Dmaza: [C: 03+1] Use wgCheckUserForceSummary instead of wmgCheckUserForceSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556397 (owner: 10Tchanders) [18:01:33] (03CR) 10Dmaza: [C: 03+1] Remove unused wmgCheckUserForceSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556398 (owner: 10Tchanders) [18:02:09] (03CR) 10BBlack: [C: 03+2] authdns: move config to profile [puppet] - 10https://gerrit.wikimedia.org/r/556429 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [18:03:59] (03CR) 10BBlack: [C: 03+2] authdns: move update system to profile and refac [puppet] - 10https://gerrit.wikimedia.org/r/556430 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [18:04:13] (03PS3) 10BBlack: authdns: move update system to profile and refac [puppet] - 10https://gerrit.wikimedia.org/r/556430 (https://phabricator.wikimedia.org/T240285) [18:08:55] (03PS1) 10BBlack: authdns: get rid of spec test stuff [puppet] - 10https://gerrit.wikimedia.org/r/556435 (https://phabricator.wikimedia.org/T240285) [18:08:57] (03PS1) 10BBlack: rename authdns module to gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/556436 [18:10:27] (03CR) 10BBlack: [C: 03+2] authdns: get rid of spec test stuff [puppet] - 10https://gerrit.wikimedia.org/r/556435 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [18:11:10] (03CR) 10BBlack: [C: 03+2] rename authdns module to gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/556436 (owner: 10BBlack) [18:15:42] (03PS1) 10Pmiazga: Enable Article and Discussion tabs for all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556439 (https://phabricator.wikimedia.org/T232594) [18:16:27] (03CR) 10jerkins-bot: [V: 04-1] Enable Article and Discussion tabs for all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556439 (https://phabricator.wikimedia.org/T232594) (owner: 10Pmiazga) [18:18:11] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Remove unused HHVM files [puppet] - 10https://gerrit.wikimedia.org/r/556282 (https://phabricator.wikimedia.org/T229792) (owner: 10Krinkle) [18:19:23] (03PS1) 10Pmiazga: Add History to article toolbar for all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556440 (https://phabricator.wikimedia.org/T232652) [18:20:00] 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Tracking), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Wikicology) @Jdlrobson, What does this suggest?: https://www.google.com/search?q=D%E1%BA%B9%CC%80j%E1%BB%8D+T... [18:20:19] (03CR) 10jerkins-bot: [V: 04-1] Add History to article toolbar for all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556440 (https://phabricator.wikimedia.org/T232652) (owner: 10Pmiazga) [18:26:02] 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Tracking), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Jdlrobson) That article was only created today and is likely not been picked up by search crawlers. Google's al... [18:30:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Thanks for this. Sorry it took me so long to review. This is a good starting point, I 've left a number of inline comments." (0320 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [18:36:01] (03PS1) 10Herron: dns: add kibana-next and logstash-next service addresses [dns] - 10https://gerrit.wikimedia.org/r/556442 [18:36:56] (03PS1) 10Herron: lvs: add entries for logstash-next and kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/556443 [18:45:54] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10thcipriani) a:05thcipriani→03jcrespo Approved, thank you @jcrespo ! [18:45:58] 10Operations, 10netops: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10brion) [18:47:11] (03PS2) 10Herron: lvs: add entries for logstash-next and kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/556443 [18:49:18] 10Operations, 10netops: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10jcrespo) p:05Triage→03High @ayounsi could you have a look if it is something we could do something about? It was reported on IRC and more than one person said it was expe... [18:50:03] (03PS1) 10BBlack: p::dns::auth: split up the monolith [puppet] - 10https://gerrit.wikimedia.org/r/556447 (https://phabricator.wikimedia.org/T240285) [18:50:05] (03PS1) 10BBlack: p::dns::auth: remove create_resources use [puppet] - 10https://gerrit.wikimedia.org/r/556448 (https://phabricator.wikimedia.org/T240285) [18:52:47] 10Operations, 10netops: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10jcrespo) I am told he is on vacations right now, maybe @akosiaris or @faidon can have a look? I didn't see anything obvious packet loss on grafana metrics or librenms. [18:53:19] 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Tracking), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Wikicology) @Jdlrobson articles created from October till date are not indexed but older articles does. But I d... [18:53:27] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Fix no-JS warning message (T240210) (duration: 01m 02s) [18:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:34] T240210: Fallback text for non-JS-supporting clients on Special:SuggestedTags - https://phabricator.wikimedia.org/T240210 [18:56:15] twentyafterfour / mutante / paladox any idea why phabricator no longer threads my emails properly (it posts a separate email for every notification)? Did something change in the defaults that I need to update? [18:56:54] The only thing that changed recently was the server (moved from phab1003 back to phab1001) [18:57:16] Though i also think there was a phab update too (not sure if that would have do that) [18:58:57] (03CR) 10BBlack: [C: 03+2] p::dns::auth: split up the monolith [puppet] - 10https://gerrit.wikimedia.org/r/556447 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [18:59:01] (03CR) 10BBlack: [C: 03+2] p::dns::auth: remove create_resources use [puppet] - 10https://gerrit.wikimedia.org/r/556448 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [18:59:39] Jdlrobson: Mine are still threaded, as another data point. Interesting question! [19:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191211T1900). [19:00:04] Tchanders and awight: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:50] Tchanders: I can SWAT today! [19:01:11] Urbanecm: Hi again :-) I'm happy to test and monitor, or deploy my own patch. At your convenience. [19:01:43] Thanks Urbanecm! [19:02:22] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [19:02:57] awight: I've +2'ed your backport, once I'm done and backport is merge, I'll ping you, so you can self-deploy [19:03:17] (03Merged) 10jenkins-bot: Enable CheckUser's Special:Investigate page on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [19:03:23] Urbanecm: Right on, thanks. [19:04:52] Tchanders: could you test your patch at mwdebug1001, please? [19:05:08] Sure, just a moment... [19:06:06] Urbanecm: Looks great [19:06:25] great, syncing! [19:06:33] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556397 (owner: 10Tchanders) [19:07:22] (03Merged) 10jenkins-bot: Use wgCheckUserForceSummary instead of wmgCheckUserForceSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556397 (owner: 10Tchanders) [19:08:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 9ca31f4: Enable CheckUser Special:Investigate page on testwiki (T239936) (duration: 01m 02s) [19:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:34] T239936: Enable Special:Investigate on testwiki - https://phabricator.wikimedia.org/T239936 [19:08:50] (03PS2) 10Pmiazga: Enable Article and Discussion tabs for all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556439 (https://phabricator.wikimedia.org/T232594) [19:09:28] syncing the second one too, can't be tested [19:09:50] (03PS2) 10Pmiazga: Add History to article toolbar for all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556440 (https://phabricator.wikimedia.org/T232652) [19:10:49] !log urbanecm@deploy1001 sync-file aborted: SWAT: c8fe811: Use wgCheckUserForceSummary instead of wmgCheckUserForceSummary (T239936) (duration: 00m 02s) [19:10:51] awight: do you use gmail? [19:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:57] (03CR) 10jerkins-bot: [V: 04-1] Add History to article toolbar for all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556440 (https://phabricator.wikimedia.org/T232652) (owner: 10Pmiazga) [19:11:17] Jdlrobson: /me hangs head. yes [19:11:56] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c8fe811: Use wgCheckUserForceSummary instead of wmgCheckUserForceSummary (duration: 01m 02s) [19:11:59] There are some headers which might be involved, like "Thread-Topic: PHID-TASK-3xsxamqxnfhs7h2g6pvj [19:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:24] but according to that announcement, gmail ignores these. [19:12:26] (03CR) 10Urbanecm: [C: 03+2] Remove unused wmgCheckUserForceSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556398 (owner: 10Tchanders) [19:13:18] (03Merged) 10jenkins-bot: Remove unused wmgCheckUserForceSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556398 (owner: 10Tchanders) [19:15:29] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: c8fe811: Use wgCheckUserForceSummary instead of wmgCheckUserForceSummary (duration: 01m 02s) [19:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:55] 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Tracking), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Jdlrobson) I would hope not, but really this is up to Google. There is nothing wrong on our side and nothing we... [19:18:16] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: eaa4c2c: Remove unused wmgCheckUserForceSummary (duration: 01m 01s) [19:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:25] awight: air is yours [19:18:30] Tchanders: deployed [19:18:37] Urbanecm: ack, thank you [19:18:42] Urbanecm: Wonderful, thank you [19:18:51] Happy to help! [19:23:50] (03PS1) 10Jhedden: ceph: update firewall rules for peers and clients [puppet] - 10https://gerrit.wikimedia.org/r/556454 (https://phabricator.wikimedia.org/T239918) [19:24:08] !log awight@deploy1001 scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [19:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:47] (03PS2) 10Majavah: Enable SandboxLink extension on hywwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556422 (https://phabricator.wikimedia.org/T239387) [19:26:27] awight: how did you sync the change? [19:26:29] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1003/19925/" [puppet] - 10https://gerrit.wikimedia.org/r/556454 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [19:26:44] if order of those two files matters, it can be synced in the wrong order [19:26:47] and thus causing the error [19:27:34] Urbanecm: +1 I think the issue is that scap uses rsync, so it's likely that a request is served while files are in an inconsistent state. [19:27:53] I'm considering --force, but taking a closer look at the sources for a moment. [19:28:19] Unfortunately, this isn't possible to sync one file at a time; I should have built it that way but did not. [19:28:38] If I were you, I'd revert now and prepare two commits [19:28:42] that can somehow account for this [19:28:44] awight: Manually fiddle the code files on the deployment server to make them, and then sync out. [19:28:56] or what James_F says [19:29:01] Urbanecm: That's the procedure for normal code, not for UBNs. [19:29:17] For UBNs, the most important thing is to fix production in a way that doesn't make things worse. :-) [19:29:20] (03CR) 10Ammarpad: [C: 03+1] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556422 (https://phabricator.wikimedia.org/T239387) (owner: 10Majavah) [19:29:23] * Urbanecm didn't note this is an UBN :) [19:29:34] *notice [19:29:39] There certainly is a rule that says "don't deploy multiple files that are dependent on each other", but I'm hoping to break that rule :-) [19:30:13] Yeah it's not quite UBN, but perhaps it should be: https://phabricator.wikimedia.org/T240426 [19:30:30] 10Operations: investigate making 'notrack' the default on our ferm rules - https://phabricator.wikimedia.org/T240495 (10CDanis) [19:30:43] I suspect this bug has caused the ParserCache hit rate to drop by 10% over the last couple of days. [19:31:58] (03CR) 10Bstorm: "I think you need to make sure you address IPv6 in ferm, IIRC. Otherwise it will cause issues." [puppet] - 10https://gerrit.wikimedia.org/r/556454 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [19:32:01] awight: okay i think i might know what happened now - i changed the default inbox to "priority" at some point by accident [19:32:04] all is well now [19:32:13] thanks for pointing me in the direction of gmail settings :?) [19:32:22] Okay, I'm satisfied that the code will work after a little flurry of errors. [19:32:38] Jdlrobson: ah! So the AI is not so smart?? :p [19:33:15] !log Overriding scap canaries for T240426 [19:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:21] T240426: Cite extension is causing ParserCache split by interface language - https://phabricator.wikimedia.org/T240426 [19:34:04] !log awight@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Cite: SWAT: [[gerrit:556372|Lazily fetch user interface language to prevent cache split (take 2) (T240426, T239988)]] (duration: 00m 40s) [19:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:11] T239988: Review feature: Reference errors are split by user interface language - https://phabricator.wikimedia.org/T239988 [19:34:34] 10Operations: investigate making 'notrack' the default on our ferm rules - https://phabricator.wikimedia.org/T240495 (10CDanis) [19:36:40] This is a good time to mention, logspam-watch doesn't show TypeErrors. Maybe I'm misunderstanding something, but these are fatal so I'd like to see. [19:38:21] awight: i can look into that; logspam-watch is fairly ad hoc (though i find myself using it a lot as an easier at-a-glance thing for me than logstash). [19:39:23] i definitely would not rely on it in place of the much-more-robust usual logstash sources. [19:39:27] brennen: Cool! I was wondering if watch "--differences" might be useful. Maybe not. [19:40:23] brennen: +1 I have a few logstash tabs open, but keep running into this TypeError thing so wanted to let other people know. It's a newish error, I think since PHP 7.2. [19:41:07] Yeah. [19:43:28] Urbanecm: Were you planning to deploy anything else in this window? [19:43:37] awight: no, you're the last one [19:43:40] :) [19:43:50] !log Morning SWAT complete [19:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:56] (re: --differences, interesting thought. you can run the base command with just `logspam` if you want to experiment with output.) [19:46:19] brennen: Good call, I'll play with that. Vaguely related, I was wondering if we have a repo for *client*-side commands for deployers, e.g. Lucas_WMDE's "swat" script? [19:46:50] awight: idts, maybe we should request one [19:48:01] Urbanecm: Here's a small contribution, to help motivate this repo: https://phabricator.wikimedia.org/P8845 [19:49:26] (03CR) 10Bstorm: [C: 03+1] "Seems ceph isn't dualstack aware, so whatevs." [puppet] - 10https://gerrit.wikimedia.org/r/556454 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [19:50:48] awight, brennen: `watch --differences -n 5 sh -c "logspam | sort -nr | head -25"` seems a moderate improvement, yes. [19:51:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3314 after schema change T233135', diff saved to https://phabricator.wikimedia.org/P9857 and previous config saved to /var/cache/conftool/dbconfig/20191211-195130-marostegui.json [19:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:37] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [19:52:27] (03PS2) 10Jhedden: ceph: update firewall rules for peers and clients [puppet] - 10https://gerrit.wikimedia.org/r/556454 (https://phabricator.wikimedia.org/T239918) [19:52:50] James_F: brennen: sorting by the filename gives it more stability. I'm enjoying: watch -n 10 -d 'logspam | sort -k 3' [19:53:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3314 for schema change T233135', diff saved to https://phabricator.wikimedia.org/P9858 and previous config saved to /var/cache/conftool/dbconfig/20191211-195306-marostegui.json [19:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:18] ah I also like that sorting on file naturally groups by component [19:53:55] ugh nvm, one line moved and the whole thing went to hell. [19:54:38] (03CR) 10Jhedden: [C: 03+2] ceph: update firewall rules for peers and clients [puppet] - 10https://gerrit.wikimedia.org/r/556454 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [19:55:55] htop-style controls for sorting would be nice for this kind of thing. [19:56:00] It'd be nice to collapse "^/srv/mediawiki/php-1.35.0-wmf.(\d)/" to "…$1/", if we're going to start making wishlist requests for logspam. ;-) [19:56:25] brennen: That sounds like a lot of work, though. [19:57:02] probably, but a little cleanup on output certainly wouldn't be. [19:59:43] Hmm. `logspam | sed 's/\/srv\/mediawiki\/php-1.35.0-/…/' | sort -nr | head -25` "works", but chomps at the wrong length? [20:00:04] marxarelli and James_F: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191211T2000). [20:02:20] James_F: o/ [20:02:40] Hey. Good to go from my POV. [20:02:50] marxarelli: Did you want me to run the commands? :-) [20:03:20] James_F: g2g here as well. re: commands, sure! [20:03:57] * James_F carefully re-reads https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys [20:04:42] James_F: i'm not sure if `deploy-promote` will do the right thing here [20:04:51] (going from wmf.8 to wmf.10) [20:05:13] we shall see [20:05:16] Hmm. [20:05:30] it will prompt you y/n :) [20:05:40] Yeah, but it broke. [20:06:04] oh good [20:07:40] (03PS1) 10Jforrester: group1 wikis to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556460 [20:07:43] (03CR) 10Jforrester: [C: 03+2] group1 wikis to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556460 (owner: 10Jforrester) [20:08:13] Hmm, I thought php got re-pointed for group2, not group1? [20:08:36] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556460 (owner: 10Jforrester) [20:08:39] marxarelli: Is that OK? [20:09:06] James_F: looks right [20:09:40] Cool, let's go. [20:09:41] did you give it explicit args? i.e. `deploy-promote group1 1.35.0-wmf.10` [20:10:21] No, just the raw command. [20:10:39] It "worked out" that I wanted group1 to go and offered (and I said yes). [20:10:47] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.10 [20:10:49] `Promote group1 from 1.35.0-wmf.8 to 1.35.0-wmf.10 [y/N] y` [20:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:56] great [20:11:32] Looks OK to me. [20:11:49] !log jforrester@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.10 (duration: 01m 02s) [20:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:32] James_F: is this new? "Error from line 147 of /srv/mediawiki/php-1.35.0-wmf.10/extensions/Math/src/MathWikibaseConfig.php: Class 'Wikibase\Client\WikibaseClient' not found" [20:13:47] marxarelli: It's "new" in wmf.10 but minor. [20:13:52] There's a task for it somehwere. [20:13:57] kk [20:14:06] T240458 [20:14:06] T240458: Math Extension: Class 'Wikibase\Client\WikibaseClient' not found on wikimania2016.wikimedia.org - https://phabricator.wikimedia.org/T240458 [20:14:08] otherwise, looks good to me [20:14:35] Yeah. Declaring good for now. [20:15:11] +1 [20:15:35] (03PS2) 10Arlolra: Make Parsoid/PHP cluster read-write to record lints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556001 (https://phabricator.wikimedia.org/T237326) [20:16:08] (re: watch logspam, it would make a nice wiki page. Refresh a RevisionSlider view every N seconds, showing a diff between any baseline sample and the the current error counts.) [20:18:46] (03PS1) 10Mholloway: MachineVision: Update labeling job delay to 48 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556463 [20:18:49] (03PS1) 10Mholloway: MachineVision: Show UploadWizard CTA on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556464 [20:18:51] (03PS1) 10Mholloway: MachineVision: Remove testing group restriciton on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556465 [20:20:11] 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Tracking), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Wikicology) Please, feel free to resolve this task [20:23:23] James_F: marxarelli: train all clear for today? [20:23:48] mdholloway: Yes, I think so. Go for it. [20:23:56] James_F: thanks! [20:27:44] 10Operations, 10Research, 10SRE-Access-Requests: Google Search Console access request -- Isaac - https://phabricator.wikimedia.org/T240501 (10Isaac) [20:28:51] (03CR) 10Mholloway: [C: 03+2] MachineVision: Update labeling job delay to 48 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556463 (owner: 10Mholloway) [20:29:45] (03Merged) 10jenkins-bot: MachineVision: Update labeling job delay to 48 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556463 (owner: 10Mholloway) [20:30:01] yay! [20:32:04] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:32:32] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Update labeling job delay to 48 hours (duration: 01m 05s) [20:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:51] (03CR) 10Mholloway: [C: 03+2] MachineVision: Show UploadWizard CTA on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556464 (owner: 10Mholloway) [20:32:52] 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Tracking), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Jdlrobson) 05Open→03Resolved a:03Jdlrobson I'm sorry I can't be more help @Wikicology :( [20:33:43] (03Merged) 10jenkins-bot: MachineVision: Show UploadWizard CTA on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556464 (owner: 10Mholloway) [20:34:21] 10Operations, 10netops: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10CDanis) I did some [[ https://atlas.ripe.net/measurements/23604772/#!probes | ICMP pings ]] and [[ https://atlas.ripe.net/measurements/23604785/#!probes | TCP port 443 tracer... [20:36:01] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Show UploadWizard CTA on commonswiki (duration: 01m 03s) [20:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:32] (03CR) 10Mholloway: [C: 03+2] MachineVision: Remove testing group restriciton on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556465 (owner: 10Mholloway) [20:37:22] (03Merged) 10jenkins-bot: MachineVision: Remove testing group restriciton on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556465 (owner: 10Mholloway) [20:39:30] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Remove testing group restriciton on commonswiki (duration: 01m 04s) [20:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:24] 10Operations, 10netops: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10CDanis) We also have two probes on Comcast's network constantly performing pings towards our RIPE Atlas anchor in ulsfo. Their network performance looks relatively stable ov... [20:54:02] James_F: logspam | sort -k 3 > baseline-sample.`date +%Y%m%d`.txt; watch "!#:0-4 | diff -dsw -y -W 300 !#:6 -" [20:57:10] awight: That's a bit too wide, surely. [20:57:53] And you've lost my suggestion for compacting the invariant /srv/mediawiki/php-… stuff. [20:57:55] But it's neat. [21:00:04] cscott, arlolra, subbu, halfak, and accraze: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191211T2100). [21:05:32] (03PS1) 1020after4: Set vcs user password to '*' [puppet] - 10https://gerrit.wikimedia.org/r/556471 [21:06:39] (03CR) 1020after4: "I'm not sure why * works and ! does not, however, I've confirmed that this fixes the issue." [puppet] - 10https://gerrit.wikimedia.org/r/556471 (owner: 1020after4) [21:08:12] (03CR) 1020after4: [C: 03+1] Set vcs user password to '*' [puppet] - 10https://gerrit.wikimedia.org/r/556471 (owner: 1020after4) [21:12:14] marxarelli: James_F: hi we have an UBN in CentralNotice following the train deploy just now [21:13:18] AndyRussG: Do you need prod rolled back? [21:13:45] James_F: that might be nice. Though I don't know for sure if it's caused by the train just yet [21:14:20] James_F: this is the error I see: "Skipped unresolvable module ext.centralNotice.adminUi.bannerEditor" [21:14:29] James_F: for example, here: https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/B19_WMDE_EN_Desktop_03_2_ctrl [21:14:36] Is rollback feasible? [21:14:37] (03CR) 10Jdlrobson: [C: 04-1] "Wouldn't it be better to do this in the repo? Ideally InitialiseSettings should disable defaults and be kept as slim as possible." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556440 (https://phabricator.wikimedia.org/T232652) (owner: 10Pmiazga) [21:15:13] (03PS1) 10Jforrester: Revert "group1 wikis to 1.35.0-wmf.10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556474 [21:15:15] (03CR) 10Jdlrobson: [C: 04-1] "(ideally the repo we checkout for development should match the one in production - the best way to do that is for the config in the repo t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556440 (https://phabricator.wikimedia.org/T232652) (owner: 10Pmiazga) [21:15:19] (03CR) 10Jforrester: [C: 03+2] Revert "group1 wikis to 1.35.0-wmf.10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556474 (owner: 10Jforrester) [21:15:47] AndyRussG: Hi! Just to make sure, this only affects CentralNotice admins, not readers? [21:15:58] awight: I have no idea [21:16:09] * awight looks [21:16:14] AndyRussG: That URL you gave me works fine for me, FWIW. [21:16:18] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.35.0-wmf.10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556474 (owner: 10Jforrester) [21:16:29] !log jforrester@deploy1001 sync-wikiversions aborted: group0 to 1.34.0-wmf.0 (duration: 00m 00s) [21:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:36] James_F: you may see different stuff if you don't have CN admin rights [21:16:43] Yeah. [21:17:28] (03CR) 1020after4: [C: 03+1] "no problems detected by puppet compiler: https://puppet-compiler.wmflabs.org/compiler1001/19926/" [puppet] - 10https://gerrit.wikimedia.org/r/556471 (owner: 1020after4) [21:17:51] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group1 back to 1.34.0-wmf.8 [21:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:01] AndyRussG: Train rolled back. Better? [21:18:55] James_F: not yet, but maybe we have to wait for a RL cache turnover? [21:19:03] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10leila) >>! In T240303#5726814, @BBlack wrote: > I'm assuming that, for now, the hosting of the web service (and email?) is not moving, just the whois ownership and DNS ser... [21:19:22] AndyRussG: If that [21:19:26] AndyRussG: I found that extension is only enabled on meta.wmo, so this might not be a train blocker... [21:19:27] 's where your issue is, yes. [21:19:38] s/extension/RL module/ [21:19:40] AndyRussG: Can you file a UBN task? [21:19:54] James_F: awight: https://phabricator.wikimedia.org/T240505 [21:21:10] is it ok to be deploying parsoid now? [21:21:21] arlolra: Yes, should be. [21:21:28] thanks [21:21:54] !log arlolra@deploy1001 Started deploy [parsoid/deploy@5ba7506]: Updating Parsoid to af576d5 [21:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:21] James_F: awight: fixed!!!!!! [21:22:41] We can figure try to figure this out today so maybe the train can go everywhere tomorrow if u like? [21:26:06] AndyRussG: You'll need to stop referring to `jquery.ui.datepicker` and use `jquery.ui` in your RL manifest, that's all. [21:26:26] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) [21:26:50] (James_F: nice path sed for logspam!) [21:27:14] AndyRussG: For some reason CN didn't come up in codesearch. [21:28:46] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) [21:29:16] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:29:19] James_F: right yes that's it... thanks!!! Mmmm that module is defined in PHP, in CentralNoticeHooks.php [21:29:34] Yeah, trivial fix and back-port should do it. [21:29:42] I'm trying to work out why code search broke. [21:29:44] So maybe you only looked through extension.json [21:30:44] No, I looked everywhere. [21:31:07] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@5ba7506]: Updating Parsoid to af576d5 (duration: 09m 12s) [21:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:27] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) >>! In T236327#5729983, @Jclark-ctr wrote: > @RobH > Server New Rack Switchport > kafka-jumbo1001 a4 39 > kafka-jumbo1003 b2... [21:31:36] James_F: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CentralNotice/+/389d9bff78694c8ded51c98d4eb2b09323f56663/includes/CentralNoticeHooks.php#75 [21:32:10] James_F: my guess is that best would be to delay removing these modules until sometime in January [21:32:21] apologies also if we weren't on top of deprecation and removal schedules! [21:32:32] Krinkle: FYI. [21:36:24] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:36:34] James_F: it looks like the jquery stuff is still available, then, we just have to refer to it differently? [21:36:47] AndyRussG: Yes. One second, leaving meeting. [21:36:47] https://phabricator.wikimedia.org/T219604 [21:37:04] Krinkle: James_F: ok in that case we should be able to do that for tomorrow [21:39:10] James_F: thanks so much btw!!!!!!!!!!!! :) [21:40:32] !log Updated Parsoid to af576d5 (T237693, T238777, T237306, T239875, T240053) [21:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:42] T237306: PEG error: Expected end of input or tlb but "\r" found. - https://phabricator.wikimedia.org/T237306 [21:40:42] T239875: VE leaves around inline data-parsoid on edited content - https://phabricator.wikimedia.org/T239875 [21:40:42] T238777: Argument 1 passed to Parsoid\Wt2Html\TT\OnlyInclude::onTag() must be an instance of Parsoid\Tokens\Token, array given - https://phabricator.wikimedia.org/T238777 [21:40:43] T237693: Port templatedata mocha tests to phpunit - https://phabricator.wikimedia.org/T237693 [21:40:43] T240053: Detect potential selser corruption - https://phabricator.wikimedia.org/T240053 [21:41:14] AndyRussG: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralNotice/+/556477 [21:41:55] AndyRussG: for clarity, are you saying you want to delay group1 deployment until tomorrow to address https://phabricator.wikimedia.org/T240505 ? [21:42:41] AndyRussG: If you can merge, I'd like us to pull that to production and fix and continue the train rather than delay everything further. [21:43:08] James_F: ok sure, gimme 5 min to test locally? [21:43:18] marxarelli: ^ [21:43:27] AndyRussG: Sure. [21:46:28] James_F: minor snag, suddenly can't access Gerrit on the command line [21:46:48] AndyRussG: Ha, that doesn't help. :-) [21:48:35] James_F: https seems to work [21:50:10] awight: fish shell-style collapsing of path names would be nice, if we're pushing the boat out. [21:50:33] awight: So "wmf.8/i/l/o/MemcachedPeclBagOStuff.php" instead of "wmf.8/includes/libs/objectcache/MemcachedPeclBagOStuff.php" [21:50:56] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10TheDJ) [21:51:45] 10Operations, 10netops: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10brion) `sudo mtr -z -s 1000 -T -P 443 phabricator.wikimedia.org` gives similar results: ` Host Loss% Snt... [21:54:35] James_F: it fixes the UBN but I'm getting other issues on a different CN page [21:54:38] locally [21:54:42] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [21:55:11] James_F: would it be OK to delay the train until tomorrow so we can test more thoroughly? This is FR season still [21:55:20] AndyRussG: Oh, hmm. On the wmf_deploy branch? [21:55:34] on the patch that you just sent [21:55:43] Yes, but against-master or against-wmf-deploy? [21:55:47] ohhmm maybe I didn't get the last patch set [21:55:56] one sec [21:57:37] !log arlolra@deploy1001 Started deploy [parsoid/deploy@5ba7506]: (no justification provided) [21:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:47] James_F: A salty dehydration, I could row out with that. Or even better, something called "MemcachedPeclBagOStuff" stops calling our mobile... [21:58:03] James_F: on Special:CentralNotice I'm not getting the date picker widgets [21:58:06] locally, again [21:58:24] AndyRussG: https://phabricator.wikimedia.org/P9860 is the list of patches in master but not wmf_deploy in CentralNotice. Just i18n and CI changes. [22:00:01] James_F: I think those patches are unrelated to anything? It's normal for fr-tech to not merge to wmf_deploy in the giving season. [22:00:20] AndyRussG: If CN has unrelated issues and you'd rather not deploy, I'll back out the change from MW and we can roll the train without the performance improvements, instead. Does that work? [22:01:30] James_F: yes that would be better [22:01:53] James_F: there are not unrelated issues, but as awight pointed out, we don't normally deploy CN changes that accumulate in master at this time of year [22:02:07] James_F: to which mw change are you referring? [22:02:16] James_F: we could probably figure out what's going wrong and cherry-pick to wmf_deploy by tomorrow [22:02:17] marxarelli: fd0e2259605d2cc7846d3de1398c71843dacfa98 [22:02:31] I'll back it out of wmf.10 at least. [22:03:24] how can we be sure that won't cause other extensions to fail? [22:03:36] marxarelli: The revert? [22:03:53] ah, i see. they're simply aliases to jquery.ui? [22:03:54] marxarelli: The patch in MW that caused the issue is the removal of back-compat. [22:03:55] Yeah. [22:04:50] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@5ba7506]: (no justification provided) (duration: 07m 13s) [22:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:06] i see. makes sense logically. still, i'm a bit hesitant [22:05:59] James_F: since the promotion of group1 went otherwise smoothly, might it be better to let AndyRussG address outstanding CN issues before tomorrow's promotions? [22:06:00] This blew up because CentralNotice isn't indexed in code search as deployed code, so I assumed the lack of results showed it was no longer being used. [22:06:23] marxarelli: That'd be my preference, but it's FR-tech's call. [22:06:42] But I'd really rather we rolled group1 today rather than try to do both group1 and group2 on a Thursday. [22:06:45] marxarelli: James_F: looks like the the date picker issue I thought was there is actually ok [22:06:59] Ah, good? [22:07:05] !log arlolra@deploy1001 Started deploy [parsoid/deploy@5ba7506]: (no justification provided) [22:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:41] James_F: yeah gimme just a few more minutes here... [22:08:46] (03CR) 10CDanis: [C: 03+1] puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [22:08:57] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@5ba7506]: (no justification provided) (duration: 01m 51s) [22:08:59] AndyRussG: No worries, it's most important that your team are happy. :-) [22:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:21] well that stunk [22:10:32] arlolra: Unsucessful [22:10:32] James_F: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/556477 [22:10:33] ? [22:10:48] yeah, had to rollback [22:10:59] AndyRussG: Yeah, should I prepare a cherry-pick to wmf_deploy? [22:11:09] James_F: ^ we can just wait for that to merge and land on the beta cluster, then cherry-pick to the wmf_deploy branch [22:11:18] * James_F nods. [22:11:50] thanks for verifying, AndyRussG [22:12:02] marxarelli: thank u and James_F! [22:12:25] Any idea why I'd suddenly be getting this with git/gerrit? sign_and_send_pubkey: signing failed: agent refused operation [22:12:26] James_F is pushing the buttons today. sorry, James_F :) [22:12:27] andyrussg@gerrit.wikimedia.org: Permission denied (publickey). [22:12:41] * James_F grins. [22:13:00] AndyRussG: No, sorry. [22:13:04] marxarelli: No worries. :-) [22:13:04] er, no. no idea [22:14:58] beta-code-update-eqiad is running now, but without my fix. [22:18:12] * James_F goes for a quick coffee break [22:24:42] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:24:50] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [22:25:25] 10Operations, 10ops-codfw: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10RobH) As this is down and not calling into puppet, its been set to 'failed' in netbox. Please place back to active when its working again. [22:26:32] James_F: marxarelli: ejegg will do the cherry-pick to wmf_deploy after we see this on the beta cluster [22:26:48] cool cool [22:28:16] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:31:32] marxarelli: James_F: the fix works on the beta cluster [22:31:47] I'll let you know when it's merged to the wmf_dpeloy branch [22:33:51] marxarelli: James_F: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/556487 [22:34:00] I forgot you could also do this with the Gerrit UI [22:34:54] right on. ty! [22:35:23] oh fun. test failure [22:35:48] (03PS1) 10Jhedden: ceph: add rbd client support [puppet] - 10https://gerrit.wikimedia.org/r/556488 (https://phabricator.wikimedia.org/T239918) [22:36:05] AndyRussG: fails CI with "Declaration of ApiCentralNoticeChoiceDataTest::setUp() must be compatible with ApiTestCase::setUp()" [22:36:07] (03PS1) 10CDanis: mwdebug1002: add motd warning [puppet] - 10https://gerrit.wikimedia.org/r/556489 (https://phabricator.wikimedia.org/T214734) [22:36:31] Hmm. [22:38:15] (03CR) 10jerkins-bot: [V: 04-1] mwdebug1002: add motd warning [puppet] - 10https://gerrit.wikimedia.org/r/556489 (https://phabricator.wikimedia.org/T214734) (owner: 10CDanis) [22:39:00] marxarelli: I'll verify manually [22:39:32] Rebased onto https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/556490 to fix that. [22:39:53] (03PS2) 10Jhedden: ceph: add rbd client support [puppet] - 10https://gerrit.wikimedia.org/r/556488 (https://phabricator.wikimedia.org/T239918) [22:39:55] (New-ish requirement of MediaWiki.) [22:41:46] James_F: that also didn't pass [22:41:56] (03CR) 10Urbanecm: [C: 03+1] Enable SandboxLink extension on hywwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556422 (https://phabricator.wikimedia.org/T239387) (owner: 10Majavah) [22:41:59] And of course that won't pass because it needs the parent patch. [22:42:04] But between them we can V+2. [22:42:50] (03PS1) 10BBlack: ferm: add destination support to services [puppet] - 10https://gerrit.wikimedia.org/r/556491 (https://phabricator.wikimedia.org/T240285) [22:42:52] (03PS1) 10BBlack: Refactor DNS ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/556492 (https://phabricator.wikimedia.org/T240285) [22:43:17] AndyRussG: There, that passes. OK for me to deploy to prod? [22:43:37] (03PS2) 10CDanis: mwdebug1002: add motd warning [puppet] - 10https://gerrit.wikimedia.org/r/556489 (https://phabricator.wikimedia.org/T214734) [22:45:01] James_F: since https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/556487 incorporates both and passes, we can manually submit https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/556490 yeah? [22:45:29] James_F: marxarelli: I think this is OK, but normally I double-check all code going out to prod before we deploy [22:45:50] It looks like only code changes in CI and unit tests, is that correct? [22:46:10] Yes. [22:46:11] (other than the fix James_F submitted) [22:46:19] James_F: then yeah I think that's fine [22:46:23] Thanks for the help here! [22:46:28] It's a PHP 7.2 `: void` documentation fix, essentially. [22:46:30] Happy to help. [22:46:47] (03PS3) 10CDanis: mwdebug1002: add motd warning [puppet] - 10https://gerrit.wikimedia.org/r/556489 (https://phabricator.wikimedia.org/T214734) [22:48:45] (03CR) 10CDanis: [C: 03+2] "PCC lg: https://puppet-compiler.wmflabs.org/compiler1001/19930/" [puppet] - 10https://gerrit.wikimedia.org/r/556489 (https://phabricator.wikimedia.org/T214734) (owner: 10CDanis) [22:52:14] (03CR) 10CDanis: "Ah, sorry! I totally missed that this was an outstanding change, and did this myself with If9ca8c8." [puppet] - 10https://gerrit.wikimedia.org/r/556302 (https://phabricator.wikimedia.org/T214734) (owner: 10Gergő Tisza) [22:52:35] James_F: merged! https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/556487 [22:53:06] OK, now I need to manually bump wmf.10's pointer, I think? [22:54:33] (03PS3) 10Jhedden: ceph: add rbd client support [puppet] - 10https://gerrit.wikimedia.org/r/556488 (https://phabricator.wikimedia.org/T239918) [22:54:34] (03PS1) 10Jhedden: openstack: change cloudvirt1022 to ceph based virt role [puppet] - 10https://gerrit.wikimedia.org/r/556495 (https://phabricator.wikimedia.org/T239918) [22:54:37] James_F: depends on whether there are security patches applied, but yes [22:55:50] marxarelli: Yeah, working on it. [22:55:58] * James_F twiddles thumbs [22:56:18] (03CR) 10CDanis: "(Also, I think this change doesn't actually work: role::debug_proxy is used by hassaleh/hassium, not the mwdebug hosts. PCC shows no-op: h" [puppet] - 10https://gerrit.wikimedia.org/r/556302 (https://phabricator.wikimedia.org/T214734) (owner: 10Gergő Tisza) [22:58:10] right, debug_proxy is hassaleh/hassium, and they're also slated to be killed relatively soon [22:58:25] (ats-be doesn't go through debug_proxy, only the remaining varnish backends use those to reach mwdebug) [22:59:15] James_F: php-1.35.0-wmf.10/extensions/CentralNotice looks right [22:59:34] marxarelli: Yeah, I'm trying to remember the manual gerrit command in the absence of git review. [22:59:50] `git push origin HEAD:refs/for/[branch]` [22:59:53] `git push origin HEAD:/refs/for/wmf/1.35.0-wmf.10` or something? [22:59:58] Aha, thanks. [23:00:25] what are you pushing though? [23:00:35] The submodule update. [23:00:40] James_F: marxarelli: lmk if/when I should test anything on prod... [23:00:44] ah, gotcha [23:00:46] It's not automatic for CentralNotice, unlike anything else. [23:01:48] AndyRussG: would be great to have your eyeballs on it in a minute! [23:01:50] James_F: marxarelli: there are now CN release branches [23:02:03] AndyRussG: Yes, but they're broken by the use of wmf_deploy. [23:02:04] so in fact you should cherry-pick to those [23:02:12] they worked until recently [23:02:19] Oh, right, I could just CP to wmf.10 [23:02:22] they are created from wmf_deploy [23:02:24] That's simpler than what I'm doing. [23:02:48] "recent" change, as of earlier this year [23:03:54] (03PS2) 10BBlack: ferm: add destination support to services [puppet] - 10https://gerrit.wikimedia.org/r/556491 (https://phabricator.wikimedia.org/T240285) [23:03:56] (03PS2) 10BBlack: Refactor DNS ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/556492 (https://phabricator.wikimedia.org/T240285) [23:04:56] Thanks for the hint, AndyRussG. [23:06:10] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10Krinkle) This question isn't directly related but might help indirectly clear some confusion: Who will pay for the domain name when it expires? (Noting that DNS is where... [23:08:13] OK! [23:08:36] marxarelli: The only way to test this is to hand-move meta to wmf.10 on mwdebug1001, yes? [23:08:51] Or just do the sync and the whole train and check then? [23:08:57] (03CR) 10BBlack: [C: 03+2] ferm: add destination support to services [puppet] - 10https://gerrit.wikimedia.org/r/556491 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [23:09:01] (03CR) 10BBlack: [C: 03+2] Refactor DNS ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/556492 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [23:09:23] James_F: it's up to AndyRussG i think [23:09:41] regarding wmf_deploy, the branch script simply uses origin/HEAD when branching, so gerrit has HEAD set to wmf_deploy for that repo [23:09:52] twentyafterfour: Yeah. [23:10:13] it's simpler than it used to be :) [23:10:16] AndyRussG: Is a few minutes' disruption OK? Hand-editing wikiversions feels risky. [23:13:14] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/CentralNotice/includes/CentralNoticeHooks.php: T240505 Remove CentralNotice's used of deprecated jquery.ui module aliases (duration: 01m 25s) [23:13:17] (03CR) 10Jhedden: [C: 04-1] "-1 until I can find stretch packages for ceph nautilus" [puppet] - 10https://gerrit.wikimedia.org/r/556488 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [23:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:19] T240505: Unbreak now (please) : newly broken CentralNotice admin interface - https://phabricator.wikimedia.org/T240505 [23:13:42] James_F: should be ok [23:14:02] OK, let's doit. [23:15:24] PROBLEM - Check systemd state on dns4002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:41] dns4002 i me [23:15:43] (03PS1) 10BBlack: p::dns::recursor: bugfix for new ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/556498 (https://phabricator.wikimedia.org/T240285) [23:15:44] dns4002 is me [23:15:53] (03PS1) 10Jforrester: group1 wikis to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556499 [23:15:55] (03CR) 10Jforrester: [C: 03+2] group1 wikis to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556499 (owner: 10Jforrester) [23:16:04] (03CR) 10BBlack: [V: 03+2 C: 03+2] p::dns::recursor: bugfix for new ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/556498 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [23:16:45] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556499 (owner: 10Jforrester) [23:18:13] AndyRussG: Seems fixed (mwdebug1001). [23:18:43] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.10 [23:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:46] !log jforrester@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.10 (duration: 01m 02s) [23:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:46] RECOVERY - Check systemd state on dns4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:04] (03PS1) 10BBlack: p::dns::recursor: bugfix for new ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/556500 (https://phabricator.wikimedia.org/T240285) [23:21:20] (03CR) 10BBlack: [V: 03+2 C: 03+2] p::dns::recursor: bugfix for new ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/556500 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [23:22:53] James_F: Hm.. how come that didn't show up in Codesearch? [23:22:58] (CN use of jqui) [23:23:08] Krinkle: Because CN isn't in code search. [23:23:12] Krinkle: *sighs* [23:23:21] I recall it using the "wrong" branch but at least included with master, no? [23:23:26] Apparently not. [23:23:37] crap, indeed [23:23:56] I don't understand why we don't just trim the auth list for master and use that? [23:24:00] James_F: cool! yes works for me :) [23:24:02] Rather than wmf_deploy. [23:24:05] AndyRussG: Excellent. [23:24:36] yay! [23:24:37] Krinkle: can it index origin/HEAD instead of origin/master? CN's origin/HEAD is wmf_deploy [23:24:37] both wmf_deploy and master had the same references [23:24:52] marxarelli: It can't, that's an outstanding request upstream. [23:25:02] i see [23:25:16] But given we now version CentralNotice again, why not just version it off master and lock down the list of people who can merge into that repo very seriously? [23:25:41] It's quite disruptive to have one special-snowflake extension alongside the 190 others that aren't. [23:26:04] (03CR) 10Jforrester: [C: 03+2] Deploy DiscussionTools: Part I, extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556414 (https://phabricator.wikimedia.org/T240468) (owner: 10Jforrester) [23:26:29] (03PS1) 10CDanis: debian release 1.3.0-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/556501 [23:26:51] (03Merged) 10jenkins-bot: Deploy DiscussionTools: Part I, extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556414 (https://phabricator.wikimedia.org/T240468) (owner: 10Jforrester) [23:27:18] (03PS2) 10Jforrester: Deploy DiscussionTools: Part II, InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556415 (https://phabricator.wikimedia.org/T240468) [23:27:30] (03CR) 10Jforrester: [C: 03+2] Deploy DiscussionTools: Part II, InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556415 (https://phabricator.wikimedia.org/T240468) (owner: 10Jforrester) [23:28:06] (03PS2) 10Jforrester: Deploy DiscussionTools: Part III, CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556416 (https://phabricator.wikimedia.org/T240468) [23:28:19] (03Merged) 10jenkins-bot: Deploy DiscussionTools: Part II, InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556415 (https://phabricator.wikimedia.org/T240468) (owner: 10Jforrester) [23:29:17] (03CR) 10CDanis: [C: 03+2] debian release 1.3.0-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/556501 (owner: 10CDanis) [23:29:19] (03CR) 10Jforrester: [C: 03+2] Deploy DiscussionTools: Part III, CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556416 (https://phabricator.wikimedia.org/T240468) (owner: 10Jforrester) [23:30:01] James_F: I think it'd be more disruptive to not be able to have stuff merge, like CI update and such, when we're not deploying [23:30:10] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wmgUseDiscussionTools false everywhere T240468 (duration: 01m 03s) [23:30:11] James_F: that said, if you'd like to file a task, that's a fine place to discuss [23:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:16] T240468: Deploy DiscussionTools to beta cluster - https://phabricator.wikimedia.org/T240468 [23:30:25] I think everyone's open to hearing proposals that would make things smoother [23:30:36] (03PS2) 10Jforrester: [BETA] Enable DiscussionTools on Beta English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556417 (https://phabricator.wikimedia.org/T240468) [23:30:37] (03Merged) 10jenkins-bot: Deploy DiscussionTools: Part III, CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556416 (https://phabricator.wikimedia.org/T240468) (owner: 10Jforrester) [23:31:02] and thanks again [23:33:40] marxarelli: hound-search hardcodes ref=master [23:34:50] ah i see [23:37:15] (03PS4) 10Jhedden: ceph: add rbd client support [puppet] - 10https://gerrit.wikimedia.org/r/556488 (https://phabricator.wikimedia.org/T239918) [23:38:41] James_F: Can you share a codesearch query where CentralNotice should appear but doesn't? I wasn't able to reproduce that, myself. [23:40:12] awight: https://codesearch.wmflabs.org/deployed/?q=%22Adam%20Roses%20Wight%22&i=nope&files=&repos= ;-) [23:41:06] James_F: https://codesearch.wmflabs.org/search/?q=bannerEditor&i=nope&files=&repos= [23:41:17] It's the "deployed" selector, whatever that is. [23:41:25] awight: Yes, in the *everything* search it's missing. [23:41:28] Err [23:41:37] (03CR) 10Jhedden: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/556488 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [23:41:48] In the *everything* search it's present, but it's missing from *deployed*, which is what we actually test against. [23:42:08] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Add ability to load the DiscussionTools extension, disabled everywhere T240468 (duration: 01m 02s) [23:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:15] T240468: Deploy DiscussionTools to beta cluster - https://phabricator.wikimedia.org/T240468 [23:42:25] (03CR) 10Jforrester: [C: 03+2] [BETA] Enable DiscussionTools on Beta English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556417 (https://phabricator.wikimedia.org/T240468) (owner: 10Jforrester) [23:42:27] James_F: Thanks for explaining! [23:42:44] awight: Always. :-) [23:43:16] (03Merged) 10jenkins-bot: [BETA] Enable DiscussionTools on Beta English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556417 (https://phabricator.wikimedia.org/T240468) (owner: 10Jforrester) [23:46:56] James_F: I'm finally catching up: https://phabricator.wikimedia.org/diffusion/LCSH/browse/master/write_config.py$99 and of source, CentralNotice is the one remaining special_extension [23:47:10] err *of course [23:47:22] awight: Yeah. :-( [23:48:03] (03PS2) 10Jhedden: openstack: change cloudvirt1022 to ceph based virt role [puppet] - 10https://gerrit.wikimedia.org/r/556495 (https://phabricator.wikimedia.org/T239918) [23:49:41] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10Jdforrester-WMF) Production Commons is now running... [23:50:35] I think extensions/VisualEditor/lib/ve is also missing: https://codesearch.wmflabs.org/deployed/?q=https%3A%2F%2Fgerrit.wikimedia.org%2Fr%2Fp%2FVisualEditor%2FVisualEditor.git&i=nope&files=&repos= [23:52:45] (03PS5) 10Jhedden: ceph: add rbd client support [puppet] - 10https://gerrit.wikimedia.org/r/556488 (https://phabricator.wikimedia.org/T239918) [23:58:41] (03PS1) 10Jhedden: add fake ceph rbd client key [labs/private] - 10https://gerrit.wikimedia.org/r/556503 [23:59:11] (03CR) 10Jhedden: [V: 03+2 C: 03+2] add fake ceph rbd client key [labs/private] - 10https://gerrit.wikimedia.org/r/556503 (owner: 10Jhedden)