[00:15:57] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10Papaul) @Eevans Thank you. below tracking information for returned disk {F31783583} [00:19:30] 10Operations, 10SRE-Access-Requests, 10LDAP: Add uid=srodlund,ou=people,dc=wikimedia,dc=org to cn=wmf,ou=groups,dc=wikimedia,dc=org - https://phabricator.wikimedia.org/T251163 (10bd808) >>! In T251163#6087030, @Aklapper wrote: > See https://phabricator.wikimedia.org/project/profile/956/ which links to https:... [01:04:31] (03PS1) 10Ssingh: Add `setuptools_scm' to setup.py to manage the version number [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/592797 [01:05:41] (03CR) 10Ssingh: [C: 03+2] Add `setuptools_scm' to setup.py to manage the version number [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/592797 (owner: 10Ssingh) [01:13:14] PROBLEM - dump of es4 in codfw on db1115 is CRITICAL: Last dump for es4 at codfw (es2022.codfw.wmnet) taken on 2020-04-28 00:00:01 is 189 GB, but previous one was 163 GB, a change of 15.9% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:15:42] PROBLEM - dump of es5 in codfw on db1115 is CRITICAL: Last dump for es5 at codfw (es2025.codfw.wmnet) taken on 2020-04-28 00:00:01 is 167 GB, but previous one was 141 GB, a change of 18.4% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:26:56] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:28:56] PROBLEM - Host cp5012 is DOWN: PING CRITICAL - Packet loss = 100% [01:54:22] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:15:19] Updating cxserver.. [04:15:52] (03PS2) 10KartikMistry: Update cxserver to 2020-04-27-061703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/592665 (https://phabricator.wikimedia.org/T249852) [04:16:48] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-04-27-061703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/592665 (https://phabricator.wikimedia.org/T249852) (owner: 10KartikMistry) [04:17:04] (03Merged) 10jenkins-bot: Update cxserver to 2020-04-27-061703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/592665 (https://phabricator.wikimedia.org/T249852) (owner: 10KartikMistry) [04:18:34] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [04:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:28] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [04:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:09] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [04:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:41] !log Updated cxserver to 2020-04-27-061703-production (T249852) [04:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:47] T249852: Enable Google Translate support in Content Translation for Amharic, Kyrgyz, Luxembourgish, Scots Gaelic and Xhosa - https://phabricator.wikimedia.org/T249852 [04:59:29] !log depool and powercycle cp5012 [04:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:58] RECOVERY - Host cp5012 is UP: PING OK - Packet loss = 0%, RTA = 231.21 ms [05:19:27] 10Operations, 10ops-eqsin, 10Traffic: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10Vgutierrez) [05:19:57] 10Operations, 10ops-eqsin, 10Traffic: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10Vgutierrez) p:05Triage→03Medium [05:29:43] (03PS1) 10Marostegui: Revert "install_server: Allow reimage of labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/592808 [05:30:01] (03CR) 10jerkins-bot: [V: 04-1] Revert "install_server: Allow reimage of labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/592808 (owner: 10Marostegui) [05:30:17] oh come on jenknis... [05:30:23] (03Abandoned) 10Marostegui: Revert "install_server: Allow reimage of labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/592808 (owner: 10Marostegui) [05:33:07] (03PS1) 10Marostegui: install_server: Do not reimage labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/592809 [05:34:19] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/592809 (owner: 10Marostegui) [05:34:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P11054 and previous config saved to /var/cache/conftool/dbconfig/20200428-053453-marostegui.json [05:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:32] !log Deploy schema change on db1112 [05:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:34] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [05:40:14] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [05:42:13] !log Restart labsdb1011 with innodb_purge_threads set to 10 - T249188 [05:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:18] T249188: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 [05:50:12] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [05:52:25] ^ me [05:52:58] !log Reclone labsdb1011 from labsdb1012 - T249188 [05:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:04] T249188: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 [05:53:38] marostegui: how dare you :D [05:53:46] elukey: I sent an email! [05:53:51] ahaahha yes yes I am kidding [05:54:27] elukey: I think it is time for you to open an account on the mariadb-dev jira, you guys are helping out to chase a bug down! [05:55:03] marostegui: hard pass, I have enough upstream accounts to report bug to for the moment :D [05:55:17] :_( [05:56:54] elukey: I thought you were a mariadb lover [05:57:12] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10elukey) @Cmjohnson thanks a lot for the work, if you have time to prioritize these two nodes during the next days it would be super helpful (the elastic cluster is a little bit under... [05:57:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1112', diff saved to https://phabricator.wikimedia.org/P11056 and previous config saved to /var/cache/conftool/dbconfig/20200428-055719-marostegui.json [05:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:27] !log Deploy schema change on s4 codfw, this will generate lag on codfw - T250055 [06:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:32] T250055: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 [06:14:10] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [06:19:06] about --^ there is a big reindex ongoing, and we are testing new GC settings, that don't seem to help much [06:19:32] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [06:19:36] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known issue https://wikitech.wikimedia.org/wiki/HAProxy [06:25:31] (03PS1) 10Marostegui: install_server: Allow reimage db1105 [puppet] - 10https://gerrit.wikimedia.org/r/592810 (https://phabricator.wikimedia.org/T250666) [06:35:19] !log Deploy schema change on s3 master with replication for the wikis at T250071#6051598 - T250071 [06:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:25] T250071: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 [06:54:40] (03PS1) 10Giuseppe Lavagetto: Revert "Enable LCStoreStaticArray on depooled mw1407 for benchmarking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592867 [06:57:40] 10Operations, 10netops, 10observability: Upgrade LibreNMS to 1.63 - https://phabricator.wikimedia.org/T251222 (10ayounsi) p:05Triage→03Low [06:59:20] <_joe_> elukey: I can't say I find that alert very informative [06:59:31] <_joe_> "WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS" [07:00:07] <_joe_> (not blaming you, just noticing) [07:00:17] (03PS2) 10Giuseppe Lavagetto: Revert "Enable LCStoreStaticArray on depooled mw1407 for benchmarking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592867 [07:03:04] _joe_ ah yes may need some rewording, it just checks the https ports of cloudelastic.wikimedia.org [07:08:31] I think the intent is to indicate that MW (jobrunners?) can't reach elastic for WMF cloud [07:23:20] (03CR) 10Jcrespo: [C: 03+1] install_server: Allow reimage db1105 [puppet] - 10https://gerrit.wikimedia.org/r/592810 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:23:36] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db1105 [puppet] - 10https://gerrit.wikimedia.org/r/592810 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:24:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311 and 3312 for reimage', diff saved to https://phabricator.wikimedia.org/P11057 and previous config saved to /var/cache/conftool/dbconfig/20200428-072416-marostegui.json [07:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:23] (03PS1) 10Marostegui: db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/592871 [07:26:14] (03CR) 10Marostegui: [C: 03+2] db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/592871 (owner: 10Marostegui) [07:26:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/592641 (owner: 10Jbond) [07:26:50] !log Reimage db1105 [07:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:03] 10Operations, 10ops-eqsin, 10Traffic: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10Vgutierrez) [07:32:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [07:36:51] @seen hashar [07:36:51] mutante: I have never seen hashar [07:39:11] (03CR) 10Dzahn: [C: 03+2] Use noc@ not webmaster@ in lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/592734 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [07:41:30] (03CR) 10Dzahn: "There are only very few remaining places using the apache module. After these are replaced with httpd module it should be dropped entirely" [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [07:43:10] (03CR) 10KartikMistry: [C: 04-1] "Please don't deploy this as of now. See: https://phabricator.wikimedia.org/T246383#6087778" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592479 (https://phabricator.wikimedia.org/T246383) (owner: 10VulpesVulpes825) [07:43:20] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:46] (03CR) 10Dzahn: [WIP] Use noc@ not webmaster@ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [07:44:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:00] (03PS1) 10Jcrespo: install_server: Revert reimage of db1105, fix reimage of db2102 [puppet] - 10https://gerrit.wikimedia.org/r/592872 (https://phabricator.wikimedia.org/T250666) [07:46:22] mutante: enwp’s rss parses horribly but not invalid [07:47:03] RhinosF1: how do you check that? [07:47:21] mutante: see wm-bot’s logs of #ZppixBot [07:47:22] btw, it happens with both RSS and Atom [07:47:29] (03PS2) 10Jcrespo: install_server: Revert reimage of db1105, fix reimage of db2102 [puppet] - 10https://gerrit.wikimedia.org/r/592872 (https://phabricator.wikimedia.org/T250666) [07:47:53] mutante: http://wm-bot.wmflabs.org/browser/index.php?start=04%2F28%2F2020&end=04%2F28%2F2020&display=%23ZppixBot [07:48:35] RhinosF1: ah, i see. and if you add a miraheze wiki then it fails as i reported? or it just changed since i reported? [07:48:52] mutante: let me try [07:49:29] (03CR) 10Marostegui: [C: 03+1] install_server: Revert reimage of db1105, fix reimage of db2102 [puppet] - 10https://gerrit.wikimedia.org/r/592872 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [07:49:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:05] (03CR) 10Jcrespo: [C: 03+2] install_server: Revert reimage of db1105, fix reimage of db2102 [puppet] - 10https://gerrit.wikimedia.org/r/592872 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [07:51:40] RhinosF1: i somehow doubt miraheze custom-hacked their feed, but who knows. or maybe it's the MediaWiki version [07:51:53] or wm-bot got fixed meanwhile.. shrug [07:51:56] mutante: we’re 1.34.1 [07:52:35] RhinosF1: either way, if you can't reproduce the bug anymore then feel free to close it as resolved [07:52:38] <_joe_> !log running benchmarks on mw1407 (LCStoreStaticArray) and mw1409 (LCStoreCDB) for T99740: restart php-fpm, pool for 5 minutes to warmup caches, then depool both servers. [07:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:45] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [07:53:24] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:53:25] mutante: closing, works fine [07:53:35] RhinosF1: alright, thanks for checking [07:53:36] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10jcrespo) @CDanis backup2002 was recently installed into buster (apparently, wrongly), but it already contains data. Would a simple: ` grub-install /dev/sdb ` (I am assuming sda already has it) fix the issue? [07:53:40] Reception123: thanks for testing that and spamming logs [07:53:53] no problem [07:54:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:58:22] 10Operations, 10DNS, 10Traffic, 10WMF-Annual-Report (TransparencyReport): Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362 (10Dzahn) @APalmer_WMF @Prtksxna Hi! I just got back to this ticket because we need to move the transparen... [07:58:24] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for squid [puppet] - 10https://gerrit.wikimedia.org/r/592619 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:58:50] (03PS1) 10Marostegui: Revert "db1105: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/592874 [08:02:52] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 85, down: 1, dormant: 0, excluded: 2, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:03:30] PROBLEM - PHP opcache health on mw1407 is CRITICAL: CRITICAL: opcache full. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:04:59] 10Operations, 10netops: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) >>! In T200277#6084500, @faidon wrote: > Interesting idea! Couple of notes: > - What do you mean by "virtual links" and Netbox not supporting them? Is that VLANs for our transports over the PtMP VPLS? Yes, both PtMP VP... [08:05:40] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:06:11] !log installing qemu security updates [08:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:55] (03CR) 10Marostegui: [C: 03+2] Revert "db1105: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/592874 (owner: 10Marostegui) [08:09:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repoo db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11058 and previous config saved to /var/cache/conftool/dbconfig/20200428-080920-marostegui.json [08:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:49] (03PS1) 10Kormat: db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/592876 (https://phabricator.wikimedia.org/T250666) [08:10:33] (03CR) 10Marostegui: [C: 03+1] db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/592876 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [08:11:14] (03CR) 10Kormat: [C: 03+2] db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/592876 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [08:11:55] (03CR) 10Jcrespo: "> Patch Set 4:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/592247 (owner: 10Jcrespo) [08:13:09] !log rsyncing transparency-report-private files from bromine to miscweb1002/2002. git-cloning was removed about a year ago but site still exists. need to figure out if it should be deleted (T188362 T247650) [08:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:17] T188362: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362 [08:13:17] T247650: replace bromine and vega with buster VMs - https://phabricator.wikimedia.org/T247650 [08:13:45] !log reimaging db2124 to buster T250666 [08:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:52] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [08:17:38] !log jynus@cumin2001 START - Cookbook sre.hosts.downtime [08:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:40] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:20:06] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [08:20:09] !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:33] !log restarting blazegraph on wdqs1007 (T242453) [08:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:41] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [08:22:03] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:22:15] 10Operations, 10netops: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 (10ayounsi) [08:22:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 87, down: 0, dormant: 0, excluded: 2, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:24:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repoo db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11059 and previous config saved to /var/cache/conftool/dbconfig/20200428-082420-marostegui.json [08:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 85, down: 1, dormant: 0, excluded: 2, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:29:49] <_joe_> XioNoX: known? ^^ [08:29:57] nop [08:30:05] RECOVERY - PHP opcache health on mw1407 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:30:20] <_joe_> !log restarting php-fpm on mw1407 and mw1409 again, then running traffic on them for 1 hour. [08:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:04] _joe_: that's the main eqiad-esams link, things should fail over cleanly if it's not flapping [08:31:31] Scheduled Maintenance Window #: 18305447-1 [08:31:45] <_joe_> ack :) [08:34:28] (03PS2) 10JMeybohm: * Add a script to simplify imports of new upstream versions * Remove override_dh_auto_configure * Drop patches no longer needed * Remove unneeded build dependencies * Install helm binary as helm2, use alternatives [debs/helm] - 10https://gerrit.wikimedia.org/r/592689 [08:35:57] (03PS3) 10JMeybohm: Build with vendor, improve packaging [debs/helm] - 10https://gerrit.wikimedia.org/r/592689 [08:36:16] (03Abandoned) 10JMeybohm: Clean up debian directory [debs/helm] - 10https://gerrit.wikimedia.org/r/592690 (owner: 10JMeybohm) [08:36:20] !log deleting wikidatawiki_content_1587076410 from cloudelastic [08:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11060 and previous config saved to /var/cache/conftool/dbconfig/20200428-084041-marostegui.json [08:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:23] (03PS1) 10Marostegui: db1105: Reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/592881 [08:42:41] (03CR) 10Jbond: [C: 03+2] override manifests dir: allow passing manifest_dir to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/584934 (https://phabricator.wikimedia.org/T248689) (owner: 10Jbond) [08:43:22] (03CR) 10Marostegui: [C: 03+2] db1105: Reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/592881 (owner: 10Marostegui) [08:44:27] <_joe_> jouncebot: next [08:44:27] In 2 hour(s) and 15 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200428T1100) [08:46:10] (03PS1) 10Jbond: 0.7.3: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/592882 [08:47:31] (03CR) 10Jbond: [C: 03+2] 0.7.3: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/592882 (owner: 10Jbond) [08:50:04] (03PS1) 10Dzahn: httpbb: add tests for miscweb sites [puppet] - 10https://gerrit.wikimedia.org/r/592883 (https://phabricator.wikimedia.org/T247650) [08:51:50] (03PS1) 10Kormat: install_server: Allow reimage db2124 [puppet] - 10https://gerrit.wikimedia.org/r/592884 (https://phabricator.wikimedia.org/T250666) [08:53:09] (03PS3) 10QEDK: Enable VisualEditor for more namespaces on vecwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592427 (https://phabricator.wikimedia.org/T250419) [08:53:54] (03CR) 10Volans: "To go for buster you have to also remove the override in our DHCP config for this host:" [puppet] - 10https://gerrit.wikimedia.org/r/592884 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [08:54:37] (03PS1) 10Muehlenhoff: Fix Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/592886 [08:55:39] !log re-set lost licenses on asw2-a/b-eqiad [08:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:39] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/592886 (owner: 10Muehlenhoff) [08:56:41] (03CR) 10Marostegui: [C: 03+1] install_server: Allow reimage db2124 [puppet] - 10https://gerrit.wikimedia.org/r/592884 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [08:56:51] (03CR) 10Marostegui: [C: 03+1] "> To go for buster you have to also remove the override in our DHCP" [puppet] - 10https://gerrit.wikimedia.org/r/592884 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [08:58:00] (03PS2) 10Kormat: install_server: Allow reimage db2124 [puppet] - 10https://gerrit.wikimedia.org/r/592884 (https://phabricator.wikimedia.org/T250666) [08:59:00] (03CR) 10Muehlenhoff: [C: 03+2] Fix Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/592886 (owner: 10Muehlenhoff) [08:59:23] (03CR) 10Marostegui: "The idea of doing this in a single commit is to be able to revert this one by clicking just revert on gerrit, that's why we normally just " [puppet] - 10https://gerrit.wikimedia.org/r/592884 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [09:01:31] (03PS3) 10Kormat: install_server: Allow reimage db2124 [puppet] - 10https://gerrit.wikimedia.org/r/592884 (https://phabricator.wikimedia.org/T250666) [09:02:28] (03CR) 10Kormat: [C: 03+2] install_server: Allow reimage db2124 [puppet] - 10https://gerrit.wikimedia.org/r/592884 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [09:05:17] (03PS1) 10Kormat: install_server: switch db2124 to buster [puppet] - 10https://gerrit.wikimedia.org/r/592887 (https://phabricator.wikimedia.org/T250666) [09:05:51] (03CR) 10Marostegui: [C: 04-1] "You can simply delete the line, by default it will pick buster" [puppet] - 10https://gerrit.wikimedia.org/r/592887 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [09:06:49] (03PS1) 10Elukey: Analytics Presto: avoid query failures due to corrupted statistics [puppet] - 10https://gerrit.wikimedia.org/r/592888 [09:06:56] (03PS2) 10Kormat: install_server: switch db2124 to buster [puppet] - 10https://gerrit.wikimedia.org/r/592887 (https://phabricator.wikimedia.org/T250666) [09:08:12] (03CR) 10Marostegui: [C: 03+1] install_server: switch db2124 to buster [puppet] - 10https://gerrit.wikimedia.org/r/592887 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [09:08:35] (03CR) 10Kormat: [C: 03+2] install_server: switch db2124 to buster [puppet] - 10https://gerrit.wikimedia.org/r/592887 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [09:08:52] (03CR) 10Elukey: [C: 03+2] Analytics Presto: avoid query failures due to corrupted statistics [puppet] - 10https://gerrit.wikimedia.org/r/592888 (owner: 10Elukey) [09:10:01] (03PS1) 10Dzahn: show program name in usage text [software/httpbb] - 10https://gerrit.wikimedia.org/r/592889 [09:10:46] elukey: hey. is it ok to puppet-merge your changes? [09:10:53] kormat: yep! thanks! [09:11:09] grand, done :) [09:11:18] (03CR) 10jerkins-bot: [V: 04-1] show program name in usage text [software/httpbb] - 10https://gerrit.wikimedia.org/r/592889 (owner: 10Dzahn) [09:11:53] (03CR) 10Jbond: "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:12:01] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:12:39] !log elukey@cumin1001 START - Cookbook sre.presto.roll-restart-workers [09:12:39] !log elukey@cumin1001 END (FAIL) - Cookbook sre.presto.roll-restart-workers (exit_code=99) [09:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:49] !log elukey@cumin1001 START - Cookbook sre.presto.roll-restart-workers [09:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:36] (03CR) 10Jbond: [C: 03+2] mcrouter: add default timeout values [puppet] - 10https://gerrit.wikimedia.org/r/592641 (owner: 10Jbond) [09:14:09] (03PS1) 10Urbanecm: Allow bdwikimedia bureaucrats to revoke sysop flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592890 (https://phabricator.wikimedia.org/T251078) [09:14:45] (03PS2) 10Dzahn: show program name in usage text [software/httpbb] - 10https://gerrit.wikimedia.org/r/592889 [09:15:43] (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592890 (https://phabricator.wikimedia.org/T251078) (owner: 10Urbanecm) [09:16:10] (03CR) 10jerkins-bot: [V: 04-1] show program name in usage text [software/httpbb] - 10https://gerrit.wikimedia.org/r/592889 (owner: 10Dzahn) [09:18:35] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:18:43] (03PS2) 10Dzahn: httpbb: add tests for miscweb sites [puppet] - 10https://gerrit.wikimedia.org/r/592883 (https://phabricator.wikimedia.org/T247650) [09:20:53] !log kormat@cumin1001 dbctl commit (dc=all): 'Depool db2124 T250666', diff saved to https://phabricator.wikimedia.org/P11063 and previous config saved to /var/cache/conftool/dbconfig/20200428-092052-kormat.json [09:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:03] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [09:22:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) [09:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:59] (03CR) 10Dzahn: "[cumin1001:/srv/deployment/httpbb-tests] $ httpbb /tmp/test_miscweb.yaml --hosts=vega.codfw.wmnet,bromine.eqiad.wmnet,miscweb1002.eqiad.wm" [puppet] - 10https://gerrit.wikimedia.org/r/592603 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [09:25:31] (03CR) 10Dzahn: [C: 03+2] ATS: switch backends for the 3 transparency sites [puppet] - 10https://gerrit.wikimedia.org/r/592603 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [09:25:40] (03PS4) 10Dzahn: ATS: switch backends for the 3 transparency sites [puppet] - 10https://gerrit.wikimedia.org/r/592603 (https://phabricator.wikimedia.org/T247650) [09:30:09] (03PS1) 10Ema: Retry requests in case of errors up to sendAttempts times [software/purged] - 10https://gerrit.wikimedia.org/r/592891 (https://phabricator.wikimedia.org/T249583) [09:32:28] !log running puppet on cp-ats for backend config change [09:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:55] !log depool wdqs1007 to catch up on lag a bit [09:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:12] (03PS1) 10Jbond: idp: refactopr to make idp_secondary optional [puppet] - 10https://gerrit.wikimedia.org/r/592892 [09:41:54] (03PS3) 10Dzahn: show program name in usage text [software/httpbb] - 10https://gerrit.wikimedia.org/r/592889 [09:42:12] (03CR) 10Elukey: profile::idp: add mcrouter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [09:43:06] (03PS1) 10Jbond: populatedb: log output of /dev/null manifest_dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/592893 [09:43:45] (03CR) 10Dzahn: "for scholarships and iegreview i actually wanted better assertions but for some reason i don't get the same results with curl and httpbb. " [puppet] - 10https://gerrit.wikimedia.org/r/592883 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [09:46:25] (03PS2) 10Muehlenhoff: cumin: Fix Python version for Buster and remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/584587 [09:46:36] (03CR) 10Jbond: [C: 03+2] populatedb: log output of /dev/null manifest_dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/592893 (owner: 10Jbond) [09:46:54] (03PS2) 10Ema: Retry requests in case of errors up to sendAttempts times [software/purged] - 10https://gerrit.wikimedia.org/r/592891 (https://phabricator.wikimedia.org/T249583) [09:48:03] PROBLEM - Check systemd state on an-presto1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:03] PROBLEM - Check systemd state on an-presto1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:28] this is me --^ [09:50:00] (03CR) 10Vgutierrez: [C: 03+1] Retry requests in case of errors up to sendAttempts times [software/purged] - 10https://gerrit.wikimedia.org/r/592891 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [09:50:03] RECOVERY - Check systemd state on an-presto1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:03] RECOVERY - Check systemd state on an-presto1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:28] !log starting branch cut for train [09:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:52] (03CR) 10Ema: [C: 03+2] Retry requests in case of errors up to sendAttempts times [software/purged] - 10https://gerrit.wikimedia.org/r/592891 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [09:55:09] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:43] (03PS1) 10Jbond: populate_puppetdb: add --host paramter to run for single host [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/592894 [09:56:08] (03CR) 10jerkins-bot: [V: 04-1] populate_puppetdb: add --host paramter to run for single host [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/592894 (owner: 10Jbond) [09:56:30] (03PS1) 10Ema: Release 0.9 [software/purged] - 10https://gerrit.wikimedia.org/r/592895 [09:57:39] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:49] (03PS2) 10Jbond: populate_puppetdb: add --host paramter to run for single host [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/592894 [10:00:01] (03CR) 10Jbond: [C: 03+2] populate_puppetdb: add --host paramter to run for single host [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/592894 (owner: 10Jbond) [10:00:38] 10Operations, 10LDAP-Access-Requests, 10LDAP: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Samwalton9) Hm. I have no idea what my password is for that account - is there somewhere I can reset it? [10:01:18] (03CR) 10Ema: [C: 03+2] Release 0.9 [software/purged] - 10https://gerrit.wikimedia.org/r/592895 (owner: 10Ema) [10:02:40] (03CR) 10Jbond: [C: 03+2] idp: refactopr to make idp_secondary optional [puppet] - 10https://gerrit.wikimedia.org/r/592892 (owner: 10Jbond) [10:04:19] 10Operations, 10LDAP-Access-Requests, 10LDAP: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10MoritzMuehlenhoff) >>! In T250189#6088157, @Samwalton9 wrote: > Hm. I have no idea what my password is for that account - is there somewhere I can reset it? You can use ht... [10:05:38] !log 1.35.0-wmf.30 was branched at ffc8e887573d7b288067b263c5b6047b2b2db081 for T249962 [10:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:44] T249962: 1.35.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T249962 [10:06:11] 10Puppet, 10Wikimedia Meet: Puppetize the account manager - https://phabricator.wikimedia.org/T251034 (10Dzahn) The longer we wait moving it the more it will be an issue to lose git history. Things that are temporary have a habit of becoming permanent. [10:07:35] 10Operations, 10LDAP-Access-Requests, 10LDAP: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Samwalton9) Per the above, I don't have a Wikitech account, just an LDAP login, so the Wikitech password reset form won't work. [10:08:01] !log upload purged 0.9 to buster-wikimedia [10:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:25] <_joe_> !log starting benchmarks for light page on mw140{7,9} [10:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:52] (03PS1) 10Muehlenhoff: Add separate repository sync definition for HP RAID tools on Buster [puppet] - 10https://gerrit.wikimedia.org/r/592900 [10:18:31] 10Puppet, 10Wikimedia Meet: Puppetize the account manager - https://phabricator.wikimedia.org/T251034 (10Dzahn) requested import on https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests [10:19:05] (03PS1) 10Kormat: Revert "install_server: Allow reimage db2124" [puppet] - 10https://gerrit.wikimedia.org/r/592901 (https://phabricator.wikimedia.org/T250666) [10:19:25] (03CR) 10Marostegui: [C: 03+1] Revert "install_server: Allow reimage db2124" [puppet] - 10https://gerrit.wikimedia.org/r/592901 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [10:20:51] (03CR) 10Jcrespo: [C: 03+1] Add separate repository sync definition for HP RAID tools on Buster [puppet] - 10https://gerrit.wikimedia.org/r/592900 (owner: 10Muehlenhoff) [10:21:07] (03CR) 10Kormat: [C: 03+2] Revert "install_server: Allow reimage db2124" [puppet] - 10https://gerrit.wikimedia.org/r/592901 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [10:21:52] (03CR) 10Dzahn: [C: 03+1] "looks good to me and in compiler: https://puppet-compiler.wmflabs.org/compiler1001/22170/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/591322 (owner: 10Jbond) [10:23:54] (03CR) 10Ayounsi: [C: 03+2] Cleanup unused policy-statements [homer/public] - 10https://gerrit.wikimedia.org/r/591414 (owner: 10Ayounsi) [10:25:26] (03PS5) 10Jbond: profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) [10:25:43] (03PS6) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [10:25:53] (03PS5) 10Jbond: apero_cas: enable memcached on idp_test [puppet] - 10https://gerrit.wikimedia.org/r/592661 (https://phabricator.wikimedia.org/T233931) [10:26:41] (03CR) 10Jbond: "updated, ready for review" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [10:26:46] (03CR) 10Jbond: [C: 03+2] phabricator: remove srcaddr in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/591322 (owner: 10Jbond) [10:27:38] (03PS1) 10Jcrespo: mariadb: Disallow reimage of db2102, reimage new backup1002 [puppet] - 10https://gerrit.wikimedia.org/r/592903 (https://phabricator.wikimedia.org/T250816) [10:28:13] !log repool wdqs1007 (lag caught up) [10:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:35] !log liw@deploy1001 Pruned MediaWiki: 1.35.0-wmf.30 (duration: 01m 27s) [10:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:37] (03PS1) 10Jbond: phab: correct syntax error in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/592905 [10:32:39] (03CR) 10Jcrespo: "CCing in case it interacts with some of your deployments." [puppet] - 10https://gerrit.wikimedia.org/r/592903 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [10:33:57] (03PS1) 10Kormat: Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/592906 (https://phabricator.wikimedia.org/T250666) [10:34:13] (03CR) 10Jcrespo: [C: 03+2] mariadb: Disallow reimage of db2102, reimage new backup1002 [puppet] - 10https://gerrit.wikimedia.org/r/592903 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [10:34:26] (03PS1) 10Muehlenhoff: Add a second IDP staging host [dns] - 10https://gerrit.wikimedia.org/r/592907 (https://phabricator.wikimedia.org/T233930) [10:34:28] (03CR) 10Kormat: [C: 03+2] Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/592906 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [10:34:40] <_joe_> !log running main_page test on mw1407,9 [10:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:09] (03CR) 10Dzahn: [C: 03+1] phab: correct syntax error in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/592905 (owner: 10Jbond) [10:36:21] (03CR) 10Jbond: [C: 03+2] phab: correct syntax error in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/592905 (owner: 10Jbond) [10:36:34] thanks mutante [10:37:10] yw, jbond42 [10:38:12] (03PS1) 10Ayounsi: Python 3.8 support [software/homer] - 10https://gerrit.wikimedia.org/r/592909 [10:38:47] <_joe_> !log running load.php test on mw1407,9 [10:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:43] (03CR) 10Volans: [C: 03+1] "LGTM if CI is happy :)" [software/homer] - 10https://gerrit.wikimedia.org/r/592909 (owner: 10Ayounsi) [10:39:45] !log cp-text: upgrade purged to 0.9 and restart [10:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:12] !log remove unused policy-statements from routers [10:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:31] (03CR) 10Jbond: "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/592907 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [10:41:51] (03CR) 10Ayounsi: [C: 03+2] Python 3.8 support [software/homer] - 10https://gerrit.wikimedia.org/r/592909 (owner: 10Ayounsi) [10:41:53] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase,name=restbase2014.codfw.wmnet [10:41:56] (03PS1) 10Ema: cache: test purged on cp2030, part of cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/592911 (https://phabricator.wikimedia.org/T249583) [10:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:01] 10Operations, 10Traffic: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 (10ema) 05Open→03Resolved a:03ema Fixed by moving to purged (T249583). [10:43:37] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase-backend,name=restbase2014.codfw.wmnet [10:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:45] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase-ssl,name=restbase2014.codfw.wmnet [10:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:07] PROBLEM - PHP7 rendering on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1311 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:45:07] PROBLEM - Apache HTTP on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1311 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:45:46] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10ema) 05Open→03Resolved a:03ema The cache_text cluster has now been running purged i... [10:45:55] 10Operations, 10Traffic, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10ema) [10:46:11] (03CR) 10Muehlenhoff: [C: 03+2] cumin: Fix Python version for Buster and remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/584587 (owner: 10Muehlenhoff) [10:46:51] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reimaging to buster T250666', diff saved to https://phabricator.wikimedia.org/P11064 and previous config saved to /var/cache/conftool/dbconfig/20200428-104650-kormat.json [10:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:58] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [10:47:58] (03CR) 10Ema: [C: 03+2] cache: test purged on cp2030, part of cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/592911 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [10:48:09] <_joe_> !log running heavy_page test on mw1407,9 [10:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:25] !log liw@deploy1001 Pruned MediaWiki: 1.35.0-wmf.27 (duration: 12m 37s) [10:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:24] (03CR) 10Muehlenhoff: [C: 03+2] Add a second IDP staging host [dns] - 10https://gerrit.wikimedia.org/r/592907 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [10:53:04] (03PS1) 10Muehlenhoff: Remove now obsolete check for jessie [puppet] - 10https://gerrit.wikimedia.org/r/592913 [10:55:43] (03PS1) 10Muehlenhoff: phragile: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/592914 [10:57:09] (03CR) 10Dzahn: [C: 03+1] "this is used in cloud, but looks like it is stretch meanwhile: https://openstack-browser.toolforge.org/project/phragile" [puppet] - 10https://gerrit.wikimedia.org/r/592914 (owner: 10Muehlenhoff) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200428T1100). [11:00:04] Majavah and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] o/ [11:00:44] (03PS2) 10Majavah: Create a bunch of namespace aliases for thwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592645 (https://phabricator.wikimedia.org/T251118) [11:01:01] (03PS1) 10Lars Wirzenius: testwikis wikis to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592915 [11:01:03] (03CR) 10Lars Wirzenius: [C: 03+2] testwikis wikis to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592915 (owner: 10Lars Wirzenius) [11:01:59] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592915 (owner: 10Lars Wirzenius) [11:03:20] Urbanecm: ping? [11:04:02] !log liw@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.30 [11:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:49] (03PS5) 10Reedy: Use noc@ not webmaster@ [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) [11:06:48] (03CR) 10Jbond: varnish: update varnish config to use the abuse_networks global (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [11:08:59] Majavah: I can deploy your change [11:09:21] sure, that works too. [11:09:47] thought that Martin would do it as he also has a patch scheduled for this window [11:13:16] (03CR) 10Muehlenhoff: [C: 03+2] Add separate repository sync definition for HP RAID tools on Buster [puppet] - 10https://gerrit.wikimedia.org/r/592900 (owner: 10Muehlenhoff) [11:13:21] I’m a bit confused about that วิกิตำรา alias [11:13:26] or name [11:13:35] because it’s not in MediaWiki’s MessagesTh.php [11:13:58] but it does seem to be the Project namespace on-wiki, so I guess I just don’t fully understand how it works but it’s still correct ^^ [11:14:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592645 (https://phabricator.wikimedia.org/T251118) (owner: 10Majavah) [11:15:34] ah, yes, I’m an idiot – of course the project namespace wouldn’t be translated in MediaWiki >.< [11:15:38] since it varies by project ^^ [11:15:40] (03CR) 10Ema: [C: 03+1] "Nice! A couple of nits, otherwise good to go." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [11:15:42] (03Merged) 10jenkins-bot: Create a bunch of namespace aliases for thwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592645 (https://phabricator.wikimedia.org/T251118) (owner: 10Majavah) [11:16:03] the project namespace is set on InitializeSettings [11:16:14] yeah, I found it with codesearch [11:16:26] godog: ema: [11:16:28] sorry [11:16:45] hm, `scap pull` on mwdebug1001 is taking a while… [11:17:29] uh, why is there a “testwikis to wmf.30” log message above? [11:17:37] 10Operations: Add CI to the private repo - https://phabricator.wikimedia.org/T251247 (10Reedy) [11:17:40] liw: are you doing the train already? [11:18:43] Lucas_WMDE, deploying to testwikis [11:18:47] Majavah: the change should be on mwdebug1001 now, can you test it? [11:18:57] Lucas_WMDE: sure, a sec [11:19:02] liw: isn’t that supposed to be in two hours? [11:19:05] or am I reading the calendar wrong [11:19:37] Lucas_WMDE: working as expected [11:19:45] ok, then I’ll sync [11:20:10] ssacli/buster-wikimedia,now 4.15-6.0 amd64 [installed], nice, thanks [11:20:24] !log updated ssacli/ssaducli for buster-wikimedia's thirdparty/hwraid component to 4.15-6.0 [11:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:30] failed to acquire lock, owner is liw, reason ist testwikis to wmf.30 [11:20:46] Lucas_WMDE, group0 in two hours, testwikis happen before the train deployment window [11:20:59] I thought group0 = testwikis? [11:21:13] Lucas_WMDE, I don't think so [11:21:24] doing SWAT and train at the same time doesn’t seem like a great idea to me… [11:21:33] jynus: let's see if they renamed arguments etc :-) [11:21:46] moritzm: as far as check goes, it seems to work [11:21:57] and also tested some manual commands [11:22:16] great! [11:22:29] the actual problem is when they create an hpssnewclicmd new utility! [11:22:47] "only working for gen11 controllers" [11:22:49] Lucas_WMDE, sorry; I have to get this done before the train window or I'll be late; I hope this testwikis deployment is over in 10 mins or so [11:22:54] (03PS7) 10Muehlenhoff: profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 [11:23:16] ok, I guess then I’ll wait until I can sync [11:23:18] Somehow I missed the jouncebot ping... [11:23:21] Lucas_WMDE, I'm on the 3.2.1.10 step of https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys [11:23:27] * Lucas_WMDE looks [11:23:55] Ping me once I can do my patch [11:24:42] cc Lucas_WMDE Majavah [11:24:52] sure [11:26:01] “This can take well over an hour” [11:26:38] :/ [11:26:46] so there’s a phantom deployment window, which isn’t in the deployment calendar, but can block scap for over an hour, overlapping with the one-hour EU SWAT? [11:26:50] I am so confused [11:26:54] * Urbanecm too [11:26:54] has it always been like this and I’ve never noticed? [11:27:03] * Majavah is also confused [11:27:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. When you have a moment, please document this somewhere in wikitech, perhaps here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_V" [puppet] - 10https://gerrit.wikimedia.org/r/592750 (owner: 10Andrew Bogott) [11:27:28] ... and directly after the one-hour EU SWAT window is an one-hour no deploys window? [11:28:09] and testwiki seems to act as a group-1 which gets the new version before even the rest of group0 [11:28:38] (03PS1) 10Hnowlan: changeprop: enable more rules in kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/592918 (https://phabricator.wikimedia.org/T248677) [11:28:55] t [11:28:58] ? [11:29:26] * Lucas_WMDE checks wikitech version history [11:29:28] the testwiki deployment is there so we know we can deploy at all, to anything. that step has failed before. [11:30:03] (03CR) 10Muehlenhoff: [C: 04-1] profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [11:30:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add wmcs-instancepurge.py and a timer job that uses it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592750 (owner: 10Andrew Bogott) [11:30:31] liw: is it possible to schedule it officially, so it doesn't conflict with any sort of windows? [11:31:13] it looks like it’s been in the wikitech instructions for a while, just with a shorter estimated duration [11:31:19] used to be more like half an hour, apparently [11:31:31] I guess that would’ve been under the “you don’t need to book a deployment window” threshold [11:31:58] again, I'm sorry I'm overlapping with EU SWAT; I didn't know I was interfereing with it; I started the pre-train window steps hours ago, and they always seem to take longer, and always break in new exciting ways [11:32:36] someone should probable file a ticket about booking an actual window for this [11:32:40] liw: I see. As we're in half of the window, do you think it's likely it will get unblocked, or should we cancel it and reschedule our patches? [11:32:46] didn’t mean to blame you specifically, sorry if it sounded like this [11:32:53] just very confused that this is how the process is supposed to be [11:33:03] needs a phabricator task or ops-l email, yeah [11:33:06] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [11:33:08] +1 [11:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:14] Urbanecm, re scheduling, maybe, I've made a note to bring it up with Tyler [11:33:55] Urbanecm, I have on idea how long this will still take; the estimate on the wiki page is way off [11:34:12] and scap doesn't give an ETA [11:34:16] * Urbanecm doesn't understand the second message [11:34:49] !log jmm@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:34:51] Urbanecm, I don't know how long it will still take, sorry [11:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:16] * Lucas_WMDE is running `watch ls` on the lockfile scap printed out (/var/lock/scap.operations_mediawiki-config.lock) [11:36:14] Okay, ack. So I guess we should reschedule patches and make sure this gets fixed for next time? [11:36:35] probably makes sense to reschedule yours [11:36:39] Majavah’s is already merged [11:36:47] so I’m really hoping to sync that before the no-deploy window [11:36:56] yeah I was just about to ask what to do with my patch currently on mwdebug1001 [11:37:04] I'll give a shout once my scap finishes [11:37:13] thanks [11:37:32] thanks [11:37:37] * liw notes this is supposedly the biggest train, in terms of commits, in recent history: 900+ commits [11:37:50] have luck with the train anyway! [11:38:11] thanks [11:38:34] (03CR) 10Elukey: profile::idp: add mcrouter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:38:34] wow [11:38:43] 900+ MediaWiki core commits, I assume? [11:38:47] Lucas_WMDE: just realized that as my patch modifies namespace aliases it probably needs a namespaceDupes.php run. should we just revert and try again tomorrow? [11:39:01] I guess that’s due to no train last week + all the WikiPage/Article deprecations [11:39:16] Majavah: hm, good point [11:39:23] IIRC that script usually doesn’t take long to ru [11:39:25] *run [11:39:46] it's a small wiki but I have no idea about the runtime of that script [11:39:52] let’s wait a bit before reverting [11:45:02] !log Deploy schema change on s8 eqiad master with replication T250071 [11:45:06] If we can sync-file it, we can surely run the script - it should run just seconds [11:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:08] T250071: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 [11:52:55] !log liw@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.30 (duration: 48m 53s) [11:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:04] \o/ [11:53:07] my scap has finished [11:53:13] then I’ll try running mine [11:53:19] there should be enough time for that + maint script [11:53:19] just in time :D [11:54:32] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:592645|Create a bunch of namespace aliases for thwikibooks (T251118)]] (duration: 01m 05s) [11:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:39] T251118: Create an alias for some namespace in Thai Wikibooks - https://phabricator.wikimedia.org/T251118 [11:54:48] script without --fix says 493 links to fix, 493 were resolvable. [11:54:52] which looks good [11:55:22] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=thwikibooks --fix | tee T251118-fix [11:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:31] !log EU SWAT done [11:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:20] Lucas_WMDE: thanks, appers to be working without X-Wikimedia-Debug too [11:57:37] great [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200428T1200) [12:01:20] "969 Changes in 193 repos by 81 authors" accodring to https://www.mediawiki.org/wiki/MediaWiki_1.35/wmf.30/Changelog [12:01:37] oh, I see [12:02:20] (03PS1) 10Ayounsi: Configure management-instance on routers [homer/public] - 10https://gerrit.wikimedia.org/r/592920 (https://phabricator.wikimedia.org/T247073) [12:03:22] (03CR) 10Elukey: [C: 04-1] "The idea looks good but I have some specific comments about the code changes, plus some more general ones:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [12:13:47] (03PS1) 10Jbond: network::parse_abuse_networks: update module function [puppet] - 10https://gerrit.wikimedia.org/r/592921 [12:16:26] (03PS1) 10Seddon: Uncoupling graphoid on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592924 [12:18:02] (03PS1) 10Muehlenhoff: Move idp-test to row C [dns] - 10https://gerrit.wikimedia.org/r/592925 (https://phabricator.wikimedia.org/T233930) [12:18:07] (03PS2) 10Jbond: network::parse_abuse_networks: update module function [puppet] - 10https://gerrit.wikimedia.org/r/592921 [12:20:27] (03PS3) 10Jbond: network::parse_abuse_networks: update module function [puppet] - 10https://gerrit.wikimedia.org/r/592921 [12:23:11] (03PS1) 10Ema: cache: move to purged [puppet] - 10https://gerrit.wikimedia.org/r/592928 (https://phabricator.wikimedia.org/T249583) [12:25:29] (03CR) 10Jbond: [C: 03+2] network::parse_abuse_networks: update module function [puppet] - 10https://gerrit.wikimedia.org/r/592921 (owner: 10Jbond) [12:26:47] (03PS9) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [12:27:16] (03CR) 10Ayounsi: "To be merged/deployed once cr3-esams gets upgraded." [homer/public] - 10https://gerrit.wikimedia.org/r/592920 (https://phabricator.wikimedia.org/T247073) (owner: 10Ayounsi) [12:34:13] (03PS10) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [12:35:46] !log Temporarily change query killer from 300 seconds to 3600 on labsdb1010 T249188 [12:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:54] T249188: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 [12:36:10] (03PS1) 10Jbond: abuse_nets: add comments [labs/private] - 10https://gerrit.wikimedia.org/r/592931 [12:36:27] (03CR) 10Jbond: [V: 03+2 C: 03+2] abuse_nets: add comments [labs/private] - 10https://gerrit.wikimedia.org/r/592931 (owner: 10Jbond) [12:37:06] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [12:39:05] !log Deploy schema change on db1102:3314 [12:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:46] !log Deploy schema change on dbstore1004:3314 [12:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] profile::url_downloader: Add types and switch to lookup() (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [12:45:55] (03PS6) 10Jbond: profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) [12:48:55] (03CR) 10Jbond: "Thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [12:49:10] (03PS7) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [12:49:21] (03PS6) 10Jbond: apero_cas: enable memcached on idp_test [puppet] - 10https://gerrit.wikimedia.org/r/592661 (https://phabricator.wikimedia.org/T233931) [12:51:14] (03PS1) 10Elukey: admin: add krb flag for jmads [puppet] - 10https://gerrit.wikimedia.org/r/592935 (https://phabricator.wikimedia.org/T250560) [12:51:40] (03CR) 10Elukey: [C: 03+2] admin: add krb flag for jmads [puppet] - 10https://gerrit.wikimedia.org/r/592935 (https://phabricator.wikimedia.org/T250560) (owner: 10Elukey) [12:54:46] (03CR) 10Vgutierrez: [C: 03+1] VCL: clarify scripted requests error [puppet] - 10https://gerrit.wikimedia.org/r/592652 (owner: 10Ema) [12:55:11] (03PS11) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [12:58:36] (03PS10) 10Ayounsi: Netbox driven switch interfaces configuration [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) [12:58:38] (03PS2) 10Ayounsi: Netbox driven routers disabled interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/592246 [12:58:40] (03PS2) 10Ayounsi: Chassis: more generic, add ae count [homer/public] - 10https://gerrit.wikimedia.org/r/592251 [12:58:42] (03PS1) 10Ayounsi: Add graceful-switchover multiple RE devices [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) [12:58:47] (03CR) 10jerkins-bot: [V: 04-1] Netbox driven switch interfaces configuration [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [12:58:56] (03PS2) 10Ayounsi: Add graceful-switchover to multiple RE devices [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) [12:59:23] (03PS1) 10Hoo man: Add new properties to wmgWBRepoPreferredPageImagesProperties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592939 (https://phabricator.wikimedia.org/T249811) [13:00:04] liw and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200428T1300). [13:00:38] (03PS1) 10Lars Wirzenius: group0 wikis to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592940 [13:00:40] (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592940 (owner: 10Lars Wirzenius) [13:01:35] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592940 (owner: 10Lars Wirzenius) [13:01:58] (03CR) 10Muehlenhoff: [C: 03+2] Move idp-test to row C [dns] - 10https://gerrit.wikimedia.org/r/592925 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [13:03:46] (03PS1) 10Jbond: abuse_nets: test multiline yaml [labs/private] - 10https://gerrit.wikimedia.org/r/592942 [13:03:56] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.30 [13:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:06] (03CR) 10Jbond: [V: 03+2 C: 03+2] abuse_nets: test multiline yaml [labs/private] - 10https://gerrit.wikimedia.org/r/592942 (owner: 10Jbond) [13:05:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, let's see how it works" [deployment-charts] - 10https://gerrit.wikimedia.org/r/589415 (https://phabricator.wikimedia.org/T248523) (owner: 10RLazarus) [13:05:35] (03PS2) 10Alexandros Kosiaris: Change all helmfile_log_sal commands from prepare to presync hooks. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589415 (https://phabricator.wikimedia.org/T248523) (owner: 10RLazarus) [13:06:11] Yey, I now have a commit in mediawiki core live in production. [13:07:18] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' . [13:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:30] (03PS12) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [13:07:33] RhinosF1, congratulations! [13:07:54] liw: Thanks for doing the train! [13:08:29] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [13:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:28] (03PS2) 10Dzahn: ATS: switch backends for design and sitemaps static sites [puppet] - 10https://gerrit.wikimedia.org/r/592610 (https://phabricator.wikimedia.org/T247650) [13:12:37] 10Operations, 10Prod-Kubernetes, 10serviceops: `helmfile --interactive apply` logs to SAL even if cancelled - https://phabricator.wikimedia.org/T248523 (10akosiaris) 05Open→03Resolved a:03akosiaris https://gerrit.wikimedia.org/r/589415 merged, run a test deploy and got https://sal.toolforge.org/sal/log... [13:13:16] (03PS13) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [13:17:50] (03CR) 10Dzahn: [C: 03+2] ATS: switch backends for design and sitemaps static sites [puppet] - 10https://gerrit.wikimedia.org/r/592610 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [13:19:03] (03PS3) 10Vgutierrez: Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 [13:19:14] (03CR) 10jerkins-bot: [V: 04-1] Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 (owner: 10Vgutierrez) [13:19:29] jbond42: eh, the behaviour of puppet-merge is confusing me [13:19:51] (03PS14) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [13:20:18] first it shows me your change.. then i say "no", then it does not stop but pulls my change.. then i say "yes" and it claims "all done" but does not do the sync ? [13:20:35] the first change is a labs change right? [13:20:47] you need to say yes to both or it won't proceed with the merging [13:20:48] 10Puppet, 10User-jbond: puppet-merge: answering no to merging labs-private prevents puppet-merge from pushing to all puppet masters - https://phabricator.wikimedia.org/T251104 (10jbond) [13:20:57] mutante: see ^^ [13:21:07] yeah, exactly like T251104 [13:21:07] T251104: puppet-merge: answering no to merging labs-private prevents puppet-merge from pushing to all puppet masters - https://phabricator.wikimedia.org/T251104 [13:21:18] i have merged my private one now and it has pushed yours [13:21:22] oooh, i see [13:21:29] thank you, ok [13:21:40] confusing part is that it says "all done" but isn't actually done [13:21:53] yes its a bug ill try to look at it later this week [13:22:01] ack! thanks:) [13:22:11] np [13:24:11] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [13:26:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [13:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:09] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:27:54] (03PS1) 10Jbond: base::firewall: dont add desc as comments are cache specific [puppet] - 10https://gerrit.wikimedia.org/r/592945 [13:30:58] !log Restarting uppercaseTitlesForUnicodeTransition.php as part of T219279 for frwiki [13:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:04] T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 [13:32:11] (03PS1) 10Muehlenhoff: Add idp-test1001 to site.pp/DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/592946 [13:33:25] !log running puppet on cp-ats - switching backends of design.wikimedia.org and sitemaps.wikimedia.org [13:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:43] (03PS3) 10Ottomata: refine.pp - Slight refactor to use new unified refine tranform functions [puppet] - 10https://gerrit.wikimedia.org/r/592756 (https://phabricator.wikimedia.org/T238230) [13:34:29] (03CR) 10Jbond: [C: 03+2] base::firewall: dont add desc as comments are cache specific [puppet] - 10https://gerrit.wikimedia.org/r/592945 (owner: 10Jbond) [13:40:02] (03CR) 10Ottomata: "Looks ok! https://puppet-compiler.wmflabs.org/compiler1001/22184/an-launcher1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/592756 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [13:41:07] (03CR) 10Ppchelko: [C: 03+2] changeprop: enable more rules in kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/592918 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [13:42:52] 10Operations, 10Puppet, 10User-jbond: Add CI to the private repo - https://phabricator.wikimedia.org/T251247 (10jbond) [13:43:09] (03CR) 10Muehlenhoff: [C: 03+2] Add idp-test1001 to site.pp/DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/592946 (owner: 10Muehlenhoff) [13:43:10] 10Operations, 10Puppet, 10User-jbond: Add CI to the private repo - https://phabricator.wikimedia.org/T251247 (10jbond) p:05Triage→03Medium [13:43:43] !log enabling Kafka TLS for eventgate-main [13:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:54] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:43:54] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:49] (03CR) 10Jbond: [C: 03+1] Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [13:48:10] !log holger@mwmaint1002 end (frwiki=success) [13:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:06] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10holger.knust) @Framawiki Merci beaucoup! [13:50:20] (03CR) 10Elukey: [C: 04-1] Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [13:53:17] !log update ATS 8.1 on cp4026 - T249335 [13:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:23] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [13:54:31] !log installing idp-test2001.wikimedia.org [13:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:48] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:59:48] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:06] (03CR) 10Jbond: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [14:01:26] 10Operations, 10LDAP-Access-Requests, 10LDAP: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10CDanis) You should be able to log in and change your password at https://toolsadmin.wikimedia.org/profile/settings/accounts/ [14:05:00] 10Operations, 10LDAP-Access-Requests, 10LDAP: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Samwalton9) Oh perfect, that was the domain I needed for my password manager to locate the right login details for me 😁 Confirmed I have access to Superset with that login. [14:08:18] (03PS1) 10Ottomata: eventgate and eventstreams - Specificy kafka ssl cipher settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/592951 (https://phabricator.wikimedia.org/T250149) [14:10:26] (03PS2) 10Ottomata: eventgate and eventstreams - Specify kafka ssl cipher settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/592951 (https://phabricator.wikimedia.org/T250149) [14:11:42] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10CDanis) >>! In T215183#6087811, @jcrespo wrote: > @CDanis backup2002 was recently installed into buster (apparently, wrongly), but it already contains data. Would a simple: > ` > grub-install /dev/sdb > ` > (I am... [14:12:31] (03CR) 10Elukey: [C: 03+1] eventgate and eventstreams - Specify kafka ssl cipher settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/592951 (https://phabricator.wikimedia.org/T250149) (owner: 10Ottomata) [14:12:50] (03PS2) 10Hnowlan: changeprop: enable more rules in kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/592918 (https://phabricator.wikimedia.org/T248677) [14:13:04] (03CR) 10Hnowlan: [V: 03+2] changeprop: enable more rules in kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/592918 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [14:13:23] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10MoritzMuehlenhoff) >>! In T215183#6089160, @CDanis wrote: > I'm not sure how backup2002 wound up in that state -- AFAIK the partman files should have been fixed since Buster was available, and I see that you reim... [14:15:27] (03CR) 10Ottomata: [C: 03+2] eventgate and eventstreams - Specify kafka ssl cipher settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/592951 (https://phabricator.wikimedia.org/T250149) (owner: 10Ottomata) [14:15:36] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop: enable more rules in kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/592918 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [14:18:03] (03PS1) 10Ottomata: eventstreams - kafka ssl.ca.location [deployment-charts] - 10https://gerrit.wikimedia.org/r/592961 [14:18:37] (03PS3) 10Hnowlan: changeprop: enable more rules in kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/592918 (https://phabricator.wikimedia.org/T248677) [14:18:57] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop: enable more rules in kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/592918 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [14:19:15] (03Merged) 10jenkins-bot: changeprop: enable more rules in kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/592918 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [14:19:27] (03CR) 10Ottomata: [C: 03+2] eventstreams - kafka ssl.ca.location [deployment-charts] - 10https://gerrit.wikimedia.org/r/592961 (owner: 10Ottomata) [14:19:36] (03PS2) 10Ottomata: eventstreams - kafka ssl.ca.location [deployment-charts] - 10https://gerrit.wikimedia.org/r/592961 [14:19:38] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10CDanis) Ah, I see, `no-srv-format.cfg` doesn't set up any SW RAID at all, which means debian-installer of course won't know to install grub on both disks. I'm guessing you set up the RAID for / manually? So the... [14:19:44] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventstreams - kafka ssl.ca.location [deployment-charts] - 10https://gerrit.wikimedia.org/r/592961 (owner: 10Ottomata) [14:20:42] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [14:20:42] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [14:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:34] !log restarting KDC on krb1001 to pick up openssl update [14:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:18] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [14:23:18] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [14:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:49] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:37] (03CR) 10RLazarus: [C: 03+1] "Thanks!" [software/httpbb] - 10https://gerrit.wikimedia.org/r/592889 (owner: 10Dzahn) [14:25:06] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [14:25:10] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [14:25:11] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [14:25:11] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [14:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:46] (03CR) 10Jhedden: [C: 03+1] Add wmcs-instancepurge.py and a timer job that uses it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592750 (owner: 10Andrew Bogott) [14:26:08] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [14:26:48] (03PS1) 10Seddon: inline comment update and fix to allow beta graphs to use prod files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592964 [14:27:58] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10jcrespo) `no-srv-format.cfg` is the (wrong) recipe we use for forcing a failure so stateful services don't get accidentally reimaged (as it happened once) :-D. Backup hosts use normally `custom/backup-format.cfg... [14:28:02] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10Jclark-ctr) @elukey Dimm is on site Ping me on IRC i am on site right now if you are available to change it [14:28:26] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [14:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:25] (03CR) 10Andrew Bogott: [C: 03+2] Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 (owner: 10Andrew Bogott) [14:33:48] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:57] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for KDC [puppet] - 10https://gerrit.wikimedia.org/r/592966 (https://phabricator.wikimedia.org/T135991) [14:35:21] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10CDanis) Ah okay. It does look like `backup-format.cfg` contains the necessary incantations for replicated GRUB. [14:35:26] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10elukey) @hnowlan @Eevans can you sync with @Jclark-ctr ? [14:35:52] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:35:52] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:12] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for KDC [puppet] - 10https://gerrit.wikimedia.org/r/592966 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:37:36] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:37:36] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:03] (03CR) 10Reedy: [C: 04-1] inline comment update and fix to allow beta graphs to use prod files (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592964 (owner: 10Seddon) [14:38:27] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10jcrespo) I've corrected manually backup2002, db1115, db2093 and will ask Manuel about the proxies, some of those are just being decommissioned or will be reimaged soon. Arguably we will not have lots of servers a... [14:39:09] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:39:09] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:33] (03PS1) 10JMeybohm: Add debian directory, .gitreview [debs/helm3] - 10https://gerrit.wikimedia.org/r/592967 [14:40:22] 10Operations, 10LDAP-Access-Requests, 10LDAP: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10CDanis) 05Open→03Resolved Great, glad to hear it! I also updated SRE's docs with some of what @bd808 said, as I don't think that was widely understood on the SRE team. [14:41:34] (03CR) 10Elukey: [C: 03+1] mcrouter: enable the gutter pool everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/592520 (https://phabricator.wikimedia.org/T244852) (owner: 10Giuseppe Lavagetto) [14:42:46] <_joe_> alea iacta est [14:42:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mcrouter: enable the gutter pool everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/592520 (https://phabricator.wikimedia.org/T244852) (owner: 10Giuseppe Lavagetto) [14:43:33] 👀 [14:43:50] (03PS1) 10Andrew Bogott: wmcs-instancepurge.py: Add a bit more info to the help message. [puppet] - 10https://gerrit.wikimedia.org/r/592969 (https://phabricator.wikimedia.org/T251152) [14:45:19] 🎉 [14:45:26] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-instancepurge.py: Add a bit more info to the help message. [puppet] - 10https://gerrit.wikimedia.org/r/592969 (https://phabricator.wikimedia.org/T251152) (owner: 10Andrew Bogott) [14:45:28] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:45:28] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:51] (03PS1) 10Andrew Bogott: sre-sandbox hiera: added a link to the project request [puppet] - 10https://gerrit.wikimedia.org/r/592971 (https://phabricator.wikimedia.org/T247517) [14:49:34] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:49:34] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:28] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:41] (03CR) 10Andrew Bogott: [C: 03+2] "docs added at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#wmcs-instancepurge" [puppet] - 10https://gerrit.wikimedia.org/r/592750 (owner: 10Andrew Bogott) [14:52:17] 10Operations, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Create vm-reaper job to manage lifespan of VMs "wmcs-instancepurge" - https://phabricator.wikimedia.org/T251152 (10Andrew) [14:52:28] 10Operations, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10Andrew) [14:52:31] 10Operations, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Create vm-reaper job to manage lifespan of VMs "wmcs-instancepurge" - https://phabricator.wikimedia.org/T251152 (10Andrew) 05Open→03Resolved https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maint... [14:52:38] (03CR) 10Andrew Bogott: [C: 03+2] sre-sandbox hiera: added a link to the project request [puppet] - 10https://gerrit.wikimedia.org/r/592971 (https://phabricator.wikimedia.org/T247517) (owner: 10Andrew Bogott) [14:54:13] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:54:13] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:23] 10Operations, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10Andrew) 05Open→03Resolved a:03Andrew This project has now been created. @jbond, you are the initial projectad... [14:55:55] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:55:56] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:45] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson relforge1003 A2 U34 Port 34 relforge1004 B2 U31... [14:58:05] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:58:06] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:58:06] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Jclark-ctr) [14:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:04] (03CR) 10JMeybohm: [C: 03+1] "LGTM. I guess it's not a problem to clean this up when jessie is no longer used for Docker stuff" [puppet] - 10https://gerrit.wikimedia.org/r/592913 (owner: 10Muehlenhoff) [14:59:47] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:44] fdans: standup? [15:00:58] ottomata: wrong channel but sure! [15:01:29] we do only sitdowns in this channel :-P [15:01:47] PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 9.649e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [15:02:00] (03CR) 10JMeybohm: Add debian directory, .gitreview (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/592967 (owner: 10JMeybohm) [15:02:04] (03PS1) 10Muehlenhoff: Update acmechief config for second IDP staging host [puppet] - 10https://gerrit.wikimedia.org/r/592975 [15:02:06] (03PS1) 10Muehlenhoff: Enable idp-test1001 as second IDP staging server [puppet] - 10https://gerrit.wikimedia.org/r/592976 (https://phabricator.wikimedia.org/T233930) [15:02:30] (03PS1) 10Jbond: ferm ferm-status: filter out dynamic iptables rules for docker and k8s [puppet] - 10https://gerrit.wikimedia.org/r/592977 [15:03:55] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: replace nutcracker with mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592276 (owner: 10Andrew Bogott) [15:04:46] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Joe) [15:05:37] (03PS2) 10JMeybohm: Add debian directory, .gitreview [debs/helm3] - 10https://gerrit.wikimedia.org/r/592967 [15:09:06] (03PS1) 10Hnowlan: changeprop: Increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/592979 (https://phabricator.wikimedia.org/T248677) [15:10:36] (03CR) 10Ppchelko: [C: 03+2] "maybe we'd need even more, but let's be conservative." [deployment-charts] - 10https://gerrit.wikimedia.org/r/592979 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:11:07] (03Merged) 10jenkins-bot: changeprop: Increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/592979 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:11:21] (03PS2) 10Jbond: ferm ferm-status: filter out dynamic iptables rules for docker and k8s [puppet] - 10https://gerrit.wikimedia.org/r/592977 [15:11:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/592975 (owner: 10Muehlenhoff) [15:13:41] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:54] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [15:13:54] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [15:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:37] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:08] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592977 (owner: 10Jbond) [15:15:17] (03CR) 10Muehlenhoff: "Some comments inline" (036 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/592967 (owner: 10JMeybohm) [15:15:29] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:46] OAuth seems to be having some issues. I get E004, others can't even press allow [15:16:47] (03CR) 10Jbond: [C: 03+2] ferm ferm-status: filter out dynamic iptables rules for docker and k8s [puppet] - 10https://gerrit.wikimedia.org/r/592977 (owner: 10Jbond) [15:23:45] (03PS4) 10JMeybohm: Build with vendor, improve packaging [debs/helm] - 10https://gerrit.wikimedia.org/r/592689 [15:26:58] 10Operations, 10MediaWiki-Cache, 10Traffic, 10serviceops, and 5 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Krinkle) [15:27:06] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@2b87a75]: Switch off rules moved to k8s T248677 [15:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:13] T248677: Finalise changeprop migration to k8s - https://phabricator.wikimedia.org/T248677 [15:28:26] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@2b87a75]: Switch off rules moved to k8s T248677 (duration: 01m 20s) [15:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:36] !log rolling restart of ats-tls on cp[3050,3052,3054,3056] - T249335 [15:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:44] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [15:35:12] (03PS1) 10Jbond: ferm-status: minor optimisations [puppet] - 10https://gerrit.wikimedia.org/r/592984 [15:35:59] (03CR) 10Vgutierrez: [C: 03+1] Update acmechief config for second IDP staging host [puppet] - 10https://gerrit.wikimedia.org/r/592975 (owner: 10Muehlenhoff) [15:36:55] (03PS2) 10Ema: VCL: clarify scripted requests error [puppet] - 10https://gerrit.wikimedia.org/r/592652 [15:38:01] (03CR) 10Ema: [C: 03+2] VCL: clarify scripted requests error [puppet] - 10https://gerrit.wikimedia.org/r/592652 (owner: 10Ema) [15:38:25] (03CR) 10CDanis: [C: 03+1] Chassis: more generic, add ae count [homer/public] - 10https://gerrit.wikimedia.org/r/592251 (owner: 10Ayounsi) [15:39:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/592976 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [15:40:05] (03CR) 10Volans: [C: 04-1] "I think there is a small typo, looks good otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592984 (owner: 10Jbond) [15:42:03] (03PS2) 10Jbond: ferm-status: minor optimisations [puppet] - 10https://gerrit.wikimedia.org/r/592984 [15:42:05] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592984 (owner: 10Jbond) [15:42:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/592984 (owner: 10Jbond) [15:47:21] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10Jclark-ctr) @elukey @hnowlan @Eevans Time restricted in data center leaving now will be on site Thursday 2pm-4pm utc please ping me on irc if your able to assist Thursday. [15:47:50] (03PS1) 10Hnowlan: changeprop: enable all rules in k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/592987 (https://phabricator.wikimedia.org/T248677) [15:48:23] (03CR) 10Jbond: [C: 03+2] ferm-status: minor optimisations [puppet] - 10https://gerrit.wikimedia.org/r/592984 (owner: 10Jbond) [15:48:35] (03CR) 10Ppchelko: [C: 03+2] changeprop: enable all rules in k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/592987 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:48:56] (03Merged) 10jenkins-bot: changeprop: enable all rules in k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/592987 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:49:53] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:31] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis CenturyLink ticket #18622981 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:50:31] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 85, down: 1, dormant: 0, excluded: 2, unused: 0: CDanis CenturyLink ticket #18622981 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:58:47] (03PS1) 10Jbond: ferm-status: add KUBE to list of ignored chains [puppet] - 10https://gerrit.wikimedia.org/r/592988 [16:00:04] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200428T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:12] (03PS1) 10Urbanecm: GrowthExperiments: cswiki: Change manual of style to 5 pillars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592989 (https://phabricator.wikimedia.org/T251290) [16:01:30] (03PS3) 10Andrew Bogott: mcrouter: update example code [puppet] - 10https://gerrit.wikimedia.org/r/589743 [16:02:22] (03CR) 10Dvorapa: [C: 03+1] GrowthExperiments: cswiki: Change manual of style to 5 pillars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592989 (https://phabricator.wikimedia.org/T251290) (owner: 10Urbanecm) [16:02:31] (03CR) 10Andrew Bogott: [C: 03+2] mcrouter: update example code [puppet] - 10https://gerrit.wikimedia.org/r/589743 (owner: 10Andrew Bogott) [16:02:49] (03CR) 10Dvorapa: [C: 03+1] GrowthExperiments: cswiki: Change manual of style to 5 pillars (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592989 (https://phabricator.wikimedia.org/T251290) (owner: 10Urbanecm) [16:03:10] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) >>! In T245183#5974497, @Krinkle wrote: > Here is another mysterious mis-call ([Logstash single document... [16:03:37] (03CR) 10Urbanecm: GrowthExperiments: cswiki: Change manual of style to 5 pillars (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592989 (https://phabricator.wikimedia.org/T251290) (owner: 10Urbanecm) [16:05:08] (03CR) 10Jbond: [C: 03+2] ferm-status: add KUBE to list of ignored chains [puppet] - 10https://gerrit.wikimedia.org/r/592988 (owner: 10Jbond) [16:06:10] liw: I feel that https://gerrit.wikimedia.org/r/592982 should probably get deployed to wmf.30; OK for me to do that? [16:07:41] sec [16:09:14] James_F, I find myself incompetent to read the PHP, but the issue seems worth fixing, and it doesn't seem like a big, risky change, so go ahead; are you going to be around to revert if it causes a nasty surprise? [16:10:35] 10Operations: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10colewhite) [16:10:44] Yes. [16:12:03] (03PS1) 10Cwhite: smart: set facter timeout to three minutes [puppet] - 10https://gerrit.wikimedia.org/r/592993 (https://phabricator.wikimedia.org/T199236) [16:13:29] 10Puppet, 10Wikimedia Meet: Puppetize the account manager - https://phabricator.wikimedia.org/T251034 (10Ladsgroup) Thanks! [16:15:02] James_F, good answer :) [16:15:23] * James_F grins. [16:15:27] Merging now. [16:20:06] * James_F sighs at packagist flakiness. [16:21:25] (03CR) 10Cwhite: [C: 03+2] smart: set facter timeout to three minutes [puppet] - 10https://gerrit.wikimedia.org/r/592993 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [16:26:51] (03PS1) 10Andrew Bogott: Openstack packaging: set the stage for Rocky on Buster [puppet] - 10https://gerrit.wikimedia.org/r/592997 (https://phabricator.wikimedia.org/T251294) [16:26:51] (03PS12) 10Cwhite: smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) [16:32:03] (03CR) 10RhinosF1: [C: 03+1] GrowthExperiments: cswiki: Change manual of style to 5 pillars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592989 (https://phabricator.wikimedia.org/T251290) (owner: 10Urbanecm) [16:33:55] (03PS1) 10Jhedden: Ceph: Cleanup comments in config module [puppet] - 10https://gerrit.wikimedia.org/r/592998 [16:35:05] (03CR) 10Jhedden: [C: 03+2] Ceph: Cleanup comments in config module [puppet] - 10https://gerrit.wikimedia.org/r/592998 (owner: 10Jhedden) [16:36:21] !log volker-e@deploy1001 Started deploy [design/style-guide@335122b]: Deploy design/style-guide: [16:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:29] !log volker-e@deploy1001 Finished deploy [design/style-guide@335122b]: Deploy design/style-guide: (duration: 00m 08s) [16:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:13] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.30/includes/Revision/RevisionStore.php: Follow-up If770120: Fix bad combination of type cast and ?? operator (duration: 01m 06s) [16:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:25] OK, all done from my end. [16:45:01] (03PS10) 10CRusnov: netbox: Add framework for exposing scripts to internal services [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) [16:45:37] 10Operations, 10observability: run nic_saturation_exporter on all physical hosts - https://phabricator.wikimedia.org/T250401 (10CDanis) p:05Triage→03Medium [16:51:55] (03PS2) 10Nray: Add Config for Growth Study Quick Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592788 (https://phabricator.wikimedia.org/T248421) [16:52:16] (03PS2) 10Seddon: inline comment update and fix to allow beta graphs to use prod files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592964 [16:52:27] (03PS3) 10Nray: Add Config for Growth Study Quick Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592788 (https://phabricator.wikimedia.org/T248421) [16:53:19] 10Operations, 10Puppet: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10crusnov) p:05Triage→03Medium [16:54:02] (03CR) 10jerkins-bot: [V: 04-1] inline comment update and fix to allow beta graphs to use prod files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592964 (owner: 10Seddon) [17:00:04] halfak and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200428T1700). [17:10:11] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@678fb8e]: Update mobileapps to ff88022a [17:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:35] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@678fb8e]: Update mobileapps to ff88022a (duration: 03m 23s) [17:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:13] (03PS1) 10Volans: homer: add a diff check that sends and email [puppet] - 10https://gerrit.wikimedia.org/r/593007 (https://phabricator.wikimedia.org/T249224) [17:17:06] (03CR) 10CRusnov: [C: 03+2] interface_automation: Restrict mgmt creation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/589048 (https://phabricator.wikimedia.org/T250287) (owner: 10CRusnov) [17:20:58] (03CR) 10CRusnov: [C: 03+2] netbox: Add framework for exposing scripts to internal services [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [17:21:06] (03CR) 10Volans: "compiler results at:" [puppet] - 10https://gerrit.wikimedia.org/r/593007 (https://phabricator.wikimedia.org/T249224) (owner: 10Volans) [17:25:07] (03Abandoned) 10CRusnov: templates/wmnet: Remove most mgmt.eqiad entries to test generated zones [dns] - 10https://gerrit.wikimedia.org/r/575650 (owner: 10CRusnov) [17:25:18] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:26:49] 10Operations, 10serviceops, 10Kubernetes: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [17:27:54] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/593007 (https://phabricator.wikimedia.org/T249224) (owner: 10Volans) [17:30:05] (03PS3) 10JMeybohm: Add debian directory, .gitreview [debs/helm3] - 10https://gerrit.wikimedia.org/r/592967 (https://phabricator.wikimedia.org/T251305) [17:31:44] (03CR) 10JMeybohm: Add debian directory, .gitreview (036 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/592967 (https://phabricator.wikimedia.org/T251305) (owner: 10JMeybohm) [17:34:18] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 22696 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:37:14] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:37:28] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [17:37:30] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:37:38] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:38:28] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:02] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:14] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:24] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:42] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:40:28] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:41:04] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:41:16] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:41:30] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:42:18] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:42:36] (03PS2) 10Ayounsi: add graceful-restart to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/577564 (https://phabricator.wikimedia.org/T191667) (owner: 10CDanis) [17:43:08] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:43:14] (03CR) 10Phuedx: "This LGTM. Waiting on open questions about the dependency to be resolved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592788 (https://phabricator.wikimedia.org/T248421) (owner: 10Nray) [17:44:42] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:45:26] (03PS3) 10Seddon: inline comment update and fix to allow beta graphs to use prod files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592964 [17:50:56] 10Operations, 10ops-eqsin, 10Traffic: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10wiki_willy) a:03Cmjohnson Checked Netbox and the server looks like it's still under warranty until October of this year. [17:51:00] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:51:30] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.30/extensions/OAuth: T251306 (duration: 01m 06s) [17:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:39] T251306: OAuth throwing E004 on wikis with 1.35wmf30 - https://phabricator.wikimedia.org/T251306 [17:55:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:56:13] (03PS3) 10Gehel: Re-enable OSM replication [puppet] - 10https://gerrit.wikimedia.org/r/591028 (https://phabricator.wikimedia.org/T249086) (owner: 10MSantos) [17:57:37] (03CR) 10Gehel: [C: 03+2] Re-enable OSM replication [puppet] - 10https://gerrit.wikimedia.org/r/591028 (https://phabricator.wikimedia.org/T249086) (owner: 10MSantos) [17:57:42] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:58:04] mateusbs17: ^^ late, but finally! [17:58:28] gehel: Perfect timing [18:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200428T1800). [18:00:04] Urbanecm: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:18] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:00:18] I'll do the SWAT then [18:00:35] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: cswiki: Change manual of style to 5 pillars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592989 (https://phabricator.wikimedia.org/T251290) (owner: 10Urbanecm) [18:01:46] (03Merged) 10jenkins-bot: GrowthExperiments: cswiki: Change manual of style to 5 pillars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592989 (https://phabricator.wikimedia.org/T251290) (owner: 10Urbanecm) [18:02:55] (03PS2) 10Urbanecm: Allow bdwikimedia bureaucrats to revoke sysop flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592890 (https://phabricator.wikimedia.org/T251078) [18:03:05] (03CR) 10Urbanecm: [C: 03+2] Allow bdwikimedia bureaucrats to revoke sysop flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592890 (https://phabricator.wikimedia.org/T251078) (owner: 10Urbanecm) [18:03:56] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 07c28d1: GrowthExperiments: cswiki: Change manual of style to 5 pillars (T251290) (duration: 01m 05s) [18:03:58] (03Merged) 10jenkins-bot: Allow bdwikimedia bureaucrats to revoke sysop flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592890 (https://phabricator.wikimedia.org/T251078) (owner: 10Urbanecm) [18:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:02] T251290: [cswiki] Change Manual of Style GrowthExperiments link to the Five pillars - https://phabricator.wikimedia.org/T251290 [18:05:50] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:07:01] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0639d9f: Allow bdwikimedia bureaucrats to revoke sysop flag (T251078) (duration: 01m 05s) [18:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:08] T251078: Let the bureaucrats remove administrator flag on the chapterwiki of wmbd - https://phabricator.wikimedia.org/T251078 [18:07:10] * Urbanecm done [18:07:36] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:07:44] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:07:48] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:10:26] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:10:40] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:10:54] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:12:10] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:12:26] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:13:36] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:13:40] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:15:13] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:16:32] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:18:12] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:18:52] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:21:35] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) Fixed error in Netbox for flerovium-array2. @jclark-ctr - once you have msw-a2-eqiad added into Julianne's spreadsheet (at the top in line 8), then you can close out this req... [18:23:48] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:24:30] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:24:54] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:26:12] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:30:16] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:31:50] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:32:02] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:32:18] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:32:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:33:50] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:34:32] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:34:44] (03PS1) 10CRusnov: netbox-script-proxy: Fix uwsgi configuration [puppet] - 10https://gerrit.wikimedia.org/r/593031 [18:35:44] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:35:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:35:55] (03PS4) 10Bstorm: Yet another package rename mega patch [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585963 (https://phabricator.wikimedia.org/T249079) (owner: 10BryanDavis) [18:35:58] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:36:02] (03PS1) 10CRusnov: custom_script_proxy: Provide `application` variable for uwsgi [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/593032 [18:36:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:36:18] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:36:30] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) [18:36:40] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:36:43] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/593031 (owner: 10CRusnov) [18:36:51] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/593032 (owner: 10CRusnov) [18:36:54] (03CR) 10CRusnov: [C: 03+2] custom_script_proxy: Provide `application` variable for uwsgi [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/593032 (owner: 10CRusnov) [18:37:54] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:41:17] what is all of this recommendation api spam ^^ ? [18:41:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:42:14] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:43:24] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:46:36] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:46:58] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:47:02] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:47:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:47:43] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:18] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:49:14] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:50:08] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:10] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:52:12] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:52:34] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:52:34] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:52:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:52:56] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:53:48] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:53:56] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:04] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:06] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:08] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:22] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:55:54] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:58:08] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [18:58:08] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:59:05] (03PS1) 10Andrew Bogott: wmcs pdns: use a 'hosts' list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [18:59:34] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:59:40] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:59:44] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:59:58] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:00:02] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:00:04] liw and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200428T1900). [19:01:30] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:01:44] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:01:52] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:02:31] (03PS2) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [19:02:33] (03CR) 10jerkins-bot: [V: 04-1] wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:03:16] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:03:38] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:04:12] (03PS3) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [19:04:14] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Netbox report coherence_rack Icinga alert - https://phabricator.wikimedia.org/T250054 (10wiki_willy) 05Open→03Resolved The remaining 10x Netbox errors (across all reports) will be handled via the following tasks per site: ulsfo - T250408 and T249287 eq... [19:05:08] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:07:32] (03PS4) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [19:07:34] (03CR) 10jerkins-bot: [V: 04-1] wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:08:54] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:09:00] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:09:12] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:09:38] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:10:30] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:10:36] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:10:44] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:10:55] (03CR) 10jerkins-bot: [V: 04-1] wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:12:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:13:00] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:13:09] (03PS1) 10Jforrester: Add lazy-loading to Wikimedia Foundation powered-by icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593038 (https://phabricator.wikimedia.org/T239377) [19:13:38] (03PS5) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [19:13:52] (03CR) 10Subramanya Sastry: "cscott: ping" [puppet] - 10https://gerrit.wikimedia.org/r/577656 (owner: 10C. Scott Ananian) [19:14:06] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:14:26] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:14:44] (03CR) 10Jforrester: "Maybe this should be moved into WikimediaMessages given the other footer fiddles live there? Eh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593038 (https://phabricator.wikimedia.org/T239377) (owner: 10Jforrester) [19:16:12] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:16:32] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:16:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:17:58] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:20:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:20:36] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:20:56] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:21:17] 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Krinkle) [19:21:30] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:21:30] 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Krinkle) 05Open→03Resolved a:03Krinkle [19:21:36] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:21:38] 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Krinkle) a:05Krinkle→03None [19:21:46] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:22:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:23:16] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:23:24] (03CR) 10Krinkle: [C: 03+1] Add lazy-loading to Wikimedia Foundation powered-by icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593038 (https://phabricator.wikimedia.org/T239377) (owner: 10Jforrester) [19:23:36] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:23:54] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:25:24] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:26:22] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:27:04] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:27:21] Krinkle: I was going to wait to deploy that next week alongside the train with the MW one, but I could sling it out now if you want? [19:27:22] 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Tracking task: 2020-02-04 kartotherian outage - https://phabricator.wikimedia.org/T244278 (10Krinkle) [19:27:41] 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10Krinkle) [19:27:48] 10Operations, 10SRE-OnFire, 10Wikimedia-Incident: Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10Krinkle) [19:27:51] 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10Krinkle) [19:28:09] James_F: would prefer to see both together for better analysis [19:28:26] and preferably not in the train, e.g. swat together some time between now and next week [19:29:10] Sure. [19:29:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:30:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:31:16] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:16] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:20] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:58] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:33:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:33:02] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:33:02] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:33:03] (03PS1) 10Jhedden: cloudvps: metricsinfra add project label and default alert rules [puppet] - 10https://gerrit.wikimedia.org/r/593042 (https://phabricator.wikimedia.org/T250206) [19:33:06] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:35:26] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:36:14] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:36:18] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:36:38] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:37:14] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:04] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:24] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:24] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:42] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:42] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:39:40] (03CR) 10Jhedden: [C: 03+2] cloudvps: metricsinfra add project label and default alert rules [puppet] - 10https://gerrit.wikimedia.org/r/593042 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [19:39:50] (03PS6) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [19:40:02] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:40:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:41:58] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation sug [19:41:58] ut before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received [19:41:58] .wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:42:48] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation sug [19:42:48] ut before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received [19:42:48] ticle/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:43:40] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:43:44] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:43:54] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:43:56] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:44:03] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:44:12] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [19:44:12] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:44:14] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:44:38] ^ anybody know what's up with these? [19:44:58] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:45:30] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:45:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:46:02] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:46:26] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:49:18] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:49:28] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:49:48] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:02] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:28] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:44] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:52:50] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:53:08] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for Indic Wikisource - https://phabricator.wikimedia.org/T251339 (10jayantanth) [19:53:30] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:54:15] (03PS7) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [19:54:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 87, down: 0, dormant: 0, excluded: 2, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:55:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:56:33] (03PS1) 10Jhedden: cloudvps: enable project monitoring for the metricsinfra project [puppet] - 10https://gerrit.wikimedia.org/r/593048 (https://phabricator.wikimedia.org/T250206) [19:56:34] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:57:36] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [19:57:36] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:57:48] (03CR) 10Jhedden: [C: 03+2] cloudvps: enable project monitoring for the metricsinfra project [puppet] - 10https://gerrit.wikimedia.org/r/593048 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [19:57:54] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [19:57:54] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:58:06] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [19:58:28] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [19:58:28] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:58:46] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [19:58:46] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:58:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:59:06] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [19:59:06] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:59:10] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [19:59:10] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:59:34] (03PS8) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [19:59:56] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:00:14] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:00:18] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:00:18] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:00:18] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:00:28] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:00:44] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [20:00:50] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:00:52] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:01:30] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:01:42] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:01:56] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:02:08] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:02:08] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:02:18] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:02:35] (03CR) 10jerkins-bot: [V: 04-1] wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [20:02:38] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:03:02] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:04:00] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:04:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:04:30] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:04:30] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:04:48] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:04:58] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:05:06] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:05:18] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:05:32] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:05:42] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:05:54] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:06:00] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:06:03] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:06:03] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:06:13] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:06:26] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:06:26] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:06:30] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:07:38] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:08:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:09:08] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:10:34] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:11:18] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:11:26] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:11:40] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:11:40] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:11:50] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:12:00] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:12:00] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:12:06] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:48] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:52] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:14:30] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:02] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:16] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:18] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:16:00] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:16:46] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:17:28] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:18:46] (03PS1) 10Ssingh: Add a version parameter to cescout (improves 313d63de) [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/593052 [20:19:02] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:20:13] (03CR) 10Ssingh: [C: 03+2] Add a version parameter to cescout (improves 313d63de) [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/593052 (owner: 10Ssingh) [20:22:20] brennen: the recommendation api stuff or? [20:23:08] yeah [20:23:40] by gut feeling is mediawiki is slow, and the reommendation api seems to call a wikibase api module, so then ig uess the recommendation api gets slow / fails [20:23:54] i dug around and found the api calls https://github.com/wikimedia/mediawiki-services-recommendation-api/blob/60741bb02a73f4542617d55c8236abb68beda533/lib/article.creation.morelike.js#L204 [20:24:07] but cant really find any metrics about what is happening inside the recomendation api [20:24:26] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:24:37] i guess all of these alarms were also going off last night [20:24:43] (03PS1) 10Jhedden: cloudvps: enable monitoring for projects using shinken [puppet] - 10https://gerrit.wikimedia.org/r/593054 (https://phabricator.wikimedia.org/T250206) [20:25:26] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:25:35] About the same time last night too: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=9&fullscreen&orgId=1&from=now-2d&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-role=All [20:26:16] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:26:26] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:26:46] I suspect that all of this comes down the the load on db1111 for s8, and responses from that ends up making everything slow [20:26:56] https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation?panelId=21&fullscreen&orgId=1&from=now-2d&to=now [20:27:39] looking at this last month there last 2 days / nights have actually been the worse [20:27:45] *worst [20:27:52] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:28:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:28:17] also causes elevated exceptions due to timeouts [20:28:30] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Jclark-ctr) @JHedden Reached out to dell today opened another service request 1023973621 Request new Raid Adapter [20:28:32] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:28:50] that's interesting. db1114 and 1104 depooled? [20:28:51] and general slowness https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-2d&to=now [20:29:04] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:18] shdubsh: not sure [20:29:28] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:29:36] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:50] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:50] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:30:18] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:18] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:31:42] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:31:48] shdubsh: it looks like they might be per https://noc.wikimedia.org/dbconfig/eqiad.json [20:31:52] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:59] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for Indic Wikisource - https://phabricator.wikimedia.org/T251339 (10Aklapper) Hi @jayantanth. Please see https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list for required information. Thanks! [20:32:48] addshore: indeed. perhaps s8 is overloaded? [20:33:43] 1104 is depooled for https://phabricator.wikimedia.org/T232446#6083381 [20:33:54] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:34:21] not sure about 1114 [20:35:33] 1114 was also removed in https://phabricator.wikimedia.org/P11039 not sure if that was intended [20:36:03] 10Operations, 10LDAP-Access-Requests, 10LDAP: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10bd808) >>! In T250189#6089282, @CDanis wrote: > > I also updated SRE's docs with some of what @bd808 said, as I don't think that was widely understood on the SRE team. On... [20:36:35] if commonswiki is on s1, how would load on s8 affect it? [20:36:52] most things talk to s8 [20:37:14] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:18] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:30] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:38] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:55] I was talking to perf about how add some observability to where the load is actually coming from within mediawiki / wikibase [20:38:07] as right now it is fairly impossible to tell [20:39:10] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:39:12] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:39:34] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:40:10] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:40:52] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22729 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:40:53] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:41:44] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:42:20] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:42:38] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:12] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:12] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:44:24] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:44:46] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:45:00] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:45:00] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:45:00] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:46:24] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:48:27] (03PS9) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [20:49:14] 10Operations, 10Release-Engineering-Team, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance Issue, 10Wikimedia-database-error: WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10CCicalese_WMF) 05... [20:49:22] 10Operations, 10MediaWiki-API, 10Traffic, 10Wikidata, and 2 others: wikidata.org handles GET MWAPI requests, but silently fails on POST - https://phabricator.wikimedia.org/T230051 (10CCicalese_WMF) 05Open→03Resolved a:03CCicalese_WMF Marking as Resolved as it is in the Done column. Feel free to reope... [20:49:38] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:49:58] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:50:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:50:36] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:50:36] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:50:36] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:52:12] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:54:24] (03PS10) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [20:56:04] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:56:06] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:56:16] (03PS11) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [20:57:18] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:57:32] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:57:40] !log cdanis@cumin1001 dbctl commit (dc=all): 's8 weights: -db1111, +db1099,db1101', diff saved to https://phabricator.wikimedia.org/P11069 and previous config saved to /var/cache/conftool/dbconfig/20200428-205739-cdanis.json [20:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:45] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) Great! Thanks for the update. This host is currently out of service and can be taken offline anytime. [20:59:09] (03PS2) 10CRusnov: reports cables: Add extra regexp to support more active interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/589750 [20:59:15] (03CR) 10jerkins-bot: [V: 04-1] wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [20:59:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:59:27] 10Operations, 10Parsoid, 10RESTBase, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): HTTP 400 Error when trying to save an edit on English Wikipedia: Error contacting the Parsoid/RESTBase server - https://phabricator.wikimedia.org/T250815 (10Pchelolo) p:05Medium→03High RESTBase correct... [20:59:34] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:59:48] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:59:59] (03CR) 10CRusnov: "Note the regexes were tested against currently in use network interfaces names and seems to be correct." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/589750 (owner: 10CRusnov) [21:00:10] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:00:28] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:00:50] (03CR) 10CRusnov: "> Patch Set 10:" (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [21:01:16] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:01:16] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:02:32] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:04:02] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:04:55] (03PS12) 10Andrew Bogott: wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) [21:08:12] (03CR) 10jerkins-bot: [V: 04-1] wmcs pdns: use a hiera list for auth resolvers and recursors [puppet] - 10https://gerrit.wikimedia.org/r/593035 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [21:18:16] (03PS4) 10Nray: Add Config for Growth Study Quick Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592788 (https://phabricator.wikimedia.org/T248421) [21:18:48] (03CR) 10Cwhite: [C: 03+2] smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [21:20:16] (03PS10) 10Cwhite: smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) [21:30:36] (03PS1) 10Ppchelko: Rerender mobile-html on wikidata description changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/593066 (https://phabricator.wikimedia.org/T250209) [21:32:39] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10MaxSem) Mhm, maybe put it in a subdomain of mediawiki.org? [21:33:59] 10Operations, 10SRE-Access-Requests: Request for srv/phab/phabricator/bin/bulk make-silent --id * command via SSH for moving tasks quarterly - https://phabricator.wikimedia.org/T251349 (10MBinder_WMF) [21:42:23] (03CR) 10Joewalsh: [C: 03+1] Rerender mobile-html on wikidata description changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/593066 (https://phabricator.wikimedia.org/T250209) (owner: 10Ppchelko) [21:47:24] 10Operations, 10Android-app-Bugs, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: Incorrect language variant returned for PCS endpoints - https://phabricator.wikimedia.org/T249284 (10Pchelolo) Tagging #traffic - seems like Varnish/ATS layer no longer unsets 'cache-control: no-c... [21:49:51] (03CR) 10Mholloway: [C: 03+1] Rerender mobile-html on wikidata description changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/593066 (https://phabricator.wikimedia.org/T250209) (owner: 10Ppchelko) [21:53:58] (03CR) 10Volans: [C: 03+1] "If you've tested it and that's how it works go for it :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593031 (owner: 10CRusnov) [22:10:30] (03CR) 10Volans: "The change looks sane and the commit reference correct. I didn't inspect the tar.gz though." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/592721 (owner: 10CRusnov) [22:44:54] (03CR) 10Bstorm: [C: 03+2] "Ok, I beat on this a fair bit in toolsbeta and couldn't kill it with the basic tests at least. I'll merge it along!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585963 (https://phabricator.wikimedia.org/T249079) (owner: 10BryanDavis) [22:45:38] (03Merged) 10jenkins-bot: Yet another package rename mega patch [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585963 (https://phabricator.wikimedia.org/T249079) (owner: 10BryanDavis) [23:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200428T2300) [23:00:04] nray: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:06] o/ I'm around and ready [23:03:00] I'll do the SWAT [23:03:21] thanks RoanKattouw ! [23:07:27] nray: So your cherry-pick is for wmf.30, but enwiki is still on wmf.28 and won't get wmf.30 until Thursday. Is that known/intended? [23:07:34] (see also https://tools.wmflabs.org/versions/ ) [23:09:01] oh shoot, no we want to be able to test it tomorrow. It should be wmf.28 then . I can fix it [23:09:11] OK [23:09:59] Also, for future reference, please avoid adding i18n messages in SWATted patches, it requires rebuilding the entire i18n cache which is pretty slow [23:11:12] In this case you're the only SWAT customer so it's fine, and I know things are urgent sometimes, but it's the kind of thing that we do sparingly and that people get annoyed with you for if you do it too often [23:12:45] Sorry, thanks for letting me know. We can punt on this if necessary. The correct cherry pick is at 593081 [23:13:03] I just corrected the deployment schedule page link as well [23:14:41] Thank [23:14:57] (03PS5) 10Catrope: Add Config for Growth Study Quick Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592788 (https://phabricator.wikimedia.org/T248421) (owner: 10Nray) [23:14:58] I've also amended your config patch to match quote style with the rest of the file (single quotes instead of double quotes) [23:15:46] thank you, looks good [23:17:15] I've started the process by +2ing your wmf.28 patch, so now we're in for a bunch of waiting. It'll take 10-15 minutes for CI to merge it, and then deploying it will probably take another 20-30 minutes [23:18:52] okay, thank you. I'll be on standby [23:20:26] PROBLEM - snapshot of s8 in eqiad on db1115 is CRITICAL: Last snapshot for s8 at eqiad (db1116.eqiad.wmnet:3318) taken on 2020-04-28 20:37:54 is 1033 GB, but previous one was 1564 GB, a change of 33.9% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:30:04] Oh wait I also have to merge the wmf.30 patch, otherwise your messages will disappear on Thursday [23:30:34] Sorry for that brain fart [23:32:46] No problem, I'll correct the deployment schedule page to add wmf.30 as well [23:36:25] Thanks for catching that! [23:48:21] !log catrope@deploy1001 Started scap: Update WikimediaMessages with new i18n messages for T248421 [23:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:28] T248421: Deploy Quicksurveys extension on all Wikipedias (for a Growth study) - https://phabricator.wikimedia.org/T248421 [23:48:57] OK, both patches have merged and I've downloaded them on the deployment host, now starting the actual deployment [23:49:37] sounds good [23:50:03] After this step, the messages should be in production, but the survey won't yet be. That's very quick though, config patches usually take about 30 seconds to merge and 60-90 seconds to deploy [23:50:43] Meaning, the step after this one will be quick, this one won't be. I'm going to make myself a late lunch / early dinner / whatever this meal is in the meantime [23:52:08] +1 [23:55:34] (03PS6) 10BryanDavis: Replace pykube with a custom API client [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930)