[00:00:00] would special:import work from wikitech to mw.org? [00:00:04] twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T0000). [00:00:08] probably not, right? [00:00:23] Krinkle: Use file export. [00:00:32] If you don't have that, I can grant you the rights… [00:00:42] right, xml import is still a thing, sometimes [00:01:39] sure :) [00:01:45] Also, should the pending section have the pending additions. [00:03:30] There's MediaModeration (T247943) and PushNotifications (T246718) planned, plus StopForumSpam (T181217) to make Reedy happy, I guess. ;-) [00:03:31] T181217: Deploy StopForumSpam to the Beta Cluster - https://phabricator.wikimedia.org/T181217 [00:03:32] T247943: Deploy MediaModeration Extension to Wikimedia Production - https://phabricator.wikimedia.org/T247943 [00:03:32] T246718: Deployment of the PushNotifications extension - https://phabricator.wikimedia.org/T246718 [00:04:19] RECOVERY - snapshot of s8 in codfw on db1115 is OK: Last snapshot for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2020-04-29 20:43:38 (1090 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:05:59] (03PS4) 10Jforrester: Move mobile-labs into CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586405 [00:06:00] (03PS4) 10Jforrester: Move mobile into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586406 [00:07:59] https://www.mediawiki.org/wiki/Special:Userrights/Krinkle :) [00:12:33] James_F: Krinkle: are you still deploying stuff? getting ready to release a phabricator update [00:13:18] * Krinkle is not [00:15:47] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Krinkle) >>! In T243056#6095566, @Jdforrester-WMF wrote: > Why not just put it in Labs where other toy projects go?... [00:15:55] !log deploying phabricator update: https://phabricator.wikimedia.org/project/view/4620/ [00:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:26] (03CR) 10Krinkle: [C: 03+1] Move mobile-labs into CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586405 (owner: 10Jforrester) [00:16:34] (03CR) 10Krinkle: [C: 03+1] "Scap trap, careful, etc. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586405 (owner: 10Jforrester) [00:21:39] (03PS1) 10Kaldari: Adding upload_by_url user right to all registered users on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593357 (https://phabricator.wikimedia.org/T251474) [00:26:53] !log phabricator update finished [00:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:01] (03CR) 1020after4: [C: 03+1] admins: new admin group to manage bulk jobs on Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/593166 (https://phabricator.wikimedia.org/T251349) (owner: 10Dzahn) [00:37:59] (03PS1) 10AntiCompositeNumber: engine.ghostscript: use -sstdout=%stderr with gs [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/593358 (https://phabricator.wikimedia.org/T236240) [00:38:01] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/593358 (https://phabricator.wikimedia.org/T236240) (owner: 10AntiCompositeNumber) [00:40:46] (03PS2) 10AntiCompositeNumber: engine.ghostscript: use -sstdout=%stderr with gs [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/593358 (https://phabricator.wikimedia.org/T236240) [01:38:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:40:25] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:49:45] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:51:31] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:23:17] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:23:25] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:20:51] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:22:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:32:01] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:32:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:40:57] PROBLEM - Host graphite2003 is DOWN: PING CRITICAL - Packet loss = 100% [03:42:27] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [03:44:11] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 2.178 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [04:07:59] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (idp-test1001), Fresh: 92 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [04:08:51] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:09:27] PROBLEM - carbon-frontend-relay metric drops on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [04:10:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:13:07] RECOVERY - carbon-frontend-relay metric drops on graphite1004 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [04:20:29] PROBLEM - carbon-frontend-relay metric drops on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [04:22:19] RECOVERY - carbon-frontend-relay metric drops on graphite1004 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [04:31:35] PROBLEM - carbon-frontend-relay metric drops on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [04:33:23] RECOVERY - carbon-frontend-relay metric drops on graphite1004 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [04:38:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1097:3314', diff saved to https://phabricator.wikimedia.org/P11088 and previous config saved to /var/cache/conftool/dbconfig/20200430-043803-marostegui.json [04:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:51] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [04:48:31] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [04:52:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084', diff saved to https://phabricator.wikimedia.org/P11089 and previous config saved to /var/cache/conftool/dbconfig/20200430-045159-marostegui.json [04:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:35] PROBLEM - carbon-frontend-relay metric drops on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [04:59:03] RECOVERY - carbon-frontend-relay metric drops on graphite1004 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [05:00:04] marostegui: (Dis)respected human, time to deploy x1 database master restart (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T0500). Please do the needful. [05:00:18] !log Restart x1 master (db1120) - T250701 [05:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:25] T250701: Restart extension1 (x1) database primary master (db1120) - https://phabricator.wikimedia.org/T250701 [05:02:21] !log Restart x1 master finished - T250701 [05:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:03:41] ^ those are the errors from the restart [05:04:33] PROBLEM - carbon-frontend-relay metric drops on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [05:05:12] 10Operations, 10DBA: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10Marostegui) You can use db1077 for this [05:05:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:06:26] <_joe_> ah ok [05:06:35] <_joe_> I was like "good morning joe" [05:06:45] The carbon errors though have been there for a while [05:07:21] <_joe_> one might ask what "frontend drops" means [05:08:16] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:08:19] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart extension1 (x1) database primary master (db1120) - https://phabricator.wikimedia.org/T250701 (10Marostegui) 05Open→03Resolved This was done. The server was unaccessible from 05:00:41 to 05:02:02 [05:08:51] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 93 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [05:08:54] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:13:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318', diff saved to https://phabricator.wikimedia.org/P11090 and previous config saved to /var/cache/conftool/dbconfig/20200430-051329-marostegui.json [05:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:42] <_joe_> uh and what about cloudelastic [05:14:55] Good morning everyone [05:14:59] I'm working on T251371 [05:15:00] T251371: Create Awadhi Wikipedia - https://phabricator.wikimedia.org/T251371 [05:15:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P11091 and previous config saved to /var/cache/conftool/dbconfig/20200430-051506-marostegui.json [05:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:12] composer buildDBLists no works for me [05:15:29] I've created awawiki.yaml in wmf-config/config per documentation [05:16:35] https://prnt.sc/s8e405 [05:16:37] <_joe_> Zoranzoki21: wrong channel to ask aout it [05:16:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1092', diff saved to https://phabricator.wikimedia.org/P11092 and previous config saved to /var/cache/conftool/dbconfig/20200430-051637-marostegui.json [05:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:49] _joe_: Where I should ask? :D [05:16:58] <_joe_> we're debugging production issues here, please move your inquiries elsewhere [05:17:14] <_joe_> that's a mediawiki bug, not a production issue [05:17:20] <_joe_> so I don't know, #mediawiki-core? [05:17:26] _joe_: Not mediawiki issue [05:17:45] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [05:17:48] It is for operations/mediawiki-config [05:17:50] <_joe_> marostegui: can you access graphite2003? [05:17:55] ^ that must be labsdb1011 [05:18:02] _joe_: checking [05:18:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1092', diff saved to https://phabricator.wikimedia.org/P11093 and previous config saved to /var/cache/conftool/dbconfig/20200430-051818-marostegui.json [05:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:23] _joe_: No.. [05:18:31] <_joe_> marostegui: ok, that's what that alert is saying [05:18:33] It looks down [05:18:37] <_joe_> graphite2003 is down :P [05:19:04] I am checking its serial console [05:19:53] <_joe_> Zoranzoki21: let me rephrase: it's a software problem that has to do with mediawiki, not a production error [05:20:19] <_joe_> also the wiki creation is not under the responsibility of SRE, which is the team that you can usually interact with here [05:20:50] <_joe_> I asked, politely, to stop talking about it here while we're debugging an issue [05:20:50] Ok, I understand. So, where I should ask? [05:21:09] RECOVERY - carbon-frontend-relay metric drops on graphite1004 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [05:21:14] <_joe_> phabricator seems the right place, also, again, I'm busy [05:21:44] I am going to create a task about graphite2003 [05:21:47] with its hardware logs [05:22:20] I asked already there, but I thinked that I will get faster response here.. Ok, I will leave you to work on resolving production issue :) [05:23:06] <_joe_> marostegui: :/ so we lost redundancy of graphite? [05:23:28] It doesn't look good, no [05:23:33] I am going to try to power it back on [05:25:16] 10Operations, 10observability, 10Graphite: graphite2003 crashed - https://phabricator.wikimedia.org/T251479 (10Marostegui) p:05Triage→03Medium [05:26:18] 10Operations, 10observability, 10Graphite: graphite2003 crashed - https://phabricator.wikimedia.org/T251479 (10Marostegui) I have powered it back on and it started fine - I saw no errors on its boot up process. [05:26:23] The host is back [05:26:27] RECOVERY - Host graphite2003 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [05:26:49] 10Operations, 10observability, 10Graphite: graphite2003 crashed - https://phabricator.wikimedia.org/T251479 (10Marostegui) ` [05:26:27] <+icinga-wm> RECOVERY - Host graphite2003 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms ` [05:32:01] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui labsdb1011 crashed https://wikitech.wikimedia.org/wiki/HAProxy [06:33:28] 10Operations, 10ops-eqiad, 10DC-Ops: Recreate RAID on labsdb1011 - https://phabricator.wikimedia.org/T251481 (10Marostegui) [06:33:37] 10Operations, 10ops-eqiad, 10DC-Ops: Recreate RAID on labsdb1011 - https://phabricator.wikimedia.org/T251481 (10Marostegui) p:05Triage→03High [06:34:30] 10Operations, 10ops-eqiad, 10DC-Ops: Recreate RAID on labsdb1011 - https://phabricator.wikimedia.org/T251481 (10wiki_willy) a:03Jclark-ctr [06:36:19] 10Operations, 10ops-eqiad, 10DC-Ops: Recreate RAID on labsdb1011 - https://phabricator.wikimedia.org/T251481 (10wiki_willy) @Jclark-ctr mentioned he was going to be onsite on Thursday, so assigning this over to him, to look into tomorrow. Thanks, Willy [06:38:14] (03PS1) 10Marostegui: install_server: Reimage db2089 [puppet] - 10https://gerrit.wikimedia.org/r/593410 (https://phabricator.wikimedia.org/T250666) [06:38:46] 10Operations, 10DBA: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10jcrespo) If you accept some input, making `partman/custom/no-srv-format.cfg` a recipe that works but doesn't touch the /srv lvm partition would solve most of our problems (combined with the dyna... [06:40:25] (03PS1) 10Marostegui: install_server: Reimage db2089 to buster [puppet] - 10https://gerrit.wikimedia.org/r/593432 (https://phabricator.wikimedia.org/T250666) [06:42:12] (03CR) 10RhinosF1: [C: 03+1] Adding upload_by_url user right to all registered users on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593357 (https://phabricator.wikimedia.org/T251474) (owner: 10Kaldari) [06:44:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2089', diff saved to https://phabricator.wikimedia.org/P11094 and previous config saved to /var/cache/conftool/dbconfig/20200430-064450-marostegui.json [06:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P11095 and previous config saved to /var/cache/conftool/dbconfig/20200430-065008-marostegui.json [06:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1111', diff saved to https://phabricator.wikimedia.org/P11096 and previous config saved to /var/cache/conftool/dbconfig/20200430-065044-marostegui.json [06:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:23] (03CR) 10Jcrespo: [C: 03+1] install_server: Reimage db2089 [puppet] - 10https://gerrit.wikimedia.org/r/593410 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:04:50] (03PS1) 10Majavah: Enable transwiki import from wikidata, frwikisource and hiwikibooks in hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593455 (https://phabricator.wikimedia.org/T251485) [07:06:08] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2089 [puppet] - 10https://gerrit.wikimedia.org/r/593410 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:06:23] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2089 to buster [puppet] - 10https://gerrit.wikimedia.org/r/593432 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:08:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/593314 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [07:08:36] 10Operations, 10ops-eqiad, 10DC-Ops: Recreate RAID on labsdb1011 - https://phabricator.wikimedia.org/T251481 (10Marostegui) [07:08:47] 10Operations, 10ops-eqiad, 10DC-Ops: Recreate RAID on labsdb1011 - https://phabricator.wikimedia.org/T251481 (10Marostegui) Thank you, for the record: The RAID configuration needs to be: **RAID 10 strip size 256KB** [07:09:19] (03CR) 10Muehlenhoff: [C: 03+1] "Any reason this is only for buster-wikimedia, though? mtail is also used on jessie (although we can ignore jessie at this point)" [puppet] - 10https://gerrit.wikimedia.org/r/593314 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [07:13:25] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) Thanks. Manual action is better here to prevent flapping. > If all good, change the alert target so it notifies the whole of SRE This is done too. And I added the alert to... [07:14:15] (03PS1) 10Elukey: superset: correct X-Forwarded-Proto from httpd to superset [puppet] - 10https://gerrit.wikimedia.org/r/593456 [07:14:26] (03CR) 10Muehlenhoff: mtail: add flag to install mtail apt component (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [07:19:03] (03CR) 10Elukey: [C: 03+2] superset: correct X-Forwarded-Proto from httpd to superset [puppet] - 10https://gerrit.wikimedia.org/r/593456 (owner: 10Elukey) [07:21:33] (03CR) 10Dzahn: [C: 03+2] "approved by langcom https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Awadhi" [dns] - 10https://gerrit.wikimedia.org/r/593280 (https://phabricator.wikimedia.org/T251371) (owner: 10Zoranzoki21) [07:21:45] (03PS3) 10Dzahn: Add Awadhi (awa) lang [dns] - 10https://gerrit.wikimedia.org/r/593280 (https://phabricator.wikimedia.org/T251371) (owner: 10Zoranzoki21) [07:26:04] (03CR) 10Dzahn: [C: 03+2] "removing per https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Bulletin_des_administrateurs/2020/Semaine_18#Virer_un_blog_de_planet.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/592438 (https://phabricator.wikimedia.org/T251001) (owner: 10Dereckson) [07:26:16] (03PS2) 10Dzahn: Prune JMT blog from fr.planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/592438 (https://phabricator.wikimedia.org/T251001) (owner: 10Dereckson) [07:27:12] (03CR) 10Giuseppe Lavagetto: Make configuration of envoy a ConfigMap (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582777 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [07:29:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:18] (03PS1) 10Ema: ATS: stop unconditionally cache 404s [puppet] - 10https://gerrit.wikimedia.org/r/593458 (https://phabricator.wikimedia.org/T250815) [07:36:11] (03PS2) 10Ema: ATS: stop unconditionally caching 404s [puppet] - 10https://gerrit.wikimedia.org/r/593458 (https://phabricator.wikimedia.org/T250815) [07:36:58] (03CR) 10Urbanecm: noc.wikimedia.org: highlight.php should not append .txt to dblist URLs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) (owner: 10Urbanecm) [07:40:09] (03PS1) 10Marostegui: Revert "install_server: Reimage db2089" [puppet] - 10https://gerrit.wikimedia.org/r/593459 [07:42:33] (03CR) 10Vgutierrez: [C: 03+1] ATS: stop unconditionally caching 404s [puppet] - 10https://gerrit.wikimedia.org/r/593458 (https://phabricator.wikimedia.org/T250815) (owner: 10Ema) [07:43:32] (03CR) 10Ema: [C: 03+2] ATS: stop unconditionally caching 404s [puppet] - 10https://gerrit.wikimedia.org/r/593458 (https://phabricator.wikimedia.org/T250815) (owner: 10Ema) [07:43:40] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) Has it been considered to put it on https://dumps.wikimedia.org/ along with other dumps? Maybe https://dum... [07:46:56] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Reimage db2089" [puppet] - 10https://gerrit.wikimedia.org/r/593459 (owner: 10Marostegui) [07:47:37] (03CR) 10ZPapierski: [C: 03+1] "It's been years since I saw Tomcat config, but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584638 (https://phabricator.wikimedia.org/T233950) (owner: 10Jbond) [07:48:13] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Joe) >>! In T133821#6094125, @BBlack wrote: >>>! In T133821#6092865, @Joe wrote: >> - Define a schema for a "url purge mes... [07:48:33] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Joe) [07:50:03] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) We have currently about 9.4GB left on those servers. So while 4GB kind of works for now.. it will not for a... [07:50:41] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10MoritzMuehlenhoff) Agreed with what Daniel said, miscweb is for small static webservices, if the dump is already 4.... [07:52:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2089', diff saved to https://phabricator.wikimedia.org/P11097 and previous config saved to /var/cache/conftool/dbconfig/20200430-075211-marostegui.json [07:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:45] (03CR) 10Dzahn: [C: 03+2] LVS/icinga: avoid duplicate service definitions [puppet] - 10https://gerrit.wikimedia.org/r/593230 (https://phabricator.wikimedia.org/T211692) (owner: 10Dzahn) [07:57:50] (03PS4) 10Dzahn: prometheus/icinga: avoid duplicate service definitions [puppet] - 10https://gerrit.wikimedia.org/r/593228 (https://phabricator.wikimedia.org/T211692) [07:58:04] (03CR) 10Dzahn: [C: 03+2] prometheus/icinga: avoid duplicate service definitions [puppet] - 10https://gerrit.wikimedia.org/r/593228 (https://phabricator.wikimedia.org/T211692) (owner: 10Dzahn) [07:58:53] 10Operations, 10Discovery-Search, 10SDC General, 10Structured Data Engineering, and 2 others: Create CQS puppet configs by applying query_service module - https://phabricator.wikimedia.org/T237089 (10Gehel) [07:58:55] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: Refactor Puppet WDQS module to make it usable for wdqs and cqs - https://phabricator.wikimedia.org/T232297 (10Gehel) [08:02:06] 10Operations, 10Discovery-Search, 10SDC General, 10Structured Data Engineering, and 2 others: Create CQS puppet configs by applying query_service module - https://phabricator.wikimedia.org/T237089 (10Gehel) [08:10:21] (03CR) 10Dzahn: [C: 03+2] show program name in usage text [software/httpbb] - 10https://gerrit.wikimedia.org/r/592889 (owner: 10Dzahn) [08:12:42] (03PS1) 10Ema: Add license and copyright notices [software/purged] - 10https://gerrit.wikimedia.org/r/593461 [08:17:59] (03CR) 10Dzahn: [C: 04-1] "as suggested by Alex let me replace the process check with a check_http check" [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [08:20:17] (03CR) 10Jbond: "After the discussion in the SRE Foundations meeting yesterday, I'm starting to lean towards having no whitelist and leaving things as is" [homer/public] - 10https://gerrit.wikimedia.org/r/593250 (https://phabricator.wikimedia.org/T226742) (owner: 10Ayounsi) [08:21:06] 10Operations, 10Parsoid, 10RESTBase, 10Traffic, and 2 others: HTTP 400 Error when trying to save an edit on English Wikipedia: Error contacting the Parsoid/RESTBase server - https://phabricator.wikimedia.org/T250815 (10ema) >>! In T250815#6094356, @Pchelolo wrote: > #traffic This seems like a borderline UB... [08:22:05] (03CR) 10Jbond: "> Patch Set 1:" [homer/public] - 10https://gerrit.wikimedia.org/r/593250 (https://phabricator.wikimedia.org/T226742) (owner: 10Ayounsi) [08:22:38] (03PS2) 10Ema: Add license and copyright notices [software/purged] - 10https://gerrit.wikimedia.org/r/593461 [08:30:02] (03PS2) 10Ayounsi: Change blackhole term scope [homer/public] - 10https://gerrit.wikimedia.org/r/593250 (https://phabricator.wikimedia.org/T226742) [08:30:37] 10Operations: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10MoritzMuehlenhoff) [08:31:50] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/593250 (https://phabricator.wikimedia.org/T226742) (owner: 10Ayounsi) [08:33:37] (03CR) 10Ayounsi: [C: 03+2] Change blackhole term scope [homer/public] - 10https://gerrit.wikimedia.org/r/593250 (https://phabricator.wikimedia.org/T226742) (owner: 10Ayounsi) [08:36:19] !log change blackhole term scope on all routers - T226742 [08:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:07] (03PS2) 10Giuseppe Lavagetto: Make configuration of envoy a ConfigMap [deployment-charts] - 10https://gerrit.wikimedia.org/r/582777 (https://phabricator.wikimedia.org/T244843) [08:40:09] (03PS2) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [08:49:43] (03CR) 10Jbond: [C: 03+2] tomcat: create new tomcat module intended for use with apereo cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584638 (https://phabricator.wikimedia.org/T233950) (owner: 10Jbond) [09:00:11] (03CR) 10Lars Wirzenius: "I'm afraid this falls far outside anything I've been involved with or have an opinion on. Best of luck! Be safe." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592427 (https://phabricator.wikimedia.org/T250419) (owner: 10QEDK) [09:01:16] 10Operations: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10jbond) p:05Triage→03Medium [09:02:28] (03PS1) 10Jbond: java: update java.security [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) [09:02:57] (03CR) 10Jbond: [C: 04-1] "_1 unlitt this has been reviewed" [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [09:04:14] (03CR) 10Ema: "updated pcc here: https://puppet-compiler.wmflabs.org/compiler1003/22200/" [puppet] - 10https://gerrit.wikimedia.org/r/584902 (owner: 10Ema) [09:09:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Fix the services definition, but otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/584902 (owner: 10Ema) [09:10:33] 10Operations, 10DBA: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 (10Marostegui) [09:10:44] 10Operations, 10DBA: Upgrade and restart s5 and s6 primary DB master: Tue 5th May - https://phabricator.wikimedia.org/T251154 (10Marostegui) [09:13:46] (03CR) 10RhinosF1: [C: 03+1] Enable transwiki import from wikidata, frwikisource and hiwikibooks in hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593455 (https://phabricator.wikimedia.org/T251485) (owner: 10Majavah) [09:14:01] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 79.7 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [09:14:15] (03PS5) 10Ema: cache: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/584902 [09:14:36] (03CR) 10Jbond: [C: 04-1] "had a quick first pass" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [09:14:54] (03CR) 10Ema: cache: use profile::lvs::realserver (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/584902 (owner: 10Ema) [09:15:41] 10Operations, 10DBA: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) [09:16:02] 10Operations, 10DBA: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) p:05Triage→03Medium [09:16:25] 10Operations, 10DBA: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) [09:18:34] 10Operations, 10Patch-For-Review: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10jbond) [09:22:03] 10Operations, 10DBA: Upgrade and restart s5 and s6 primary DB master: Tue 5th May - https://phabricator.wikimedia.org/T251154 (10Marostegui) Day before: - Install the 10.1.43-2 package on both masters Maintenance day: - Silence all hosts in s5 and s6 - Set read only on s5 and s6: ` dbctl --scope eqiad section... [09:23:53] 10Operations, 10observability, 10Graphite: graphite2003 crashed - https://phabricator.wikimedia.org/T251479 (10fgiunchedi) 05Open→03Stalled Thanks @Marostegui ! Host seems indeed fine once it came back. Looks like it might have been a non recoverable ECC, from `ipmi-sel` ` 13 | Apr-30-2020 | 03:36:11 |... [09:28:37] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 183.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [09:32:17] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 41.2 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [09:34:58] (03CR) 10JMeybohm: [C: 03+2] Add debian directory and .gitreview [debs/helm3] - 10https://gerrit.wikimedia.org/r/592967 (https://phabricator.wikimedia.org/T251305) (owner: 10JMeybohm) [09:35:06] 10Operations, 10DBA: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10Kormat) @jcrespo : i'm happy to work on that, but i'd like to do the proposed change in this task first. partman is voodoo anytime i've touched it, so it will take some time and some care to cha... [09:38:37] 10Operations, 10DBA: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10jcrespo) I see now, sorry, I didn't understood the proposed scope of work first time I read it. +10000 for me. [09:39:51] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:46:59] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22737 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:47:17] !log reimaging db1077 for testing purposes T251392 [09:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:24] T251392: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 [09:48:32] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Joe) Looking at our existing event schemas, [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/event-schemas/+/master/jsonschema/resource_change/1.0.0.ya... [09:49:15] (03PS1) 10Kormat: install_server: Allow reimage of db1077 [puppet] - 10https://gerrit.wikimedia.org/r/593471 (https://phabricator.wikimedia.org/T251392) [09:49:47] (03CR) 10Marostegui: [C: 03+1] "Good luck!" [puppet] - 10https://gerrit.wikimedia.org/r/593471 (https://phabricator.wikimedia.org/T251392) (owner: 10Kormat) [09:49:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] cache: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/584902 (owner: 10Ema) [09:50:19] (03CR) 10Kormat: [C: 03+2] install_server: Allow reimage of db1077 [puppet] - 10https://gerrit.wikimedia.org/r/593471 (https://phabricator.wikimedia.org/T251392) (owner: 10Kormat) [09:51:26] (03CR) 10Ema: [C: 03+2] cache: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/584902 (owner: 10Ema) [09:53:15] !log imported helm3 3.2.0-1+deb10u1 to main for buster-wikimedia [09:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:07] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 119.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [10:05:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/593218 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [10:08:21] (03PS1) 10Volans: junos: do not commit check on empty diff [software/homer] - 10https://gerrit.wikimedia.org/r/593475 [10:09:22] !log imported helm 2.12.2-4 to main for buster-wikimedia [10:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:14] (03PS2) 10Volans: junos: do not commit check on empty diff [software/homer] - 10https://gerrit.wikimedia.org/r/593475 [10:15:31] (03PS1) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: small refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) [10:17:38] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.hosts.rotate-pdu-password: small refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:17:43] jbond42: love the small! +74 -62 :-p [10:17:51] :D [10:23:47] (03PS3) 10Seddon: Uncoupling graphoid on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592924 [10:24:14] (03PS2) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: small refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) [10:28:57] (03PS2) 10Dzahn: ci: remove integration.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/591338 [10:32:16] (03CR) 10Dzahn: [C: 03+2] ci: remove integration.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/591338 (owner: 10Dzahn) [10:34:06] (03CR) 10Dzahn: "puppet removed the site and reloaded httpd. apache2ctl -S already does not show it anymore without manual clean up" [puppet] - 10https://gerrit.wikimedia.org/r/591338 (owner: 10Dzahn) [10:34:33] (03PS2) 10Dzahn: delete integration.mediawiki.org [dns] - 10https://gerrit.wikimedia.org/r/591340 [10:36:33] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:37:14] (03CR) 10Dzahn: [C: 03+2] "httpd config removed from contint*" [dns] - 10https://gerrit.wikimedia.org/r/591340 (owner: 10Dzahn) [10:40:39] (03PS1) 10Dzahn: Revert "delete integration.mediawiki.org" [dns] - 10https://gerrit.wikimedia.org/r/593478 [10:41:20] (03CR) 10Dzahn: [C: 03+2] "decided it's better to keep the existing redirect on the cluster" [dns] - 10https://gerrit.wikimedia.org/r/593478 (owner: 10Dzahn) [10:41:57] (03PS4) 10Dzahn: remove https://transparency-private.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/593181 (https://phabricator.wikimedia.org/T188362) [10:43:55] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22732 bytes in 7.444 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:44:51] (03PS1) 10Dzahn: remove transparency-private.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/593479 [10:45:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: cleanup unused openstack components [puppet] - 10https://gerrit.wikimedia.org/r/593223 (owner: 10Arturo Borrero Gonzalez) [10:45:58] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22232/" [puppet] - 10https://gerrit.wikimedia.org/r/593181 (https://phabricator.wikimedia.org/T188362) (owner: 10Dzahn) [10:46:44] arturo: feel free to merge both [10:47:13] mutante: ? [10:47:40] you refer to puppet.git? I just merged a patch of mine, but I only saw that one [10:47:51] arturo: when running puppet-merge i saw it was locked by you. but it's all good now. i got it [10:48:01] (03CR) 10Elukey: java: update java.security (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [10:48:02] cool! thanks [10:48:24] (03CR) 10Elukey: "Adding also Gehel since this affects ES too. Should be fine but better to give an heads up!" [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [10:48:48] jbond42: thanks a lot for --^ [10:51:03] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:51:09] elukey: np although moritzm deserves most the thanks ;) [10:51:47] !log bromine,vega,miscweb[12]002: rm -rf /srv/org/wikimedia/TransparencyReport-private [10:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:56] (03CR) 10Dzahn: [C: 03+2] remove transparency-private.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/593479 (owner: 10Dzahn) [10:53:00] (03PS2) 10Dzahn: remove transparency-private.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/593479 [10:54:12] jbond42: he loves jvms so I think it is a pleasure for him to work on this :D [10:54:36] heheh :DDD [10:58:31] (03CR) 10Volans: "Just an initial and partial pass, I didn't look at the changes after mid-file for now. Replying to some comment inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [11:00:01] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22728 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T1100). [11:00:04] Majavah: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:25] I can SWAT today! [11:00:33] great, thanks [11:00:57] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593455 (https://phabricator.wikimedia.org/T251485) (owner: 10Majavah) [11:01:02] (03CR) 10Gehel: [C: 03+1] "LGTM, I don't see anything that should break elasticsearch (but there might be things going on behind the scene that I don't know about). " [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [11:01:22] CptViraj: are you available to test the hiwikisource import patch? [11:02:01] (03Merged) 10jenkins-bot: Enable transwiki import from wikidata, frwikisource and hiwikibooks in hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593455 (https://phabricator.wikimedia.org/T251485) (owner: 10Majavah) [11:02:34] (03CR) 10Muehlenhoff: java: update java.security (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [11:02:38] Majavah: I can do that as a steward [11:02:47] sure, works as well [11:03:30] seems to work, syncing [11:03:46] o/ [11:03:47] (03CR) 10Ayounsi: [C: 03+1] "Tested with diff/no-diff/commit and can't find nits in the code." [software/homer] - 10https://gerrit.wikimedia.org/r/593475 (owner: 10Volans) [11:03:59] hey Seddon [11:04:22] Hey Urbanecm :D [11:05:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 6572e25: Enable transwiki import from wikidata, frwikisource and hiwikibooks in hiwikisource (T251485) (duration: 01m 12s) [11:05:36] (03CR) 10Volans: [C: 03+2] junos: do not commit check on empty diff [software/homer] - 10https://gerrit.wikimedia.org/r/593475 (owner: 10Volans) [11:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:38] ;Majavah done! [11:05:39] T251485: Enable import from more wikis on hiwikisource - https://phabricator.wikimedia.org/T251485 [11:05:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1084', diff saved to https://phabricator.wikimedia.org/P11098 and previous config saved to /var/cache/conftool/dbconfig/20200430-110539-marostegui.json [11:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:52] Urbanecm: thanks as always [11:07:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1091', diff saved to https://phabricator.wikimedia.org/P11099 and previous config saved to /var/cache/conftool/dbconfig/20200430-110721-marostegui.json [11:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:40] !log Deploy schema change on db1091 [11:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:22] Urbanecm: I didn't save the wiki page to add my patch to the list. [11:08:49] Can I save and it be done or do I need to wait for the next window? [11:09:10] Seddon: sure :) [11:09:13] (03Merged) 10jenkins-bot: junos: do not commit check on empty diff [software/homer] - 10https://gerrit.wikimedia.org/r/593475 (owner: 10Volans) [11:09:37] Urbanecm: Done! [11:09:50] https://gerrit.wikimedia.org/r/#/c/592924/ [11:10:51] Seddon: is that associated with a Phabricator task? [11:11:08] https://phabricator.wikimedia.org/T242855 [11:12:12] that seems to be missing the "Bug: " line on commit message [11:13:45] Thanks seddon [11:14:10] (03PS4) 10Urbanecm: Uncoupling graphoid on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592924 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [11:14:20] (03PS5) 10Urbanecm: Uncoupling graphoid on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592924 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [11:14:34] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592924 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [11:14:48] Seddon: can you test this change once it's at mwdebug1001? [11:15:14] Urbanecm: Yep [11:15:23] (03Merged) 10jenkins-bot: Uncoupling graphoid on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592924 (https://phabricator.wikimedia.org/T242855) (owner: 10Seddon) [11:15:23] Great [11:16:33] Seddon: available at mwdebug1001, lmk [11:20:12] Urbanecm: see comments in T249643 [11:20:13] T249643: Restore the "reviewer" group for fawiki - https://phabricator.wikimedia.org/T249643 [11:21:35] (03CR) 10Jbond: "thanks updated" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [11:22:08] (03CR) 10Volans: [C: 03+1] "The change looks ok to me and the script nicer, thanks for the fixes. I didn't tested it though." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/589076 (owner: 10RLazarus) [11:24:24] Seddon: are you looking at i [11:24:26] T? [11:27:26] Urbanecm: I'd just sync it [11:27:47] I need to be better prepared for testing next time [11:28:53] okay, let's try it [11:29:23] 10Operations, 10ops-eqiad, 10DC-Ops: Recreate RAID on labsdb1011 - https://phabricator.wikimedia.org/T251481 (10Jclark-ctr) @Marostegui deleted current Raid, Recreated raid 10 with stripe size 256 [11:29:38] syncing [11:30:34] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 83e1475: Uncoupling graphoid on testwiki (T242855) (duration: 01m 06s) [11:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:41] T242855: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 [11:30:56] (03CR) 10Volans: "replies inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [11:31:08] Seddon: done [11:31:40] Urbanecm: I've got the extension installed and tested as you requested as well and all seems good [11:31:44] 10Operations, 10ops-eqiad, 10DC-Ops: Recreate RAID on labsdb1011 - https://phabricator.wikimedia.org/T251481 (10Marostegui) 05Open→03Resolved Excellent - Thank you, I see the host is now in the installer menu! I will take it from here! Thank you! [11:31:59] Sorry for the holdup on that. First time doing this [11:33:20] Seddon: I'm glad it works now :) [11:34:01] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 9 others: Restart extension1 (x1) database primary master (db1120) - https://phabricator.wikimedia.org/T250701 (10Johan) [11:35:23] PROBLEM - Host restbase1025 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:59] ^ this is me, DIMM repairs [11:37:29] PROBLEM - Host restbase1025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:38:15] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10Jclark-ctr) replaced failed Dimm [11:40:57] RECOVERY - Host restbase1025 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [11:43:21] RECOVERY - Host restbase1025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [11:43:52] 10Operations, 10DBA: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1077.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202004301143_... [11:48:11] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10hnowlan) 05Open→03Resolved [11:54:08] !log running `aborrero@apt1001:~ $ sudo -i reprepro --delete clearvanished` to clean unused openstack components and packages (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/593223) [11:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:21] !log updating tiff on stretch [11:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:46] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:15] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T1200) [12:02:17] !log disable puppet in apt1001 to briefly test a reprepro pull filter before merging a proper patch [12:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:04] 10Operations, 10DBA: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1077.eqiad.wmnet'] ` and were **ALL** successful. [12:06:53] 10Operations, 10LDAP-Access-Requests: LDAP-Access-Requests for Superset - https://phabricator.wikimedia.org/T251516 (10PDas) [12:09:00] !log rolling restart of thumbor service [12:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:51] PROBLEM - PHP opcache health on wtp1032 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:11:53] PROBLEM - PHP opcache health on wtp1042 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:11:59] PROBLEM - PHP opcache health on wtp1033 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:12:11] PROBLEM - PHP opcache health on wtp1047 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:12:11] PROBLEM - PHP opcache health on wtp1045 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:12:11] PROBLEM - PHP opcache health on wtp1048 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:12:43] PROBLEM - PHP opcache health on wtp1037 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:12:47] PROBLEM - PHP opcache health on wtp1036 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:12:49] PROBLEM - PHP opcache health on wtp1025 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:12:49] PROBLEM - PHP opcache health on wtp1029 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:12:49] PROBLEM - PHP opcache health on wtp1039 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:13:17] PROBLEM - PHP opcache health on wtp1031 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:13:57] ^^ this was caused by me i did a rolling restart of php-fpm on the wtp servers [12:13:58] uhm... why all the old parsoid servers at once [12:14:01] aha :) [12:14:03] PROBLEM - PHP opcache health on wtp1043 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:14:05] thanks Bsadowski1 [12:14:10] thanks jbond42 [12:14:19] RECOVERY - PHP opcache health on wtp1036 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:14:21] RECOVERY - PHP opcache health on wtp1029 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:14:21] RECOVERY - PHP opcache health on wtp1039 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:14:33] sorry i should have done this more gradully [12:14:44] 10Operations, 10DBA: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10Kormat) Success: using `db1007) ;; \` in `netboot.cfg` achieved the (short-term) goal of allowing us to use manual partitioning. [12:14:51] RECOVERY - PHP opcache health on wtp1042 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:14:52] np, the recoveries look good [12:14:59] RECOVERY - PHP opcache health on wtp1033 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:15:01] ok great and phew [12:15:13] RECOVERY - PHP opcache health on wtp1047 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:15:13] RECOVERY - PHP opcache health on wtp1045 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:15:37] RECOVERY - PHP opcache health on wtp1043 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:15:49] PROBLEM - PHP opcache health on wtp1034 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:16:23] RECOVERY - PHP opcache health on wtp1032 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:16:23] RECOVERY - PHP opcache health on wtp1031 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:16:45] RECOVERY - PHP opcache health on wtp1048 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:22:15] (03PS1) 10Kormat: Revert "install_server: Allow reimage of db1077" [puppet] - 10https://gerrit.wikimedia.org/r/593490 (https://phabricator.wikimedia.org/T251392) [12:25:39] (03CR) 10Kormat: [C: 03+2] Revert "install_server: Allow reimage of db1077" [puppet] - 10https://gerrit.wikimedia.org/r/593490 (https://phabricator.wikimedia.org/T251392) (owner: 10Kormat) [12:27:15] what did I do, mutante [12:27:30] ? [12:27:31] :P [12:28:23] Bsadowski1: sorry, it was just my mistake with autocomplete [12:29:59] Bsadowski1: thanks for all your work on anti-vandalism and commons uploading :) [12:32:59] RECOVERY - PHP opcache health on wtp1034 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:35:02] !log rolling restart of php7.2-fpm on mw1* servers [12:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:27] 10Operations, 10SRE-tools, 10observability: cookbook sre.hosts.downtime: add feature to remove downtimes - https://phabricator.wikimedia.org/T251519 (10Dzahn) [12:38:59] (03PS1) 10Marostegui: install_server: Reimage labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/593493 (https://phabricator.wikimedia.org/T249188) [12:39:28] 10Operations, 10SRE-tools, 10observability: cookbook sre.hosts.downtime: add feature to remove downtimes - https://phabricator.wikimedia.org/T251519 (10Dzahn) p:05Triage→03Medium [12:40:21] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/593493 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [12:41:47] RECOVERY - PHP opcache health on wtp1037 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:43:15] (03PS5) 10Elukey: Refactor the exporter to support metrics specs via config file [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/592261 [12:43:24] (03PS1) 10Marostegui: labsdb1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/593494 [12:43:59] (03PS2) 10Marostegui: labsdb1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/593494 (https://phabricator.wikimedia.org/T249188) [12:44:09] !log re-enable puppet in apt1001 [12:44:09] (03CR) 10Elukey: "@Busecolak, I added some metrics to the conf example files, nothing more. I'll try to package this code and deploy to one of my hosts, and" [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/592261 (owner: 10Elukey) [12:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:37] 10Operations, 10SRE-tools, 10observability: cookbook sre.hosts.downtime: add feature to remove downtimes - https://phabricator.wikimedia.org/T251519 (10Volans) @Dzahn In any cookbook you can use the context manager [[ https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.Icing... [12:44:38] (03CR) 10Marostegui: [C: 03+2] labsdb1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/593494 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [12:45:45] 10Operations, 10LDAP-Access-Requests: LDAP-Access-Requests for Superset - https://phabricator.wikimedia.org/T251516 (10Aklapper) @PDas: Hi! Is that username correct? https://wikitech.wikimedia.org/wiki/User:Pdas says that it does not exist? Can you please clarify? See https://phabricator.wikimedia.org/project/... [12:46:11] RECOVERY - PHP opcache health on wtp1025 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:47:16] 10Operations, 10LDAP-Access-Requests: LDAP-Access-Requests for Superset - https://phabricator.wikimedia.org/T251516 (10Majavah) >>! In T251516#6096966, @Aklapper wrote: > @PDas: Hi! Is that username correct? https://wikitech.wikimedia.org/wiki/User:Pdas says that it does not exist? Can you please clarify? See... [12:50:26] 10Operations, 10LDAP-Access-Requests: LDAP-Access-Requests for Superset - https://phabricator.wikimedia.org/T251516 (10Aklapper) Majavah: Thanks. Maybe one day I'll learn our confusing LDAP concept... [12:53:57] PROBLEM - PHP opcache health on mw2310 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:54:15] PROBLEM - PHP opcache health on mw2312 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:54:21] 10Operations, 10Analytics, 10LDAP-Access-Requests: LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10DZierten) Approved thank you! [12:54:29] (03CR) 10Thiemo Kreuz (WMDE): "Oh. I think some people might not feel good having this number around. ;-) Where did the 660 come from?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592761 (owner: 10Ladsgroup) [12:54:29] PROBLEM - PHP opcache health on mw2315 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:54:29] PROBLEM - PHP opcache health on mw2236 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:54:53] PROBLEM - PHP opcache health on mw2224 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:54:53] PROBLEM - PHP opcache health on mw2228 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:54:59] PROBLEM - PHP opcache health on mw2240 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:05] PROBLEM - PHP opcache health on mw2165 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:07] PROBLEM - PHP opcache health on mw2168 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:12] theses mw alerts are also me, i restarted mw2 php-fpm 20 mins ago [12:55:23] PROBLEM - PHP opcache health on mw2235 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:25] PROBLEM - PHP opcache health on mw2237 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:27] PROBLEM - PHP opcache health on mw2273 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:32] mw1* is being slowly restarted now [12:55:35] godog: volans: [12:55:45] PROBLEM - PHP opcache health on mw2277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:45] PROBLEM - PHP opcache health on mw2170 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:47] PROBLEM - PHP opcache health on mw2226 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:47] PROBLEM - PHP opcache health on mw2276 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:55:47] PROBLEM - PHP opcache health on mw2271 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:17] PROBLEM - PHP opcache health on mw2163 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:17] PROBLEM - PHP opcache health on mw2239 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:17] PROBLEM - PHP opcache health on mw2167 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:17] PROBLEM - PHP opcache health on mw2180 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:17] PROBLEM - PHP opcache health on mw2181 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:18] jbond42: I take it you meant /go volans :) [12:56:21] PROBLEM - PHP opcache health on mw2178 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:25] PROBLEM - PHP opcache health on mw2164 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:25] PROBLEM - PHP opcache health on mw2166 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:27] PROBLEM - PHP opcache health on mw2256 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:27] PROBLEM - PHP opcache health on mw2174 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:31] yes sorry godog [12:56:33] PROBLEM - PHP opcache health on mw2272 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:33] PROBLEM - PHP opcache health on mw2182 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:33] :) [12:56:35] PROBLEM - PHP opcache health on mw2171 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:37] PROBLEM - PHP opcache health on mw2177 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:43] i will update that script at some point [12:56:49] PROBLEM - PHP opcache health on mw2169 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:49] PROBLEM - PHP opcache health on mw2257 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:51] PROBLEM - PHP opcache health on mw2185 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:55] RECOVERY - PHP opcache health on mw2168 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:57] PROBLEM - PHP opcache health on mw2173 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:57:10] PROBLEM - PHP opcache health on mw2195 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:57:13] PROBLEM - PHP opcache health on mw2172 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:57:35] PROBLEM - PHP opcache health on mw2176 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:57:35] PROBLEM - PHP opcache health on mw2175 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:57:39] PROBLEM - PHP opcache health on mw2196 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:57:47] PROBLEM - PHP opcache health on mw2191 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:57:53] PROBLEM - PHP opcache health on mw2268 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:57:55] PROBLEM - PHP opcache health on mw2194 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:57:57] PROBLEM - PHP opcache health on mw2184 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:58:07] PROBLEM - PHP opcache health on mw2192 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:58:13] PROBLEM - PHP opcache health on mw2188 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:58:15] PROBLEM - PHP opcache health on mw2186 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:58:23] PROBLEM - PHP opcache health on mw2197 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:59:27] RECOVERY - PHP opcache health on mw2170 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:59:29] RECOVERY - PHP opcache health on mw2310 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:00:04] liw and brennen: (Dis)respected human, time to deploy Mediawiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T1300). Please do the needful. [13:00:07] RECOVERY - PHP opcache health on mw2166 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:00:57] RECOVERY - PHP opcache health on mw2273 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:01:01] (03PS1) 10Lars Wirzenius: all wikis to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593498 [13:01:03] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593498 (owner: 10Lars Wirzenius) [13:01:17] RECOVERY - PHP opcache health on mw2175 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:01:19] RECOVERY - PHP opcache health on mw2276 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:01:34] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593498 (owner: 10Lars Wirzenius) [13:01:35] RECOVERY - PHP opcache health on mw2268 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:01:47] RECOVERY - PHP opcache health on mw2163 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:01:49] RECOVERY - PHP opcache health on mw2167 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:02:03] RECOVERY - PHP opcache health on mw2272 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:02:19] PROBLEM - PHP opcache health on mw2327 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:02:23] PROBLEM - PHP opcache health on mw2150 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:02:33] PROBLEM - PHP opcache health on mw2152 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:02:43] PROBLEM - PHP opcache health on mw2160 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:07] PROBLEM - PHP opcache health on mw2158 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:21] PROBLEM - PHP opcache health on mw2325 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:21] PROBLEM - PHP opcache health on mw2331 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:21] PROBLEM - PHP opcache health on mw2309 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:41] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: kubeadm-k8s: create versioned components [puppet] - 10https://gerrit.wikimedia.org/r/593499 (https://phabricator.wikimedia.org/T250866) [13:03:43] RECOVERY - PHP opcache health on mw2315 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:43] RECOVERY - PHP opcache health on mw2180 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:47] RECOVERY - PHP opcache health on mw2188 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:49] RECOVERY - PHP opcache health on mw2186 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:49] RECOVERY - PHP opcache health on mw2164 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:51] RECOVERY - PHP opcache health on mw2256 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:53] RECOVERY - PHP opcache health on mw2174 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:57] RECOVERY - PHP opcache health on mw2182 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:57] RECOVERY - PHP opcache health on mw2197 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:59] RECOVERY - PHP opcache health on mw2171 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:00] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.30 [13:04:01] PROBLEM - PHP opcache health on mw2243 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:03] RECOVERY - PHP opcache health on mw2177 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:07] RECOVERY - PHP opcache health on mw2224 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:07] RECOVERY - PHP opcache health on mw2228 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:13] RECOVERY - PHP opcache health on mw2327 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:13] RECOVERY - PHP opcache health on mw2240 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:13] RECOVERY - PHP opcache health on mw2169 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:15] 10Operations, 10SRE-tools, 10observability: cookbook sre.hosts.downtime: add feature to remove downtimes - https://phabricator.wikimedia.org/T251519 (10Dzahn) Thanks volans! It's great news there is already a method for it. For now just a quick comment on the latter part. If hosts are not "ready to be che... [13:04:17] RECOVERY - PHP opcache health on mw2257 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:17] RECOVERY - PHP opcache health on mw2150 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:17] RECOVERY - PHP opcache health on mw2185 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:19] RECOVERY - PHP opcache health on mw2165 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:23] RECOVERY - PHP opcache health on mw2173 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:37] RECOVERY - PHP opcache health on mw2195 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:37] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.2.1 [software/homer] - 10https://gerrit.wikimedia.org/r/593501 [13:04:39] RECOVERY - PHP opcache health on mw2235 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:39] RECOVERY - PHP opcache health on mw2172 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:04:40] RECOVERY - PHP opcache health on mw2237 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:01] RECOVERY - PHP opcache health on mw2277 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:01] RECOVERY - PHP opcache health on mw2176 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:03] RECOVERY - PHP opcache health on mw2226 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:03] RECOVERY - PHP opcache health on mw2271 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:07] RECOVERY - PHP opcache health on mw2196 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:15] RECOVERY - PHP opcache health on mw2191 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:15] RECOVERY - PHP opcache health on mw2325 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:15] RECOVERY - PHP opcache health on mw2309 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:15] RECOVERY - PHP opcache health on mw2331 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:21] RECOVERY - PHP opcache health on mw2312 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:21] RECOVERY - PHP opcache health on mw2194 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:23] RECOVERY - PHP opcache health on mw2184 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:29] PROBLEM - PHP opcache health on mw2247 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:33] RECOVERY - PHP opcache health on mw2239 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:33] RECOVERY - PHP opcache health on mw2192 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:35] RECOVERY - PHP opcache health on mw2236 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:35] RECOVERY - PHP opcache health on mw2181 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:39] RECOVERY - PHP opcache health on mw2178 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:43] PROBLEM - PHP opcache health on mw2153 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:47] PROBLEM - PHP opcache health on mw2246 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:53] RECOVERY - PHP opcache health on mw2243 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:55] PROBLEM - PHP opcache health on mw2279 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:55] PROBLEM - PHP opcache health on mw2159 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:07] PROBLEM - PHP opcache health on mw2259 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:15] PROBLEM - PHP opcache health on mw2250 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:19] PROBLEM - PHP opcache health on mw2260 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:19] PROBLEM - PHP opcache health on mw2248 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:30] PROBLEM - PHP opcache health on mw2161 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:33] PROBLEM - PHP opcache health on mw2267 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:53] PROBLEM - PHP opcache health on mw2264 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:07:09] 10Operations, 10SRE-tools, 10observability: cookbook sre.hosts.downtime: add feature to remove downtimes - https://phabricator.wikimedia.org/T251519 (10Volans) >>! In T251519#6096994, @Dzahn wrote: > Thanks volans! It's great news there is already a method for it. > > For now just a quick comment on the l... [13:08:03] RECOVERY - PHP opcache health on mw2250 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:09:50] group2 at wmf.30 [13:11:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [13:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:55] PROBLEM - PHP opcache health on mw2315 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:12:59] PROBLEM - PHP opcache health on mw2313 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:13:03] PROBLEM - PHP opcache health on mw2311 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:13:35] PROBLEM - PHP opcache health on mw2316 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:13:37] PROBLEM - PHP opcache health on mw2314 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:13:52] (03CR) 10Ayounsi: [C: 03+1] CHANGELOG: add changelogs for release v0.2.1 [software/homer] - 10https://gerrit.wikimedia.org/r/593501 (owner: 10Volans) [13:14:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:39] PROBLEM - PHP opcache health on mw2224 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:14:39] PROBLEM - PHP opcache health on mw2228 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:14:39] PROBLEM - PHP opcache health on mw2198 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:14:45] PROBLEM - PHP opcache health on mw2169 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:14:45] PROBLEM - PHP opcache health on mw2240 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:14:51] PROBLEM - PHP opcache health on mw2165 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:00] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.2.1 [software/homer] - 10https://gerrit.wikimedia.org/r/593501 (owner: 10Volans) [13:15:09] PROBLEM - PHP opcache health on mw2231 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:11] PROBLEM - PHP opcache health on mw2235 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:13] PROBLEM - PHP opcache health on mw2172 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:15] PROBLEM - PHP opcache health on mw2237 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:15] PROBLEM - PHP opcache health on mw2273 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:37] PROBLEM - PHP opcache health on mw2177 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:41] PROBLEM - PHP opcache health on mw2258 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:49] PROBLEM - PHP opcache health on mw2257 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:52] PROBLEM - PHP opcache health on mw2185 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:57] PROBLEM - PHP opcache health on mw2168 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:57] PROBLEM - PHP opcache health on mw2173 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:16:11] RECOVERY - PHP opcache health on mw2231 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:16:11] PROBLEM - PHP opcache health on mw2195 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:17:05] RECOVERY - PHP opcache health on mw2314 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:17:39] RECOVERY - PHP opcache health on mw2279 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:18:18] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.2.1 [software/homer] - 10https://gerrit.wikimedia.org/r/593501 (owner: 10Volans) [13:18:39] RECOVERY - PHP opcache health on mw2159 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:18:43] PROBLEM - PHP opcache health on mw2184 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:18:43] PROBLEM - PHP opcache health on mw2196 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:18:51] PROBLEM - PHP opcache health on mw2226 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:19:45] RECOVERY - PHP opcache health on mw2258 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:19:51] PROBLEM - PHP opcache health on mw2171 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:19:55] PROBLEM - PHP opcache health on mw2275 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:20:09] RECOVERY - PHP opcache health on mw2248 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:20:59] PROBLEM - PHP opcache health on mw2188 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:21:09] RECOVERY - PHP opcache health on mw2247 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:21:32] RECOVERY - PHP opcache health on mw2267 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:22:09] RECOVERY - PHP opcache health on mw2275 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:22:11] RECOVERY - PHP opcache health on mw2259 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:22:12] PROBLEM - PHP opcache health on mw2167 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:22:15] PROBLEM - PHP opcache health on mw2180 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:22:15] PROBLEM - PHP opcache health on mw2268 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:22:15] PROBLEM - PHP opcache health on mw2272 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:22:27] RECOVERY - PHP opcache health on mw2260 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:22:35] RECOVERY - PHP opcache health on mw2160 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:22:39] RECOVERY - PHP opcache health on mw2161 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:23:19] PROBLEM - PHP opcache health on mw2312 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:23:23] PROBLEM - PHP opcache health on mw2191 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:23:35] RECOVERY - PHP opcache health on mw2152 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:25:29] PROBLEM - PHP opcache health on mw2230 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:25:41] RECOVERY - PHP opcache health on mw2240 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:25:45] RECOVERY - PHP opcache health on mw2158 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:26:55] RECOVERY - PHP opcache health on mw2153 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:27:29] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:29:01] PROBLEM - PHP opcache health on mw2164 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:29:01] PROBLEM - PHP opcache health on mw2175 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:29:25] RECOVERY - PHP opcache health on mw2264 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:29:35] PROBLEM - PHP opcache health on mw2327 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:29:47] RECOVERY - PHP opcache health on mw2268 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:29:51] PROBLEM - PHP opcache health on mw2225 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:30:11] RECOVERY - PHP opcache health on mw2273 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:30:19] RECOVERY - PHP opcache health on mw2175 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:30:21] PROBLEM - PHP opcache health on mw2190 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:30:49] PROBLEM - PHP opcache health on mw2331 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:30:55] RECOVERY - PHP opcache health on mw2188 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:32:05] PROBLEM - PHP opcache health on mw2367 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:32:09] RECOVERY - PHP opcache health on mw2224 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:32:25] PROBLEM - PHP opcache health on mw2357 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:32:27] PROBLEM - PHP opcache health on mw2359 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:32:37] PROBLEM - PHP opcache health on mw2375 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:32:39] PROBLEM - PHP opcache health on mw2351 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:33:07] PROBLEM - PHP opcache health on mw2369 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:33:24] (03CR) 10Dzahn: httpbb: add tests for miscweb sites (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/592883 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [13:33:32] PROBLEM - PHP opcache health on mw2371 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:33:32] PROBLEM - PHP opcache health on mw2365 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:33:39] PROBLEM - PHP opcache health on mw2303 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:33:45] PROBLEM - PHP opcache health on mw2355 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:34:04] (03PS5) 10Dzahn: httpbb: add tests for miscweb sites [puppet] - 10https://gerrit.wikimedia.org/r/592883 (https://phabricator.wikimedia.org/T247650) [13:34:05] RECOVERY - PHP opcache health on mw2272 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:34:21] PROBLEM - PHP opcache health on mw2229 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:34:21] RECOVERY - PHP opcache health on mw2246 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:34:24] (03PS3) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: small refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) [13:35:09] PROBLEM - PHP opcache health on mw2269 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:35:19] PROBLEM - PHP opcache health on mw2318 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:35:22] PROBLEM - PHP opcache health on mw2217 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:35:35] PROBLEM - PHP opcache health on mw2320 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:35:47] PROBLEM - PHP opcache health on mw2182 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:35:49] PROBLEM - PHP opcache health on mw2276 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:35:49] (03CR) 10Jbond: [C: 04-1] "self -1 untill we get rid of the protected function call" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [13:35:57] PROBLEM - PHP opcache health on mw2321 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:03] RECOVERY - PHP opcache health on mw2237 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:19] PROBLEM - PHP opcache health on mw2325 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:35] PROBLEM - PHP opcache health on mw2322 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:39] PROBLEM - PHP opcache health on mw2308 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:41] PROBLEM - PHP opcache health on mw2137 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:55] PROBLEM - PHP opcache health on mw2323 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:57] PROBLEM - PHP opcache health on mw2209 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:59] PROBLEM - PHP opcache health on mw2200 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:59] PROBLEM - PHP opcache health on mw2146 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:07] PROBLEM - PHP opcache health on mw2223 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:07] RECOVERY - PHP opcache health on mw2180 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:07] PROBLEM - PHP opcache health on mw2205 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:11] PROBLEM - PHP opcache health on mw2208 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:14] (03PS1) 10Volans: README: fix syntax for PyPI [software/homer] - 10https://gerrit.wikimedia.org/r/593511 [13:37:17] PROBLEM - PHP opcache health on mw2202 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:17] PROBLEM - PHP opcache health on mw2143 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:23] PROBLEM - PHP opcache health on mw2140 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:27] PROBLEM - PHP opcache health on mw2354 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:27] PROBLEM - PHP opcache health on mw2360 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:29] PROBLEM - PHP opcache health on mw1407 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:31] PROBLEM - PHP opcache health on mw2207 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:35] PROBLEM - PHP opcache health on mw2319 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:42] PROBLEM - PHP opcache health on mw2332 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:37:57] PROBLEM - PHP opcache health on mw2135 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:11] PROBLEM - PHP opcache health on mw2141 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:15] PROBLEM - PHP opcache health on mw2370 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:15] PROBLEM - PHP opcache health on mw2138 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:15] PROBLEM - PHP opcache health on mw2136 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:17] PROBLEM - PHP opcache health on mw2142 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:17] PROBLEM - PHP opcache health on mw2144 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:22] RECOVERY - PHP opcache health on mw2230 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:22] PROBLEM - PHP opcache health on mw2201 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:25] PROBLEM - PHP opcache health on mw2328 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:29] PROBLEM - PHP opcache health on mw2214 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:35] PROBLEM - PHP opcache health on mw2221 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:41] PROBLEM - PHP opcache health on mw2358 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:47] RECOVERY - PHP opcache health on mw2315 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:38:57] PROBLEM - PHP opcache health on mw2294 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:01] RECOVERY - PHP opcache health on mw2316 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:01] PROBLEM - PHP opcache health on mw2374 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:15] PROBLEM - PHP opcache health on mw2204 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:15] PROBLEM - PHP opcache health on mw2251 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:19] PROBLEM - PHP opcache health on mw2286 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:21] PROBLEM - PHP opcache health on mw2352 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:33] PROBLEM - PHP opcache health on mw2304 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:45] PROBLEM - PHP opcache health on mw2216 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:45] PROBLEM - PHP opcache health on mw2212 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:47] PROBLEM - PHP opcache health on mw2222 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:47] PROBLEM - PHP opcache health on mw2309 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:47] PROBLEM - PHP opcache health on mw2317 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:49] (03CR) 10Volans: [C: 03+2] README: fix syntax for PyPI [software/homer] - 10https://gerrit.wikimedia.org/r/593511 (owner: 10Volans) [13:39:55] RECOVERY - PHP opcache health on mw2228 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:55] RECOVERY - PHP opcache health on mw2269 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:59] PROBLEM - PHP opcache health on mw2293 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:39:59] PROBLEM - PHP opcache health on mw2245 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:03] PROBLEM - PHP opcache health on mw2295 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:09] PROBLEM - PHP opcache health on mw2376 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:09] PROBLEM - PHP opcache health on mw2368 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:13] PROBLEM - PHP opcache health on mw2210 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:13] RECOVERY - PHP opcache health on mw2169 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:15] PROBLEM - PHP opcache health on mw2291 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:15] PROBLEM - PHP opcache health on mw2253 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:15] PROBLEM - PHP opcache health on mw2287 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:17] PROBLEM - PHP opcache health on mw2372 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:19] PROBLEM - PHP opcache health on mw2297 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:23] PROBLEM - PHP opcache health on mw2256 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:33] PROBLEM - PHP opcache health on mw2302 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:33] PROBLEM - PHP opcache health on mw2299 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:33] PROBLEM - PHP opcache health on mw2261 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:35] PROBLEM - PHP opcache health on mw2292 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:37] PROBLEM - PHP opcache health on mw2298 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:51] PROBLEM - PHP opcache health on mw2300 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:52] PROBLEM - PHP opcache health on mw2306 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:52] PROBLEM - PHP opcache health on mw2284 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:53] PROBLEM - PHP opcache health on mw2362 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:57] RECOVERY - PHP opcache health on mw2172 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:57] RECOVERY - PHP opcache health on mw2311 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:57] PROBLEM - PHP opcache health on mw2285 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:41:07] RECOVERY - PHP opcache health on mw2164 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:41:11] PROBLEM - PHP opcache health on mw2296 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:41:11] PROBLEM - PHP opcache health on mw2289 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:41:12] PROBLEM - PHP opcache health on mw2283 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:41:35] PROBLEM - PHP opcache health on mw2252 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:41:39] PROBLEM - PHP opcache health on mw2262 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:41:55] PROBLEM - PHP opcache health on mw2356 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:41:57] RECOVERY - PHP opcache health on mw2167 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:41:59] PROBLEM - PHP opcache health on mw2218 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:01] PROBLEM - PHP opcache health on mw2192 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:01] PROBLEM - PHP opcache health on mw2197 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:01] PROBLEM - PHP opcache health on mw2211 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:03] PROBLEM - PHP opcache health on mw2288 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:03] PROBLEM - PHP opcache health on mw2290 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:03] PROBLEM - PHP opcache health on mw2330 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:15] PROBLEM - PHP opcache health on mw2361 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:37] RECOVERY - PHP opcache health on mw2235 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:43:05] (03Merged) 10jenkins-bot: README: fix syntax for PyPI [software/homer] - 10https://gerrit.wikimedia.org/r/593511 (owner: 10Volans) [13:43:11] PROBLEM - PHP opcache health on mw2236 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:43:39] PROBLEM - PHP opcache health on mw2244 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:44:21] PROBLEM - PHP opcache health on mw2206 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:44:21] PROBLEM - PHP opcache health on mw2220 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:44:22] PROBLEM - PHP opcache health on mw2307 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:44:23] PROBLEM - PHP opcache health on mw2326 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:44:23] PROBLEM - PHP opcache health on mw2334 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:44:23] PROBLEM - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:44:23] PROBLEM - PHP opcache health on mw2366 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:44:23] PROBLEM - PHP opcache health on mw2364 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:44:51] (03CR) 10Ema: [C: 03+2] Add license and copyright notices [software/purged] - 10https://gerrit.wikimedia.org/r/593461 (owner: 10Ema) [13:44:53] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10BBlack) I like the root event timestamp info. We could potentially put in future rules to help by ignoring ancient purges, in some cases (e.g. if we can guarante... [13:45:11] RECOVERY - PHP opcache health on mw2184 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:46:04] (03PS1) 10Jcrespo: mariadb: Fix bacula alerts documentation urls [puppet] - 10https://gerrit.wikimedia.org/r/593513 (https://phabricator.wikimedia.org/T234900) [13:46:43] PROBLEM - PHP opcache health on mw2203 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:46:47] PROBLEM - PHP opcache health on mw2373 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:46:49] RECOVERY - PHP opcache health on mw2141 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:47:09] PROBLEM - PHP opcache health on mw2277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:47:11] RECOVERY - PHP opcache health on mw2313 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:48:30] PROBLEM - PHP opcache health on mw2176 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:48:30] PROBLEM - PHP opcache health on mw2178 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:48:32] PROBLEM - PHP opcache health on mw2215 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:48:34] PROBLEM - PHP opcache health on mw2305 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:48:34] PROBLEM - PHP opcache health on mw2324 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:49:16] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [13:51:13] (03PS1) 10Volans: Upstream release v0.2.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/593514 [13:52:12] RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:52:26] RECOVERY - Check systemd state on maps2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:46] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:06] RECOVERY - PHP opcache health on mw2364 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:53:18] RECOVERY - PHP opcache health on mw2365 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:53:50] RECOVERY - PHP opcache health on mw2226 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:53:52] RECOVERY - PHP opcache health on mw2372 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:54:29] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10KartikMistry) [13:56:54] RECOVERY - PHP opcache health on mw2308 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:58:14] RECOVERY - PHP opcache health on mw2295 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:58:48] RECOVERY - PHP opcache health on mw2220 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:58:50] RECOVERY - PHP opcache health on mw2325 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:59:14] RECOVERY - PHP opcache health on mw2362 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:00:04] RECOVERY - PHP opcache health on mw2135 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:00:26] RECOVERY - PHP opcache health on mw2203 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:00:30] RECOVERY - PHP opcache health on mw2191 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:01:37] !log upgrade trafficserver to version 8.0.7-1wm2 on cp4025 and cp4031 [14:01:40] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:26] RECOVERY - PHP opcache health on mw2287 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:03:04] RECOVERY - Check systemd state on maps2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:08] RECOVERY - PHP opcache health on mw2195 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:03:08] RECOVERY - PHP opcache health on mw2197 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:03:16] RECOVERY - PHP opcache health on mw2136 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:03:24] RECOVERY - PHP opcache health on mw2198 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:03:48] RECOVERY - PHP opcache health on mw2356 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:03:50] RECOVERY - PHP opcache health on mw2257 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:04:55] (03PS12) 10Krinkle: Switch "wait for replica" method to use GTIDs for external DB clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [14:05:04] RECOVERY - PHP opcache health on mw2277 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:05:04] RECOVERY - PHP opcache health on mw2214 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:05:11] (03PS1) 10Ema: ATS: disabling transaction_active_timeout_in for test eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/593517 (https://phabricator.wikimedia.org/T242767) [14:05:23] marostegui: just a minute, looking at opcache stuff [14:05:32] RECOVERY - PHP opcache health on mw2354 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:05:32] RECOVERY - PHP opcache health on mw2334 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:05:46] PROBLEM - PHP opcache health on wtp2010 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:05:46] RECOVERY - PHP opcache health on mw2366 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:06:07] Krinkle: np [14:06:24] _joe_: that's a lot of opcache health at once.. [14:06:44] PROBLEM - PHP opcache health on wtp2012 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:06:52] all codfw I think.. [14:07:00] RECOVERY - PHP opcache health on mw2185 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:07:20] RECOVERY - PHP opcache health on mw2289 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:07:32] RECOVERY - PHP opcache health on mw1407 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:07:33] 10Operations, 10Parsoid, 10RESTBase, 10Traffic, and 2 others: HTTP 400 Error when trying to save an edit on English Wikipedia: Error contacting the Parsoid/RESTBase server - https://phabricator.wikimedia.org/T250815 (10Pchelolo) 05Open→03Resolved a:03Pchelolo >>! In T250815#6096169, @ema wrote: >>>!... [14:07:34] RECOVERY - PHP opcache health on mw2190 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:07:34] RECOVERY - PHP opcache health on mw2182 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:07:36] RECOVERY - PHP opcache health on mw2288 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:07:37] Krinkle: was a restart of php-fpm [14:07:38] RECOVERY - PHP opcache health on mw2307 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:07:38] PROBLEM - PHP opcache health on wtp2018 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:07:41] across the fleet [14:07:53] volans: oh its cachehit, not cache size [14:07:58] I didn't know we had an alert for that [14:08:00] RECOVERY - PHP opcache health on mw2318 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:08:01] ok, carryon :D [14:08:02] PROBLEM - PHP opcache health on wtp2003 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:08:10] RECOVERY - PHP opcache health on mw2291 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:08:17] marostegui: alright, wanna go ahead? [14:08:26] Krinkle: let's go for db-codfw [14:08:26] I'm prepping some test urls and Logstash now [14:08:44] cool I will do some testing with intermediate masters too [14:08:47] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10ema) >>! In T248736#6094878, @Ottomata wrote: > Is there a more permanent fix? Any idea why ATS was leaking the socket FDs? Nope, we'll try to reproduce next week in isolation with h... [14:08:48] PROBLEM - PHP opcache health on wtp2007 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:08:50] RECOVERY - PHP opcache health on mw2236 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:08:51] marostegui: want me to split the patch or sync-file is ok? [14:08:54] PROBLEM - PHP opcache health on wtp2002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:08:54] PROBLEM - PHP opcache health on wtp2016 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:08:58] RECOVERY - PHP opcache health on mw2256 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:09:00] Krinkle: we can just sync-file db-codfw.php [14:09:04] RECOVERY - PHP opcache health on mw2370 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:09:04] PROBLEM - PHP opcache health on wtp2009 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:09:05] ok [14:09:10] (03CR) 10Krinkle: [C: 03+2] Switch "wait for replica" method to use GTIDs for external DB clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [14:09:28] RECOVERY - PHP opcache health on mw2324 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:09:36] RECOVERY - PHP opcache health on mw2146 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:09:42] RECOVERY - PHP opcache health on mw2312 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:09:48] 10Operations, 10DNS, 10Traffic: Reverse DNS missing for some hosts - https://phabricator.wikimedia.org/T251522 (10Reedy) [14:09:50] RECOVERY - PHP opcache health on mw2320 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:09:57] (03CR) 10Jcrespo: [C: 03+2] mariadb: Fix bacula alerts documentation urls [puppet] - 10https://gerrit.wikimedia.org/r/593513 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:10:10] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22732 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:10:11] (03Merged) 10jenkins-bot: Switch "wait for replica" method to use GTIDs for external DB clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [14:10:12] PROBLEM - PHP opcache health on wtp2008 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:10:14] PROBLEM - PHP opcache health on wtp2006 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:10:26] RECOVERY - PHP opcache health on mw2216 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:10:35] <_joe_> I have a few things to tune further in that alert it seems [14:10:42] PROBLEM - PHP opcache health on wtp2019 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:10:56] RECOVERY - PHP opcache health on mw2328 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:11:06] RECOVERY - PHP opcache health on mw2368 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:11:46] !log rolling restart of ats-tls on text@esams - T249335 [14:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:52] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [14:12:02] RECOVERY - PHP opcache health on mw2285 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:04] RECOVERY - PHP opcache health on mw2223 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:04] RECOVERY - PHP opcache health on mw2229 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:06] RECOVERY - PHP opcache health on mw2351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:10] RECOVERY - PHP opcache health on mw2206 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:20] RECOVERY - PHP opcache health on mw2284 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:26] RECOVERY - PHP opcache health on mw2142 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:27] (03PS6) 10Elukey: Refactor the exporter to support metrics specs via config file [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/592261 [14:12:48] RECOVERY - PHP opcache health on mw2332 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:56] RECOVERY - PHP opcache health on mw2253 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:13:06] marostegui: I have it staged on mwdebug2001 to start with. [14:13:12] PROBLEM - PHP opcache health on wtp2005 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:13:36] Krinkle: going there! [14:13:48] RECOVERY - PHP opcache health on mw2299 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:13:50] RECOVERY - PHP opcache health on mw2360 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:13:50] !log Stop slave on es2020 for testing [14:13:52] RECOVERY - PHP opcache health on mw2374 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:10] RECOVERY - PHP opcache health on mw2192 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:14:12] RECOVERY - PHP opcache health on mw2225 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:14:21] 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251523 (10soworu) [14:14:42] RECOVERY - PHP opcache health on mw2286 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:14:46] RECOVERY - PHP opcache health on mw2205 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:14:48] (03PS2) 10Ppchelko: EventBus: Switch to namespaced class names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/590437 [14:15:10] RECOVERY - PHP opcache health on mw2276 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:15:22] (03PS1) 10Kormat: admin: Add /home comforts for kormat. [puppet] - 10https://gerrit.wikimedia.org/r/593519 [14:15:28] PROBLEM - PHP opcache health on wtp2020 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:15:28] PROBLEM - PHP opcache health on wtp2011 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:15:31] Krinkle: I have created a lock on deploy1001 to avoid any issues with unexpected deployments [14:15:40] (03PS2) 10Ema: ATS: disable transaction_active_timeout_in for test eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/593517 (https://phabricator.wikimedia.org/T242767) [14:15:48] RECOVERY - PHP opcache health on mw2293 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:15:48] RECOVERY - PHP opcache health on mw2245 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:15:48] (03CR) 10Ppchelko: "This changes this depends on has been merged, this can now be merged and deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/590437 (owner: 10Ppchelko) [14:15:56] 10Operations, 10LDAP-Access-Requests: LDAP-Access-Requests for Superset - https://phabricator.wikimedia.org/T251516 (10PDas) My bad. The username was Praveen Das. Thanks for your help -- Best, Praveen Das (he/him) Partnerships Manager, South Asia Wikimedia Foundation pdas@wikimedia.org Mob: +91 - 8951323603 [14:15:59] marostegui: ok. I had one as well I think [14:16:06] RECOVERY - PHP opcache health on mw2321 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:16:08] RECOVERY - PHP opcache health on mw2217 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:16:08] ah ok [14:16:19] $ touch /var/lock/scap-global-lock [14:16:21] is what I usually do [14:16:22] RECOVERY - PHP opcache health on mw2358 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:16:25] anyway, no worries [14:16:27] Krinkle: same :) [14:16:28] RECOVERY - PHP opcache health on mw2204 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:16:30] RECOVERY - PHP opcache health on mw2298 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:16:30] RECOVERY - PHP opcache health on mw2251 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:16:30] RECOVERY - PHP opcache health on mw2290 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:16:30] RECOVERY - PHP opcache health on mw2306 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:16:54] RECOVERY - PHP opcache health on mw2208 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:16:56] RECOVERY - PHP opcache health on mw2261 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:17:18] RECOVERY - PHP opcache health on mw2375 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:17:24] RECOVERY - PHP opcache health on mw2296 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:17:26] RECOVERY - PHP opcache health on mw2138 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:17:32] marostegui: browsing pages, and parsing wikitext, / parser cache hit seems to work fine [14:17:37] trying FLow now [14:17:43] tracking at https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [14:17:47] ok, I haven't seen anything weird on eswiki either [14:17:52] RECOVERY - PHP opcache health on mw2376 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:17:52] RECOVERY - PHP opcache health on mw2323 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:18:00] RECOVERY - PHP opcache health on mw2244 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:18:16] RECOVERY - PHP opcache health on mw2292 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:18:36] RECOVERY - PHP opcache health on mw2283 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:18:38] RECOVERY - PHP opcache health on mw2168 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:18:38] RECOVERY - PHP opcache health on mw2173 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:18:40] RECOVERY - PHP opcache health on mw2218 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:19:00] RECOVERY - PHP opcache health on mw2317 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:19:04] RECOVERY - PHP opcache health on mw2322 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:19:16] PROBLEM - PHP opcache health on wtp2015 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:19:16] RECOVERY - PHP opcache health on mw2252 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:19:26] RECOVERY - PHP opcache health on mw2300 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:19:42] RECOVERY - PHP opcache health on mw2352 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:20:00] RECOVERY - PHP opcache health on mw2140 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:20:04] RECOVERY - PHP opcache health on mw2294 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:20:04] (03CR) 10Kormat: [C: 03+2] admin: Add /home comforts for kormat. [puppet] - 10https://gerrit.wikimedia.org/r/593519 (owner: 10Kormat) [14:20:14] PROBLEM - PHP opcache health on wtp2004 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:20:14] PROBLEM - PHP opcache health on wtp2014 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:20:30] RECOVERY - PHP opcache health on mw2304 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:20:50] PROBLEM - PHP opcache health on wtp2001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:20:54] PROBLEM - PHP opcache health on wtp2013 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:20:56] RECOVERY - PHP opcache health on mw2330 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:20:59] marostegui: reads seem fine, next I'll make some Flow edits via eqiad, and try to make my codfw/mwdebug profile await them [14:21:12] RECOVERY - PHP opcache health on mw2201 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:21:12] Krinkle: sounds good [14:21:34] Krinkle: once we deployed eqiad one, I will try to generate some lag on an es4 slave [14:21:36] RECOVERY - PHP opcache health on mw2200 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:22:20] RECOVERY - PHP opcache health on mw2211 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:22:30] PROBLEM - PHP opcache health on wtp2017 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:22:35] (03CR) 10Muehlenhoff: "Some comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593499 (https://phabricator.wikimedia.org/T250866) (owner: 10Arturo Borrero Gonzalez) [14:22:38] RECOVERY - PHP opcache health on mw2212 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:22:42] RECOVERY - PHP opcache health on mw2373 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:23:19] (03CR) 10Muehlenhoff: "Also adding Alex and Janis for comments, not sure if kubeadm might also be used for the prod k8s setup going forward." [puppet] - 10https://gerrit.wikimedia.org/r/593499 (https://phabricator.wikimedia.org/T250866) (owner: 10Arturo Borrero Gonzalez) [14:24:25] !log upgrade trafficserver to 8.0.7-1wm2 on cp[5006,5011] [14:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:45] 10Operations, 10DNS: Reverse DNS missing for some hosts - https://phabricator.wikimedia.org/T251522 (10Reedy) [14:25:36] RECOVERY - PHP opcache health on mw2302 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:25:57] marostegui: lgtm. [14:26:07] marostegui: shall I stage it on mwdebug1001 now? [14:26:22] RECOVERY - PHP opcache health on mw2371 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:26:26] Krinkle: let's go for that [14:26:46] (done) [14:28:57] (03CR) 10Muehlenhoff: apereo_cas: add more timeout values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587515 (owner: 10Jbond) [14:29:38] RECOVERY - PHP opcache health on mw2165 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:30:56] (03PS1) 10Elukey: profile::statistics::explorer::misc_jobs: get more jobs from stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/593521 (https://phabricator.wikimedia.org/T249754) [14:31:41] marostegui: all works fine, although I do feel that writes are *very* slow, but that could just be mwdebug. [14:31:50] Let me try the other mwdebug in eqiad where I didn' stage yet [14:31:57] ok [14:32:22] RECOVERY - PHP opcache health on mw2215 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:32:23] meh, it's the same, ok. [14:32:26] marostegui: so next? [14:32:47] (03PS2) 10Elukey: profile::statistics::explorer::misc_jobs: get more jobs from stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/593521 (https://phabricator.wikimedia.org/T249754) [14:32:51] Krinkle: let's deploy db-codfw entirely? [14:32:55] marostegui: [14:32:56] "No active GTIDs in 171966572-171966572-896177597,171974667-171974667-815842459,180355159-180355159-115369055 share a domain with those in db1120-bin.870/913658331" [14:33:10] that's x1 [14:33:11] let me check [14:33:44] It's possible that this is because I switched between two eqiad servers where one has it and one not. [14:33:56] I'll see if it is recurring if I just browse within the new one [14:33:56] but that is the master [14:34:38] what I mean is, it is possbile that the new server stored that my session has GTIDs and then the old server tries to wait for them but isn't allowed to query them per the old config. [14:36:26] yeah, it only happens on the "old" eqiad server [14:36:28] RECOVERY - PHP opcache health on mw2297 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:36:43] marostegui: ok, sycning dc-codfw [14:36:45] Krinkle: yeah, I cannot see anything wrong with those particular GTIDs [14:36:46] cool [14:37:21] thx [14:37:28] RECOVERY - PHP opcache health on mw2207 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:38:04] RECOVERY - PHP opcache health on mw2359 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:38:13] !log krinkle@deploy1001 Synchronized wmf-config/db-codfw.php: I46d2b811f6287689 (duration: 00m 57s) [14:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:20] (03PS2) 10RLazarus: maintenance: Migrate purge_checkuser to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589377 (https://phabricator.wikimedia.org/T211250) [14:38:23] * Krinkle re-locks [14:38:38] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate purge_checkuser to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589377 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [14:38:40] * marostegui checking errors [14:38:58] is there an easy way to filter in Logstash for only host:mw2* or something like that? [14:39:26] RECOVERY - PHP opcache health on mw2144 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:39:38] I guess something like that works :D [14:39:41] nice [14:39:58] (03CR) 10Bstorm: "We should chat about this approach so I understand the reasoning. I have concerns about it. We might be better off with version pinning or" [puppet] - 10https://gerrit.wikimedia.org/r/593499 (https://phabricator.wikimedia.org/T250866) (owner: 10Arturo Borrero Gonzalez) [14:40:35] Krinkle: I don't see anything relevant [14:41:42] RECOVERY - PHP opcache health on mw2209 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:41:47] marostegui: https://logstash.wikimedia.org/goto/ae3b0ae0c6231c3a68a060590df655f1 [14:42:16] (03PS3) 10RLazarus: maintenance: Migrate purge_abusefilter to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589369 (https://phabricator.wikimedia.org/T211250) [14:42:21] the only thing there is "Async set op failed" which is pre-existing [14:42:23] suspicious though [14:42:29] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate purge_abusefilter to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589369 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [14:42:31] it means it was unable to write to one of the parser cache backends [14:42:39] is that normal? [14:42:46] presumably memc is writable but maybe pc is not? [14:43:10] I thought pc was active-active [14:43:20] Krinkle: No, parsercache is only writable in eqiad [14:43:34] marostegui: ah okay. [14:43:43] in theory, it should be active-active, I think we may disable it out of fear [14:43:48] But that error isn't related to this deployment [14:43:52] yeah [14:44:01] e.g. codfw testing breaks eqiad by replication [14:44:03] we have the bi-di replicatin in place but don't accept writes yet in codfw, is that right? [14:44:14] I am not sure, can be checked [14:44:24] anyway, good to know it's normal [14:44:27] Krinkle: Yes (although for maintenance we don't have codfw -> eqiad replication enabled) [14:44:29] But we could, yes [14:44:32] codfw is RO [14:44:46] so there you have it [14:44:52] I mean the pc hosts have some magic bi-di replication right? [14:44:56] not for other hosts of course [14:45:03] Krinkle: In general we only enable codfw -> eqiad replication when we are about to do a DC switchover [14:45:06] I think we could have it rw, as long as unidirectional replication is setup [14:45:08] RECOVERY - PHP opcache health on mw2176 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:45:12] RECOVERY - PHP opcache health on mw2327 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:45:18] RECOVERY - PHP opcache health on mw2221 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:45:22] ok, np. thught it was done already! [14:45:26] RECOVERY - PHP opcache health on wtp2012 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:45:43] marostegui: artificial lag? [14:46:18] Krinkle: In codfw? I can do that, but I doubt we'll see any errors, no? [14:46:24] (03PS2) 10RLazarus: maintenance: Migrate purged_expired_userrights to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589378 (https://phabricator.wikimedia.org/T211250) [14:46:28] RECOVERY - PHP opcache health on mw2177 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:46:33] marostegui: you wanted to do that somewhere, s4 I think? [14:46:41] let me know what you wnat me to test :) [14:46:46] RECOVERY - PHP opcache health on mw2262 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:46:48] Krinkle: es4, yeah, but I guess we need to deploy db-eqiad [14:46:48] (03CR) 10Bstorm: aptrepo: kubeadm-k8s: create versioned components (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/593499 (https://phabricator.wikimedia.org/T250866) (owner: 10Arturo Borrero Gonzalez) [14:46:51] * Krinkle got a new chair this morning [14:46:54] XDD [14:46:58] RECOVERY - PHP opcache health on mw2178 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:46:59] * Krinkle feels like he is now in jynus 's chair. [14:47:02] RECOVERY - PHP opcache health on mw2357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:47:04] RECOVERY - PHP opcache health on wtp2003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:47:09] Krinkle: eh? [14:47:14] Krinkle: you are still missing the best camera ever, like he does [14:47:23] jynus: it reminds me of yours. I don't know many people with a good gaming-style chair like that. [14:47:28] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate purged_expired_userrights to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589378 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [14:47:30] Krinkle: anyways, let's go for db-eqiad.php and I can generate some seconds of lag in one es4 and see how it is handled? [14:47:59] I will send a patch for a proposal of pc2*, unrelated to what you are doing [14:48:04] marostegui: ah okay, Ithought you wanted to do that first. the patch is applied on mwdebug in eqiad . [14:48:06] RECOVERY - PHP opcache health on mw2196 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:48:14] Krinkle: Ah ok! [14:48:32] (03CR) 10Bstorm: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/593499 (https://phabricator.wikimedia.org/T250866) (owner: 10Arturo Borrero Gonzalez) [14:48:57] marostegui: i'm okay either way, I'll prepare a revert in that case locally just in case [14:48:57] Krinkle: Then let me do that, but it would need to be the coincidence that you are browsing and using that special slave, so not sure how easy that whole coincidence can happen :) [14:49:03] ok, I'll wait. [14:49:16] ah good point. [14:50:04] RECOVERY - PHP opcache health on mw2319 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:50:30] RECOVERY - PHP opcache health on mw2171 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:51:10] Krinkle: Let's deploy db-eqiad.php and prepare the revert quickly just in case [14:51:14] RECOVERY - PHP opcache health on mw2202 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:52:22] RECOVERY - PHP opcache health on mw2355 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:52:40] RECOVERY - PHP opcache health on mw2210 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:53:08] RECOVERY - PHP opcache health on mw2369 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:53:10] syncing.. [14:53:25] ok [14:53:41] !log krinkle@deploy1001 Synchronized wmf-config/db-eqiad.php: I46d2b811f6287689 (duration: 00m 57s) [14:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:51] checking errors [14:54:00] RECOVERY - PHP opcache health on wtp2009 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:54:57] [efc439ed-d1a8-4f60-9d13-53a17610a91d] /w/api.php MediaWiki\Storage\BlobAccessException from line 261 of /srv/mediawiki/php-1.35.0-wmf.30/includes/Storage/SqlBlobStore.php: The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. [14:55:16] marostegui: expected? [14:55:25] Not really, let me check [14:55:48] This is my view: https://logstash.wikimedia.org/goto/9bc5139e1ec1fd51b3f735a9654ac701 [14:55:54] (03PS1) 10Jcrespo: mariadb: enable read_only monitoring in parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/593527 (https://phabricator.wikimedia.org/T172489) [14:56:25] I am checking fatal dashboard just in case [14:56:59] No active GTIDs in 171978862-171978862-49423052 share a domain with those in es1023-bin.189/821373123 [14:57:03] This type of message seems new [14:57:19] but if it recovers and is from the race condiino then that should recover within a minute or so [14:57:27] Looks like those messages I pasted are new too: https://logstash.wikimedia.org/goto/df184d10ef8c9719d4a0d227eb9f5424 [14:57:50] RECOVERY - PHP opcache health on mw2309 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:58:02] RECOVERY - PHP opcache health on mw2305 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:58:32] I can edit just fine [14:58:33] marostegui: you lagged a server though,right? [14:58:38] not yet [14:58:55] I did a few minutes ago, but not since we deployed [14:59:15] the "No Active GTIDs" message was non-fatal and has disappeared since [14:59:18] so all good there [14:59:19] This seems stable https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&refresh=5m&from=now-3h&to=now [14:59:51] The "database is locked" error is also gone [14:59:52] querying "The database is currently locked to new entries and other modifications" on mediawiki-errors shows it happened 30 times 5min ago only. [14:59:58] RECOVERY - PHP opcache health on wtp2015 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:00:00] may've been random lag [15:00:01] yep, it is gone [15:00:03] (03CR) 10Jcrespo: [C: 04-1] "I am not too happy about this, but I think monitoring should be in place, in some way or form (not sure about paging and status of codfw)." [puppet] - 10https://gerrit.wikimedia.org/r/593527 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [15:00:08] possibly fake lag per known issue [15:00:38] marostegui: checking app server latencies as well [15:00:50] ok, once we're happy, I will lag the server to see how it goes [15:01:05] (03CR) 10Jcrespo: [C: 04-1] "Also what happens with standby hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/593527 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [15:01:35] (03CR) 10RLazarus: httpbb: add tests for miscweb sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592883 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [15:02:50] Krinkle: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-shard=es5&var-role=All there is an strange increase on reads there [15:02:54] (03PS3) 10Elukey: profile::statistics::explorer::misc_jobs: get more jobs from stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/593521 (https://phabricator.wikimedia.org/T249754) [15:03:12] RECOVERY - PHP opcache health on mw2143 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:03:35] Krinkle: wow https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=es1021&var-port=9104 [15:04:00] kinda matches the deployment time [15:04:11] could be non-worrying metadata? [15:04:12] marostegui: is the new method itself using rows to communicate? [15:04:15] could be [15:04:19] I don't know how it works exactly [15:04:27] but pt-hearbeat is a table and the seconds thing, I guess not. [15:04:33] so read_rnd_next can be full table scans [15:04:40] but if the table has 1 row is non-issue [15:04:42] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10Papaul) I think it is better to do it when the new msw1 is in place. No need to do it now on the old msw1-eqiad [15:05:00] https://grafana.wikimedia.org/d/000000273/mysql?panelId=7&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=es1021&var-port=9104 [15:05:03] Weird [15:05:08] do we hae select queries separately from rows read? [15:05:10] let me check sys [15:05:25] yep https://grafana.wikimedia.org/d/000000273/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=es1021&var-port=9104 [15:05:27] that hasn't changed [15:05:43] temp tables.. [15:05:48] is that expected to change? [15:05:54] I am checking the sys db [15:06:19] (03CR) 10Elukey: "http://puppet-compiler.wmflabs.org/22234/" [puppet] - 10https://gerrit.wikimedia.org/r/593521 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [15:06:47] https://phabricator.wikimedia.org/P11103 [15:07:22] marostegui: what does that metric mean? rows_full_scanned [15:07:43] events in last time that a query caused a full table scan? [15:08:11] yeah, so it is pt-heartbeat, no worries then [15:08:18] Krinkle: the number of rows that were scanned by full scans [15:08:18] (03PS2) 10RLazarus: maintenance: Migrate purge_old_cx_drafts to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589379 (https://phabricator.wikimedia.org/T211250) [15:08:21] if that makes any sense :) [15:08:32] but yeah, it looks like it just heartbeat [15:08:46] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate purge_old_cx_drafts to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589379 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [15:08:52] I think we don't notice on the other servers because the activity here is much smaller [15:09:07] Krinkle: so, now that the dust has settled, let's create some lag to see what happens [15:09:24] We better create some controlled lag than finding out in the middle of the night [15:09:26] marostegui: ah total number of rows read by all table scan queries together, not for 1 or avg. [15:09:35] Krinkle: yep [15:09:42] hm.. that seens strangely non-useful? but maybe that's dev perspective only [15:09:47] ha ha [15:09:57] Krinkle: it is perfect to blame devs! [15:09:58] I'd want to know how much 1 table scan had to do. [15:10:01] :-) [15:10:15] anyway, glad its okay [15:10:18] that's also available [15:10:28] ok, let me create some lag then? [15:10:40] could temp tables be avoided if we change how the pt heartbeat query works? E.g. is it causing some kind of trx that we can skip? [15:10:42] sure [15:11:01] or is temp tables result of table scan [15:11:24] RECOVERY - PHP opcache health on mw2303 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:11:39] !log Create lag on es1021 [15:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:51] Krinkle: I wouldn't worry- the metrics have to be taken in perspective of the table size [15:11:54] RECOVERY - PHP opcache health on mw2331 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:12:00] this is a 4-row table [15:12:00] (03CR) 10Muehlenhoff: "@Gehel: This puppetised java.security is applied via the java::security puppet class, which only gets applied to Hadoop, Kafka (main/Jumbo" [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [15:12:04] RECOVERY - PHP opcache health on mw2222 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:13:09] (03CR) 10Gehel: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [15:13:17] jynus: ok :) - I can imagine maybe it would be bad for the server to be creating millinos of temp tables as an extra duty now even if they are small. but if that's cheap then no problem [15:13:27] Krinkle: I think we should plot the size of those rather than the number of events [15:13:45] Krinkle: not if they are on memory and 200-byte sized [15:13:47] (03PS2) 10RLazarus: maintenance: Migrate purge_securepoll to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589384 (https://phabricator.wikimedia.org/T211250) [15:13:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. One thing to consider; some of these new features are only introduced in u252, so we'll first need to roll out the OpenJ" [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [15:14:08] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate purge_securepoll to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589384 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [15:14:25] or let me rephrase, it wouldn't be clear than an index would have practical advantage [15:14:49] jynus: right, and there isn't an arbitrary limit that says if there are more than N temp tables in memory, then crash the server or start doing other weird stuff [15:15:05] (like php would) [15:15:06] it all about the size [15:15:17] 1TB memory table bad :-D [15:15:23] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/593314 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [15:15:55] malloc significant but only when on MB sizes [15:15:57] Krinkle: I have created a series of 15 seconds lag, so far so good, no? [15:16:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:16:10] RECOVERY - PHP opcache health on wtp2019 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:16:19] (03CR) 10Ayounsi: [C: 03+1] Upstream release v0.2.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/593514 (owner: 10Volans) [15:16:43] marostegui: which section? [15:16:47] Krinkle: es4 [15:16:57] funny to see this on the error logs /w/index.php?title=Anexo:Cronolog%C3%ADa_de_la_pandemia_de_enfermedad_por_coronavirus_de_2019-2020&action=submit [15:17:24] Krinkle: es4, es1021 to be more precise [15:17:29] (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v0.2.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/593514 (owner: 10Volans) [15:18:13] (03PS2) 10Jbond: java: update java.security [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) [15:18:23] RECOVERY - PHP opcache health on mw2326 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:18:50] (03CR) 10Jbond: java: update java.security (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [15:20:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:20:36] Krinkle: is the scap lock still needed/ [15:20:38] ? [15:21:25] volans: clear now [15:22:00] thanks, I guess that we don't have a way to lock only mw-related repos :D [15:22:32] Krinkle: do you want me to generate any lag on a core section (sX) although this patch shouldn't change the behaviour on those, no? [15:24:09] marostegui: Yeah, I think we're good there. [15:24:21] So I think we are good then [15:24:28] marostegui: how long did the lag exist roughly? [15:24:43] I generated 4 series of 15 seconds [15:24:48] but it probably lasted very short [15:24:54] although replication was stopped for 15 seconds [15:25:19] es doesn't have many writes, so lag recovers super fast on these new hosts [15:26:12] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:26:41] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) 05Open→03Stalled Stalling the task until we either: * can start doing more intrusive testing to see if it works as expected * msw1-eqiad is replaced with T225121 [15:26:42] RECOVERY - PHP opcache health on mw2367 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:27:19] (03PS1) 10JMeybohm: package-builder: Use GBP_PBUILDER_ variables [puppet] - 10https://gerrit.wikimedia.org/r/593542 (https://phabricator.wikimedia.org/T233020) [15:27:42] !log volans@deploy1001 Started deploy [homer/deploy@56506db]: Release v0.2.1 [15:27:46] marostegui: app server latency spikd very briefly after the sync, but all back to normal [15:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:57] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1588256983499&to=1588260143211 [15:28:03] !log volans@deploy1001 Finished deploy [homer/deploy@56506db]: Release v0.2.1 (duration: 00m 21s) [15:28:03] so, I guess we're done? [15:28:05] could be the same errors we saw about blocked DB maybe [15:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:10] Krinkle: I believe so yeah [15:28:15] Edit rate is also stable [15:28:22] (03PS1) 10Volans: netbox: add support for RW and RO tokens [software/spicerack] - 10https://gerrit.wikimedia.org/r/593543 [15:28:24] (03PS1) 10Volans: netbox: expose the pynetbox API object [software/spicerack] - 10https://gerrit.wikimedia.org/r/593544 [15:28:48] RECOVERY - PHP opcache health on wtp2011 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:29:10] RECOVERY - PHP opcache health on wtp2005 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:29:39] marostegui: the fatal monitor dashboard btw, I"m not sure works correctly anymore. [15:29:45] merged into https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors a while ago [15:29:53] oh [15:30:19] it is odd though that it has so few messages, it used to work fine [15:30:23] I'll just delete it then [15:30:32] better, yeah [15:30:33] thank you [15:30:38] I am going to call it a day then [15:31:18] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22712 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:31:38] yeah, please, delete it or at least put a deprecation text on top [15:31:44] I was still using it [15:32:19] same :( [15:32:25] Anyways, changed my bookmarks now [15:32:39] Going offline! Thanks for a smooth deploy Krinkle! [15:32:54] there is one related thing I want to ask Krinkle [15:32:56] RECOVERY - PHP opcache health on mw2361 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:33:00] RECOVERY - PHP opcache health on wtp2010 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:33:02] if I can [15:33:02] * Krinkle left a redirect notice on the board for now [15:33:05] sure [15:33:06] RECOVERY - PHP opcache health on wtp2014 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:33:09] 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): Inconsistent caching/staleness of mobile-html responses for certain articles - https://phabricator.wikimedia.org/T249770 (10Pchelolo) [15:33:18] I know all these are warnings [15:33:22] RECOVERY - PHP opcache health on wtp2018 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:33:25] but feel like a lot [15:33:44] https://logstash.wikimedia.org/app/kibana#/dashboard/DBReplication [15:33:59] I may have asked you this before [15:34:07] so sorry if that is the case [15:34:17] (03CR) 10Elukey: [C: 03+2] profile::statistics::explorer::misc_jobs: get more jobs from stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/593521 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [15:34:24] they seem to be coming mostly from the jobrunner [15:34:30] RECOVERY - PHP opcache health on wtp2020 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:34:57] but they are not "spikes", but a constant rate [15:35:12] I would expect spikes of lag, but not a constant rate [15:35:18] "waitForMasterPos: timed out waiting on" right? [15:35:22] yes [15:35:45] [0.001667s] [15:35:57] That's quite a low amount to send a warning for.. [15:36:01] so timed out after microseconds? [15:36:06] *miliseconds [15:36:16] I have no idea what this warning means or when it is logged exactly [15:36:24] ok then [15:36:33] RECOVERY - PHP opcache health on mw2137 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:36:44] maybe it was a "yeah, we enabled on purpose extra logging for X reasons" [15:36:44] I suspect maybe there is some conditional block and that sometimes it does not need to wait [15:36:46] RECOVERY - PHP opcache health on wtp2007 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:36:52] but maybe the log message is send after that condition block [15:36:54] if you don't know that is ok [15:36:56] RECOVERY - PHP opcache health on wtp2008 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:37:00] yeah [15:37:02] RECOVERY - PHP opcache health on wtp2006 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:37:03] I will keep digging [15:37:21] and run into again today [15:37:23] In general I ignore all DB* channels unless it becomes an exception or higher-level timeout. [15:37:35] yeah, I do too [15:37:46] but yeah noisy is usuually reason to be suspect [15:37:51] maybe file at ask :) [15:37:59] yeah, I may [15:38:04] probably a good one to play with for on-boarding CPT with rdbms further. [15:38:16] RECOVERY - PHP opcache health on wtp2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:38:20] at least to understand why it is logged and if it works as intended [15:38:22] only asked in case it was a "yeah, I know exactly what it is, don't worry" [15:38:28] :-D [15:39:04] RECOVERY - PHP opcache health on wtp2016 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:42:51] 10Operations, 10netops: Peer with SFMIX at ulsfo - https://phabricator.wikimedia.org/T251536 (10faidon) p:05Triage→03Medium [15:43:54] RECOVERY - PHP opcache health on wtp2004 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:02] RECOVERY - PHP opcache health on wtp2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:18] RECOVERY - PHP opcache health on wtp2017 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:50:11] 10Operations, 10Traffic: ATS: Add the ability to check if origin server responses can be cached and their lifetime to the Lua plugin - https://phabricator.wikimedia.org/T251537 (10ema) [15:50:25] 10Operations, 10Traffic: ATS: Add the ability to check if origin server responses can be cached and their lifetime to the Lua plugin - https://phabricator.wikimedia.org/T251537 (10ema) [15:58:41] 10Operations, 10Discovery-Search: Also use java::security on elasticsearch/relforge - https://phabricator.wikimedia.org/T251540 (10MoritzMuehlenhoff) [15:59:06] (03CR) 10Muehlenhoff: "Created https://phabricator.wikimedia.org/T251540 for java::security on the Elastic clusters" [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T1600). Please do the needful. [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:27] 10Operations, 10netops: Peer with SFMIX at ulsfo - https://phabricator.wikimedia.org/T251536 (10faidon) I just submitted their form. [16:00:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [16:01:13] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Joe) After a discussion on the patch, it was clearer to me that some information can't be removed from the message, and that makes `resource_change` the perfect f... [16:01:21] RECOVERY - PHP opcache health on wtp2013 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:04:46] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) [16:09:43] 10Operations, 10Core Platform Team, 10Traffic: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 (10Krinkle) [16:12:25] 10Operations, 10Core Platform Team, 10Traffic: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 (10Krinkle) [16:12:56] 10Operations, 10Core Platform Team, 10Traffic: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 (10Krinkle) > The kafka topic mediawiki.job.cdnPurge is currently receiving many (most?) purge messages. Maybe most by volume, but it's semantically very diferrent and a rather internal... [16:15:51] 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): Inconsistent caching/staleness of mobile-html responses for certain articles - https://phabricator.wikimedia.org/T249770 (10ema) The specific issues described in this ticket sho... [16:19:27] (03CR) 10Elukey: [C: 03+1] "The big map with default settings was not appealing at first, but I see how slick it becomes when configuring multiple instances of camus," [puppet] - 10https://gerrit.wikimedia.org/r/593288 (owner: 10Ottomata) [16:20:59] I'm signing off, train has reached all sites, and so far everything looks OK; brennen will roll back while I'm asleep [16:22:22] hopefully IFF needed. :) [16:24:12] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:24:29] (03PS1) 10Hnowlan: changeprop: allow additional configuration of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/593559 (https://phabricator.wikimedia.org/T251176) [16:27:02] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:27:54] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:28:23] (03CR) 10Ppchelko: [C: 03+2] changeprop: allow additional configuration of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/593559 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [16:28:40] (03Merged) 10jenkins-bot: changeprop: allow additional configuration of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/593559 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [16:28:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:29:42] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:35:16] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:40:39] i'm now thinking i should likely roll back for T251457 unless this is purely an artifact of database timeouts that are happening otherwise and not its own distinct issue. advice sought. [16:40:40] T251457: LoadBalancer: Transaction spent [n] second(s) in writes, exceeding the limit of [n] - https://phabricator.wikimedia.org/T251457 [16:47:19] (03CR) 10Ladsgroup: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592761 (owner: 10Ladsgroup) [16:48:59] (03CR) 10Ladsgroup: "If there's no objection by Monday, I'll deploy this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592761 (owner: 10Ladsgroup) [16:55:36] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7777 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:59:42] (03PS5) 10Ottomata: refine.pp - Slight refactor to use new unified refine tranform functions [puppet] - 10https://gerrit.wikimedia.org/r/592756 (https://phabricator.wikimedia.org/T238230) [16:59:59] (03CR) 10CDanis: [C: 03+1] "+1 as long as is rolled out carefully on multi-REs" [homer/public] - 10https://gerrit.wikimedia.org/r/592920 (https://phabricator.wikimedia.org/T247073) (owner: 10Ayounsi) [17:00:04] halfak and accraze: How many deployers does it take to do Services – Graphoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T1700). [17:01:33] (03CR) 10Ottomata: [V: 03+2 C: 03+2] refine.pp - Slight refactor to use new unified refine tranform functions [puppet] - 10https://gerrit.wikimedia.org/r/592756 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:02:41] (03CR) 10Reedy: "Might make sense to wait till .31 starts next week, as otherwise if we have a rollback to .29, this needs reverting too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/590437 (owner: 10Ppchelko) [17:15:48] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:15:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:15:48] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:16:06] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [17:19:16] ACKNOWLEDGEMENT - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP CDanis Telia ticket# 01154507 - The acknowledgement expires at: 2020-05-01 18:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:19:16] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Telia ticket# 01154507 - The acknowledgement expires at: 2020-05-01 18:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:19:16] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Telia ticket# 01154507 - The acknowledgement expires at: 2020-05-01 18:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:19:21] jouncebot: now [17:19:21] For the next 0 hour(s) and 40 minute(s): Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T1700) [17:19:24] jouncebot: next [17:19:24] In 0 hour(s) and 40 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T1800) [17:19:29] (03PS4) 10Reedy: inline comment update and fix to allow beta graphs to use prod files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592964 (owner: 10Seddon) [17:21:50] (03CR) 10Reedy: [C: 03+2] inline comment update and fix to allow beta graphs to use prod files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592964 (owner: 10Seddon) [17:22:05] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:22:49] (03Merged) 10jenkins-bot: inline comment update and fix to allow beta graphs to use prod files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592964 (owner: 10Seddon) [17:24:40] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: labs only (duration: 00m 58s) [17:24:41] (03PS2) 10Bartosz Dziewoński: Load DiscussionTools on en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592630 (https://phabricator.wikimedia.org/T249376) (owner: 10Esanders) [17:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:21] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [17:26:00] (03CR) 10Volans: "updates inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [17:26:25] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [17:26:27] We're going to revert the train. [17:26:33] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 572 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:26:46] James_F: yeah, editpage, or maybe something in Wikibase [17:26:53] * James_F nods. [17:27:03] ^ cc: marostegui [17:27:24] Maybe RevisionRecord has a really expensive edge case where it recurses the constructor/sets up services again or something? [17:27:35] brennen: +1 to revert [17:28:32] reverting momentarily. [17:29:49] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:30:03] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:31:39] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:37:41] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert "group1 and group2 wikis to 1.35.0-wmf.28" [17:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:48] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) [17:37:53] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22711 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:39:17] fun: [17:39:22] Traceback (most recent call last): [17:39:24] File "/usr/lib/python2.7/multiprocessing/queues.py", line 268, in _feed [17:39:26] send(obj) [17:39:28] IOError: [Errno 32] Broken pipe [17:39:35] ^ from scap. [17:40:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1091', diff saved to https://phabricator.wikimedia.org/P11104 and previous config saved to /var/cache/conftool/dbconfig/20200430-174057-marostegui.json [17:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:45] 10Operations, 10Parsoid, 10RESTBase, 10Traffic, and 2 others: HTTP 400 Error when trying to save an edit on English Wikipedia: Error contacting the Parsoid/RESTBase server - https://phabricator.wikimedia.org/T250815 (10matmarex) [17:43:59] (03PS7) 10Cwhite: smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) [17:44:03] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 47 probes of 638 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:46:15] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [17:46:56] brennen: but it still worked as expected (ie: we reverted)? [17:48:15] (03CR) 10Cwhite: [C: 03+2] smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:48:37] (03CR) 10Thiemo Kreuz (WMDE): "Uh, fascinating: https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/mediawiki-config+wgmemoryLimit. The limit was raised " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592761 (owner: 10Ladsgroup) [17:48:52] greg-g: yeah, as far as i can tell. it reached `rebuilt and synchronized wikiversions files: Revert "group1 and group2 wikis to 1.35.0-wmf.28"`, and https://tools.wmflabs.org/versions/ checks out. [17:49:05] * greg-g nods [17:49:07] (03PS4) 10Cwhite: smart: move metrics registry and metrics init to global [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) [17:49:49] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 30 probes of 638 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:49:57] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:50:33] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:51:10] (03CR) 10Kaldari: [C: 03+1] Load DiscussionTools on en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592630 (https://phabricator.wikimedia.org/T249376) (owner: 10Esanders) [17:53:37] (03PS1) 10Ottomata: Fix refine_event table_blacklist_regex and remove absented mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/593573 (https://phabricator.wikimedia.org/T238230) [17:56:54] (03CR) 10jerkins-bot: [V: 04-1] Fix refine_event table_blacklist_regex and remove absented mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/593573 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:57:43] (03PS2) 10Ottomata: Refine - fix table_blacklist_regex and remove mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/593573 (https://phabricator.wikimedia.org/T238230) [17:59:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Refine - fix table_blacklist_regex and remove mediawiki_events refine job [puppet] - 10https://gerrit.wikimedia.org/r/593573 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:59:39] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T1800). [18:00:04] kaldari, qedk, and MatmaRex: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:15] I'm here! [18:02:16] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10RLazarus) [18:03:00] hi [18:05:07] FYI, SWAT window folks, we have currently rolled wmf.30 back to group0. [18:06:01] Ah [18:06:18] That explains a lot :) [18:06:41] see T251457 [18:06:42] T251457: LoadBalancer: Transaction spent [n] second(s) in writes, exceeding the limit of [n] - https://phabricator.wikimedia.org/T251457 [18:07:13] hmm. i think i should remove my backport then [18:07:29] it depends on a patch that is in wmf.30 [18:08:31] (03CR) 10Bartosz Dziewoński: [C: 04-1] "wmf.30 was rolled back" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592630 (https://phabricator.wikimedia.org/T249376) (owner: 10Esanders) [18:08:41] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.0 200 OK - 22689 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:10:17] brennen: My SWAT patch is dependent on wmf.30, so no deployment for me for now. [18:12:09] kk [18:16:00] Is the swat window closed brennen ? [18:16:12] i have a patch relating to icons that I'd love to throw in now if that's possible [18:16:20] if not i can wait till PM today [18:18:23] 10Operations, 10WMF-JobQueue, 10Sustainability: Upgrade jobrunners to redis 2.8 - https://phabricator.wikimedia.org/T97909 (10Krinkle) [18:18:48] 10Operations, 10WMF-JobQueue, 10Sustainability: Upgrade jobrunners to redis 2.8 - https://phabricator.wikimedia.org/T97909 (10Krinkle) 05Open→03Declined I'm assuming this is obsolete given there are no longer JobQueue redis instances. [18:19:21] Jdlrobson: production is currently clear - i'm just awaiting a fix for a known issue before we can roll forward again - so as long as you're good with both .28 and .30 still being running, i don't see any reason you can't go for it. [18:21:04] perfect okay ill update wikitech [18:22:26] brennen: hm.. it is not committed to gerrit/ [18:22:55] Krinkle: gah - my bad, pushing commit now [18:23:19] got distracted by scap breakage. [18:24:02] (03PS1) 10Brennen Bearnes: Group0 only to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593578 (https://phabricator.wikimedia.org/T251457) [18:24:04] (03CR) 10Brennen Bearnes: [C: 03+2] Group0 only to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593578 (https://phabricator.wikimedia.org/T251457) (owner: 10Brennen Bearnes) [18:24:52] (03Merged) 10jenkins-bot: Group0 only to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593578 (https://phabricator.wikimedia.org/T251457) (owner: 10Brennen Bearnes) [18:25:07] (03PS1) 10Jdlrobson: Logo wordmarks should not define fill color - opacity will be used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593579 (https://phabricator.wikimedia.org/T251135) [18:25:23] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/593579 Logo wordmarks should not define fill color - opacity will be used < brennen that's the change [18:25:23] 10Operations, 10Wikimedia-Mailing-lists: The Wikiml-l not archiving mail from August 2019 - https://phabricator.wikimedia.org/T251554 (10jayantanth) [18:28:50] 10Operations, 10Wikimedia-Mailing-lists: The Wikiml-l is not archiving mail from August 2019 - https://phabricator.wikimedia.org/T251554 (10jayantanth) [18:29:08] 10Operations, 10Analytics, 10Traffic: Remove North Korea from data quality traffic entropy reports - https://phabricator.wikimedia.org/T251546 (10Nuria) [18:30:02] 10Operations, 10Analytics, 10Traffic: Remove North Korea from data quality traffic entropy reports - https://phabricator.wikimedia.org/T251546 (10Nuria) I would remove it from daily/hourly jobs both: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/data_quality_stats/hourly/queries/traffic_e... [18:34:09] Jdlrobson: apologies, i haven't actually got any experience SWATing to static/ prior to now, slighly unclear on procedure. [18:34:46] just +2 and scap pull on an mwdebug, i suppose? [18:39:23] 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): Inconsistent caching/staleness of mobile-html responses for certain articles - https://phabricator.wikimedia.org/T249770 (10Pchelolo) 05Open→03Resolved Seems like all the my... [18:42:37] oh dont worry than brennen - i'll wait till later today, unless MatmaRex is able to help? [18:42:52] (i dont have deploy rights) [18:43:11] sorry, no idea how that's done, and i don't have deploy rights either [18:43:25] thanks! maybe Krinkle ? :) [18:43:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic: Remove North Korea from data quality traffic entropy reports - https://phabricator.wikimedia.org/T251546 (10mforns) a:03mforns [18:45:30] (03PS3) 10Cwhite: smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) [18:46:27] k rescheduled for 4pm :) [18:46:28] (03CR) 10jerkins-bot: [V: 04-1] smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [18:46:46] Jdlrobson: ack, thanks. [18:48:31] (03PS2) 10RLazarus: maintenance: Migrate db_lag_stats_reporter to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589672 (https://phabricator.wikimedia.org/T211250) [18:48:35] (03PS4) 10Cwhite: smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) [18:48:41] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate db_lag_stats_reporter to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589672 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [18:51:35] (03PS3) 10Jdlrobson: Add project taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591160 (https://phabricator.wikimedia.org/T249047) [18:51:39] brennen: For the future, you sync static and then manually purge the files from Varnish/ATS. [18:52:10] brennen: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Purging [18:52:26] James_F: thx [18:52:53] But technically SWATs/etc. are off until the train is fixed, per policy. :-( [18:53:01] (03PS3) 10RLazarus: maintenance: Migrate cirrussearch to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589680 (https://phabricator.wikimedia.org/T211250) [18:56:28] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate cirrussearch to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589680 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [18:56:51] (03PS2) 10RLazarus: maintenance: Migrate generatecaptcha to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589688 (https://phabricator.wikimedia.org/T211250) [19:00:04] liw and brennen: Dear deployers, time to do the Mediawiki train - European+American Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T1900). [19:00:23] * brennen looks at jouncebot like that. [19:00:45] * James_F grins. [19:02:16] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate generatecaptcha to periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589688 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [19:02:32] (03PS2) 10RLazarus: maintenance: Migrate pageassessments to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589695 (https://phabricator.wikimedia.org/T211250) [19:03:56] (03PS1) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:06:18] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate pageassessments to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589695 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [19:06:40] (03PS2) 10RLazarus: maintenance: Migrate readinglists to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589706 (https://phabricator.wikimedia.org/T211250) [19:07:05] (03CR) 10jerkins-bot: [V: 04-1] Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:07:27] (03PS2) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:09:20] (03PS3) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:10:27] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate readinglists to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589706 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [19:10:54] (03CR) 10Reedy: [C: 04-1] "As has now happened! :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/590437 (owner: 10Ppchelko) [19:12:18] (03PS4) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:15:33] (03PS5) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:19:09] (03PS6) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:20:56] (03PS7) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:24:01] (03PS8) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:25:09] !log test mtail rc35 upgrade on fermium - T251466 [19:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:15] T251466: Upgrade mtail to 3.0.0-rc35 - https://phabricator.wikimedia.org/T251466 [19:26:17] (03PS1) 10Cmjohnson: Adding production dns (ipv4 & ipv6) backup1002 [dns] - 10https://gerrit.wikimedia.org/r/593596 (https://phabricator.wikimedia.org/T250816) [19:27:36] (03CR) 10jerkins-bot: [V: 04-1] Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:33:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10Cmjohnson) [19:34:06] 10Operations, 10SRE-Access-Requests: Revoke production access for jmorgan - https://phabricator.wikimedia.org/T251560 (10leila) [19:34:50] (03PS9) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:38:21] (03CR) 10jerkins-bot: [V: 04-1] Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:40:11] (03PS10) 10Ottomata: Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) [19:42:57] !log reboot cloudvirt1024 for NIC firmware updates T241884 [19:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:05] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [19:45:11] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/22243/an-launcher1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:45:23] (03CR) 10Ottomata: [C: 03+2] Factor out RefineFailuresChecker into the refine_job define [puppet] - 10https://gerrit.wikimedia.org/r/593594 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:52:30] (03PS7) 10Ottomata: Refactor and DRY camus module templates [puppet] - 10https://gerrit.wikimedia.org/r/593288 [19:52:57] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Refactor and DRY camus module templates [puppet] - 10https://gerrit.wikimedia.org/r/593288 (owner: 10Ottomata) [19:56:23] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T251563 (10ops-monitoring-bot) [19:59:28] (03PS1) 10Ottomata: Remove absented failed_flags_ refine::jobs [puppet] - 10https://gerrit.wikimedia.org/r/593605 (https://phabricator.wikimedia.org/T238230) [19:59:45] (03CR) 10RLazarus: [C: 03+2] "> if this looks good to you, I'll test it by renewing the certs again before merging" [puppet] - 10https://gerrit.wikimedia.org/r/589076 (owner: 10RLazarus) [20:03:08] (03CR) 10Ottomata: [C: 03+2] Remove absented failed_flags_ refine::jobs [puppet] - 10https://gerrit.wikimedia.org/r/593605 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:03:37] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T251563 (10Peachey88) [20:04:49] !log Disabling puppet on all mcrouter hosts for cert renewal. This isn't strictly needed, as the certs from last time are still fine -- just testing the renewal script. [20:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:25] (03PS1) 10Cmjohnson: Adding backup1002 mac to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/593606 (https://phabricator.wikimedia.org/T250816) [20:05:48] !log cloudvirt1024 upgrade iDRAC firmware from 2.4.8 to 2.5.4 T241884 [20:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:54] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [20:06:20] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns (ipv4 & ipv6) backup1002 [dns] - 10https://gerrit.wikimedia.org/r/593596 (https://phabricator.wikimedia.org/T250816) (owner: 10Cmjohnson) [20:08:00] (03CR) 10Cmjohnson: [C: 03+2] Adding backup1002 mac to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/593606 (https://phabricator.wikimedia.org/T250816) (owner: 10Cmjohnson) [20:08:54] (03PS2) 10Cwhite: aptrepo: add mtail component for controlled mtail upgrade [puppet] - 10https://gerrit.wikimedia.org/r/593314 (https://phabricator.wikimedia.org/T251466) [20:10:25] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10Cmjohnson) a:05Cmjohnson→03jcrespo @jcrespo This server is just about ready for install, Can you do the raid cfg and update netboot.cfg. After that, it's ready for inst... [20:10:36] !log mcrouter certs re-renewed on puppetmaster1001, puppet enabled on mcrouter hosts [20:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:15] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) Used the BIOS versions in that last log message, the correct iDRAC versions and log output are below ` # bash ./iDRAC-with-Lifecycle-Contr... [20:13:04] !log test mtail rc35 upgrade on logstash1007 - T251466 [20:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:09] T251466: Upgrade mtail to 3.0.0-rc35 - https://phabricator.wikimedia.org/T251466 [20:17:07] (03CR) 10Cwhite: "The package now installs on jessie, stretch, and buster. Thanks Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/593314 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [20:17:33] (03PS2) 10Cwhite: mtail: add flag to install mtail apt component [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) [20:19:40] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10Cmjohnson) [20:19:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/593314 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [20:20:06] ema: looks like solved already, but fwiw https://grafana.wikimedia.org/d/000000066/resourceloader?panelId=46&fullscreen&orgId=1&from=now-30d&to=now [20:20:12] expected spike after major deploy [20:20:42] bit stronger than usual, keeping an eye on it [20:22:44] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) [20:22:57] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) p:05Triage→03Medium [20:29:22] (03CR) 10RLazarus: "> Patch Set 5: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/589076 (owner: 10RLazarus) [20:30:53] (03PS1) 10Ottomata: [WIP] Add eventlogging_legacy job to refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [20:31:05] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add eventlogging_legacy job to refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [20:36:11] (03PS3) 10Cwhite: mtail: add flag to install mtail from apt component [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) [20:36:50] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10herron) [20:37:07] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10herron) p:05Triage→03High [20:39:41] (03CR) 10jerkins-bot: [V: 04-1] mtail: add flag to install mtail from apt component [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [20:43:47] (03PS4) 10Cwhite: mtail: add flag to install mtail from apt component [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) [20:44:57] (03CR) 10Muehlenhoff: mtail: add flag to install mtail from apt component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [20:47:01] (03CR) 10jerkins-bot: [V: 04-1] mtail: add flag to install mtail from apt component [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [20:53:23] (03PS1) 10Ottomata: camus::job - Allow for false when setting boolean parameters [puppet] - 10https://gerrit.wikimedia.org/r/593612 [20:55:43] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22244/an-launcher1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/593612 (owner: 10Ottomata) [20:55:53] (03PS5) 10Cwhite: mtail: add flag to install mtail from apt component [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) [20:57:17] (03PS1) 10Cmjohnson: Adding netboot.cfg and dhcpd for cloudelastic100[56] [puppet] - 10https://gerrit.wikimedia.org/r/593613 (https://phabricator.wikimedia.org/T249062) [20:58:06] (03PS2) 10Cmjohnson: Adding netboot.cfg and dhcpd for cloudelastic100[56] [puppet] - 10https://gerrit.wikimedia.org/r/593613 (https://phabricator.wikimedia.org/T249062) [21:02:53] (03CR) 10Cwhite: mtail: add flag to install mtail from apt component (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [21:04:26] (03CR) 10Urbanecm: "This was approved by Legal, removing my -2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593286 (https://phabricator.wikimedia.org/T251447) (owner: 10Urbanecm) [21:05:19] (03PS2) 10Urbanecm: Assign oathauth-verify-user to stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593286 (https://phabricator.wikimedia.org/T251447) [21:05:26] PROBLEM - PHP7 rendering on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:07:02] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:07:08] RECOVERY - PHP7 rendering on mw1349 is OK: HTTP OK: HTTP/1.1 200 OK - 76087 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:12:09] (03CR) 10DannyS712: "LGTM, just want to double check that stewards are being told about this and told to be careful?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593286 (https://phabricator.wikimedia.org/T251447) (owner: 10Urbanecm) [21:13:14] (03CR) 10Urbanecm: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593286 (https://phabricator.wikimedia.org/T251447) (owner: 10Urbanecm) [21:13:41] (03CR) 10DannyS712: [C: 03+1] "> > Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593286 (https://phabricator.wikimedia.org/T251447) (owner: 10Urbanecm) [21:16:04] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22722 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:18:48] PROBLEM - PHP7 rendering on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:20:32] RECOVERY - PHP7 rendering on mw1352 is OK: HTTP OK: HTTP/1.1 200 OK - 76087 bytes in 0.574 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:34:25] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10Andrew) [21:41:08] PROBLEM - PHP opcache health on mw2145 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:44:34] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10Andrew) a:05Papaul→03Andrew [21:45:39] 10Operations, 10ops-codfw, 10DC-Ops: host rename: labtestservices2003.wikimedia.org -> cloudservices2003-dev.wikimedia.org - https://phabricator.wikimedia.org/T251576 (10Andrew) [21:49:49] (03PS1) 10Andrew Bogott: pdns: update mysql-client package name for Buster [puppet] - 10https://gerrit.wikimedia.org/r/593621 (https://phabricator.wikimedia.org/T251294) [21:49:51] (03PS1) 10Andrew Bogott: Initial role assignment for cloudcontrol2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/593622 (https://phabricator.wikimedia.org/T251294) [21:53:31] (03CR) 10Andrew Bogott: [C: 03+2] pdns: update mysql-client package name for Buster [puppet] - 10https://gerrit.wikimedia.org/r/593621 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [21:53:53] (03CR) 10Andrew Bogott: [C: 03+2] Initial role assignment for cloudcontrol2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/593622 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [21:59:26] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban), 10Upstream: Jenkins job builder ignores BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10hashar) [22:01:18] RECOVERY - PHP opcache health on mw2145 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:02:38] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) I'm unable to upgrade the SATA because of the failed drive state: `name=ERROR Serial ATA firmware # bash ./Serial-ATA_Firmware_V141M_LN_DL... [22:03:02] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T251579 (10ops-monitoring-bot) [22:04:30] PROBLEM - PHP7 rendering on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:06:12] RECOVERY - PHP7 rendering on mw1349 is OK: HTTP OK: HTTP/1.1 200 OK - 76085 bytes in 0.592 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:08:12] PROBLEM - PHP7 rendering on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:11:26] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) I'd also like to point out that we have another system purchased in the same batch T192119, and 6 more with the same configuration T201352... [22:11:48] RECOVERY - PHP7 rendering on mw1355 is OK: HTTP OK: HTTP/1.1 200 OK - 76053 bytes in 0.381 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:12:20] PROBLEM - SSH on db2082.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:13:28] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5310 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:16:20] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) I've cleared the foreign configuration on drives 4 and 9 again, once the rebuild completes I'll attempt the SATA firmware and system BIOS u... [22:17:21] (03PS7) 10BryanDavis: Replace pykube with a custom API client [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) [22:18:20] (03PS1) 10Andrew Bogott: keystone: make the max_active_keys a bit smarter [puppet] - 10https://gerrit.wikimedia.org/r/593626 [22:18:22] (03PS1) 10Andrew Bogott: Add fernet key numbers for cloudcontrol2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/593627 [22:20:17] (03CR) 10Andrew Bogott: "noop victory!" [puppet] - 10https://gerrit.wikimedia.org/r/593626 (owner: 10Andrew Bogott) [22:25:50] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10wiki_willy) >>! In T241884#6098941, @JHedden wrote: > I'd also like to point out that we have another system purchased in the same batch T192119, a... [22:28:53] (03CR) 10BryanDavis: "Cherry picked PS7 to toolsbeta.test@toolsbeta-sgebastion-04:qa/tools-webservice." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) (owner: 10BryanDavis) [22:43:02] PROBLEM - PHP opcache health on mw2219 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:45:16] PROBLEM - PHP7 rendering on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:46:58] RECOVERY - PHP7 rendering on mw1349 is OK: HTTP OK: HTTP/1.1 200 OK - 76085 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:52:31] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10faidon) For completely unrelated to this reasons I was looking today at LDAP and the Google Group integration. JumpClou... [22:53:04] Zuul is being slow :( [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200430T2300). [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:09] I can SWAT today! [23:00:36] Here [23:00:44] Thanks Urbanecm ! [23:01:14] Jdlrobson: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/593589 is for wmf.30, which is, as of now, only on group0 (testwikis). Is it urgent? [23:01:35] (03PS3) 10Urbanecm: Assign oathauth-verify-user to stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593286 (https://phabricator.wikimedia.org/T251447) [23:01:41] It needs to go out in tbat [23:01:43] (03CR) 10Urbanecm: [C: 03+2] Assign oathauth-verify-user to stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593286 (https://phabricator.wikimedia.org/T251447) (owner: 10Urbanecm) [23:01:50] That branch [23:02:19] okay, as you want [23:02:31] Thanks for asking [23:02:41] +2'ed, let's wait for CI [23:02:51] meanwhile, going to +2 the config patches [23:03:01] (03Merged) 10jenkins-bot: Assign oathauth-verify-user to stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593286 (https://phabricator.wikimedia.org/T251447) (owner: 10Urbanecm) [23:03:08] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593579 (https://phabricator.wikimedia.org/T251135) (owner: 10Jdlrobson) [23:04:16] (03Merged) 10jenkins-bot: Logo wordmarks should not define fill color - opacity will be used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593579 (https://phabricator.wikimedia.org/T251135) (owner: 10Jdlrobson) [23:05:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: cf5f7ff: Assign oathauth-verify-user to stewards (T251447) (duration: 01m 05s) [23:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:48] T251447: Assign oathauth-verify-user to stewards at metawiki - https://phabricator.wikimedia.org/T251447 [23:06:13] Jdlrobson: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/593579 is pulled onto mwdebug1001, could you have a look please? [23:06:46] Looking! [23:07:11] https://wikipedia.org/static/images/mobile/copyright/commons-wordmark-en.svg lgtm [23:07:20] (03PS2) 10Andrew Bogott: Add fernet key numbers for cloudcontrol2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/593627 [23:07:22] perfect, syncing [23:07:22] (03PS2) 10Andrew Bogott: keystone: make the max_active_keys a bit smarter [puppet] - 10https://gerrit.wikimedia.org/r/593626 [23:08:10] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591160 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [23:08:13] (03CR) 10Andrew Bogott: [C: 03+2] Add fernet key numbers for cloudcontrol2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/593627 (owner: 10Andrew Bogott) [23:08:52] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: ae1424a: Logo wordmarks should not define fill color - opacity will be used (T251135) (duration: 01m 05s) [23:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:01] T251135: Make mobile wordmark gray again - https://phabricator.wikimedia.org/T251135 [23:10:20] (03PS4) 10Urbanecm: Add project taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591160 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [23:10:24] (03CR) 10Urbanecm: Add project taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591160 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [23:10:27] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591160 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [23:11:19] (03Merged) 10jenkins-bot: Add project taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591160 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [23:11:46] Jdlrobson: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/591160 is available at mwdebug1001 for testing too. Let me know! [23:12:07] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10Andrew) 05Resolved→03Open a:05Andrew→03Papaul Just noticed that this is already closed :) thanks @Papaul [23:12:55] https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-eu.svg 💪 [23:13:18] nice [23:13:27] I get a 404 [23:14:23] Reedy: I don't? (just on mwdebug now, through scap syns the static files) [23:14:25] RECOVERY - PHP opcache health on mw2219 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:14:36] ohh [23:14:48] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: 9065650: Add project taglines (T249047) (duration: 01m 05s) [23:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:56] T249047: [Site Config] Make new logos available in production in preparation for T246170 - https://phabricator.wikimedia.org/T249047 [23:15:10] Yeh you need to use xdebug [23:15:29] ahh, wrong path,.. [23:15:32] syncing again§ [23:17:14] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/: SWAT: 9065650: Add project taglines (T249047) (duration: 01m 04s) [23:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:38] xdebug != x-wikimedia-debug [23:17:39] ;) [23:18:01] That is correct. Sorry! [23:19:09] W00t new logo https://en.wikipedia.beta.wmflabs.org/w/index.php?title=AHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH&oldid=417650&mobileaction=toggle_view_desktop [23:19:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 9065650: Add project taglines (T249047) (duration: 01m 04s) [23:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:14] Urbanecm on 2nd thoughts we can skip the opacity minerva change [23:20:24] I'll get it done monday [23:20:32] okay [23:20:42] After seeing the logos in black on mobile rather than opaque it's not terrible [23:20:52] Thanks for suggesting i hold back i appreciate it. [23:21:01] And thanks for swatting these exciting changes [23:21:09] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: 9065650: Add project taglines (T249047) (duration: 01m 04s) [23:21:10] happy to help! [23:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:15] T249047: [Site Config] Make new logos available in production in preparation for T246170 - https://phabricator.wikimedia.org/T249047 [23:21:19] so that should be all Jdlrobson :) [23:21:39] Thank you!! [23:21:56] happy to help! [23:23:09] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/593543 (owner: 10Volans) [23:24:01] (03CR) 10CRusnov: [C: 03+1] "Looks good, I really appreciate the Todo item in the comment. THat's good." [software/spicerack] - 10https://gerrit.wikimedia.org/r/593544 (owner: 10Volans) [23:24:08] * Urbanecm done [23:29:31] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 71.19 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [23:32:41] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 53.9 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [23:40:31] PROBLEM - PHP7 rendering on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:42:09] RECOVERY - PHP7 rendering on mw1355 is OK: HTTP OK: HTTP/1.1 200 OK - 76089 bytes in 9.401 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:45:57] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 3.051 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [23:48:09] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 1.017 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [23:48:12] Clicking "show preview" on metawiki is opening the preview in a new window - is this intentional? [23:51:07] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 797 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:52:24] 10Operations, 10ops-codfw, 10DC-Ops: host rename: labtestservices2003.wikimedia.org -> cloudservices2003-dev.wikimedia.org - https://phabricator.wikimedia.org/T251576 (10Papaul) p:05Triage→03Medium [23:57:20] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10Papaul) 05Open→03Resolved