[00:27:07] (03CR) 10Bstorm: sssd: Add a whole duplicate hierarchy of sssd images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [00:29:14] (03CR) 10Krinkle: [C: 03+1] Lower gzip threshold for SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/537974 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [00:43:41] !log db2060 - remove PXE flag boot override - set Boot Device to none [00:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:57] 10Operations, 10fundraising-tech-ops, 10netops: possible routing issue between eqiad and Maxmind network - https://phabricator.wikimedia.org/T233672 (10Jgreen) [01:12:06] 10Operations, 10fundraising-tech-ops, 10netops: possible routing issue between eqiad and Maxmind network - https://phabricator.wikimedia.org/T233672 (10Jgreen) p:05Triage→03Unbreak! Flipping this to "Unbreak Now!" since it's a timely issue, and service outage interfering with the donation pipeline. We do... [01:47:21] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:03:13] 10Operations, 10fundraising-tech-ops, 10netops: possible routing issue between eqiad and Maxmind network - https://phabricator.wikimedia.org/T233672 (10ayounsi) [02:03:48] 10Operations, 10fundraising-tech-ops, 10netops: possible routing issue between eqiad and Maxmind network - https://phabricator.wikimedia.org/T233672 (10ayounsi) a:03ayounsi All those IPs are behind Cloudflare. Opened a ticket with them. [02:23:50] (03CR) 10CRusnov: "> Patch Set 1: -Code-Review" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/533131 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [02:30:35] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1024.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:36:52] (03PS7) 10CRusnov: Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) [02:37:06] (03CR) 10CRusnov: "Thanks for review!" (035 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) (owner: 10CRusnov) [02:37:37] (03CR) 10jerkins-bot: [V: 04-1] Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) (owner: 10CRusnov) [02:45:56] (03PS4) 10CRusnov: netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) [02:46:30] (03CR) 10CRusnov: "Cool, test fix in latest PS." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [02:47:59] 10Operations, 10fundraising-tech-ops, 10netops: possible routing issue between eqiad and Maxmind network - https://phabricator.wikimedia.org/T233672 (10ayounsi) 05Open→03Resolved Resolved by Cloudflare. [02:50:11] (03CR) 10CRusnov: "> Patch Set 14: Verified-1" [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [02:50:13] (03CR) 10jerkins-bot: [V: 04-1] netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [03:19:55] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:20:13] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:21:33] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:21:51] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:35:05] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:25] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:45:58] !log crusnov@cumin1001 START - Cookbook sre.hosts.downtime [03:45:58] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:02] !log rebooted netboxdb[12]001 for kernel upgrade [03:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:47] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:09:29] (03PS5) 10CRusnov: netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) [04:13:01] !log Start pre switchover steps - T230783 [04:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:05] T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 [04:20:17] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [04:21:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set weight 0 to db1123 T230783', diff saved to https://phabricator.wikimedia.org/P9156 and previous config saved to /var/cache/conftool/dbconfig/20190924-042121-marostegui.json [04:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:26] T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 [04:21:57] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [04:22:13] (03PS3) 10Marostegui: mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/538522 (https://phabricator.wikimedia.org/T230783) [04:22:36] (03CR) 10Marostegui: mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/538522 (https://phabricator.wikimedia.org/T230783) (owner: 10Marostegui) [04:27:58] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/538522 (https://phabricator.wikimedia.org/T230783) (owner: 10Marostegui) [04:50:30] In 10 minutes we'll start the s3 switchover [04:53:06] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10jcrespo) > whether that needs changing on the desired thresholds is a different discussion. The director of SRE was the person who decided that at the time becaus... [05:00:04] marostegui and jynus: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for s3 database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190924T0500). [05:00:08] jynus: ready? [05:00:11] ok [05:00:14] !log Starting s3 failover from db1075 to db1123 - T230783 [05:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:18] T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 [05:00:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s3 as read-only for maintenance T230783', diff saved to https://phabricator.wikimedia.org/P9157 and previous config saved to /var/cache/conftool/dbconfig/20190924-050034-marostegui.json [05:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:55] mmm [05:00:58] I can still edit [05:01:07] me too [05:01:20] it is not s3 [05:01:24] it is DEFAULT or something [05:01:24] is it DEFAULT? [05:01:28] let's see [05:01:43] argh [05:01:48] section not found [05:01:51] for default [05:01:53] this is a dbctl code bug, I am very sorry [05:02:13] cdanis: is easy to hotfix? we have 30 minutes window [05:02:22] asking cause wikitech was an easy one :) [05:02:26] maybe let's do it old style? [05:02:27] for the failover we could edit the files in etcd by hand [05:02:37] you could also do a config push yeah [05:02:47] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Joe) >>! In T233534#5518243, @jcrespo wrote: >> whether that needs changing on the desired thresholds is a different discussion. > > The director of SRE was the p... [05:03:20] jynus: old sytle as in setting read_only on the master? [05:03:26] <_joe_> is it? [05:03:49] cdanis: not sure I am following you with that [05:04:28] <_joe_> cdanis: so the problem is with ReadOnlyBySection? [05:04:39] well, I was thinking editing the php files [05:04:41] _joe_: yes, it knows to translate to DEFAULT in the sectionLoads output, but not in the readOnlyBySection output [05:04:50] compare https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/refs/heads/master/conftool/extensions/dbconfig/config.py#99 vs line 126 _joe_ [05:04:53] <_joe_> yeah we can just push the data to etcd I agree [05:04:55] but setting read only on db should be detected automatically too [05:05:06] I am going to remove the "ro" with dbctl to leave it as it was for now [05:05:09] although less clean [05:05:26] <_joe_> marostegui: no wait [05:05:30] ok [05:05:46] or cannot we just write DEFAULT on the edit to etcd manually? [05:05:49] jynus: the swithover script would fail, as I believe the check for RO on the database will fail [05:06:22] yeah, I meant just letting the script do its thing [05:06:30] or using read only master option [05:06:35] both would work [05:06:36] ah, I see what you mean [05:06:51] I would wait for joe suggestion [05:06:56] let's wait for cdanis and _joe_ to see if it can be fixed easily [05:07:00] I would only do that if we were in an emergency [05:07:04] _joe_: I am ready to edit etcd data by hand [05:07:12] I have a confctl invocation with an editor open :) [05:07:29] _joe_: +1 from you? [05:07:41] <_joe_> cdanis: using confctl edit? [05:07:45] yes [05:07:47] <_joe_> on that key? eek [05:07:51] <_joe_> ok go on [05:08:22] <_joe_> then I think we can do a bugfix release in a few minutes, btew [05:09:02] <_joe_> I'll work on it [05:09:16] ok sounds good [05:09:29] I have prepped also a dbctl config restore invocation in another shell, Just In Case(tm) [05:09:47] <_joe_> I was about to suggest that [05:10:14] !log T230783 mark DEFAULT not s3 as readonly in etcd etcd dbconfig data [05:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:17] T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 [05:10:44] read only confirmed [05:10:51] I am going to proceed with the topology change [05:10:58] please do [05:11:02] marostegui: my ok [05:11:16] cdanis: done, you can remove read only [05:11:25] wait [05:11:38] waiting [05:11:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1123 to s3 master and remove read-only from s3 T230783', diff saved to https://phabricator.wikimedia.org/P9158 and previous config saved to /var/cache/conftool/dbconfig/20190924-051147-marostegui.json [05:11:51] now you can [05:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:54] new master promoted [05:11:54] <_joe_> ahahahah [05:12:01] <_joe_> I would wait a few seconds [05:12:06] <_joe_> for mediawiki to catch up [05:12:11] <_joe_> it has up to 15 seconds [05:12:43] from the DB side we are good to go [05:12:46] failing with config vs failing with db, not much of a difference [05:13:06] job queue may take half an hour to update [05:13:07] <_joe_> cdanis: want me to edit it back? [05:13:08] !log cdanis@cumin1001 dbctl commit (dc=all): 're-do T230783 master promotion and set read-write', diff saved to https://phabricator.wikimedia.org/P9159 and previous config saved to /var/cache/conftool/dbconfig/20190924-051307-cdanis.json [05:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:14] _joe_: already done [05:13:17] and dumps hours or days [05:13:19] I can edit now [05:13:34] me too [05:13:46] (03CR) 10Marostegui: wmnet: Update s3-master alias to point to db1123 [dns] - 10https://gerrit.wikimedia.org/r/538004 (https://phabricator.wikimedia.org/T230783) (owner: 10Marostegui) [05:13:49] PROBLEM - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:13:57] uh oh [05:14:10] that is expected [05:14:44] _joe_: we need a better way (i.e. any way at all) to bail out of a confctl edit invocation that you then don't want to commit [05:14:47] PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops [05:15:10] they are only double digit ongoing transactions [05:15:19] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:29] gone now [05:15:29] RECOVERY - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:15:33] (03PS3) 10Abijeet Patro: Fix incorrect channel name for TranslationNotifications extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) [05:15:54] <_joe_> cdanis: yeah probably, the edit interface is very barebones [05:16:07] Everything looks good I think now [05:16:25] RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops [05:16:29] (03CR) 10Abijeet Patro: Fix incorrect channel name for TranslationNotifications extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [05:16:31] _joe_: I had started a confctl edit before marostegui applied https://phabricator.wikimedia.org/P9158 and then un-did it (but fortunately very easy to fix) [05:16:42] <_joe_> heh [05:17:52] <_joe_> cdanis: go to bed, I'll write the fix, the tests, and make a patch to the current version so it's fixed [05:18:02] I'm just filing a task or two and then doing exactly that :) [05:18:28] cdanis: thank you :) [05:18:37] this also reminds me that I kept meaning to add the etcd optimistic concurrency support (checking for a certain previous value of a write) [05:18:59] <_joe_> we do compare-and-swap [05:19:15] I don't think we do? I stomped on marostegui's edit just now [05:19:17] <_joe_> unless something changed in the meantime, maybe not for edit? [05:19:18] (03PS3) 10Marostegui: wmnet: Update s3-master alias to point to db1123 [dns] - 10https://gerrit.wikimedia.org/r/538004 (https://phabricator.wikimedia.org/T230783) [05:19:26] <_joe_> yeah I'm a bit perplexed [05:19:39] we certainly don't pass the expected previous value down to the etcd driver [05:19:46] <_joe_> oh right it works only for point-in-time writes [05:19:53] <_joe_> not for edits [05:19:57] cdanis: can you add me as a subscriber to the task you create? [05:19:59] <_joe_> which can last longer [05:20:02] right [05:20:15] <_joe_> so yeah for edits we need a better CAS logic [05:20:18] dbctl config commits are like that as well, I think; they can stomp on each other because user is presented a diff and such [05:20:24] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s3-master alias to point to db1123 [dns] - 10https://gerrit.wikimedia.org/r/538004 (https://phabricator.wikimedia.org/T230783) (owner: 10Marostegui) [05:21:51] (03PS2) 10Elukey: profile::analytics::refinery::job::druid_load: add dims to netflow [puppet] - 10https://gerrit.wikimedia.org/r/538603 (https://phabricator.wikimedia.org/T229682) [05:22:32] marostegui: o/ morningggg - can I puppet-merge or is there still work in progress? Don't want to slow-down/interfere [05:22:39] elukey: go for it! [05:22:41] thanks for asking [05:23:47] marostegui: subscribed you to T233679. don't think you will care about T233680 or about T233681 which have more to do with internals [05:23:48] T233681: compare-and-swap writes for confctl edit and for dbctl commit - https://phabricator.wikimedia.org/T233681 [05:23:48] T233680: No way to cancel a confctl edit invocation - https://phabricator.wikimedia.org/T233680 [05:23:48] T233679: dbctl doesn't always correctly translate section names in its output - https://phabricator.wikimedia.org/T233679 [05:24:25] cdanis: thanks! I wanted to mention it as a follow up because our RO was longer than usual (still quite good) but I wanted to mention that we encountered some issues and those will be followed up shortly [05:24:36] ofc :) [05:24:56] sorry to have broken the streak of under 90 seconds [05:25:15] the last one was under 60 seconds! [05:25:16] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::druid_load: add dims to netflow [puppet] - 10https://gerrit.wikimedia.org/r/538603 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey) [05:25:42] cdanis: Thanks for being online for us! :) [05:25:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give weight 100 to db1075', diff saved to https://phabricator.wikimedia.org/P9160 and previous config saved to /var/cache/conftool/dbconfig/20190924-052545-marostegui.json [05:25:47] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:38] <3 [05:26:40] goodnight [05:27:02] cdanis: night! :) [05:27:28] cdanis: o/ [05:30:38] would someone be willing to give a 3 line summary of what happened? the scrollback is a bit tedious to dig through [05:31:12] apergos: IIUC it was a planned failover of the s3 master [05:31:28] <_joe_> apergos: TL;DR the failover went well, but we found one place where dbctl didn't manage to translate 's3' to 'DEFAULT' for mediawiki [05:31:35] aaaahhh [05:31:57] thanks _joe _ [05:32:26] I remember something like that happening recently what was it? [05:32:52] <_joe_> like what? [05:33:03] something related to s3 default on dbctl [05:33:12] jynus: it was with wikitech [05:33:16] ah [05:33:18] that [05:33:21] so not related [05:33:52] I think the issue with wikitech is that it wasn't listed as a section [05:33:58] yeah [05:34:07] or it has a different name [05:34:11] *had [05:34:19] No, it wasn't listed [05:35:37] https://gerrit.wikimedia.org/r/#/c/operations/software/conftool/+/534153/ [05:35:57] thanks [05:36:36] <_joe_> yeah we needed to just add it to the allowed sections in the schema. [05:37:39] <_joe_> I think it wasn't considered a normal section back when I first wrote dbctl, and it changed in between, so that did slip through the cracks at the time [05:37:55] yeah, it should eventually dissappear [05:39:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1075', diff saved to https://phabricator.wikimedia.org/P9161 and previous config saved to /var/cache/conftool/dbconfig/20190924-053919-marostegui.json [05:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:54] 10Operations, 10DBA, 10Patch-For-Review: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) This was done successfully. read only start: 05:10:14 UTC AM read only stop: 05:13:08 UTC AM total read only time: 2 minutes 54 s... [05:39:57] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Marostegui) [05:40:00] 10Operations, 10DBA, 10Patch-For-Review: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) 05Open→03Resolved [05:50:09] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) >>! In T233534#5517306, @Krenair wrote: > I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed th... [05:59:06] (03PS1) 10Giuseppe Lavagetto: Translate the default section name to DEFAULT in ReadOnlyBySection too [software/conftool] - 10https://gerrit.wikimedia.org/r/538732 (https://phabricator.wikimedia.org/T233679) [06:02:26] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Joe) >>! In T233534#5518359, @Marostegui wrote: >>>! In T233534#5517306, @Krenair wrote: >> I'm wondering if an entry should be added under "Where did we get lucky... [06:02:35] <_joe_> marostegui: this ^^ will fix the bug we found this morning [06:05:42] _joe_: oh, that was fast! [06:06:16] <_joe_> that was easy, most importantly :D [06:06:45] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) >>! In T233534#5518388, @Joe wrote: >>>! In T233534#5518359, @Marostegui wrote: >>>>! In T233534#5517306, @Krenair wrote: >>> I'm wondering if an entry... [06:19:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1075', diff saved to https://phabricator.wikimedia.org/P9162 and previous config saved to /var/cache/conftool/dbconfig/20190924-061943-marostegui.json [06:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:38] (03CR) 10Volans: [C: 03+1] "LGTM, good catch!" [software/conftool] - 10https://gerrit.wikimedia.org/r/538732 (https://phabricator.wikimedia.org/T233679) (owner: 10Giuseppe Lavagetto) [06:24:31] 10Operations, 10DBA: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230784 (10Marostegui) a:03Marostegui [06:29:21] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Marostegui) [06:30:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1075', diff saved to https://phabricator.wikimedia.org/P9163 and previous config saved to /var/cache/conftool/dbconfig/20190924-063002-marostegui.json [06:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Translate the default section name to DEFAULT in ReadOnlyBySection too [software/conftool] - 10https://gerrit.wikimedia.org/r/538732 (https://phabricator.wikimedia.org/T233679) (owner: 10Giuseppe Lavagetto) [06:35:11] (03PS1) 10Marostegui: mariadb: Decommission db1066 [puppet] - 10https://gerrit.wikimedia.org/r/538739 (https://phabricator.wikimedia.org/T233071) [06:35:58] (03Merged) 10jenkins-bot: Translate the default section name to DEFAULT in ReadOnlyBySection too [software/conftool] - 10https://gerrit.wikimedia.org/r/538732 (https://phabricator.wikimedia.org/T233679) (owner: 10Giuseppe Lavagetto) [06:36:55] !log Remove db1066 from tendril and zarcillo T233071 [06:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:58] T233071: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 [06:36:58] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1066 [puppet] - 10https://gerrit.wikimedia.org/r/538739 (https://phabricator.wikimedia.org/T233071) (owner: 10Marostegui) [06:37:30] !log Stop MySQL on db1066 - T233071 [06:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:55] 10Operations, 10ops-eqiad, 10decommission: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Marostegui) a:05Marostegui→03RobH [06:39:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Marostegui) This host is ready for #dc-ops to decommission [06:40:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [06:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:47] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2044.codfw.wmnet` - db2044.codfw.wmnet (**PASS**) - Downtimed host on Ic... [06:41:33] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 6 others: RFC: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Nikerabbit) [06:43:10] (03PS1) 10Marostegui: mariadb: Remove db2044 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/538740 (https://phabricator.wikimedia.org/T230761) [06:43:36] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [06:44:58] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2044 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/538740 (https://phabricator.wikimedia.org/T230761) (owner: 10Marostegui) [06:45:27] (03PS1) 10Marostegui: wmnet: Remove db2044 production entries [dns] - 10https://gerrit.wikimedia.org/r/538741 (https://phabricator.wikimedia.org/T230761) [06:46:32] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2044 production entries [dns] - 10https://gerrit.wikimedia.org/r/538741 (https://phabricator.wikimedia.org/T230761) (owner: 10Marostegui) [06:47:29] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Marostegui) a:05RobH→03Papaul [06:47:52] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Marostegui) Host ready for @Papaul to finish the last steps after running the decommissioning script. [06:55:22] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [07:04:31] (03PS1) 10Elukey: Set Debian Buster for krb1001's PXE boot [puppet] - 10https://gerrit.wikimedia.org/r/538746 (https://phabricator.wikimedia.org/T233141) [07:06:48] (03CR) 10Elukey: [C: 03+2] Set Debian Buster for krb1001's PXE boot [puppet] - 10https://gerrit.wikimedia.org/r/538746 (https://phabricator.wikimedia.org/T233141) (owner: 10Elukey) [07:12:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/538048 (https://phabricator.wikimedia.org/T233189) (owner: 10Volans) [07:16:36] (03PS3) 10Volans: admin: move Papaul from datacenter-ops to ops group [puppet] - 10https://gerrit.wikimedia.org/r/538048 (https://phabricator.wikimedia.org/T233189) [07:18:09] (03PS1) 10Marostegui: mariadb: Promote db1138 to master [puppet] - 10https://gerrit.wikimedia.org/r/538747 (https://phabricator.wikimedia.org/T230784) [07:18:22] !log swift eqiad-prod: continue ms-be1027 decom T233289 [07:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:26] T233289: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 [07:19:13] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/538747 (https://phabricator.wikimedia.org/T230784) (owner: 10Marostegui) [07:19:50] (03CR) 10Volans: [C: 03+2] admin: move Papaul from datacenter-ops to ops group [puppet] - 10https://gerrit.wikimedia.org/r/538048 (https://phabricator.wikimedia.org/T233189) (owner: 10Volans) [07:20:26] (03PS1) 10Marostegui: wmnet: Update s4-master to point to db1138 [dns] - 10https://gerrit.wikimedia.org/r/538748 (https://phabricator.wikimedia.org/T230784) [07:21:07] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/538748 (https://phabricator.wikimedia.org/T230784) (owner: 10Marostegui) [07:24:42] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [07:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:01] !log uploaded openjdk-8 8u222-b10-1~deb10u2 to buster-wikimedia component/jdk8 T233604 [07:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:04] T233604: Create OpenJDK 8 packages for Buster - https://phabricator.wikimedia.org/T233604 [07:29:51] (03PS1) 10Effie Mouzeli: Convert 100% of API servers to only serve PHP7.2 [puppet] - 10https://gerrit.wikimedia.org/r/538749 [07:30:25] 10Operations: Create OpenJDK 8 packages for Buster - https://phabricator.wikimedia.org/T233604 (10MoritzMuehlenhoff) 05Open→03Resolved This is uploaded to apt.wikimedia.org and ready to use. (I've Installed the packages on stat1005 for a quick installability test) [07:30:37] (03PS2) 10Effie Mouzeli: Convert 100% of API servers to only serve PHP7.2 [puppet] - 10https://gerrit.wikimedia.org/r/538749 (https://phabricator.wikimedia.org/T219150) [07:32:32] moritzm: \o/ [07:32:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [07:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [07:33:04] (03PS1) 10Elukey: Remove Python 2 packages from Analytics Client nodes [puppet] - 10https://gerrit.wikimedia.org/r/538750 (https://phabricator.wikimedia.org/T204734) [07:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:07] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2049.codfw.wmnet` - db2049.codfw.wmnet (**PASS**) - Downtimed host on Ic... [07:34:34] (03PS1) 10Marostegui: mariadb: Decommission db2049 [puppet] - 10https://gerrit.wikimedia.org/r/538751 (https://phabricator.wikimedia.org/T230721) [07:34:58] (03PS1) 10Marostegui: wmnet: Remove db2049 production entries [dns] - 10https://gerrit.wikimedia.org/r/538752 (https://phabricator.wikimedia.org/T230721) [07:35:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2049 [puppet] - 10https://gerrit.wikimedia.org/r/538751 (https://phabricator.wikimedia.org/T230721) (owner: 10Marostegui) [07:36:23] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2049 production entries [dns] - 10https://gerrit.wikimedia.org/r/538752 (https://phabricator.wikimedia.org/T230721) (owner: 10Marostegui) [07:37:12] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) a:05RobH→03Papaul [07:37:47] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) Host ready for @Papaul to decommission after running the decommission script. [07:38:10] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:38:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:46] doing some tests with debmonitor, it might be temporarily unavailable (downtimed it) [07:39:09] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/538626 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [07:42:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [07:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [07:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:06] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2046.codfw.wmnet` - db2046.codfw.wmnet (**PASS**) - Downtimed host on Ic... [07:44:12] (03CR) 10Elukey: "elukey@cumin1001:~$ sudo cumin 'c:profile::analytics::cluster::packages::common or c:profile::analytics::cluster::packages::statistics' 'l" [puppet] - 10https://gerrit.wikimedia.org/r/538750 (https://phabricator.wikimedia.org/T204734) (owner: 10Elukey) [07:45:24] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:32] (03PS1) 10Marostegui: mariadb: Decommission db2046 [puppet] - 10https://gerrit.wikimedia.org/r/538754 (https://phabricator.wikimedia.org/T231767) [07:47:06] (03CR) 10Muehlenhoff: "You can use debdeploy to check reverse dependencies, e.g." [puppet] - 10https://gerrit.wikimedia.org/r/538750 (https://phabricator.wikimedia.org/T204734) (owner: 10Elukey) [07:47:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Ops Group for papaul@ - https://phabricator.wikimedia.org/T233189 (10Volans) 05Open→03Resolved >>! In T233189#5512761, @faidon wrote: > This is approved. Patch merged, this is now effective. Resolving the task. @Papaul feel fr... [07:48:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2046 [puppet] - 10https://gerrit.wikimedia.org/r/538754 (https://phabricator.wikimedia.org/T231767) (owner: 10Marostegui) [07:48:49] (03PS1) 10Marostegui: wmnet: Remove db2046 production entries [dns] - 10https://gerrit.wikimedia.org/r/538755 (https://phabricator.wikimedia.org/T231767) [07:49:42] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2046 production entries [dns] - 10https://gerrit.wikimedia.org/r/538755 (https://phabricator.wikimedia.org/T231767) (owner: 10Marostegui) [07:50:39] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui) a:05RobH→03Papaul [07:50:58] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui) Host ready for @Papaul to decommission after running the decommission script. [07:51:17] !log depool wdqs1006 to clear HTTP too many request error [07:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:54] (03PS3) 10Effie Mouzeli: Convert 100% of API servers to only serve PHP7.2 [puppet] - 10https://gerrit.wikimedia.org/r/538749 (https://phabricator.wikimedia.org/T219150) [07:54:15] (03CR) 10Volans: [C: 03+1] "LGTM, one possible small improvement inline" (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/533131 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [07:58:24] (03PS1) 10Elukey: Add AAAA/PTR records for krb1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/538799 (https://phabricator.wikimedia.org/T233141) [07:59:34] (03PS2) 10Elukey: Add AAAA/PTR records for krb1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/538799 (https://phabricator.wikimedia.org/T233141) [08:00:24] (03PS4) 10Effie Mouzeli: Convert 100% of API servers to only serve PHP7.2 [puppet] - 10https://gerrit.wikimedia.org/r/538749 (https://phabricator.wikimedia.org/T219150) [08:02:35] (03CR) 10Elukey: [C: 03+2] Add AAAA/PTR records for krb1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/538799 (https://phabricator.wikimedia.org/T233141) (owner: 10Elukey) [08:05:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:14] telia transport down with eqord --^, seems planned maintenance [08:07:05] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 6.148e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:07:43] looking [08:08:42] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Krenair) [08:09:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:10:20] and this is the other side :) --^ [08:16:49] (03PS1) 10Ammarpad: Update logo for mx.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538834 (https://phabricator.wikimedia.org/T233670) [08:18:23] !log Deploy schema change on s5 master with replication - T231172 [08:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:27] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [08:19:21] !log mvolz@deploy1001 scap-helm zotero upgrade staging -f zotero-values-staging.yaml stable/zotero [namespace: zotero, clusters: staging] [08:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:31] !log mvolz@deploy1001 scap-helm zotero cluster staging completed [08:19:31] !log mvolz@deploy1001 scap-helm zotero finished [08:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:32] (03CR) 10Effie Mouzeli: [V: 03+1] "Looks ok https://puppet-compiler.wmflabs.org/compiler1001/18525/" [puppet] - 10https://gerrit.wikimedia.org/r/538749 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [08:23:54] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1006 is CRITICAL: 6.164e+04 ge 4.32e+04 Mathew.onipe node is depooled for lag to catch up - The acknowledgement expires at: 2019-09-25 18:21:50. https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:29:12] !log Disable puppet on api cluster and restart php-fpm to finish php7 migration - T219150 [08:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:16] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [08:30:15] !log Deploy schema change on s4 master with replication - T231172 [08:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:18] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [08:30:33] (03PS1) 10Filippo Giunchedi: prometheus: tweak widespread puppet failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/538836 (https://phabricator.wikimedia.org/T232303) [08:31:40] (03PS9) 10Jcrespo: backups: Change file owner of bacula storage&director config [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) [08:31:42] (03PS1) 10Jcrespo: mariadb: make core_test hosts not page on replication/process issues [puppet] - 10https://gerrit.wikimedia.org/r/538837 (https://phabricator.wikimedia.org/T177782) [08:32:18] (03CR) 10Urbanecm: [C: 04-1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [08:32:51] !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [08:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:56] (03PS2) 10Jcrespo: mariadb: make core_test hosts not page on replication/process issues [puppet] - 10https://gerrit.wikimedia.org/r/538837 (https://phabricator.wikimedia.org/T177782) [08:32:59] !log mvolz@deploy1001 scap-helm zotero cluster eqiad completed [08:32:59] !log mvolz@deploy1001 scap-helm zotero finished [08:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:27] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10fgiunchedi) I'm done with both graphite200[12], good to go on my end [08:33:50] !log installed expat security updates on remaining mw* servers [08:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:02] (03CR) 10Marostegui: [C: 03+1] mariadb: make core_test hosts not page on replication/process issues [puppet] - 10https://gerrit.wikimedia.org/r/538837 (https://phabricator.wikimedia.org/T177782) (owner: 10Jcrespo) [08:36:03] !log stop db1114 mariadb process for some time [08:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:24] !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [08:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:26] !log mvolz@deploy1001 scap-helm zotero cluster codfw completed [08:37:26] !log mvolz@deploy1001 scap-helm zotero finished [08:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:44] !log Deploy schema change on s8 master with replication - T231172 [08:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:47] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [08:46:36] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] Convert 100% of API servers to only serve PHP7.2 [puppet] - 10https://gerrit.wikimedia.org/r/538749 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [08:47:14] (03PS5) 10Effie Mouzeli: Convert 100% of API servers to only serve PHP7.2 [puppet] - 10https://gerrit.wikimedia.org/r/538749 (https://phabricator.wikimedia.org/T219150) [08:47:45] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:50:10] \o/ [08:52:37] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10jcrespo) [08:53:24] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Marostegui) [08:53:47] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Marostegui) p:05Triage→03Normal [08:54:24] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Marostegui) @Cmjohnson or @Jclark-ctr let me know when it is a moment to power drain this host and I will have it ready (aka I will depool it) [08:54:38] (03PS1) 10Effie Mouzeli: hiera: revert change to mw2206.yaml [puppet] - 10https://gerrit.wikimedia.org/r/538841 [08:57:33] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: revert change to mw2206.yaml [puppet] - 10https://gerrit.wikimedia.org/r/538841 (owner: 10Effie Mouzeli) [09:00:18] (03PS1) 10Muehlenhoff: Switch IDP logs to day-based log files [puppet] - 10https://gerrit.wikimedia.org/r/538843 [09:01:38] (03PS1) 10Elukey: profile::java::analytics: deploy openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) [09:06:36] (03CR) 10Muehlenhoff: profile::java::analytics: deploy openjdk-8 on Buster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [09:07:27] (03PS2) 10Elukey: profile::java::analytics: deploy openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) [09:09:06] (03CR) 10Elukey: profile::java::analytics: deploy openjdk-8 on Buster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [09:09:29] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:48] (03CR) 10Muehlenhoff: profile::java::analytics: deploy openjdk-8 on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [09:11:27] (03PS3) 10Elukey: profile::java::analytics: deploy openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) [09:13:11] (03CR) 10Volans: "> Patch Set 14:" [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [09:13:27] (03CR) 10Filippo Giunchedi: [C: 04-1] "The idea LGTM, see inline" (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite) [09:14:52] !log Drop table archive_save on frwiki T233187 [09:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:55] T233187: Drop frwiki.archive_save table - https://phabricator.wikimedia.org/T233187 [09:17:03] jouncebot next [09:17:03] In 1 hour(s) and 42 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190924T1100) [09:17:30] (03PS2) 10Awight: FileImporter: limited default deployment (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) [09:17:32] (03PS2) 10Awight: FileImporter: limited default deployment (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537434 (https://phabricator.wikimedia.org/T232539) [09:18:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/18530/" [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [09:21:38] (03PS8) 10Filippo Giunchedi: Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [09:22:13] (03CR) 10Filippo Giunchedi: [C: 03+2] Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [09:26:24] !log Upgrade to php 7.2.22 on deploy* - T230024 [09:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:28] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [09:27:23] 10Operations, 10Citoid, 10Release Pipeline, 10Services: Migrate citoid and zotero services to helm ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10Mvolz) [09:30:16] !log Deploy schema change on s2 master with replication - T231172 [09:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:20] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [09:32:19] (03PS1) 10Filippo Giunchedi: Revert "Set up scap target for deploying the phatality plugin into kibana" [puppet] - 10https://gerrit.wikimedia.org/r/538848 [09:32:47] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "Set up scap target for deploying the phatality plugin into kibana" [puppet] - 10https://gerrit.wikimedia.org/r/538848 (owner: 10Filippo Giunchedi) [09:32:55] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "Set up scap target for deploying the phatality plugin into kibana" [puppet] - 10https://gerrit.wikimedia.org/r/538848 (owner: 10Filippo Giunchedi) [09:34:52] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/18529/" [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [09:36:34] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, and 3 others: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10Joe) 05Open→03Resolved [09:37:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] backups: Change file owner of bacula storage&director config (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [09:38:10] (03PS4) 10Jdrewniak: Enable alternate mobile link for ar,zh,hi wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538295 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [09:40:20] (03CR) 10Jcrespo: [C: 04-1] "> Patch Set 9: Code-Review-1" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [09:41:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [09:42:25] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:42] (03PS4) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [09:45:44] (03PS6) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [09:45:46] (03PS1) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [09:47:31] 10Operations, 10Citoid, 10Release Pipeline, 10Services: Migrate citoid and zotero services to helm ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10Mvolz) [09:48:46] (03CR) 10jerkins-bot: [V: 04-1] query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [09:51:55] !log Upgrade to php 7.2.22 on mwmaint* - T230024 [09:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:59] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [09:52:57] PROBLEM - Varnish HTCP daemon on cp1075 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish [09:55:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:25] !log Deploy schema change on labswiki (wikitech) and labtestwiki T231172 [09:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:28] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [09:59:27] (03CR) 10Volans: "Thanks for writing a cookbook! Some comments inline." (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [10:03:26] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] FileImporter: limited default deployment (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [10:03:31] !log Deploy schema change on s1 master with replication - T231172 [10:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:35] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [10:06:45] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:31] !log Deploy schema change on s7 (centralauth and wikis) master with replication - T231172 [10:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:35] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [10:21:23] (03CR) 10Volans: [C: 04-1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/537576 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [10:22:28] (03CR) 1020after4: "Weird, ok I'll remove the dependency." [puppet] - 10https://gerrit.wikimedia.org/r/538848 (owner: 10Filippo Giunchedi) [10:23:57] (03CR) 10Volans: "> Patch Set 4:" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [10:24:33] twentyafterfour: thanks! also do you know if we need to add the repo to hieradata/role/common/deployment_server.yaml ? [10:25:16] the empty repo I'd assumed it would make scap barf [10:28:29] (03CR) 10Thiemo Kreuz (WMDE): Lower gzip threshold for SVGs served by MediaWiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537974 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [10:29:22] !log mobrovac@deploy1001 Started deploy [restbase/deploy@19d0f44]: Expose the key_value buckets to production IPs - T223953 [10:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:26] T223953: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 [10:34:49] (03PS2) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [10:35:24] jouncebot next [10:35:24] In 0 hour(s) and 24 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190924T1100) [10:41:47] (03PS1) 10Giuseppe Lavagetto: mediawiki: move all appservers to PHP7 as well. [puppet] - 10https://gerrit.wikimedia.org/r/538855 (https://phabricator.wikimedia.org/T219150) [10:43:44] <_joe_> waiting for pcc then we can just merge if all's good [10:43:55] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:44:20] <_joe_> akosiaris: it that wikifeeds? ^^ [10:44:49] not sure, I don't know if the functionality has been deployed yet to restbase [10:45:02] it could be mobileapps as well that is [10:45:11] * akosiaris checking [10:45:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18533/ shows the patch applies cleanly on one canary, one non-canary, and one previously " [puppet] - 10https://gerrit.wikimedia.org/r/538855 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto) [10:45:27] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:46:04] godog: I think we do need to add the repo to deployment_server [10:46:11] it's been a while since I messed with scap so I'm rusty [10:47:09] _joe_: aside from a puppet config that tells restbase where to find wikifeeds doesn't look like anything changing how restbase responds to /v1/feed requests has changed [10:47:28] twentyafterfour: yeah same here, but yeah I too think will be needed [10:47:42] <_joe_> akosiaris: ack I had no idea if something changed in the meanwhile [10:50:54] <_joe_> !log converting mw1261 to full-php7 [10:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:41] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@19d0f44]: Expose the key_value buckets to production IPs - T223953 (duration: 22m 20s) [10:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:45] T223953: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 [10:52:40] akosiaris: _joe_: rb is not using wikifeeds yet [10:52:49] i plan to deploy that change later today [10:53:16] but first, akosiaris, yu should try restrouter in staging now, it should work (TM) [10:54:16] <_joe_> !log converting all appservers to php7, T219150 [10:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:19] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [10:55:05] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@7857639]: Bump CirrusSearchLinksUpdate concurrency to clear the queue - T233584 [10:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:08] T233584: Re-adjust cirrusSearchLinksUpdate vs cirrusSearchLinksUpdatePrioritized concurrency - https://phabricator.wikimedia.org/T233584 [10:56:04] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@7857639]: Bump CirrusSearchLinksUpdate concurrency to clear the queue - T233584 (duration: 01m 00s) [10:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190924T1100). [11:00:04] awight, kostajh, and jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:08] o/ [11:00:15] great! [11:00:25] I can SWAT today! [11:00:29] I/ [11:00:36] Thank you :-) [11:00:53] starting with awight's patches [11:01:06] (03PS3) 10Urbanecm: FileImporter: limited default deployment (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537434 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [11:01:11] (03PS3) 10Urbanecm: FileImporter: limited default deployment (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [11:01:17] (03CR) 10Urbanecm: [C: 03+2] FileImporter: limited default deployment (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [11:01:21] hi [11:02:21] (03Merged) 10jenkins-bot: FileImporter: limited default deployment (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [11:02:44] hi kostajh [11:03:01] I've +2'ed your backport kostajh [11:03:06] Urbanecm: thx! [11:03:26] (03CR) 10jenkins-bot: FileImporter: limited default deployment (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [11:03:40] (03CR) 10Urbanecm: [C: 03+2] FileImporter: limited default deployment (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537434 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [11:04:26] awight: syncing the VariantSettings.php patch [11:04:36] (03Merged) 10jenkins-bot: FileImporter: limited default deployment (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537434 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [11:05:17] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: 8a89652: FileImporter: limited default deployment (1/2; T232539) (duration: 01m 03s) [11:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:20] T232539: [Deployment] FileExporter as a default feature to first wikis, 2019-09-24 - https://phabricator.wikimedia.org/T232539 [11:05:33] 10Operations, 10serviceops: Remove PHP 7.0 from production application servers - https://phabricator.wikimedia.org/T220600 (10jijiki) 05Open→03Invalid After discussing with @MoritzMuehlenhoff, since we are planning to reimage all mw* servers after finishing PHP7 migration, we can mark this as invalid. [11:06:17] Urbanecm: thnx for swatting https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MassMessage/+/538742/ [11:06:26] yw mobrovac [11:06:37] (03CR) 10jenkins-bot: FileImporter: limited default deployment (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537434 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [11:06:50] is it possible to test it? [11:06:53] PROBLEM - PHP opcache health on mw2223 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:07:01] Urbanecm: yes, it would be good to test first. [11:07:02] 10Operations, 10serviceops: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 (10jijiki) @Dzahn phabricator servers are still on php7.2.8, should they be upgraded to 7.2.22 ? [11:07:04] (03PS1) 10Vgutierrez: Release 8.0.5-1wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/538857 (https://phabricator.wikimedia.org/T233667) [11:07:17] awight: your patch is on mwdebug1002 [11:07:48] Urbanecm: looks happy! [11:07:57] cool, I'll sync then awight [11:07:59] (03PS1) 1020after4: Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/538858 (https://phabricator.wikimedia.org/T230752) [11:08:03] ack [11:09:18] (03CR) 1020after4: "@filippo: This should be better than my previous attempt, however, I have to be AFK for a while this morning so if you aren't comfortable " [puppet] - 10https://gerrit.wikimedia.org/r/538858 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [11:09:28] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: a14b772: FileImporter: limited default deployment (2/2; T232539) (duration: 00m 56s) [11:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:34] awight: done! [11:09:42] thanks :-) [11:09:49] (03CR) 10Urbanecm: [C: 03+2] Enable alternate mobile link for ar,zh,hi wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538295 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:10:07] RECOVERY - PHP opcache health on mw2223 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:10:23] jan_drewniak: your patch is next [11:10:24] (03PS5) 10Urbanecm: Enable alternate mobile link for ar,zh,hi wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538295 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:10:30] <_joe_> !log all wikis (including API) are now served by PHP7 T219150 [11:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:34] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [11:10:41] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/538857 (https://phabricator.wikimedia.org/T233667) (owner: 10Vgutierrez) [11:10:41] PROBLEM - PHP opcache health on mw2202 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:10:42] (03CR) 10Urbanecm: [C: 03+2] Enable alternate mobile link for ar,zh,hi wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538295 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:10:51] Urbanecm: sounds good [11:11:07] ^ should recover soon [11:11:39] (03Merged) 10jenkins-bot: Enable alternate mobile link for ar,zh,hi wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538295 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:11:45] PROBLEM - PHP opcache health on mw2217 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:11:50] <_joe_> effie: uhm not sure? [11:11:57] (03CR) 10jenkins-bot: Enable alternate mobile link for ar,zh,hi wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538295 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:12:03] <_joe_> those servers are in codfw, were they restarted? [11:12:21] PROBLEM - PHP opcache health on mw2207 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:12:23] I restarted the api ones way long ago [11:12:43] <_joe_> it's expected on codfw, kindof [11:12:45] jan_drewniak: your patch is on mwdebug1002, can you test? [11:14:48] Urbanecm: yup looks fine [11:14:53] Urbanecm: I snuck one more backport into the queue, FYI. [11:14:58] thanks jan_drewniak [11:14:59] RECOVERY - PHP opcache health on mw2217 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:15:35] awight: acknowledged [11:16:14] awight: that's an i18n change, we'd have to run full scap to make it happen. Do you think it is worth it? [11:16:27] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: 8bf6aae: Enable alternate mobile link for ar,zh,hi wikis (T206497) (duration: 00m 54s) [11:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:31] T206497: Enable $wgMFNoindexPages for: Italian, Dutch, Korean, Arabic, Chinese, and Hindi Wikipedias - https://phabricator.wikimedia.org/T206497 [11:16:31] jan_drewniak: synced [11:17:02] Urbanecm: great, thanks! [11:17:05] yw [11:17:28] Urbanecm: good point, let's not do this patch, thank you [11:17:43] okay, thanks awight [11:20:13] (03PS2) 10Urbanecm: Follow-up 8f3f0705baed: add missing namespace for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538709 (https://phabricator.wikimedia.org/T233562) (owner: 10MarcoAurelio) [11:20:17] (03CR) 10Urbanecm: [C: 03+2] Follow-up 8f3f0705baed: add missing namespace for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538709 (https://phabricator.wikimedia.org/T233562) (owner: 10MarcoAurelio) [11:20:43] (03CR) 10Urbanecm: [C: 03+2] Update logo for mx.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538834 (https://phabricator.wikimedia.org/T233670) (owner: 10Ammarpad) [11:21:33] (03Merged) 10jenkins-bot: Follow-up 8f3f0705baed: add missing namespace for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538709 (https://phabricator.wikimedia.org/T233562) (owner: 10MarcoAurelio) [11:21:43] (03Merged) 10jenkins-bot: Update logo for mx.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538834 (https://phabricator.wikimedia.org/T233670) (owner: 10Ammarpad) [11:21:49] (03CR) 10jenkins-bot: Follow-up 8f3f0705baed: add missing namespace for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538709 (https://phabricator.wikimedia.org/T233562) (owner: 10MarcoAurelio) [11:21:59] kostajh: your backport is on mwdebug1002, please test and let me know [11:22:10] Urbanecm: looking [11:22:53] Urbanecm: looks good [11:22:59] kostajh: syncing [11:23:45] RECOVERY - PHP opcache health on mw2207 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:23:46] (03CR) 10jenkins-bot: Update logo for mx.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538834 (https://phabricator.wikimedia.org/T233670) (owner: 10Ammarpad) [11:24:35] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/GrowthExperiments/modules/homepage/ext.growthExperiments.Homepage.less: SWAT: d4c64a7: Fix broken display of mobile overlay headings (T233163) (duration: 00m 57s) [11:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:39] T233163: [mobile] Homepage modules overlays displayed cutoff - https://phabricator.wikimedia.org/T233163 [11:24:55] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&v [11:24:55] g-eqiad&var-topic=All&var-consumer_group=All [11:25:13] Urbanecm: thanks again [11:25:20] you're welcome kostajh [11:25:21] RECOVERY - PHP opcache health on mw2202 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:26:10] !log urbanecm@deploy1001 Synchronized static/images/project-logos/mxwikimedia.png: SWAT: 246b352: Update logo for mx.wikimedia (T233670) (duration: 00m 54s) [11:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:13] T233670: Change Wikimedia México logo - https://phabricator.wikimedia.org/T233670 [11:27:58] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/MassMessage/: SWAT: ba9b209: Provide deduplication info to MassMessageJob (T232379) (duration: 00m 57s) [11:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:01] T232379: MassMessage problems - multiple deliveries and missing deliveries - https://phabricator.wikimedia.org/T232379 [11:30:26] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: b6947c5: Follow-up 8f3f0705baed: add missing namespace for eswiki (T233562) (duration: 00m 56s) [11:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:29] T233562: Update wgNamespaceRobotPolicies on eswiki - https://phabricator.wikimedia.org/T233562 [11:30:32] !log Purge https://en.wikipedia.org/static/images/project-logos/mxwikimedia.png (T233670) [11:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:54] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Joe) [11:33:59] (03PS1) 10Urbanecm: Set wgArticleCountMethod to any for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538862 (https://phabricator.wikimedia.org/T233673) [11:34:04] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [11:34:47] (03PS1) 10Giuseppe Lavagetto: lvs: do not check hhvm/php7 at the same time anymore. [puppet] - 10https://gerrit.wikimedia.org/r/538864 (https://phabricator.wikimedia.org/T219127) [11:35:52] (03CR) 10Urbanecm: [C: 03+2] Set wgArticleCountMethod to any for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538862 (https://phabricator.wikimedia.org/T233673) (owner: 10Urbanecm) [11:36:31] (03CR) 10jenkins-bot: Set wgArticleCountMethod to any for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538862 (https://phabricator.wikimedia.org/T233673) (owner: 10Urbanecm) [11:37:24] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: 9eaa4f8: Set wgArticleCountMethod to any for napwikisource (T233673) (duration: 00m 56s) [11:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:27] T233673: change ArticleCountMethod to 'any' for napwikisource - https://phabricator.wikimedia.org/T233673 [11:37:32] (03CR) 10Urbanecm: [C: 03+2] Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [11:37:40] (03PS3) 10Urbanecm: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [11:37:45] (03CR) 10Urbanecm: [C: 03+2] Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [11:38:52] (03Merged) 10jenkins-bot: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [11:39:11] (03CR) 10jenkins-bot: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [11:39:21] !log Run mwscript initSiteStats.php --wiki=napwikisource --update (T233673) [11:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:35] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [11:41:53] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: 11a48f8: Add support for some languages on Commons and stop support for nys on Wikidata (T230480) (duration: 00m 56s) [11:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:56] T230480: Synchronize wmgExtraLanguageNames setting between Wikidata and Commons, and remove Nyungar (nys) from Wikidata - https://phabricator.wikimedia.org/T230480 [11:42:24] (03CR) 10Volans: [C: 03+2] sre.hosts.decomission: improve logging in the console [cookbooks] - 10https://gerrit.wikimedia.org/r/538047 (owner: 10Volans) [11:43:15] !log EU SWAT done [11:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:44] (03PS1) 10KartikMistry: Use ContentTranslationEnableMT to disable MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538867 (https://phabricator.wikimedia.org/T232986) [11:44:04] thnx Urbanecm for driving today's swat! [11:44:15] you're welcome mobrovac [11:44:15] (03Merged) 10jenkins-bot: sre.hosts.decomission: improve logging in the console [cookbooks] - 10https://gerrit.wikimedia.org/r/538047 (owner: 10Volans) [11:44:40] (03CR) 10jerkins-bot: [V: 04-1] Use ContentTranslationEnableMT to disable MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538867 (https://phabricator.wikimedia.org/T232986) (owner: 10KartikMistry) [11:44:51] do mine eyes deceive me? are we exclusively using PHP7 in production now? https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-uteb2bhn3ztrqcy/ [11:45:17] (03CR) 10Jbond: "> Patch Set 4:" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [11:45:22] !log mobrovac@deploy1001 Started deploy [restbase/deploy@87eea26]: Start using the wikifeeds service for v1/feed - T170455 [11:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:25] T170455: Extract the feed endpoints from PCS into a new wikifeeds service - https://phabricator.wikimedia.org/T170455 [11:45:55] Lucas_WMDE: they are not, we are! [11:46:10] \o/ 🎉 [11:46:30] (03CR) 10Jbond: [C: 03+2] IPMI: add support for channel 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [11:46:52] now let’s get rid of all those pesky PHPCS rules that prevent us from using PHP7 features like primitive type hints ;) [11:47:03] (03PS1) 10Muehlenhoff: Fix path name [puppet] - 10https://gerrit.wikimedia.org/r/538868 [11:47:57] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@87eea26]: Start using the wikifeeds service for v1/feed - T170455 (duration: 02m 35s) [11:47:59] (03CR) 10Jbond: [C: 03+2] IPMI: add support for channel 2 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [11:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:56] (03PS3) 10Jbond: base::puppet: add preferred_serialization_format = pson [puppet] - 10https://gerrit.wikimedia.org/r/538682 (https://phabricator.wikimedia.org/T233643) [11:49:47] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:51:47] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/18535/idp1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/538868 (owner: 10Muehlenhoff) [11:51:56] (03CR) 10Jbond: [C: 03+2] base::puppet: add preferred_serialization_format = pson [puppet] - 10https://gerrit.wikimedia.org/r/538682 (https://phabricator.wikimedia.org/T233643) (owner: 10Jbond) [11:52:36] (03Merged) 10jenkins-bot: IPMI: add support for channel 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [11:54:03] (03CR) 10jenkins-bot: IPMI: add support for channel 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [11:54:19] (03CR) 10Jbond: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/538868 (owner: 10Muehlenhoff) [11:55:53] re restbase above ^, known, deplooed [11:58:05] (03PS2) 10Muehlenhoff: Fix path name [puppet] - 10https://gerrit.wikimedia.org/r/538868 [11:59:38] (03CR) 10Muehlenhoff: [C: 03+2] Fix path name [puppet] - 10https://gerrit.wikimedia.org/r/538868 (owner: 10Muehlenhoff) [12:02:46] (03CR) 10Jbond: Switch IDP logs to day-based log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538843 (owner: 10Muehlenhoff) [12:04:11] (03PS4) 10Elukey: profile::java::analytics: deploy openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) [12:06:43] (03CR) 10Jbond: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/538836 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [12:08:12] (03PS1) 10Muehlenhoff: Write the service ID to the JSON service defition [puppet] - 10https://gerrit.wikimedia.org/r/538870 [12:09:18] (03CR) 10Elukey: [C: 03+2] profile::java::analytics: deploy openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/538844 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [12:11:14] (03PS1) 10DCausse: [cirrus] temp disable sanity check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538872 (https://phabricator.wikimedia.org/T233584) [12:12:07] jouncebot: now [12:12:07] No deployments scheduled for the next 3 hour(s) and 47 minute(s) [12:12:49] I need to deploy a mw-config patch, please lemme me know if you have objections [12:14:07] (03PS1) 10Elukey: profile::java::analytics: correct relationships with Exec [puppet] - 10https://gerrit.wikimedia.org/r/538873 [12:14:51] (03CR) 10DCausse: [C: 03+2] [cirrus] temp disable sanity check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538872 (https://phabricator.wikimedia.org/T233584) (owner: 10DCausse) [12:15:20] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Cmjohnson) The ticket was created with Dell. I am waiting on their approval and then for the Dell tech to coordinate a day/time to swap the board out [12:15:50] (03Merged) 10jenkins-bot: [cirrus] temp disable sanity check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538872 (https://phabricator.wikimedia.org/T233584) (owner: 10DCausse) [12:15:57] (03CR) 10Marostegui: [C: 03+1] "> @Marostegui So Gerrit is on m2 and i see in puppet mariadb" [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [12:16:06] (03CR) 10Muehlenhoff: Switch IDP logs to day-based log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538843 (owner: 10Muehlenhoff) [12:16:20] !log mobrovac@deploy1001 Started deploy [restbase/deploy@19d0f44]: REVERT (due to wikifeeds problems): Start using the wikifeeds service for v1/feed - T170455 [12:16:21] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) Excellent! Thank you! [12:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:24] (03CR) 10Elukey: [C: 03+2] profile::java::analytics: correct relationships with Exec [puppet] - 10https://gerrit.wikimedia.org/r/538873 (owner: 10Elukey) [12:16:24] T170455: Extract the feed endpoints from PCS into a new wikifeeds service - https://phabricator.wikimedia.org/T170455 [12:16:28] (03CR) 10jenkins-bot: [cirrus] temp disable sanity check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538872 (https://phabricator.wikimedia.org/T233584) (owner: 10DCausse) [12:18:55] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@19d0f44]: REVERT (due to wikifeeds problems): Start using the wikifeeds service for v1/feed - T170455 (duration: 02m 35s) [12:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:01] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:20:30] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T233584 [cirrus] temp disable sanity check (duration: 00m 55s) [12:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:34] T233584: Re-adjust cirrusSearchLinksUpdate vs cirrusSearchLinksUpdatePrioritized concurrency - https://phabricator.wikimedia.org/T233584 [12:20:34] (03PS2) 10Muehlenhoff: Switch IDP logs to day-based log files [puppet] - 10https://gerrit.wikimedia.org/r/538843 [12:22:37] !log remove systemd-sysv from jessie-wikimedia/openstack-mitaka-jessie in install1002 (T231793) [12:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:40] T231793: Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 [12:26:17] 10Operations, 10cloud-services-team (Kanban): Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 (10MoritzMuehlenhoff) Once the bpo package is dropped from the component, after an "apt-get update", please also align the following systems to use systemd-sysv=232-25+deb9u12: ` lou... [12:28:03] 10Operations, 10cloud-services-team (Kanban): Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 (10aborrero) @JHedden I hijacked this part of the task. This is what I did for the record: I confirmed we had the right setting here: modules/aptrepo/files/reprepro-update-filter-wmcs... [12:31:11] 10Operations, 10DBA: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Trizek-WMF) [12:31:14] 10Operations, 10DBA: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Trizek-WMF) [12:31:16] 10Operations, 10DBA: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 (10Trizek-WMF) [12:31:20] 10Operations, 10DBA, 10Patch-For-Review: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230784 (10Trizek-WMF) [12:33:27] 10Operations, 10cloud-services-team (Kanban): Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 (10MoritzMuehlenhoff) >>! In T231793#5519349, @MoritzMuehlenhoff wrote: > Once the bpo package is dropped from the component, after an "apt-get update", please also align the following... [12:40:22] (03PS2) 10KartikMistry: Use ContentTranslationEnableMT to disable MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538867 (https://phabricator.wikimedia.org/T232986) [12:42:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/538870 (owner: 10Muehlenhoff) [12:45:15] (03CR) 10Jbond: Switch IDP logs to day-based log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538843 (owner: 10Muehlenhoff) [12:46:30] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [12:47:58] (03PS2) 10Muehlenhoff: Write the service ID to the JSON service defition [puppet] - 10https://gerrit.wikimedia.org/r/538870 [12:48:42] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "aborrero@cloudcontrol2003-dev:~ $ aptitude why python-keystone" [puppet] - 10https://gerrit.wikimedia.org/r/538445 (owner: 10Andrew Bogott) [12:50:11] (03CR) 10Muehlenhoff: [C: 03+2] Write the service ID to the JSON service defition [puppet] - 10https://gerrit.wikimedia.org/r/538870 (owner: 10Muehlenhoff) [12:50:24] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [12:58:13] (03CR) 10Muehlenhoff: Switch IDP logs to day-based log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538843 (owner: 10Muehlenhoff) [13:00:55] (03PS1) 10Marostegui: mariadb: Remove db2036 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/538874 (https://phabricator.wikimedia.org/T223885) [13:01:25] (03PS1) 10Marostegui: wmnet: Remove db2036 production entries [dns] - 10https://gerrit.wikimedia.org/r/538875 (https://phabricator.wikimedia.org/T223885) [13:02:09] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [13:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:59] (03PS3) 10Muehlenhoff: Switch IDP logs to day-based log files [puppet] - 10https://gerrit.wikimedia.org/r/538843 [13:05:12] (03CR) 10Jbond: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/538843 (owner: 10Muehlenhoff) [13:08:00] (03PS4) 10Muehlenhoff: Switch IDP logs to day-based log files [puppet] - 10https://gerrit.wikimedia.org/r/538843 [13:12:34] (03CR) 10Muehlenhoff: [C: 03+2] Switch IDP logs to day-based log files [puppet] - 10https://gerrit.wikimedia.org/r/538843 (owner: 10Muehlenhoff) [13:16:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "oh, I did not comment on the directories being already 0444 and being made 0555. that was well done. But some directories were not, hence " [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [13:18:56] !log volans@cumin1001 START - Cookbook sre.hosts.decommission [13:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:29] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=True) [13:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:35] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2036 - https://phabricator.wikimedia.org/T223885 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `db2036.codfw.wmnet` - db2036.codfw.wmnet (**FAIL**) - Host steps raised exceptio... [13:33:55] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:08] this is us dealing with a kafka -> HDFS issue --^ [13:36:12] (03PS1) 10Volans: sre.hosts.decommission: fix name conflict [cookbooks] - 10https://gerrit.wikimedia.org/r/538881 [13:38:08] (03Abandoned) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445 (owner: 10Andrew Bogott) [13:38:19] ACKNOWLEDGEMENT - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. ottomata manually running a camus job https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:33] (03PS1) 10Mobrovac: RESTRouter: Add missing back-end svc URIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/538882 (https://phabricator.wikimedia.org/T223953) [13:38:38] (03PS2) 10Volans: sre.hosts.decommission: fix name conflict [cookbooks] - 10https://gerrit.wikimedia.org/r/538881 [13:38:44] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [13:40:14] <_joe_> !log uploaded conftool 1.1.4-3 to stretch-wikimedia, T233679 [13:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:18] T233679: dbctl doesn't always correctly translate section names in its output - https://phabricator.wikimedia.org/T233679 [13:41:35] (03PS4) 10Jbond: puppetmaster::frontend: add locale backend and promote puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/538686 (https://phabricator.wikimedia.org/T233203) [13:41:38] (03CR) 10Volans: [C: 03+2] "Merging to unblock broken cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/538881 (owner: 10Volans) [13:44:18] (03PS2) 10Marostegui: mariadb: Remove db2036 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/538874 (https://phabricator.wikimedia.org/T223885) [13:45:11] <_joe_> !log installing the new conftool version on the cumin hosts [13:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (and identical to the two patches from yesterday)" [puppet] - 10https://gerrit.wikimedia.org/r/538686 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [13:46:52] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix name conflict [cookbooks] - 10https://gerrit.wikimedia.org/r/538881 (owner: 10Volans) [13:48:41] (03CR) 10Jbond: [C: 03+2] puppetmaster::frontend: add locale backend and promote puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/538686 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [13:49:04] (03CR) 10CDanis: [C: 03+1] Translate the default section name to DEFAULT in ReadOnlyBySection too [software/conftool] - 10https://gerrit.wikimedia.org/r/538732 (https://phabricator.wikimedia.org/T233679) (owner: 10Giuseppe Lavagetto) [13:49:11] !log promote puppetmaster1003 to a real puppetmaster backend https://gerrit.wikimedia.org/r/c/operations/puppet/+/538686 [13:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:46] <_joe_> cdanis: I was thinking, we should try to set s3 read-only in codfw [13:49:53] <_joe_> without committing even [13:50:01] sure, we can just look at the diff [13:50:29] (03PS3) 10Elukey: Add ferm rules to allow bacula backups for Matomo and Analytics meta [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) [13:50:40] !log volans@cumin1001 START - Cookbook sre.hosts.decommission [13:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:53] (03CR) 10Jcrespo: [C: 04-1] "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [13:50:53] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [13:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:00] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2036 - https://phabricator.wikimedia.org/T223885 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `db2036.codfw.wmnet` - db2036.codfw.wmnet (**PASS**) - Downtimed host on Icinga... [13:51:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2036 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/538874 (https://phabricator.wikimedia.org/T223885) (owner: 10Marostegui) [13:51:23] (03PS3) 10Marostegui: mariadb: Remove db2036 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/538874 (https://phabricator.wikimedia.org/T223885) [13:51:33] (03PS1) 10Muehlenhoff: Remove tmpreaper from mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/538884 (https://phabricator.wikimedia.org/T151304) [13:52:42] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2036 production entries [dns] - 10https://gerrit.wikimedia.org/r/538875 (https://phabricator.wikimedia.org/T223885) (owner: 10Marostegui) [13:53:22] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2036 - https://phabricator.wikimedia.org/T223885 (10Marostegui) [13:53:48] (03PS2) 10Muehlenhoff: Remove tmpreaper from mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/538884 (https://phabricator.wikimedia.org/T151304) [13:53:53] 10Operations, 10ops-codfw, 10decommission: Decommission db2036 - https://phabricator.wikimedia.org/T223885 (10Marostegui) a:05RobH→03Papaul Host ready for @Papaul to finish the last decommissioning steps [13:56:46] (03PS4) 10Elukey: Add ferm rules to allow bacula backups for Matomo and Analytics meta [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) [13:59:27] (03CR) 10Elukey: "would it be worth to extend the tmpreaper class to take a ensure parameter? So the cleanup of conf/package/etc.. will be handled by puppet" [puppet] - 10https://gerrit.wikimedia.org/r/538884 (https://phabricator.wikimedia.org/T151304) (owner: 10Muehlenhoff) [14:00:13] (03CR) 10Elukey: [C: 03+2] Add ferm rules to allow bacula backups for Matomo and Analytics meta [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [14:00:36] (03CR) 10Muehlenhoff: "All the mw servers will get reimaged, which addresses this (and a dozen of other leftovers from HHVM)" [puppet] - 10https://gerrit.wikimedia.org/r/538884 (https://phabricator.wikimedia.org/T151304) (owner: 10Muehlenhoff) [14:00:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Comments inline, bump the version in Chart.yaml as well" (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/538882 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [14:03:32] 10Operations, 10cloud-services-team (Kanban): Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 (10Andrew) We're precariously close to upgrading to Newton, so maybe this is moot? [14:04:41] (03PS1) 10Jcrespo: mariadb backups: Include extra valid sections on checking script [puppet] - 10https://gerrit.wikimedia.org/r/538885 (https://phabricator.wikimedia.org/T231208) [14:06:41] 10Operations, 10cloud-services-team (Kanban): Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 (10MoritzMuehlenhoff) >>! In T231793#5519559, @Andrew wrote: > We're precariously close to upgrading to Newton, so maybe this is moot? But the servers will remain at Stretch, so it's... [14:07:29] (03PS2) 10Mobrovac: RESTRouter: Add missing back-end svc URIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/538882 (https://phabricator.wikimedia.org/T223953) [14:07:44] (03PS1) 10Elukey: Fix ferm rules for Matomo and Analytics meta [puppet] - 10https://gerrit.wikimedia.org/r/538887 [14:07:55] PROBLEM - Check systemd state on matomo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:16] this is me --^ [14:08:33] (03CR) 10Elukey: [C: 03+2] Fix ferm rules for Matomo and Analytics meta [puppet] - 10https://gerrit.wikimedia.org/r/538887 (owner: 10Elukey) [14:09:22] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [14:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:32] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:09:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:09:33] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:50] (03PS1) 10Jbond: add dns name for cas icinga vhost [dns] - 10https://gerrit.wikimedia.org/r/538888 [14:09:58] !log rebooting cloudvirt1021 for kernel update [14:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:22] (03CR) 10Jbond: [C: 03+2] add dns name for cas icinga vhost [dns] - 10https://gerrit.wikimedia.org/r/538888 (owner: 10Jbond) [14:11:09] RECOVERY - Check systemd state on matomo1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] RESTRouter: Add missing back-end svc URIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/538882 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [14:13:20] (03Merged) 10jenkins-bot: RESTRouter: Add missing back-end svc URIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/538882 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [14:13:25] (03CR) 10Herron: [C: 03+1] prometheus: tweak widespread puppet failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/538836 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [14:14:31] (03CR) 10Herron: "friendly ping on this" [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [14:14:58] 10Operations, 10serviceops: Make the parsoid cluster to support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Joe) [14:15:19] (03PS1) 10Alexandros Kosiaris: Publish restrouter 0.0.4 chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/538891 [14:15:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] Publish restrouter 0.0.4 chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/538891 (owner: 10Alexandros Kosiaris) [14:16:06] (03Merged) 10jenkins-bot: Publish restrouter 0.0.4 chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/538891 (owner: 10Alexandros Kosiaris) [14:16:19] (03CR) 10Gilles: "Thanks for this detailed research, Thiemo! I'll switch the threshold to 150." [puppet] - 10https://gerrit.wikimedia.org/r/537974 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [14:17:00] 10Operations, 10serviceops: Set up LVS for parsoid/PHP - https://phabricator.wikimedia.org/T233722 (10Joe) [14:17:58] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [14:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:32] (03PS3) 10Gilles: Lower gzip threshold for SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/537974 (https://phabricator.wikimedia.org/T232615) [14:18:52] (03PS4) 10Gilles: Lower gzip threshold for SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/537974 (https://phabricator.wikimedia.org/T232615) [14:19:40] (03PS2) 10Gehel: wdqs: cleanup logging config after switching to new pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537642 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [14:22:06] (03PS1) 10Alexandros Kosiaris: restrouter: Skip probes for the first 60 seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/538894 (https://phabricator.wikimedia.org/T223953) [14:22:25] (03CR) 10Gehel: [C: 03+2] wdqs: cleanup logging config after switching to new pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537642 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [14:22:53] onimisionipe: ^ [14:23:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Skip probes for the first 60 seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/538894 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [14:23:03] gehel: Thanks! [14:23:21] (03Merged) 10jenkins-bot: restrouter: Skip probes for the first 60 seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/538894 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [14:24:21] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [14:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:25] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [14:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:12] (03CR) 10Gehel: [C: 04-1] "Looks mostly good, minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [14:28:58] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [14:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:08] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [14:32:10] (03PS1) 10Jbond: SSO: add ssl cert for icinga::cas [puppet] - 10https://gerrit.wikimedia.org/r/538896 [14:32:12] (03PS1) 10Jbond: icinga::cas: use unique ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/538897 [14:33:46] (03CR) 10jerkins-bot: [V: 04-1] icinga::cas: use unique ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/538897 (owner: 10Jbond) [14:34:56] (03PS1) 10Muehlenhoff: Make accessStrategy configurable in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/538898 [14:35:00] (03PS1) 10Alexandros Kosiaris: restrouter: Skip using https for mwapi_uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/538899 (https://phabricator.wikimedia.org/T223953) [14:35:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538896 (owner: 10Jbond) [14:36:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Skip using https for mwapi_uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/538899 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [14:36:35] (03Merged) 10jenkins-bot: restrouter: Skip using https for mwapi_uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/538899 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [14:38:54] (03PS2) 10Jbond: SSO: add ssl cert for icinga::cas [puppet] - 10https://gerrit.wikimedia.org/r/538896 [14:39:17] (03PS2) 10Jbond: icinga::cas: use unique ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/538897 [14:40:22] (03CR) 10Jbond: [C: 03+2] SSO: add ssl cert for icinga::cas [puppet] - 10https://gerrit.wikimedia.org/r/538896 (owner: 10Jbond) [14:40:45] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [14:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538897 (owner: 10Jbond) [14:43:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538898 (owner: 10Muehlenhoff) [14:44:01] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: tweak widespread puppet failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/538836 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [14:44:14] (03PS2) 10Filippo Giunchedi: prometheus: tweak widespread puppet failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/538836 (https://phabricator.wikimedia.org/T232303) [14:44:22] (03PS2) 10Muehlenhoff: Make accessStrategy configurable in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/538898 [14:44:30] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:02] chaomodus: netbox_dump_run.service FYI ^^^ [14:45:22] (03PS1) 10Jbond: Revert "SSO: add ssl cert for icinga::cas" [puppet] - 10https://gerrit.wikimedia.org/r/538900 [14:46:28] (03CR) 10Jbond: [C: 03+2] Revert "SSO: add ssl cert for icinga::cas" [puppet] - 10https://gerrit.wikimedia.org/r/538900 (owner: 10Jbond) [14:47:11] (03PS3) 10Muehlenhoff: Make accessStrategy configurable in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/538898 [14:48:02] (03PS3) 10Cwhite: change EndpointMetrics from static to instance variable [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 [14:48:20] (03PS3) 10Filippo Giunchedi: prometheus: tweak widespread puppet failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/538836 (https://phabricator.wikimedia.org/T232303) [14:48:24] (03PS1) 10Jhedden: openstack: configure haproxy for eqiad1 APIs [puppet] - 10https://gerrit.wikimedia.org/r/538901 (https://phabricator.wikimedia.org/T223907) [14:48:50] (03CR) 10Cwhite: [C: 03+1] prometheus: tweak widespread puppet failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/538836 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [14:48:52] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] prometheus: tweak widespread puppet failures thresholds [puppet] - 10https://gerrit.wikimedia.org/r/538836 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [14:48:54] (03CR) 10Muehlenhoff: [C: 03+2] Make accessStrategy configurable in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/538898 (owner: 10Muehlenhoff) [14:49:48] jbond42, godog: I'll puppet-merge your patches along? [14:50:17] actually, no mine got stuck in FF again, sigh [14:50:35] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash pipeline crashes on non-UTF8 log messages. - https://phabricator.wikimedia.org/T233662 (10herron) p:05Triage→03Normal IMO mitigation in the logging pipeline is where we should focus, as there are varying applications that could produce malf... [14:50:50] (03PS1) 10Jbond: icinga::cas: add ssl cert for temp vhost [puppet] - 10https://gerrit.wikimedia.org/r/538903 [14:50:52] (03PS4) 10Muehlenhoff: Make accessStrategy configurable in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/538898 [14:51:01] (03PS5) 10Muehlenhoff: Make accessStrategy configurable in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/538898 [14:51:35] (03CR) 10Vgutierrez: [C: 03+1] icinga::cas: add ssl cert for temp vhost [puppet] - 10https://gerrit.wikimedia.org/r/538903 (owner: 10Jbond) [14:52:15] (03CR) 10Jbond: [C: 03+2] icinga::cas: add ssl cert for temp vhost [puppet] - 10https://gerrit.wikimedia.org/r/538903 (owner: 10Jbond) [14:52:18] 10Operations, 10serviceops: Set up LVS for parsoid/PHP - https://phabricator.wikimedia.org/T233722 (10herron) p:05Triage→03Normal [14:52:26] (03PS2) 10Jbond: icinga::cas: add ssl cert for temp vhost [puppet] - 10https://gerrit.wikimedia.org/r/538903 [14:52:29] haha I'll merge moritzm jbond42 [14:52:38] soudns good ? [14:52:41] 10Operations, 10Wikimedia-Mailing-lists: Close wikimediameta-l mailing list - https://phabricator.wikimedia.org/T233666 (10herron) p:05Triage→03Normal [14:52:56] {{done}} [14:53:04] well, puppet-merge is running [14:53:06] (03CR) 10Muehlenhoff: [C: 03+2] Make accessStrategy configurable in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/538898 (owner: 10Muehlenhoff) [14:53:15] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/18540/" [puppet] - 10https://gerrit.wikimedia.org/r/538901 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:54:06] (03PS3) 10Jbond: icinga::cas: add ssl cert for temp vhost [puppet] - 10https://gerrit.wikimedia.org/r/538903 [14:56:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite) [14:57:14] (03CR) 10Jbond: [C: 03+2] icinga::cas: use unique ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/538897 (owner: 10Jbond) [14:57:22] (03PS1) 10Ottomata: Disable specultative exec for camus jobs; increase map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) [14:57:27] (03CR) 10Thiemo Kreuz (WMDE): Lower gzip threshold for SVGs served by MediaWiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537974 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [14:57:38] (03PS3) 10Jbond: icinga::cas: use unique ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/538897 [14:58:16] (03CR) 10jerkins-bot: [V: 04-1] Disable specultative exec for camus jobs; increase map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [14:58:45] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) p:05Triage→03Normal Hello! Once you have confirmed the details of the request, could you please create su... [14:59:49] (03PS2) 10Ottomata: Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) [14:59:53] #/go vgutierrez [15:00:04] uh? :) [15:00:18] I don't /go without /beer [15:01:25] lol ill have to update my plugins see if they can help with that (https://github.com/irssi/scripts.irssi.org/blob/master/scripts/go.pl) [15:03:08] (03PS1) 10CDanis: check return value of $EDITOR in EditActions [software/conftool] - 10https://gerrit.wikimedia.org/r/538906 (https://phabricator.wikimedia.org/T233680) [15:06:09] 10Operations, 10ops-codfw: refresh/replace scs-a1-codfw - https://phabricator.wikimedia.org/T231686 (10Papaul) [15:06:55] (03CR) 10jerkins-bot: [V: 04-1] check return value of $EDITOR in EditActions [software/conftool] - 10https://gerrit.wikimedia.org/r/538906 (https://phabricator.wikimedia.org/T233680) (owner: 10CDanis) [15:08:29] (03PS1) 10Muehlenhoff: Set IDP access strategy for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/538907 [15:09:12] 11:04:34 TypeError: reset_mock() got an unexpected keyword argument 'return_value' [15:09:22] python 3.4 😡 [15:09:35] (03CR) 10jerkins-bot: [V: 04-1] Set IDP access strategy for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/538907 (owner: 10Muehlenhoff) [15:09:59] (03PS2) 10CDanis: check return value of $EDITOR in EditActions [software/conftool] - 10https://gerrit.wikimedia.org/r/538906 (https://phabricator.wikimedia.org/T233680) [15:10:19] 10Operations, 10cloud-services-team (Kanban): Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 (10Bstorm) >>! In T231793#5519565, @MoritzMuehlenhoff wrote: >>>! In T231793#5519559, @Andrew wrote: >> We're precariously close to upgrading to Newton, so maybe this is moot? > > But... [15:13:21] <_joe_> cdanis: yeah python 3.4 makes us sad [15:13:22] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/538907 (owner: 10Muehlenhoff) [15:15:29] (03CR) 10jerkins-bot: [V: 04-1] Set IDP access strategy for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/538907 (owner: 10Muehlenhoff) [15:15:38] _joe_: volans: salve dottori! por favore, voi avete un minuto? https://gerrit.wikimedia.org/r/538906 [15:16:43] <_joe_> cdanis: you sound like a used car dealer [15:16:50] ahahahaha [15:17:39] aahahhaha [15:18:14] too formal? [15:18:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] check return value of $EDITOR in EditActions [software/conftool] - 10https://gerrit.wikimedia.org/r/538906 (https://phabricator.wikimedia.org/T233680) (owner: 10CDanis) [15:18:44] (03PS3) 10Ottomata: Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) [15:18:57] (03PS2) 10Muehlenhoff: Set IDP access strategy for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/538907 [15:18:58] <_joe_> a weird mix of glorifying one's titles and being colloquial [15:20:52] ahahaha [15:21:01] (03PS4) 10Ottomata: Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) [15:21:40] grazie mille [15:21:43] (03CR) 10jerkins-bot: [V: 04-1] Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [15:21:47] <_joe_> cdanis: you sound like a used car dealer <-- ROTFL [15:22:05] <_joe_> he sold me an used patchset intead [15:22:11] (03PS1) 10Ayounsi: Fastnetmon notify, update Turnilo URL [puppet] - 10https://gerrit.wikimedia.org/r/538909 (https://phabricator.wikimedia.org/T229682) [15:22:17] _joe_: and you sold me a used confctl! [15:22:20] (03PS5) 10Ottomata: Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) [15:22:33] <_joe_> cdanis: fair enough, you got the short end of the stick [15:23:08] (03CR) 10jerkins-bot: [V: 04-1] Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [15:26:26] (03CR) 10Ayounsi: [C: 03+2] Fastnetmon notify, update Turnilo URL [puppet] - 10https://gerrit.wikimedia.org/r/538909 (https://phabricator.wikimedia.org/T229682) (owner: 10Ayounsi) [15:26:36] (03PS2) 10Ayounsi: Fastnetmon notify, update Turnilo URL [puppet] - 10https://gerrit.wikimedia.org/r/538909 (https://phabricator.wikimedia.org/T229682) [15:27:32] (03CR) 10CDanis: [C: 03+2] check return value of $EDITOR in EditActions [software/conftool] - 10https://gerrit.wikimedia.org/r/538906 (https://phabricator.wikimedia.org/T233680) (owner: 10CDanis) [15:27:53] (03CR) 10Gehel: [C: 04-1] "see comments inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [15:28:59] (03PS1) 10Jbond: idp: update service regex [puppet] - 10https://gerrit.wikimedia.org/r/538910 [15:29:09] (03CR) 10BBlack: [C: 03+1] lvs: do not check hhvm/php7 at the same time anymore. [puppet] - 10https://gerrit.wikimedia.org/r/538864 (https://phabricator.wikimedia.org/T219127) (owner: 10Giuseppe Lavagetto) [15:30:37] (03PS6) 10Ottomata: Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) [15:30:52] (03Merged) 10jenkins-bot: check return value of $EDITOR in EditActions [software/conftool] - 10https://gerrit.wikimedia.org/r/538906 (https://phabricator.wikimedia.org/T233680) (owner: 10CDanis) [15:31:23] (03CR) 10Ottomata: [C: 03+2] Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [15:31:33] (03PS7) 10Ottomata: Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) [15:31:37] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Disable specultative exec for camus; bump map.tasks to 24 for analytics events [puppet] - 10https://gerrit.wikimedia.org/r/538905 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [15:31:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538910 (owner: 10Jbond) [15:35:24] (03CR) 10Andrew Bogott: [C: 03+1] "This looks good -- the haproxy profile is nicely organized :)" [puppet] - 10https://gerrit.wikimedia.org/r/538901 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:37:58] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:04] (03PS1) 10Urbanecm: Typo: Add a slash to szlwiki for wgMinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) [15:40:21] (03CR) 10Jbond: [C: 03+2] idp: update service regex [puppet] - 10https://gerrit.wikimedia.org/r/538910 (owner: 10Jbond) [15:40:23] (03PS2) 10Jbond: idp: update service regex [puppet] - 10https://gerrit.wikimedia.org/r/538910 [15:42:07] (03CR) 10Jhedden: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/538901 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:44:13] (03CR) 10Andrew Bogott: [C: 03+1] "> There are no changes needed for the catalog or endpoint definitions." [puppet] - 10https://gerrit.wikimedia.org/r/538901 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:44:24] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:15] ACKNOWLEDGEMENT - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Cas Rusnov Known issue, being debugged. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:34] ACKNOWLEDGEMENT - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Cas Rusnov Known issue, beging debugged. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:34] (03CR) 10Jhedden: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/538901 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:56:09] (03CR) 10Jbond: ipmi: use run instead of checkouput (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [15:57:02] (03CR) 10Ayounsi: "> Patch Set 5:" (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [15:57:39] (03PS6) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) [15:57:41] (03CR) 10Jbond: ipmi: use run instead of checkouput (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [15:59:46] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190924T1600). Please do the needful. [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:02:06] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:02:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:02:50] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:02:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:03:08] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:03:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:03:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:03:44] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:03:55] 10Operations, 10ops-codfw, 10media-storage: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10Papaul) [16:04:24] (03CR) 10Jdlrobson: Typo: Add a slash to szlwiki for wgMinervaCustomLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [16:04:44] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:04:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:04:55] (03PS7) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) [16:05:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:05:45] 150 rps of 50x [16:05:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:06:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:06:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:06:50] cp1087 maxed out on connections to appserver backends (both api and normal) [16:08:35] not sure why, could be misbehavior at appserver or at traffic layer [16:09:56] (03CR) 10Volans: "as discussed on IRC" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [16:12:58] cdanis: I know I already asked but what dashboard did you check for that? [16:14:24] ah failed fetches yes [16:14:33] elukey: I looked at these three: https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=All&var-cache_type=text&var-cache_type=upload&var-status_type=5 and https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X and [16:14:34] https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-1h&to=now [16:14:54] yes yes sorry I remembered only 2 mins after asking :D [16:14:56] thanks :) [16:15:29] np! [16:15:48] cp1083 has a big mailbox lag but probably not causing any issue now (and it will be restarted at some point) [16:16:01] interestingly there's just as much associated with cp1077 in the logstash 50x console [16:16:07] but it didn't look so bad in failed fetches [16:17:35] (03PS1) 10Jbond: apereo_cas update samlValidate [puppet] - 10https://gerrit.wikimedia.org/r/538914 [16:25:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538914 (owner: 10Jbond) [16:28:52] (03PS1) 10CDanis: release 1.2.0-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/538915 [16:30:29] (03PS2) 10Ayounsi: Kafkatee, mask default (package provided) systemd service [puppet] - 10https://gerrit.wikimedia.org/r/536645 [16:32:02] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18541/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/536645 (owner: 10Ayounsi) [16:38:01] (03PS1) 10Jbond: icnga::cas lower http::site priority and correct error path [puppet] - 10https://gerrit.wikimedia.org/r/538918 [16:38:27] (03CR) 10CRusnov: [C: 03+2] netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [16:38:52] (03CR) 10Jbond: [C: 03+2] apereo_cas update samlValidate [puppet] - 10https://gerrit.wikimedia.org/r/538914 (owner: 10Jbond) [16:39:04] (03PS2) 10Jbond: apereo_cas update samlValidate [puppet] - 10https://gerrit.wikimedia.org/r/538914 [16:39:06] (03CR) 10jerkins-bot: [V: 04-1] icnga::cas lower http::site priority and correct error path [puppet] - 10https://gerrit.wikimedia.org/r/538918 (owner: 10Jbond) [16:39:20] (03CR) 10CDanis: [C: 03+1] icnga::cas lower http::site priority and correct error path [puppet] - 10https://gerrit.wikimedia.org/r/538918 (owner: 10Jbond) [16:40:02] 12:38:53 Syntax error at 'priority' at /srv/workspace/puppet/modules/icinga/manifests/cas.pp:31:9 on node foobar.example.com [16:40:13] oh lol jbond42 you forgot a comma [16:40:16] on line 30 [16:41:27] (03PS2) 10Jbond: icnga::cas lower http::site priority and correct error path [puppet] - 10https://gerrit.wikimedia.org/r/538918 [16:41:30] yep just got it cheers [16:41:58] (03PS3) 10Jbond: icnga::cas lower http::site priority and correct error path [puppet] - 10https://gerrit.wikimedia.org/r/538918 [16:42:05] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10Papaul) [16:43:17] (03CR) 10Jbond: [C: 03+2] icnga::cas lower http::site priority and correct error path [puppet] - 10https://gerrit.wikimedia.org/r/538918 (owner: 10Jbond) [16:44:42] (03PS6) 10CRusnov: netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) [16:56:20] (03CR) 10jenkins-bot: netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [16:56:48] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 57.71 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:58:24] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 70.26 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:58:59] re: traffic drop -- looks like eqsin had a small spike of traffic between 16:20 and 16:35, which went away, and is now going off-peak as normal [17:00:04] cscott, arlolra, subbu, halfak, and accraze: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190924T1700). Please do the needful. [17:03:59] (03CR) 10Jdlrobson: [C: 04-1] Typo: Add a slash to szlwiki for wgMinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [17:06:22] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:00] (03PS1) 10Paladox: Gerrit: Support java 8 under buster [software/gerrit] - 10https://gerrit.wikimedia.org/r/538924 [17:08:15] (03Abandoned) 10Paladox: Gerrit: Support java 8 under buster [software/gerrit] - 10https://gerrit.wikimedia.org/r/538924 (owner: 10Paladox) [17:08:32] (03PS1) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [17:15:17] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia-bd-regional mailing list - https://phabricator.wikimedia.org/T233742 (10NahidSultan) [17:15:27] (03PS2) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [17:16:39] (03CR) 10Paladox: Gerrit: Support java 8 under buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [17:17:13] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [17:18:53] 10Operations, 10ops-codfw: refresh/replace scs-a1-codfw - https://phabricator.wikimedia.org/T231686 (10Papaul) [17:26:24] (03PS1) 10Jdlrobson: Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 [17:27:04] (03PS2) 10Jdlrobson: Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) [17:36:54] (03PS1) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 [17:37:50] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:39:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:39:39] (03CR) 10jerkins-bot: [V: 04-1] logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (owner: 10Herron) [17:40:53] (03PS2) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 [17:41:49] (03PS2) 10Urbanecm: Typo: Add a slash to szlwiki for wgMinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) [17:44:50] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi) 05Open→03Resolved https://gerrit.wikimedia.org/r/c/operations/puppet/+/509140 was the last thing to do and it has been merged some time ago. [17:44:53] 10Operations, 10netops: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi) [17:46:38] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:09] (03CR) 10Krinkle: logstash: throttle duplicate normalized_message with level:ERR* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538931 (owner: 10Herron) [17:49:11] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10Mathew.onipe) We should talk to elastic to see how we can move this forward. Current... [17:54:25] (03CR) 10Jdlrobson: [C: 04-1] "Talking to Alex hollender, the SVG fill color should not be pure black - it should be #54595D." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [17:54:49] 10Operations, 10netops: BGP session down for AS4739 on cr4-ulsfo - https://phabricator.wikimedia.org/T230005 (10ayounsi) 05Open→03Resolved Sessions are now established. Thanks! [17:55:33] (03CR) 10Arturo Borrero Gonzalez: "Some comments inline. In general, I think this is great work! Thanks!" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [17:58:35] (03CR) 10Arturo Borrero Gonzalez: "Awesome work! Thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190924T1800) [18:00:28] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, 10User-Elukey: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10ayounsi) [18:00:54] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) ti [18:00:54] response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:01:35] 10Operations, 10netbox: Netbox racks consistency report - https://phabricator.wikimedia.org/T212878 (10ayounsi) [18:04:47] (03PS3) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [18:05:44] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:08:25] (03PS2) 1020after4: Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/538858 (https://phabricator.wikimedia.org/T230752) [18:08:45] (03CR) 10Bstorm: toolforge-kubernetes: restructure pod security policies (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [18:09:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: Refresh switch ports descriptions for recently renamed cloud servers - https://phabricator.wikimedia.org/T201444 (10ayounsi) [18:10:51] (03Abandoned) 10Andrew Bogott: neutron-l3-agent: forward our routing hacks to Newton [puppet] - 10https://gerrit.wikimedia.org/r/538707 (https://phabricator.wikimedia.org/T233665) (owner: 10Andrew Bogott) [18:11:59] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [18:13:11] (03PS1) 10Andrew Bogott: Cloudvirt1021: add to normal scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/538937 [18:15:03] 10Operations, 10netops: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689 (10ayounsi) 05Open→03Resolved No new updates since March 2018, feel free to reopen if the issue is still there. [18:15:34] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:47] (03PS1) 10Hashar: contint: ensure docker-ce is available [puppet] - 10https://gerrit.wikimedia.org/r/538938 [18:19:19] (03CR) 10Masumrezarock100: "I thought Mediawiki software renders SVG files as PNG." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [18:33:02] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:42] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:22] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:48] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt1021: add to normal scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/538937 (owner: 10Andrew Bogott) [18:45:55] 10Operations, 10Cloud-Services, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671 (10brion) 05Open→03Resolved Haven't encountered this in a while; Comcast etc may have improved the intermediate routes. Closing out as resol... [18:48:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:36] (03CR) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538931 (owner: 10Herron) [18:49:04] (03CR) 10Dzahn: "thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [18:52:56] 10Operations, 10Core Platform Team, 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10daniel) [18:55:21] (03CR) 10Jdlrobson: "> I thought Mediawiki software renders SVG files as PNG." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [19:00:04] twentyafterfour: How many deployers does it take to do MediaWiki train - American version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190924T1900). [19:04:28] (03CR) 10Krinkle: "Should this remove the faulty file as well to avoid confusion?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [19:13:14] (03PS3) 10Jdlrobson: Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) [19:13:28] (03PS4) 10Jdlrobson: Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) [19:15:21] (03CR) 10Masumrezarock100: "> Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [19:19:46] (03CR) 10Krinkle: "(I'd normally worry about HTML cache, but the current config made it point at https://szl.m.wikipedia.org/wiki/static/images/mobile/copyri" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [19:19:50] (03CR) 10Krinkle: [C: 03+1] Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [19:20:56] !log branching 1.34.0-wmf.24 refs T220749 [19:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:00] T220749: 1.34.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T220749 [19:24:37] (03CR) 10Masumrezarock100: "I could change colors of only two fonts by manually updating the code. https://upload.wikimedia.org/wikipedia/commons/archive/b/bd/2019092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [19:27:30] (03CR) 10Urbanecm: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [19:29:53] 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) [19:32:59] (03PS2) 10CDanis: release 1.2.0-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/538915 [19:33:24] (03PS3) 10CDanis: release 1.2.0-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/538915 [19:34:42] (03PS4) 10CDanis: release 1.2.0-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/538915 [19:34:50] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:06] 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) a:03ayounsi Keeping the default damping settings (per the doc) here is what I think we should push to our routers: `lang=diff [edit protocols bgp group IX4] + damping; [edit protocols... [19:38:08] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:08] !log disable asw2-d-eqiad:ge-5/0/41 excessive flapping [19:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:30] (03PS4) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [19:42:48] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10wikitech.wikimedia.org, 10Wikimedia-production-error: ConfirmEdit seemingly erroneously enabled for some users on wikitech - https://phabricator.wikimedia.org/T233215 (10Krinkle) Based on the stacktace being `/srv/mediawiki/w/…`, I guess this isn't product... [19:42:59] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10wikitech.wikimedia.org: ConfirmEdit seemingly erroneously enabled for some users on wikitech - https://phabricator.wikimedia.org/T233215 (10Krinkle) [19:50:34] (03CR) 10CDanis: [C: 03+2] release 1.2.0-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/538915 (owner: 10CDanis) [19:53:31] (03Merged) 10jenkins-bot: release 1.2.0-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/538915 (owner: 10CDanis) [19:59:28] zuul seems to have something stuck in its throat [20:00:09] not a lot of jobs in the mediawiki queue but they've been waiting for up to 2.5 hrs [20:07:01] eww [20:07:16] ejegg: I'll take a look. gerrit is running very slowly for me, not sure if it's related [20:09:12] thanks! [20:09:52] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10hashar) + #cloud-services-team for the hosts: cloudvirt[1001-1009,1012-1013,1019-1020].eqiad.wmnet cloudvirt1014 has already been updated and cloudvirt1013 has the... [20:10:17] (03CR) 10Jdlrobson: [C: 03+1] "Is anyone free to deploy this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [20:13:19] 10Operations, 10Mail, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10Varnent) 05Open→03Resolved Appears they got it working - thank you!! :) [20:17:21] (03CR) 10Masumrezarock100: "> Could you change the whole SVG, please?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [20:19:10] !log restarting gerrit due to unreasonably high garbage collection times and sluggish performance in general. [20:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:19] (03CR) 10Nuria: [C: 03+1] Rsync analytics mediawiki history dumps to dumps.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538312 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [20:21:51] RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:52] RECOVERY - nova-compute proc maximum on cloudvirt1024 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:25:28] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27490 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [20:33:27] (03PS1) 10Ottomata: Temporarily disabling camus import of api-request [puppet] - 10https://gerrit.wikimedia.org/r/538966 (https://phabricator.wikimedia.org/T233718) [20:34:50] RECOVERY - ensure kvm processes are running on cloudvirt1024 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:35:06] (03CR) 10Ottomata: [C: 03+2] Temporarily disabling camus import of api-request [puppet] - 10https://gerrit.wikimedia.org/r/538966 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [20:37:17] (03PS1) 10Ottomata: Temporiarily disable monitoring of camus api-request import [puppet] - 10https://gerrit.wikimedia.org/r/538968 (https://phabricator.wikimedia.org/T233718) [20:37:28] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:43] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Temporiarily disable monitoring of camus api-request import [puppet] - 10https://gerrit.wikimedia.org/r/538968 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [20:37:54] (03PS2) 10Jhedden: openstack: configure haproxy for eqiad1 APIs [puppet] - 10https://gerrit.wikimedia.org/r/538901 (https://phabricator.wikimedia.org/T223907) [20:37:56] (03CR) 10Mforns: Rsync analytics mediawiki history dumps to dumps.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538312 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [20:41:28] (03CR) 10Jhedden: [C: 03+2] openstack: configure haproxy for eqiad1 APIs [puppet] - 10https://gerrit.wikimedia.org/r/538901 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [20:45:32] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:58] (03CR) 10Ottomata: [C: 03+1] "+1 with one comment :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538312 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [20:47:18] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:28] ye [20:54:45] (03CR) 10Ammarpad: "I added it to next SWAT window (< 2 hours) in case it's not reverted by then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [21:02:43] (03PS1) 10Jhedden: openstack: Update backend API ports for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/538971 (https://phabricator.wikimedia.org/T223907) [21:04:51] (03CR) 10jerkins-bot: [V: 04-1] openstack: Update backend API ports for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/538971 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [21:07:07] (03PS2) 10Jhedden: openstack: Update backend API ports for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/538971 (https://phabricator.wikimedia.org/T223907) [21:07:40] PROBLEM - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - commonswiki_content_1556151793(78.66666666666667gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [21:09:26] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [21:09:53] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/18544/" [puppet] - 10https://gerrit.wikimedia.org/r/538971 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [21:11:08] 10Operations, 10Traffic: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10Dzahn) These hosts showed up as special cases as part of T147074 because commands over remote IPMI did not work. ssh login works. Feels like IPMI over LAN might be disabled in BIOS. [21:17:34] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.16e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:19:06] !log ganeti4001 - racadm racreset - attempt to fix IPMI [21:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:29] 10Operations, 10ops-codfw: refresh/replace scs-a1-codfw - https://phabricator.wikimedia.org/T231686 (10Papaul) [21:24:51] 10Operations, 10ops-codfw: refresh/replace scs-a1-codfw - https://phabricator.wikimedia.org/T231686 (10Papaul) 05Open→03Resolved This is complete [21:31:14] 10Operations, 10ops-codfw, 10media-storage: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10Papaul) [21:37:40] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:01] (03CR) 1020after4: [C: 03+1] Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/538858 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [21:38:26] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:25] 10Operations, 10netops: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) FYI I tested the policy from the description successfully during T226422 and T226424. [21:44:00] (03PS1) 1020after4: testwikis wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538975 [21:44:02] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538975 (owner: 1020after4) [21:45:02] (03Merged) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538975 (owner: 1020after4) [21:45:11] (03CR) 1020after4: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/18546/" [puppet] - 10https://gerrit.wikimedia.org/r/538858 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [21:45:20] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.34.0-wmf.24 refs T220749 [21:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:24] T220749: 1.34.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T220749 [21:45:42] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:16] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:29] (03CR) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538975 (owner: 1020after4) [21:53:02] !log restbase1024 - enable IPMI over LAN which wasn't working before [21:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:08] 10Operations, 10netops, 10Wikimedia-Incident: asw2-d2-eqiad crash - https://phabricator.wikimedia.org/T233645 (10ayounsi) The logs rolled over the weekend... And neither the ones shipped to central logging nor the RSI had any useful information according to JTAC. From there we can: 1/ replace the switch wi... [22:08:05] (03PS1) 10Cwhite: hiera: update ores to pass statsd through statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) [22:12:29] 10Operations, 10netops, 10observability: Netflow Collector Project - https://phabricator.wikimedia.org/T83119 (10ayounsi) 05Open→03Resolved a:03ayounsi https://wikitech.wikimedia.org/wiki/Netflow [22:16:43] (03PS1) 10Alex Monk: Fix maintain_dbusers class lookup [puppet] - 10https://gerrit.wikimedia.org/r/538979 [22:20:40] (03CR) 10Jhedden: [C: 03+1] "Nice catch!" [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [22:25:59] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.34.0-wmf.24 refs T220749 (duration: 40m 38s) [22:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:04] T220749: 1.34.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T220749 [22:32:35] (03CR) 10Alex Monk: "(https://puppet.com/docs/puppet/4.10/hiera_use_function.html was the docs for it under 4.x, this one is 4.10 as it doesn't look like they " [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [22:35:47] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) a:05jijiki→03RobH @Robh I think all this needs is a quick check if switch port label and physical label are mw1298 and if it is this can be closed. This is... [22:36:05] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) [22:36:34] (03CR) 10Andrew Bogott: [C: 03+1] "I'm convinced that this is correct (and confirmed that it looks like a no-op to the pcc). Might be of interest to jbond to poke at things" [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [22:38:50] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) So to be clear. Despite what the checkboxes say we do NOT need to disable puppet on this host and take it down. It is only about confirming this host, WMF693... [22:39:22] 10Operations, 10ops-eqiad: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10RobH) a:05RobH→03Cmjohnson >>! In T215332#5520988, @Dzahn wrote: > @Robh I think all this needs is a quick check if switch port label and physical label are mw1298 and if it is this can be closed. This is back... [22:39:49] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Dwisehaupt) After updating our stretch boot environment, I was able to boot and install the OS on this host. Additionally it has been tied into puppet and gotten its updates. Verif... [22:40:13] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Dwisehaupt) [22:44:49] (03CR) 10Cwhite: "Tested on deployment-prep and it doesn't appear to break things." [puppet] - 10https://gerrit.wikimedia.org/r/538642 (https://phabricator.wikimedia.org/T233662) (owner: 10Cwhite) [22:47:51] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Dzahn) [22:48:00] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [22:48:57] (03PS2) 10MaxSem: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) [22:52:48] PROBLEM - Check systemd state on labsdb1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:12] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [22:54:20] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190924T2300). [23:00:04] Ammarpad: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:31] RECOVERY - Check systemd state on labsdb1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:22] (03PS3) 10MaxSem: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) [23:11:51] (03CR) 10Paladox: [C: 03+1] gerrit: get LDAP server from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [23:13:13] (03CR) 10Paladox: [C: 03+1] gerrit: get LDAP server from ldap_config, use ro server, simplify (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [23:21:10] (03CR) 10Dzahn: "check out hieradata/common.yaml starting line 1505 and hieradata/eqiad.yaml starting line 43. copy those to project Hiera in Horizon or ma" [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [23:22:58] (03PS14) 10Dzahn: gerrit: get LDAP server from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [23:25:11] (03CR) 10Krinkle: [C: 03+1] Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [23:26:29] (03CR) 10Dzahn: [C: 03+2] gerrit: get LDAP server from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [23:28:58] (03PS1) 10Bstorm: wiki replicas: depool lasbdb1011 just in case of issues [puppet] - 10https://gerrit.wikimedia.org/r/538987 (https://phabricator.wikimedia.org/T233766) [23:30:01] !log switching LDAP servers used by Gerrit to readonly replicas. stop using so called "labs" config for LDAP backend. [23:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:02] mutante login seems to still work on gerrit2001 [23:31:05] (03PS2) 10Bstorm: wiki replicas: depool lasbdb1011 just in case of issues [puppet] - 10https://gerrit.wikimedia.org/r/538987 (https://phabricator.wikimedia.org/T233766) [23:31:19] paladox: yea, but it did not switch without restart [23:31:21] so far [23:31:25] oh [23:31:27] watching it with tcpdump [23:32:03] talks to seaborgium [23:32:18] (03CR) 10Bstorm: [C: 03+2] wiki replicas: depool lasbdb1011 just in case of issues [puppet] - 10https://gerrit.wikimedia.org/r/538987 (https://phabricator.wikimedia.org/T233766) (owner: 10Bstorm) [23:33:49] !log gerrit2001 - restarting gerrit service [23:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:14] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [23:35:36] !log wiki-replicas depooled labsdb1011 [23:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:49] paladox: on 2001 i can now see it use ldap-ro [23:35:55] it does need service restart [23:36:09] great :) [23:36:17] was that you? [23:37:23] yes [23:37:26] i logged in :) [23:37:30] good [23:38:05] !log gerrit service restart to switch LDAP backend [23:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:31] back for me [23:39:40] logs out and back in [23:40:20] works for me. confirmed using the new ldap server name [23:40:35] :) [23:42:07] (03CR) 10Dzahn: "this needed a service restart, first on gerrit2001 then on cobalt. confirmed both are using ldap-ro servers now. (tcpdump dst port 636). N" [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn)