[01:49:32] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:49:11] (03PS1) 10Marostegui: sanitarium_multiinstance.my: Expire logs days set to 30 [puppet] - 10https://gerrit.wikimedia.org/r/595358 (https://phabricator.wikimedia.org/T249188) [04:54:12] (03PS2) 10Marostegui: sanitarium_multiinstance.my: Expire logs days set to 30 [puppet] - 10https://gerrit.wikimedia.org/r/595358 (https://phabricator.wikimedia.org/T249188) [04:56:24] (03CR) 10Marostegui: [C: 03+2] sanitarium_multiinstance.my: Expire logs days set to 30 [puppet] - 10https://gerrit.wikimedia.org/r/595358 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [05:10:56] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:47] good :) [05:14:19] (03PS1) 10Elukey: profile::analytics::refinery:job::refine: bump event refine exec memory to 4g [puppet] - 10https://gerrit.wikimedia.org/r/595359 [05:15:16] 10Operations, 10ops-eqiad: dumpsdata1001 power supply failure - https://phabricator.wikimedia.org/T252361 (10Marostegui) [05:15:45] ACKNOWLEDGEMENT - IPMI Sensor Status on dumpsdata1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Marostegui T252361 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [05:15:51] 10Operations, 10ops-eqiad: dumpsdata1001 power supply failure - https://phabricator.wikimedia.org/T252361 (10Marostegui) p:05Triage→03Medium [05:20:12] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/22441/an-launcher1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/595359 (owner: 10Elukey) [05:45:17] 10Operations, 10SRE-Access-Requests: Revoke production access for jmorgan - https://phabricator.wikimedia.org/T251560 (10elukey) [05:56:50] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:32] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:53] !log restart wikimedia-discovery-golden on stat1007 - apparenlty killed by no memory left to allocate on the system [06:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:00] very weird, there is memory now [06:05:45] but there was not a lot some mins ago [06:05:46] https://grafana.wikimedia.org/d/000000377/host-overview?panelId=4&fullscreen&orgId=1&refresh=5m&var-server=stat1007&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics [06:05:55] also the script keeps logging errors, opened a task [06:05:57] sigh [06:21:05] (03CR) 10QEDK: [C: 03+1] "/pokes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [06:50:05] (03PS1) 10Jcrespo: Revert "backup1002: Update NIC address for card with link" [puppet] - 10https://gerrit.wikimedia.org/r/595464 (https://phabricator.wikimedia.org/T250816) [06:51:07] (03PS2) 10Jcrespo: Revert "backup1002: Update NIC address for card with link" [puppet] - 10https://gerrit.wikimedia.org/r/595464 (https://phabricator.wikimedia.org/T250816) [06:52:37] (03CR) 10Jcrespo: [C: 03+2] Revert "backup1002: Update NIC address for card with link" [puppet] - 10https://gerrit.wikimedia.org/r/595464 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [07:01:36] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10jcrespo) a:05Cmjohnson→03jcrespo Thanks, that made it boot. Thank you! Now I am only blocked by pending update of buster installer to la... [07:05:04] ^ moritz jbond42, AndrewB and I are blocked on reimaging to buster by update to lastest point release for buster installer, but I don't want to do it on my own without your knowledge [07:08:42] <_joe_> elukey: I'm commenting out the ferm rule now [07:08:49] ack [07:08:59] <_joe_> !log dropping requests to mc1020 via a firewall rule T251378 [07:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:03] T251378: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 [07:10:25] <_joe_> elukey: heh what I did wasn't ok, those machines have a default accept policy AFAICS [07:11:22] _joe_ so we should explicitly block traffic towards 11211 [07:11:39] <_joe_> yeah doing so [07:11:53] <_joe_> I'm going with DROP [07:19:00] PROBLEM - Memcached on mc1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [07:19:08] <_joe_> expected ^^ [07:20:05] so mc-gp1002 took over the load [07:21:05] <_joe_> all went there? [07:21:12] <_joe_> that's vaguely surprising [07:21:12] !log updated buster netboot images to 10.4 (updated to latest point release) [07:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:23] <_joe_> or not [07:23:47] get hit ratio for gp1002 is now 0.78, that look good [07:25:50] <_joe_> keep also in mind we're capping TTL at 10 minute [07:25:54] <_joe_> *minutes [07:26:56] yep [07:34:39] (03PS1) 10RhinosF1: Create Gapura (Portal) namespace for jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595469 [07:36:15] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['backup1002.eqiad.wmnet'] ` The log can be found in `/var/... [07:38:51] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Tobias Andersson to the ldap/wmde group - https://phabricator.wikimedia.org/T251997 (10WMDE-leszek) Thanks everyone! @Dzahn Yes, we need to update the staff page. Things are a bit slow with updating in the recent times. [07:41:02] (03PS2) 10RhinosF1: Create Gapura (Portal) namespace for jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595469 (https://phabricator.wikimedia.org/T252343) [07:41:55] (03CR) 10jerkins-bot: [V: 04-1] Create Gapura (Portal) namespace for jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595469 (https://phabricator.wikimedia.org/T252343) (owner: 10RhinosF1) [07:48:14] (03PS3) 10RhinosF1: Create Gapura (Portal) namespace for jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595469 (https://phabricator.wikimedia.org/T252343) [07:48:30] (03CR) 10RhinosF1: "> Main test build failed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595469 (https://phabricator.wikimedia.org/T252343) (owner: 10RhinosF1) [07:54:51] !log installing squid security updates [07:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:18] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10ema) >>! In T237993#6111924, @fgiunchedi wrote: > Rephrasing to make sure I understand: the major problem is making sure that the mapping from `struct kafka.Stats` to... [08:03:14] RECOVERY - Check systemd state on ms-fe2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:26] (03PS1) 10Ema: Release 0.6 [software/atskafka] - 10https://gerrit.wikimedia.org/r/595472 (https://phabricator.wikimedia.org/T237993) [08:08:02] (03PS1) 10Filippo Giunchedi: site: assign thanos::frontend to thanos-fe2* [puppet] - 10https://gerrit.wikimedia.org/r/595473 (https://phabricator.wikimedia.org/T233956) [08:09:24] (03CR) 10Filippo Giunchedi: [C: 03+2] site: assign thanos::frontend to thanos-fe2* [puppet] - 10https://gerrit.wikimedia.org/r/595473 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [08:16:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos_query site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:17:25] (03PS1) 10Dzahn: site: fix typo in role for new restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/595475 (https://phabricator.wikimedia.org/T241784) [08:17:51] (03PS1) 10Filippo Giunchedi: hieradata: add Thanos cluster and thanos::frontend role data [puppet] - 10https://gerrit.wikimedia.org/r/595476 [08:18:13] (03CR) 10jerkins-bot: [V: 04-1] hieradata: add Thanos cluster and thanos::frontend role data [puppet] - 10https://gerrit.wikimedia.org/r/595476 (owner: 10Filippo Giunchedi) [08:18:45] (03CR) 10Dzahn: [C: 03+2] "This should be why the wmf-auto-reimage failed." [puppet] - 10https://gerrit.wikimedia.org/r/595475 (https://phabricator.wikimedia.org/T241784) (owner: 10Dzahn) [08:19:01] (03PS1) 10Muehlenhoff: Remove jessie support from squid classes [puppet] - 10https://gerrit.wikimedia.org/r/595477 [08:19:25] (03CR) 10jerkins-bot: [V: 04-1] Remove jessie support from squid classes [puppet] - 10https://gerrit.wikimedia.org/r/595477 (owner: 10Muehlenhoff) [08:20:33] (03PS1) 10KartikMistry: Revert limit adjustment for Chinese translations with ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595478 (https://phabricator.wikimedia.org/T252371) [08:20:44] (03PS2) 10Muehlenhoff: Remove jessie support from squid classes [puppet] - 10https://gerrit.wikimedia.org/r/595477 [08:20:48] (03CR) 10Ema: [C: 03+2] Release 0.6 [software/atskafka] - 10https://gerrit.wikimedia.org/r/595472 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [08:20:53] (03PS2) 10Filippo Giunchedi: hieradata: add Thanos cluster and thanos::frontend role data [puppet] - 10https://gerrit.wikimedia.org/r/595476 [08:21:13] (03CR) 10jerkins-bot: [V: 04-1] Remove jessie support from squid classes [puppet] - 10https://gerrit.wikimedia.org/r/595477 (owner: 10Muehlenhoff) [08:21:50] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add Thanos cluster and thanos::frontend role data [puppet] - 10https://gerrit.wikimedia.org/r/595476 (owner: 10Filippo Giunchedi) [08:22:22] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Dzahn) Fixed typo above, this should have been why the reimage script above failed. Y... [08:24:15] <_joe_> elukey: let's recover? [08:24:28] (03PS3) 10Muehlenhoff: Remove jessie support from squid classes [puppet] - 10https://gerrit.wikimedia.org/r/595477 [08:24:34] <_joe_> or do you want to gather more data? [08:25:08] _joe_ nope I am fine [08:27:52] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) [08:30:03] <_joe_> !log removing the iptables DROP rule on mc1020 T251378 [08:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:06] T251378: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 [08:30:56] RECOVERY - Memcached on mc1020 is OK: TCP OK - 0.000 second response time on 10.64.0.81 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [08:30:57] !log cp3050: upgrade atskafka to 0.6 T237993 [08:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:59] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [08:32:35] !log rsynced data from contint1001 to contint2001 - pathes per T224591#6039192 for the migration later today [08:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:38] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [08:36:11] moritzm: is there a way to manually tell mirrors.wmf that it's a good time to sync? [08:36:48] mm. actually it _just_ sync'd now. let's see if that fixes thing. [08:37:01] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['backup1002.eqiad.wmnet'] ` [08:37:20] kormat: [sodium:~] $ sudo -u mirror ftpsync [08:37:24] for next time [08:37:43] and yea, it normally fixes it [08:39:18] 10Operations, 10serviceops: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 (10Joe) 05Open→03Resolved a:03Joe We ran this test, and it passed with flying colors: - A transient peak of memcached errors, lasting less than 1 minute - The g... [08:39:23] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Joe) [08:40:38] !log rsyncing /var/lib/jenkins from contint1001 to contint2001 with --delete [08:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:09] (03PS1) 10Paladox: phabricator: Disable/Enable dumps using hiera [puppet] - 10https://gerrit.wikimedia.org/r/595479 [08:42:16] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Disable/Enable dumps using hiera [puppet] - 10https://gerrit.wikimedia.org/r/595479 (owner: 10Paladox) [08:44:50] (03PS2) 10Paladox: phabricator: Disable/Enable dumps using hiera [puppet] - 10https://gerrit.wikimedia.org/r/595479 [08:44:56] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Disable/Enable dumps using hiera [puppet] - 10https://gerrit.wikimedia.org/r/595479 (owner: 10Paladox) [08:44:58] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/595479 (owner: 10Paladox) [08:46:06] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/595479 (owner: 10Paladox) [08:46:46] !log bounce ferm on kubernetes1007 to resolve icinga UNKNOWN [08:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:11] hurm, no. still no joy [08:52:08] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/595479 (owner: 10Paladox) [08:55:12] (03PS1) 10Vgutierrez: Release 8.0.7-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/595484 (https://phabricator.wikimedia.org/T249335) [08:55:25] !log contint2001 stopping zuul-merger , permission problem [08:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:56] (03PS2) 10Vgutierrez: Release 8.0.7-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/595484 (https://phabricator.wikimedia.org/T249335) [08:58:47] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/595479 (owner: 10Paladox) [08:58:51] PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [08:59:15] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/595479 (owner: 10Paladox) [09:00:39] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:22] ^^ I have stopped it [09:05:51] 10Operations: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10Kormat) [09:05:58] !log contint2001 - mkdir /srv/jenkins [09:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:32] !log contint1001 - rsync -avpz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/ (T224591) [09:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:35] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [09:11:40] (03PS1) 10Paladox: phabricator: Change mail alias only on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/595488 [09:13:09] 10Operations: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10Kormat) From d-i syslog: ` May 11 09:11:07 anna[5770]: WARNING **: no packages matching running kernel 4.19.0-8-amd64 in archive ` [09:13:15] (03PS2) 10Paladox: phabricator: Change mail alias only on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/595488 [09:13:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:14:35] 10Operations: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10Kormat) It looks like maybe initrd.gz got updated, but not the kernel? ` root@puppetmaster1001:/var/lib/puppet/volatile/tftpboot/buster-installer/debian-installer/amd64# ls -l total 119816 -rw-r--r... [09:22:31] 10Operations: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10MoritzMuehlenhoff) This seems caused by the separation of apt1001 and the new buster-based install servers; puppet updates /srv/tftpboot on install1003/2003, but probably the reimage by Kormat rece... [09:22:43] (03PS1) 10Filippo Giunchedi: wmnet: allocate thanos-query.svc addresses [dns] - 10https://gerrit.wikimedia.org/r/595489 (https://phabricator.wikimedia.org/T233956) [09:25:02] (03PS1) 10Filippo Giunchedi: conftool-data: add thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/595491 (https://phabricator.wikimedia.org/T233956) [09:29:26] RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [09:29:58] (03PS2) 10Gehel: icinga: Add rkemper to wdqs-admins, sms [puppet] - 10https://gerrit.wikimedia.org/r/595059 (https://phabricator.wikimedia.org/T251572) (owner: 10Ryan Kemper) [09:30:47] (03CR) 10Gehel: [C: 03+2] icinga: Add rkemper to wdqs-admins, sms [puppet] - 10https://gerrit.wikimedia.org/r/595059 (https://phabricator.wikimedia.org/T251572) (owner: 10Ryan Kemper) [09:31:08] !log contint2001 started zuul-merger again (had permission issues in /var/lib/zuul ) [09:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:37] (03CR) 10Ema: [C: 03+1] Release 8.0.7-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/595484 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [09:39:18] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:42] (03PS1) 10Filippo Giunchedi: hieradata: add thanos-query to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/595493 (https://phabricator.wikimedia.org/T233956) [09:41:46] (03PS1) 10Filippo Giunchedi: thanos: add lvs addresses to frontend [puppet] - 10https://gerrit.wikimedia.org/r/595494 (https://phabricator.wikimedia.org/T233956) [09:42:33] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.7-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/595484 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [09:44:18] !log contint2001 - find /var/lib/jenkins -user statsite -exec chown jenkins:jenkins {} \; [09:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:20] (03CR) 10Gehel: [C: 04-1] "minor syntax errors, see inline comment" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/595061 (https://phabricator.wikimedia.org/T206951) (owner: 10Ryan Kemper) [09:51:42] 10Operations: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10Kormat) I can confirm that /srv/tftpboot on apt1001 is stale: ` kormat@apt1001:/srv/tftpboot/buster-installer/debian-installer/amd64(0:0)$ ls -l total 119796 -r--r--r-- 1 root root 1322936 Jun 27... [09:52:04] 10Operations: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10MoritzMuehlenhoff) To unbreak current Buster installs it should be sufficient to replace /srv/tftpboot/buster-install on apt1001.wikimedia.org with a version from install1003 or install2003. To fi... [09:57:52] (03CR) 10Jbond: [C: 03+1] "LGTM ping me after you deploy as i also want to check the IP ACL is working with tomcat" [puppet] - 10https://gerrit.wikimedia.org/r/595159 (owner: 10Muehlenhoff) [10:01:05] kormat: /srv/tftpboot on install1003 [10:02:13] ahh, right. will fix in a few, thanks [10:06:18] (03CR) 10Giuseppe Lavagetto: Add recommendation-api chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [10:15:10] !log upload trafficserver 8.0.7-1wm3 to apt.wm.o (buster) - T242767 T249335 [10:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:15] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [10:15:15] T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 [10:18:57] (03CR) 10Dzahn: [C: 04-1] phabricator: Change mail alias only on wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595488 (owner: 10Paladox) [10:20:34] (03PS7) 10Jbond: apereo_cas: support staging environment [puppet] - 10https://gerrit.wikimedia.org/r/595150 [10:21:56] (03PS3) 10Paladox: phabricator: Use $::domain in mail alias [puppet] - 10https://gerrit.wikimedia.org/r/595488 [10:22:14] (03CR) 10Paladox: phabricator: Use $::domain in mail alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595488 (owner: 10Paladox) [10:23:32] 10Operations, 10serviceops: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) [10:28:00] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595498 (https://phabricator.wikimedia.org/T128546) [10:28:38] (03PS8) 10Jbond: apereo_cas: support staging environment [puppet] - 10https://gerrit.wikimedia.org/r/595150 [10:30:04] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T1030). [10:32:41] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10elukey) 05Open→03Resolved We added SWAP to all stat100x hosts, and set a deprecation of the notebooks for June 2020. In theory we shouldn't receive any more alarms, closing. [10:33:45] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [10:34:48] (03CR) 10Jbond: "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595150 (owner: 10Jbond) [10:36:08] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020): CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Elitre) 05Open→03Stalled [10:36:10] 10Operations, 10Goal: FY2020-2021 Q1 DC switchover and switchback - https://phabricator.wikimedia.org/T243314 (10Elitre) [10:38:53] (03PS1) 10Elukey: role::eventlogging::analytics: remove mysql config [puppet] - 10https://gerrit.wikimedia.org/r/595499 (https://phabricator.wikimedia.org/T245238) [10:46:19] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Remove references to m4-master - https://phabricator.wikimedia.org/T245238 (10elukey) [10:47:00] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Remove references to m4-master - https://phabricator.wikimedia.org/T245238 (10elukey) [10:51:57] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595498 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:53:03] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595498 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:54:20] (03PS2) 10KartikMistry: Revert limit adjustment for Chinese translations with ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595478 (https://phabricator.wikimedia.org/T252371) [10:55:57] (03PS1) 10Hnowlan: changeprop: make changeprop settings their own dict [deployment-charts] - 10https://gerrit.wikimedia.org/r/595501 (https://phabricator.wikimedia.org/T220399) [10:56:29] (03PS2) 10Hnowlan: changeprop: make changeprop settings their own dict [deployment-charts] - 10https://gerrit.wikimedia.org/r/595501 (https://phabricator.wikimedia.org/T220399) [10:56:55] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:595498| Bumping portals to master (595498)]] (duration: 01m 07s) [10:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:02] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:595498| Bumping portals to master (595498)]] (duration: 01m 06s) [10:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/595150 (owner: 10Jbond) [10:59:08] (03CR) 10Jbond: [C: 03+2] apereo_cas: support staging environment [puppet] - 10https://gerrit.wikimedia.org/r/595150 (owner: 10Jbond) [11:00:03] (03PS1) 10Giuseppe Lavagetto: purged: add support for kafka [puppet] - 10https://gerrit.wikimedia.org/r/595502 [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T1100). [11:00:05] chiborg and kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:20] * kart_ is here. [11:00:25] (03CR) 10jerkins-bot: [V: 04-1] purged: add support for kafka [puppet] - 10https://gerrit.wikimedia.org/r/595502 (owner: 10Giuseppe Lavagetto) [11:02:49] Who can deploy chiborg's patches? [11:02:59] I can deploy my patch after that. [11:05:48] chiborg: around? [11:05:58] here [11:07:57] chiborg: I'm deploying my patch first and see if we can find someone to deploy your patches. If you can't find anyone, I can help there too. [11:08:08] sorry I’m late [11:08:10] I can SWAT deploy [11:08:14] alright, thanks [11:08:50] Lucas_WMDE: I'm doing my patch now. Give me few minutes :) [11:08:55] ok :) [11:08:55] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595478 (https://phabricator.wikimedia.org/T252371) (owner: 10KartikMistry) [11:09:14] Lucas_WMDE: probably +2 on chiborg's patches can be done meanwhile? [11:09:29] * Lucas_WMDE reviews [11:09:45] (03Merged) 10jenkins-bot: Revert limit adjustment for Chinese translations with ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595478 (https://phabricator.wikimedia.org/T252371) (owner: 10KartikMistry) [11:14:55] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|595478|Revert limit adjustment for Chinese translation with ContentTranslation (T252371)]] (duration: 01m 09s) [11:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:59] T252371: Revert limit adjustment for Chinese translations with Content translation - https://phabricator.wikimedia.org/T252371 [11:15:30] Lucas_WMDE: I'm done. [11:15:34] ok [11:16:00] chiborg: let’s try the wmf.31 backport first [11:16:31] ok [11:17:13] it’s on mwdebug1001 [11:17:18] (03PS2) 10Jbond: interactive: add get_secret function [software/spicerack] - 10https://gerrit.wikimedia.org/r/594988 [11:18:51] site doesn’t seem to be horrendously broken so far [11:19:44] chiborg: have you found a way to test the change or should I just sync it and watch for errors? [11:20:00] (03PS1) 10JMeybohm: parsoid: Add TLS termination support [deployment-charts] - 10https://gerrit.wikimedia.org/r/595505 (https://phabricator.wikimedia.org/T235411) [11:20:58] haven't found a way yet, but this schema change will only break a SMDE banner if applied wrongly, so should be fine [11:21:03] WMDE [11:21:21] ok [11:23:25] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/WikimediaEvents/: SWAT: [[gerrit:594694|Update Banner Interaction Schema (T250791, wmf.31)]] (duration: 01m 07s) [11:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:28] T250791: Use EventLogging to log banner interactions - https://phabricator.wikimedia.org/T250791 [11:23:57] ok, onto wmf.30 [11:24:35] also on mwdebug1001 [11:24:46] (03CR) 10JMeybohm: parsoid: Add TLS termination support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/595505 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [11:25:46] (03PS3) 10Jbond: interactive: add get_secret function [software/spicerack] - 10https://gerrit.wikimedia.org/r/594988 [11:26:35] everything looks clear, syncing… [11:27:39] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: `... [11:27:47] (03CR) 10Jbond: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/594988 (owner: 10Jbond) [11:28:20] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: `... [11:29:07] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: `... [11:30:12] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.30/extensions/WikimediaEvents/: SWAT: [[gerrit:594693|Update Banner Interaction Schema (T250791, wmf.30)]] (duration: 01m 08s) [11:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:16] T250791: Use EventLogging to log banner interactions - https://phabricator.wikimedia.org/T250791 [11:30:59] (03PS1) 10Dzahn: aptrepo: populate /srv/tftpboot from volatile also on APT_repo servers [puppet] - 10https://gerrit.wikimedia.org/r/595507 (https://phabricator.wikimedia.org/T252382) [11:32:32] !log EU SWAT done [11:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:11] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 563 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:34:20] (03CR) 10Jbond: "There are many better then me to review this :) however looks good to me although im not sure /srv/ seems like the best place to install t" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [11:35:24] (03PS2) 10Dzahn: aptrepo: populate /srv/tftpboot from volatile also on APT_repo servers [puppet] - 10https://gerrit.wikimedia.org/r/595507 (https://phabricator.wikimedia.org/T252382) [11:35:35] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/22449/apt1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/595507 (https://phabricator.wikimedia.org/T252382) (owner: 10Dzahn) [11:39:14] (03CR) 10Muehlenhoff: "Ack on /srv. It's just a WIP, since I wanted to quickly test the approach in general, I'll upgrade this to follow the new Tomcat approach " [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [11:40:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Update TOU link in exim warnings [puppet] - 10https://gerrit.wikimedia.org/r/595246 (owner: 10BryanDavis) [11:42:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [11:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [11:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [11:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['backup1002.eqiad.wmnet'] ` The log can be found in `/var/... [11:45:09] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:13] 10Operations, 10ops-eqiad: Netbox report PuppetDB PhysicalHosts critical - https://phabricator.wikimedia.org/T251725 (10Cmjohnson) 05Open→03Resolved The restbase errors were only temporary and were related to initial imaging of the host. no errors at this time. [11:47:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:23] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1029.eqiad.wmnet'] ` and were **ALL** succ... [11:50:47] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1028.eqiad.wmnet'] ` and were **ALL** succ... [11:50:54] (03PS3) 10Hnowlan: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) [11:51:29] 10Operations, 10Patch-For-Review: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10Dzahn) >>! In T252382#6123941, @MoritzMuehlenhoff wrote: > To fix this for good we can either > - have /srv/tftpboot on apt1001 be populated from the volatile directory I did... [11:52:00] 10Operations, 10SCB, 10Services (watching): Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 (10akosiaris) 05Stalled→03Resolved a:03akosiaris Indeed @Aklapper . Thanks. [11:52:15] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1030.eqiad.wmnet'] ` and were **ALL** succ... [11:54:41] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Cmjohnson) [11:55:22] 10Operations, 10Patch-For-Review: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10jcrespo) I tested it on backup1002 and this worked well. This can be closed - but I wonder if we should have a working group in improving the install and deb service, when it... [11:55:29] (03CR) 10JMeybohm: "IMO this has been superseded by I5d9ab4069f1087fa41e0d8a5290789cee1d434d8" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554834 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [11:56:55] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Cmjohnson) 05Open→03Resolved @mobrovac these servers are ready for service implementation. I am resolvin... [11:57:37] (03Abandoned) 10JMeybohm: eventgate: convert to use the common tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/554834 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [11:58:09] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['backup1002.eqiad.wmnet'] ` [11:59:34] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime [11:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:03] 10Operations, 10ops-eqiad, 10DC-Ops: (Due Date: ASAP) rack/setup/install replacement msw-c6-eqiad - https://phabricator.wikimedia.org/T251616 (10Cmjohnson) [12:02:05] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:28] 10Operations, 10ops-eqiad, 10DC-Ops: (Due Date: ASAP) rack/setup/install replacement msw-c6-eqiad - https://phabricator.wikimedia.org/T251616 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson This has been completed, the spare msw didn't even make it to the spares list. It was used for msw-a6 replacement.... [12:02:36] mutante: so eventually I am here [12:02:54] hashar: here.. the rsync finished _exactly_ on time, heh [12:02:58] making coffee [12:02:59] \o/ [12:03:59] starts the rsync one more time for /srv/jenkins. we have to check permissions though [12:04:01] (03PS1) 10Jcrespo: backups: Add backup1002 as a spare system, enough to prepare RAID [puppet] - 10https://gerrit.wikimedia.org/r/595509 (https://phabricator.wikimedia.org/T250816) [12:04:26] (03PS2) 10Jcrespo: backups: Add backup1002 as a spare system, enough to prepare RAID [puppet] - 10https://gerrit.wikimedia.org/r/595509 (https://phabricator.wikimedia.org/T250816) [12:04:50] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [12:06:23] (03CR) 10Jcrespo: [C: 03+2] backups: Add backup1002 as a spare system, enough to prepare RAID [puppet] - 10https://gerrit.wikimedia.org/r/595509 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [12:08:03] hashar: done. now fixing permissions on /srv/jenkins [12:10:20] old on [12:10:36] I am going to stop the services and we need another rsync to catchup with changes that happened [12:10:45] ok [12:11:00] maybe we can drop the chroot parameter in rsync or at least force disable the numeric ids parameter? [12:11:08] that will saves the time to then refix the ownerships [12:11:20] let's worry about that for next time [12:11:27] all i have to do is cursor up now [12:11:42] (03CR) 10Hnowlan: changeprop: add cpjobqueue configuration switching (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [12:12:25] /srv/jenkins is easy because it's all jenkins:jenkins [12:12:31] cool [12:12:39] shutting down stuff [12:12:39] (03PS3) 10JMeybohm: mathoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [12:14:28] (03CR) 10Alexandros Kosiaris: parsoid: Add TLS termination support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/595505 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [12:14:44] (03CR) 10JMeybohm: "Rebase to master, bump version and add chart package" [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [12:14:47] !log shutting down Zuul and Jenkins for system switch # T224591 [12:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:50] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [12:15:35] and I have masked both services on contint1001 [12:15:56] mutante: you can rsync /srv/jenkins and /var/lib/jenkins again ;) [12:16:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Aside from a minor comments, rest LGTM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [12:16:10] what about /var/lib/zuul ? [12:16:28] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache on Grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/595164 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:16:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment, but rest LGTM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/558093 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [12:17:32] mutante: hmm, I guess I just need the state file, and it is not that important I can just retrigger the few events [12:17:37] so lets ignore /var/lib/zuul [12:17:40] !log contint1001 - rsync -avz --delete /var/lib/jenkins/ rsync://contint2001.wikimedia.org/ci--var-lib-jenkins- [12:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:42] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/595133 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:19:00] !log contint1001 - rsync -avz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/ [12:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:45] !log contint2001 - chown -R jenkins:jenkins /srv/jenkins/* [12:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:54] hopefully it is fast enough ;) [12:21:28] (03PS1) 10Aklapper: phabricator weekly changes email: Fix links to project pages [puppet] - 10https://gerrit.wikimedia.org/r/595513 [12:21:41] !log contint2001 - find /var/lib/jenkins/ -user statsite -exec chown jenkins {} \; [12:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:10] (03PS2) 10Hashar: switch contint from 1001 to 2001 [dns] - 10https://gerrit.wikimedia.org/r/594480 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:23:23] (03CR) 10Hashar: "Rebased ;)" [dns] - 10https://gerrit.wikimedia.org/r/594480 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:24:07] !log contint2001 - find /var/lib/jenkins/ -group bacula -exec chown jenkins:jenkins {} \; [12:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:25] (03PS4) 10Dzahn: contint: switch jenkins/zuul/gearman to contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/594477 (https://phabricator.wikimedia.org/T224591) [12:25:28] (03CR) 10Hashar: [C: 03+1] contint: switch jenkins/zuul/gearman to contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/594477 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:26:06] mutante: and I have rebased the dns change ;) [12:26:09] hashar: rsync done. permissions look ok to me [12:26:13] DNS first? [12:26:18] yes [12:26:49] (03CR) 10Dzahn: [C: 03+2] switch contint from 1001 to 2001 [dns] - 10https://gerrit.wikimedia.org/r/594480 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:26:51] and it has a 5 mins TTL , so I guess up to 10 minutes for the world to be updated [12:26:56] here it goes. TTL is 5 m [12:27:31] (03PS1) 10Aklapper: Phabricator monthly email: Explicitly list number of stalled tasks [puppet] - 10https://gerrit.wikimedia.org/r/595514 [12:27:35] (03CR) 10Dzahn: [V: 03+2 C: 03+2] switch contint from 1001 to 2001 [dns] - 10https://gerrit.wikimedia.org/r/594480 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:27:39] duh, why am i waiting for V :) [12:27:50] muscle memory or something [12:28:14] operations/dns has jenkins [12:28:15] hashar: done. and the puppet one right away, ack ? [12:28:22] yeah [12:28:28] topic branch: https://gerrit.wikimedia.org/r/q/topic:%22contint-buster%22+(status:open%20OR%20status:merged) [12:28:31] I have manually masked the services [12:28:55] (03CR) 10Dzahn: [V: 03+2 C: 03+2] contint: switch jenkins/zuul/gearman to contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/594477 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:29:04] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [12:29:12] running puppet on contint1001 [12:29:25] too quick. not merged on master [12:29:36] ;D [12:29:57] syncing .. and NOW go ahead [12:30:03] (03PS1) 10Volans: icinga: fix passive Icinga meta-monitoring for VO [puppet] - 10https://gerrit.wikimedia.org/r/595515 (https://phabricator.wikimedia.org/T252401) [12:30:19] contint.wikimedia.org is an alias for contint2001.wikimedia.org. [12:31:04] contint1001 services look fine (zuul/jenkins are masked) [12:31:42] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache on dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/595171 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:31:55] jenkins.model.InvalidBuildsDir: /srv/jenkins/builds/${ITEM_FULL_NAME} does not exist and probably cannot be created [12:31:59] in Jenkins on contint2001 :/ [12:32:37] eh.. but how.. we just synced that [12:32:43] let's count the files [12:33:02] 1185864 [12:33:15] the number of files under /srv/jenkins/builds is identical on both sides [12:33:52] and they are all owned jenkins:jenkins [12:33:58] yeah not sure what happened [12:34:01] I restarted jenkins [12:35:04] (03PS1) 10Privacybatm: transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) [12:35:12] wmf-insecte, #wikimedia-analytics, Cannot join channel (+r) - you need to be identified with services [12:35:15] duh? [12:35:24] (03PS2) 10JMeybohm: termbox: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558093 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [12:35:47] can look at that later [12:36:02] i like to see "SSH Launch of integration-agent-docker...." lines [12:36:19] https://integration.wikimedia.org/zuul/ yields a 404 :/ [12:36:55] hashar: can't confirm for me yet [12:37:00] i see a status page [12:37:33] Last reconfigured: Mon May 11 2020 08:31:53 GMT-0400 (Eastern Daylight Time) [12:37:43] (03CR) 10Muehlenhoff: [C: 03+2] Reenable the ssoSessions endpoint on the staging IDP [puppet] - 10https://gerrit.wikimedia.org/r/595159 (owner: 10Muehlenhoff) [12:37:58] (03PS4) 10JMeybohm: mathoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [12:38:04] but https://integration.wikimedia.org/zuul/status.json works ... [12:38:17] which is the proxied thing [12:38:22] hashar: for me it works.. i open it and i see logs on 2001 [12:38:25] so I guess the docroot is missing [12:38:53] (03CR) 10JMeybohm: mathoid: add TLS termination (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [12:39:01] hashar: yea, the docroot exists but empty [12:39:08] bah [12:39:25] some cache clusters give 200, other cache clusters give 404 [12:39:36] how does it get filled, hashar? manual? [12:39:49] eqiad and esams give 200, the rest give 404 [12:39:53] from a clone of integration/docroot.git to /srv/ iirc [12:40:12] cdanis: yeah i guess it is dns split brained [12:40:17] hashar: try again now [12:40:33] !log contint1001 - rsync -avz --delete /srv/org/wikimedia/integration/ rsync://contint2001.wikimedia.org/ci--srv-/org/wikimedia/integration/ [12:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:47] (03PS1) 10Jcrespo: insetup: Disable notifications to "in setup" hosts [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) [12:40:57] it's worse than that I think, the clusters serving 404s report the 404 as a cache hit [12:41:12] (03PS2) 10Jcrespo: insetup: Disable notifications to "in setup" hosts [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) [12:41:21] 👍cdanis@evebox ~ 🕣☕ for DC in eqiad codfw esams ulsfo eqsin; do curl -v $(RESOLVE https://integration.wikimedia.org/zuul/ text-lb.$DC.wikimedia.org) -o/dev/null |& egrep '< HTTP/2 |< x-cache: '; echo ; done [12:41:22] the split-brain is puppet runs on ATS, which we did not force yet [12:41:23] < HTTP/2 200 [12:41:25] < x-cache: cp1077 miss, cp1079 hit/7 [12:41:27] < HTTP/2 404 [12:41:29] < x-cache: cp2035 miss, cp2029 hit/6 [12:41:31] < HTTP/2 200 [12:41:33] < x-cache: cp3054 miss, cp3056 hit/14 [12:41:35] < HTTP/2 404 [12:41:37] < x-cache: cp4030 miss, cp4029 hit/6 [12:41:39] < HTTP/2 404 [12:41:41] (03PS3) 10Jcrespo: insetup: Disable notifications for "in setup" hosts [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) [12:41:41] < x-cache: cp5010 miss, cp5007 hit/4 [12:41:43] not sure how long we cache a 404 for [12:42:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Adding Thomas and Leszek so they know this chart is gaining TLS support. LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/558093 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [12:42:07] mutante: oh we are missing /srv/.git on contint2001 [12:42:16] cdanis: yeah I think it is just transient [12:42:20] 10Operations: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 (10ayounsi) As 1001 and 2002 are gone this task might be good to close? [12:42:32] (03PS1) 10ArielGlenn: in page content fixup script, check for truncation, move into place if good [dumps] - 10https://gerrit.wikimedia.org/r/595518 [12:43:26] (03CR) 10Jcrespo: "I think notifications should be disabled automatically for host being setup (could be overridden for specific hosts), the same than spare " [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [12:43:36] hashar: synced .git as well [12:43:52] 10Operations: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Ack, in the current scheme of things, the apt repo now lives on apt1001, a dedicated machine. [12:43:54] !log contint1001 - rsync -avz --delete /srv/.git/ rsync://contint2001.wikimedia.org/ci--srv-/org/.git/ [12:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:01] argg. hold on :) [12:44:01] mutante: and bunch of permissions gotta be fixed now ;) [12:44:47] hm [12:45:07] !log contint1001 - rsync -avz --delete /srv/.git/ rsync://contint2001.wikimedia.org/ci--srv/.git/ [12:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:08] 10Operations, 10Traffic: Puppet cleanup after purged transition - https://phabricator.wikimedia.org/T251374 (10ema) 05Open→03Resolved [12:45:24] (03PS1) 10Jcrespo: backups: Set backup1002 as an "in setup" system and disable notif. [puppet] - 10https://gerrit.wikimedia.org/r/595519 (https://phabricator.wikimedia.org/T250816) [12:46:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Adding Moritz for his information, as he is the author of this. We are proceeding with adding TLS (encryption) support to the chart via an" [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [12:46:14] (03PS2) 10Jcrespo: backups: Set backup1002 as an "in setup" system and disable notif. [puppet] - 10https://gerrit.wikimedia.org/r/595519 (https://phabricator.wikimedia.org/T250816) [12:46:14] !log contint2001 - chown -R jenkins-slave:jenkins-slave /srv/.git [12:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:24] hashar: ^ jenkins-slave owned [12:46:33] good [12:46:37] fixing some missing files now [12:47:15] please log if you can [12:47:18] PROBLEM - LVS HTTP eqiad IPv4 #page on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 10.002 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:47:18] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10Gehel) [12:47:44] yo [12:47:48] * volans here [12:47:53] <_joe_> uhm [12:48:00] I'm here too [12:48:01] * jbond42 here [12:48:06] gilles too ^ [12:48:17] RECOVERY - LVS HTTP eqiad IPv4 #page on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 368 bytes in 9.964 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:48:19] as it could be software I guess [12:48:23] mm [12:48:32] network as it recovered so fast? [12:48:53] that was fast indeed [12:48:58] it returned 503 [12:49:03] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10Gehel) [12:49:14] <_joe_> not sure tbh, it seems to be in good shape from grafana [12:49:21] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10Gehel) [12:49:22] not really [12:49:23] Phab timed out a few minutes ago as well, working now though. [12:49:30] https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?panelId=35&fullscreen&orgId=1 [12:49:34] mutante: I cleaned a few more files ;) [12:49:35] I'm not sure it is recovered yet, https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1 says a lot of 5xx ("haproxy") [12:49:44] <_joe_> indeed [12:49:46] <_joe_> https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?panelId=35&fullscreen&orgId=1 [12:49:50] seems like a spike of something started 30ish minutes ago [12:49:51] it is ongoign [12:49:57] also 404s [12:50:01] did volume of request change? [12:50:11] https://integration.wikimedia.org/ works [12:50:15] it seems so jynus [12:50:15] <_joe_> yes, so, can someone look at swift? [12:50:21] good morning 👋 [12:50:26] !log Pointing CI Jenkins to contint2001 Gearman server T224591 [12:50:26] if I interpret this correctly [12:50:26] https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?panelId=6&fullscreen&orgId=1 [12:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:30] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [12:50:36] <_joe_> good morning rzl :P [12:51:25] <_joe_> interestingly, around the same time the latency on codfw went down somehow [12:51:45] thumbor - swift traffic actually seems to have gone down [12:51:54] with lower latency [12:52:18] I'm taking a look at swift, looks fine so far [12:52:36] <_joe_> i see a ton of requests for thumbs at 330px [12:52:39] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10jbond) @HMarcus Thanks for the information i have been trying to play with the api today however im not sure i have the... [12:53:31] <_joe_> 70% of requests are for those [12:53:38] elevated 404s and 499s from Swift starting ~12:10 [12:54:01] mutante: digging. Jenkins does not start jobs for some reason [12:54:07] <_joe_> can someone look into the cdn logs to see who's requesting those 330px thumbnails? [12:54:22] seems like ghostscript renders are the only time that didn't go down in QPS. A lot of PDF or DJVU thumbnails being requested maybe? [12:54:31] only type [12:55:02] <_joe_> to be clear, the servers are not overloaded [12:55:18] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 563 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:56:09] right, it's just that there could be a lot of thumbnails taking a while to render keeping the processes busy, leaving other requests in the queue for longer [12:56:46] I see a spike of djvus that are more expensive than usual in the timeframe [12:57:08] hashar: ok, making patch to add missing /srv/jenkins [12:57:29] also a spike on VIPS requests taking longer (huge images) [12:57:40] (03PS1) 10Dzahn: jenkins: add missing /srv/jenkins dir [puppet] - 10https://gerrit.wikimedia.org/r/595521 (https://phabricator.wikimedia.org/T224591) [12:58:03] https://usercontent.irccloud-cdn.com/file/Zsgr4wyW/Screenshot%202020-05-11%20at%2014.57.48.png [12:58:50] there are lots of throttling mechanisms, though, one IP shouldn't be able to clog things up for everyone [12:59:00] mutante: sure. I am restarting Jenkins again [12:59:54] ah, I see a spike of original load errors on swift [13:00:08] on eqiad [13:00:19] https://usercontent.irccloud-cdn.com/file/ggXARlCD/Screenshot%202020-05-11%20at%2015.00.02.png [13:01:17] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops, 10Sustainability (Incident Prevention): Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10AMooney) a:03tstarling [13:02:26] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/595521 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:03:08] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T221259#6118361 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:03:32] mutante: it is broken somehow :(((( [13:03:39] i checked sampled-1000.json on weblog and couldn't see any common ips or user agents requesting 330px [13:04:41] mutante: zuul does seem to work, it does try to launch jobs but they are never run by Jenkins [13:05:02] hashar: Gearman errors in jenkins.log [13:05:11] receieved IOException while registering functions [13:05:16] oh [13:06:07] yeah that would be it [13:06:26] maybe it cant reach 127.0.0.1:4730 [13:07:37] hashar: iptables checked.. both have rules for both servers [13:07:58] unless there was an extra ACL on network gear for letting the cloud instances talk to prod [13:10:12] hashar: it is looking better now in logs [13:11:04] the master definitely manages to ssh to the cloud instances [13:11:24] ack, i see "SSH lauch... completed" [13:11:37] then "interruped while waiting for okay to send grab job" [13:11:48] but not that IOException anymore [13:11:57] yeah I am digging into that one. No clue what those IOException are [13:12:04] (03CR) 10Muehlenhoff: "+1 on the notification part, see inline comment on the cluster variable" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [13:14:49] (03PS4) 10Jcrespo: insetup: Disable notifications for "in setup" hosts [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) [13:16:20] hashar: found why the integration docroot was empty. puppet is told to git::clone but to /srv/docroot but that isn't a thing. it is /srv/org/wikimedia/integration [13:17:55] that stuff still comes from the split off doc.wikimedia.org [13:18:44] (03CR) 10Ppchelko: [C: 03+2] changeprop: make changeprop settings their own dict [deployment-charts] - 10https://gerrit.wikimedia.org/r/595501 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [13:18:58] mutante: at least zuul seems to work ;) [13:19:38] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.08149 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:20:26] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 40.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:20:32] !log Upgrade mysql package on s4 master in preparation for tomorrow's maintenance T251502 [13:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:36] T251502: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 [13:21:07] the puppet alerts are on cp hosts [13:21:09] 10Operations, 10DBA, 10User-notice: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) Package has been upgraded on db1138 [13:21:58] mutante: I don't get what is going so I am afraid I will call a rollback. I am just going to hack and switch to java 8 and see whether that works [13:22:20] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 104.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:23:37] mutante: ah there is no java8 of course :D [13:24:26] 10Operations, 10DBA, 10User-notice: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) Maintenance day: - Silence all hosts in s4 - Set read only on s4: ` dbctl --scope eqiad section s4 ro "Maintenance on s4 T251502" && dbct... [13:25:54] hashar: hmm.ok. could be java version i guess..yea [13:26:13] (03PS1) 10Dzahn: contint: fix git cloning of docroot for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/595525 [13:26:14] I see the jobs are in the queue [13:26:28] but none of them are requested by the executors [13:26:53] JobOffer[integration-agent-docker-1001 #2] rejected hashar-pinger: Waiting for next available executor on ‘integration-agent-docker-1001’ [13:27:00] so the executors are all busy somehow [13:28:11] (03CR) 10Marostegui: [C: 03+1] role::eventlogging::analytics: remove mysql config [puppet] - 10https://gerrit.wikimedia.org/r/595499 (https://phabricator.wikimedia.org/T245238) (owner: 10Elukey) [13:32:20] (03CR) 10Ottomata: [C: 03+1] role::eventlogging::analytics: remove mysql config [puppet] - 10https://gerrit.wikimedia.org/r/595499 (https://phabricator.wikimedia.org/T245238) (owner: 10Elukey) [13:34:32] mutante: so yeah lets revert. Basically rollback the puppet change and the dns change [13:34:35] and I guess that will do it [13:35:20] (03PS1) 10Dzahn: Revert "contint: switch jenkins/zuul/gearman to contint2001" [puppet] - 10https://gerrit.wikimedia.org/r/595526 [13:35:29] (03PS1) 10Dzahn: Revert "switch contint from 1001 to 2001" [dns] - 10https://gerrit.wikimedia.org/r/595527 [13:35:49] (03CR) 10Hashar: [C: 03+1] Revert "contint: switch jenkins/zuul/gearman to contint2001" [puppet] - 10https://gerrit.wikimedia.org/r/595526 (owner: 10Dzahn) [13:35:53] (03CR) 10Hashar: [C: 03+1] Revert "switch contint from 1001 to 2001" [dns] - 10https://gerrit.wikimedia.org/r/595527 (owner: 10Dzahn) [13:36:05] I guess it is an issue related to the java8 > java11 update [13:36:24] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "contint: switch jenkins/zuul/gearman to contint2001" [puppet] - 10https://gerrit.wikimedia.org/r/595526 (owner: 10Dzahn) [13:36:31] !log Rolling back CI system switch to previous known state # T224591 [13:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:35] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [13:37:09] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "switch contint from 1001 to 2001" [dns] - 10https://gerrit.wikimedia.org/r/595527 (owner: 10Dzahn) [13:37:29] hashar: you can run puppet on both [13:37:30] which obviously I haven't tested :( [13:40:03] (03CR) 10Dzahn: [C: 03+1] insetup: Disable notifications for "in setup" hosts [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [13:40:16] (03CR) 10jerkins-bot: [V: 04-1] insetup: Disable notifications for "in setup" hosts [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [13:40:23] (03PS1) 10Muehlenhoff: Add a Ferm rule for Prometheus metrics when running CAS on Tomcat [puppet] - 10https://gerrit.wikimedia.org/r/595528 [13:40:38] (03CR) 10jerkins-bot: [V: 04-1] Add a Ferm rule for Prometheus metrics when running CAS on Tomcat [puppet] - 10https://gerrit.wikimedia.org/r/595528 (owner: 10Muehlenhoff) [13:41:32] mutante: ^^ it is back [13:42:26] PROBLEM - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [13:42:41] (03PS2) 10Dzahn: contint: fix git cloning of docroot for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/595525 [13:42:51] (03PS3) 10Dzahn: contint: fix git cloning of docroot for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) [13:43:06] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/595521 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:43:12] hashar: good :) [13:43:36] so I am tempted to blame java8 [13:43:40] vs java11 [13:44:03] (03CR) 10jerkins-bot: [V: 04-1] contint: fix git cloning of docroot for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:44:15] !log upgrade ATS to 8.0.7-1wm4 in cp4032 - T249335 [13:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:18] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [13:44:33] hashar: there is a Java 8 component for buster [13:44:46] ahhh good to now [13:45:10] we're using it for e.g. Hadoop as it needs the same JRE across all distros [13:45:55] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Jclark-ctr) [13:46:10] 10Operations, 10Performance-Team, 10Thumbor: cwebp chokes on YCCK JPGs - https://phabricator.wikimedia.org/T226707 (10Gilles) [13:46:12] deb http://apt.wikimedia.org/wikimedia buster-wikimedia component/jdk8 [13:46:19] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson name rack_name position switchport db1141 A3 1 7 db1142 A5 36 36 db1143 B3 32 26 db1144 B8 7 13 db1145 C5 9 8 db1146 C5 33... [13:46:41] mutante: I am announcing the rollback [13:47:09] hashar: ok, maybe let's add wikitech? [13:47:20] for sure [13:47:25] I did ops and wikitech lists [13:48:00] I will reply and cc the two others [13:48:05] or reply on each [13:48:05] cool, i am commenting on that ticket people opened against pywikibot [13:50:04] icinga is changing accordingly, checks for zuul/jenkins/gearman are PENDING [13:50:39] (03CR) 10Ottomata: [C: 03+2] profile::analytics::refinery:job::refine: bump event refine exec memory to 4g [puppet] - 10https://gerrit.wikimedia.org/r/595359 (owner: 10Elukey) [13:51:41] (03CR) 10Elukey: [C: 03+2] role::eventlogging::analytics: remove mysql config [puppet] - 10https://gerrit.wikimedia.org/r/595499 (https://phabricator.wikimedia.org/T245238) (owner: 10Elukey) [13:52:24] 10Operations, 10observability, 10Graphite: graphite2003 crashed - https://phabricator.wikimedia.org/T251479 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi Boldly resolving [13:53:27] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Remove references to m4-master - https://phabricator.wikimedia.org/T245238 (10elukey) [13:53:35] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Remove references to m4-master - https://phabricator.wikimedia.org/T245238 (10elukey) 05Open→03Resolved a:03elukey [13:53:57] (03CR) 10Krinkle: contint: fix git cloning of docroot for integration.wm.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:55:42] moritzm: thank you for the hint ;) [13:57:26] (03CR) 10Jbond: [C: 04-1] "as mentioned on irc i dont think this is required" [puppet] - 10https://gerrit.wikimedia.org/r/595528 (owner: 10Muehlenhoff) [14:00:49] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.0006317 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:01:59] (03Abandoned) 10Muehlenhoff: Add a Ferm rule for Prometheus metrics when running CAS on Tomcat [puppet] - 10https://gerrit.wikimedia.org/r/595528 (owner: 10Muehlenhoff) [14:02:22] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) The upgrade itself went well: * the rs... [14:03:56] mutante: at least the upgrade process went fine :] [14:05:12] 10Operations, 10ops-eqiad, 10netops: upgrade row d to have 3 10G switches - https://phabricator.wikimedia.org/T196487 (10ayounsi) [14:05:52] (03CR) 10Dzahn: contint: fix git cloning of docroot for integration.wm.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [14:06:29] hashar: yea, some issues but they should be fixed for next time. and we can now separate migration from java version change [14:06:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Sorry for taking so long to review this. I think I have a last round of comments and we should be close to merging." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [14:06:43] just like for gerrit [14:09:31] (03PS1) 10Muehlenhoff: Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) [14:09:36] (03CR) 10Ppchelko: changeprop: make changeprop settings their own dict [deployment-charts] - 10https://gerrit.wikimedia.org/r/595501 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:09:56] (03CR) 10Ppchelko: [C: 03+2] "there seem o have been some CI issue, poking this again" [deployment-charts] - 10https://gerrit.wikimedia.org/r/595501 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:10:11] ^ in case this cannot be fixed/reprod with Java 11 [14:10:14] (03Merged) 10jenkins-bot: changeprop: make changeprop settings their own dict [deployment-charts] - 10https://gerrit.wikimedia.org/r/595501 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:11:47] (03CR) 10Ppchelko: changeprop: make changeprop settings their own dict (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/595501 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:20:27] (03PS4) 10Mholloway: wikifeeds: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:21:10] (03CR) 10Jdlrobson: [C: 03+1] "When this patch lands https://en.m.wikipedia.org/wiki/Main_Page will transform into https://en.m.wikipedia.org/wiki/Main_Page?debug=true&m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [14:21:15] (03CR) 10Mholloway: [C: 03+2] wikifeeds: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:21:33] (03Merged) 10jenkins-bot: wikifeeds: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:24:07] (03PS1) 10Vgutierrez: Relese 8.0.7-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/595533 (https://phabricator.wikimedia.org/T249335) [14:29:56] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [14:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:07] 10Operations, 10DBA, 10User-notice: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) Window reserved on the Deployment's calendar [14:37:55] (03PS1) 10Muehlenhoff: Enable staging IDP site for graphite [puppet] - 10https://gerrit.wikimedia.org/r/595538 [14:42:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [14:45:45] (03PS1) 10Bearloga: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) [14:45:53] (03PS5) 10Jcrespo: insetup: Disable notifications for "in setup" hosts [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) [14:47:22] (03CR) 10Filippo Giunchedi: "See inline for command name, LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595515 (https://phabricator.wikimedia.org/T252401) (owner: 10Volans) [14:47:24] (03CR) 10Jcrespo: [C: 03+2] insetup: Disable notifications for "in setup" hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595517 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [14:47:38] (03PS2) 10Ema: Release 8.0.7-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/595533 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [14:47:52] (03CR) 10Bearloga: "The configuration in data.yaml is per John's comments in I25e9fa413acb3493f27752a8a98ea78626097343" [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [14:48:01] (03PS1) 10Lucas Werkmeister (WMDE): DNM: Enable Data Bridge on Test Wikidata and its clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595542 (https://phabricator.wikimedia.org/T232584) [14:48:03] (03PS1) 10Lucas Werkmeister (WMDE): DNM: Enable Data Bridge on Wikidata and Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) [14:48:20] (03PS1) 10Lucas Werkmeister (WMDE): Anchor RegExp for Data Bridge in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595544 [14:48:51] (03CR) 10Lucas Werkmeister (WMDE): "I just noticed this while working on I734d16f7bf. Any objections?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595544 (owner: 10Lucas Werkmeister (WMDE)) [14:48:54] (03CR) 10jerkins-bot: [V: 04-1] DNM: Enable Data Bridge on Test Wikidata and its clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595542 (https://phabricator.wikimedia.org/T232584) (owner: 10Lucas Werkmeister (WMDE)) [14:49:01] (03CR) 10Mholloway: [C: 03+2] "I just attempted to deploy this in stating a few minutes ago, but `helmfile apply` timed out:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:49:03] (03CR) 10jerkins-bot: [V: 04-1] DNM: Enable Data Bridge on Wikidata and Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) (owner: 10Lucas Werkmeister (WMDE)) [14:50:15] (03PS3) 10Ema: Release 8.0.7-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/595533 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [14:50:39] 10Operations, 10netops, 10Sustainability (Incident Prevention): D1<->D8 VC link failure - https://phabricator.wikimedia.org/T251663 (10ayounsi) The only downside to removing the link fully is that it `D1` is 3 hops away `D8`, which doesn't seem to have been an issue since May 2nd. Upside is that it brings u... [14:51:44] (03CR) 10JMeybohm: "> Patch Set 4:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:52:52] (03PS1) 10Ottomata: Add Horizon webproxies that end in -beta.wmflabs.org to CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595545 (https://phabricator.wikimedia.org/T252417) [14:53:01] 10Operations: buster reimaging broken with "No kernel modules found" - https://phabricator.wikimedia.org/T252382 (10MoritzMuehlenhoff) >>! In T252382#6124358, @jcrespo wrote: > I tested it on backup1002 and this worked well. This can be closed > > - but I wonder if we should have a working group in improving th... [14:53:49] (03PS2) 10Lucas Werkmeister (WMDE): DNM: Enable Data Bridge on Test Wikidata and its clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595542 (https://phabricator.wikimedia.org/T232584) [14:53:51] (03PS2) 10Lucas Werkmeister (WMDE): DNM: Enable Data Bridge on Wikidata and Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) [14:57:47] (03CR) 10Ottomata: "Does this even work? I'm not sure if partial subdomain wildcards like '*-beta' work with CSP." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595545 (https://phabricator.wikimedia.org/T252417) (owner: 10Ottomata) [14:58:04] (03CR) 10Jbond: [C: 03+1] Enable staging IDP site for graphite [puppet] - 10https://gerrit.wikimedia.org/r/595538 (owner: 10Muehlenhoff) [14:58:30] (03PS2) 10Ottomata: Add Horizon webproxies that end in -beta.wmflabs.org to CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595545 (https://phabricator.wikimedia.org/T252417) [15:01:20] !log installing puma security updates [15:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:15] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:08:14] (03PS1) 10Hnowlan: changeprop: Fix ores config location [deployment-charts] - 10https://gerrit.wikimedia.org/r/595548 (https://phabricator.wikimedia.org/T220399) [15:11:24] (03CR) 10Ppchelko: [C: 03+2] changeprop: Fix ores config location [deployment-charts] - 10https://gerrit.wikimedia.org/r/595548 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:11:26] (03Merged) 10jenkins-bot: changeprop: Fix ores config location [deployment-charts] - 10https://gerrit.wikimedia.org/r/595548 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:13:33] (03CR) 10Muehlenhoff: [C: 03+2] Enable staging IDP site for graphite [puppet] - 10https://gerrit.wikimedia.org/r/595538 (owner: 10Muehlenhoff) [15:14:17] jynus: I'll merge your insetup patch along, ok? [15:15:39] did that now [15:15:52] harmless anyway [15:16:19] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:16:26] 10Operations, 10ops-eqiad: dumpsdata1001 power supply failure - https://phabricator.wikimedia.org/T252361 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson power cable was loose, fixed and resolving [15:18:15] (03CR) 10Krinkle: Add Horizon webproxies that end in -beta.wmflabs.org to CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595545 (https://phabricator.wikimedia.org/T252417) (owner: 10Ottomata) [15:20:35] (03PS4) 10Hnowlan: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) [15:22:00] DannyS712: around? [15:22:06] yes [15:22:45] 10Operations, 10ops-eqiad: Check patch cable between analytics1052 and asw2-a-eqiad - https://phabricator.wikimedia.org/T252325 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson that's pretty typical for a bad cable, replaced and good to go May 11 15:18:09 analytics1052 kernel: [110515.574435] tg3 0000:01:0... [15:22:55] time-zone appropriate greetings. going to merge backports for T252179. [15:22:55] T252179: Edits saved via PageUpdater need autopatrol status set - https://phabricator.wikimedia.org/T252179 [15:23:28] is there any testing that can practically be done for those on a mwdebug? [15:23:33] 10Operations, 10ops-eqiad: dumpsdata1001 power supply failure - https://phabricator.wikimedia.org/T252361 (10Marostegui) Thank you - looking good! ` ------------------------------------------------------------------------------- Record: 16 Date/Time: 05/11/2020 16:11:17 Source: system Severity:... [15:23:54] (or needs to be, i guess i should also ask) [15:24:02] Greetings to you as well. I don't think there is any testing that would work on mwdebug, if you backport and then roll out only to mediawiki.org I can easily test [15:24:03] (03PS1) 10Muehlenhoff: Fix name for httpd::site when using the staging flag [puppet] - 10https://gerrit.wikimedia.org/r/595550 [15:24:55] Back ports are all at https://gerrit.wikimedia.org/r/#/q/topic:revert-pageupdater-wmf/1.35.0-wmf.31+(status:open+OR+status:merged) [15:25:00] DannyS712: ack. merging patches. [15:27:40] 10Operations, 10ops-eqiad: failed PSU on sodium - https://phabricator.wikimedia.org/T252419 (10Cmjohnson) [15:27:43] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [15:27:55] 10Operations, 10ops-eqiad: failed PSU on sodium - https://phabricator.wikimedia.org/T252419 (10Cmjohnson) 05Open→03Resolved Swapped the psu-all green and resolving [15:29:11] RECOVERY - IPMI Sensor Status on dumpsdata1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:30:47] RECOVERY - IPMI Sensor Status on sodium is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:32:12] (03PS2) 10Muehlenhoff: Fix name for httpd::site and Icinga defs when using the staging flag [puppet] - 10https://gerrit.wikimedia.org/r/595550 [15:32:55] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10JHedden) These servers should mimic the network configuration we have in production: eth0 (ens2f0np0) on `public1-b-codfw` net... [15:38:24] (03CR) 10Jbond: [C: 03+1] Fix name for httpd::site and Icinga defs when using the staging flag [puppet] - 10https://gerrit.wikimedia.org/r/595550 (owner: 10Muehlenhoff) [15:38:33] moritzm: sorry, got distracted [15:38:41] (03PS2) 10Volans: icinga: fix passive Icinga meta-monitoring for VO [puppet] - 10https://gerrit.wikimedia.org/r/595515 (https://phabricator.wikimedia.org/T252401) [15:39:04] (03CR) 10Volans: "addressed comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595515 (https://phabricator.wikimedia.org/T252401) (owner: 10Volans) [15:39:15] (03PS3) 10Jcrespo: backups: Set backup1002 as an "in setup" system and disable notif. [puppet] - 10https://gerrit.wikimedia.org/r/595519 (https://phabricator.wikimedia.org/T250816) [15:39:44] (03CR) 10Jcrespo: [C: 03+2] backups: Set backup1002 as an "in setup" system and disable notif. [puppet] - 10https://gerrit.wikimedia.org/r/595519 (https://phabricator.wikimedia.org/T250816) (owner: 10Jcrespo) [15:39:55] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: fix passive Icinga meta-monitoring for VO [puppet] - 10https://gerrit.wikimedia.org/r/595515 (https://phabricator.wikimedia.org/T252401) (owner: 10Volans) [15:40:22] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) ` [edit interfaces interface-range vlan-public1-b-codfw] member ge-1/0/13 { ... } + member ge-1/0/4; [edit inte... [15:40:43] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) [15:41:17] All backports merged and ready for deployment [15:42:12] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:28] !log syncing backports to 1.35.0-wmf.31 (T249963) for T252179 [15:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:31] T249963: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 [15:42:31] T252179: Edits saved via PageUpdater need autopatrol status set - https://phabricator.wikimedia.org/T252179 [15:43:20] (03PS1) 10Muehlenhoff: Revert "Enable staging IDP site for graphite" [puppet] - 10https://gerrit.wikimedia.org/r/595560 [15:44:17] (03PS1) 10Hnowlan: changeprop: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/595561 [15:44:41] (03CR) 10Hnowlan: [C: 03+2] changeprop: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/595561 (owner: 10Hnowlan) [15:44:44] (03PS1) 10Cmjohnson: Adding mgmt dns for db1141-48 [dns] - 10https://gerrit.wikimedia.org/r/595563 (https://phabricator.wikimedia.org/T251614) [15:44:59] (03Merged) 10jenkins-bot: changeprop: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/595561 (owner: 10Hnowlan) [15:45:06] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:10] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for db1141-48 [dns] - 10https://gerrit.wikimedia.org/r/595563 (https://phabricator.wikimedia.org/T251614) (owner: 10Cmjohnson) [15:45:39] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Enable staging IDP site for graphite" [puppet] - 10https://gerrit.wikimedia.org/r/595560 (owner: 10Muehlenhoff) [15:48:46] (03PS1) 10Hnowlan: changeprop: Release 0.9.36 with eventservice URI corrected [deployment-charts] - 10https://gerrit.wikimedia.org/r/595571 [15:49:14] !log cdanis@cumin1001 conftool action : set/ttl=300; selector: dnsdisc=eventgate-analytics.* [15:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:30] (03PS2) 10Hnowlan: changeprop: Release 0.9.36 with eventservice URI corrected [deployment-charts] - 10https://gerrit.wikimedia.org/r/595571 [15:49:56] (03CR) 10Hnowlan: [C: 03+2] changeprop: Release 0.9.36 with eventservice URI corrected [deployment-charts] - 10https://gerrit.wikimedia.org/r/595571 (owner: 10Hnowlan) [15:50:16] (03Merged) 10jenkins-bot: changeprop: Release 0.9.36 with eventservice URI corrected [deployment-charts] - 10https://gerrit.wikimedia.org/r/595571 (owner: 10Hnowlan) [15:50:33] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10thc... [15:50:37] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10thcipriani) [15:50:41] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10thcipriani) [15:51:05] (03PS2) 10Cmjohnson: Adding mgmt dns for db1141-48 [dns] - 10https://gerrit.wikimedia.org/r/595563 (https://phabricator.wikimedia.org/T251614) [15:51:53] (03CR) 10Dzahn: [C: 03+1] "looks consistent to me, but i don't know if these actually all need public IPs, wmcs would know better" [dns] - 10https://gerrit.wikimedia.org/r/595212 (owner: 10Papaul) [15:52:26] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:53] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:38] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10thc... [15:54:42] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10thcipriani) [15:54:44] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 4 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10thcipriani) [15:54:48] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:56] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10thcipriani) [15:55:00] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10thc... [15:55:20] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10Ottomata) 05Open→03Declined [15:56:08] 10Operations, 10Analytics, 10observability: systemd::syslog conf should use :programname equals instead of startswith - https://phabricator.wikimedia.org/T251606 (10Ottomata) a:03Ottomata [15:56:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10observability: systemd::syslog conf should use :programname equals instead of startswith - https://phabricator.wikimedia.org/T251606 (10Ottomata) [15:56:19] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:56:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:51] (03PS1) 10Alexandros Kosiaris: admin: Increase CPU LimitRange to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/595574 [15:57:18] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10thc... [15:57:21] 10Operations, 10serviceops: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10thcipriani) [15:58:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Increase CPU LimitRange to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/595574 (owner: 10Alexandros Kosiaris) [15:58:35] (03Merged) 10jenkins-bot: admin: Increase CPU LimitRange to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/595574 (owner: 10Alexandros Kosiaris) [15:59:06] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [16:01:34] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) [16:01:41] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and public DNS entries for cloudceph200[1-3]-dev [dns] - 10https://gerrit.wikimedia.org/r/595212 (owner: 10Papaul) [16:01:49] (03PS4) 10Papaul: DNS: Add mgmt and public DNS entries for cloudceph200[1-3]-dev [dns] - 10https://gerrit.wikimedia.org/r/595212 [16:02:44] !log hnowlan@deploy1001 Started deploy [changeprop/deploy@82276cb]: Enabling consumption of purges topic [16:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:54] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) [16:03:00] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/Translate: [[gerrit:595135|Revert "Remove uses of WikiPage::doEditContent"]] (duration: 01m 08s) [16:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:41] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/Babel: [[gerrit:595077|Revert "Remove use of WikiPage::doEditContent"]] (duration: 01m 07s) [16:04:42] !log hnowlan@deploy1001 Finished deploy [changeprop/deploy@82276cb]: Enabling consumption of purges topic (duration: 01m 58s) [16:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:52] (03CR) 10Ottomata: Add Horizon webproxies that end in -beta.wmflabs.org to CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595545 (https://phabricator.wikimedia.org/T252417) (owner: 10Ottomata) [16:05:48] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/UploadWizard: [[gerrit:595078|Revert "Remove use of WikiPage::doEditContent"]] (duration: 01m 06s) [16:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:42] cccccckfurjedfuifrkbeebdknujkvrbullfibvkciuv [16:06:51] wow, that was *my cat* triggering my yubikey [16:06:55] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/WikimediaMaintenance: [[gerrit:595076|Revert "Remove use of WikiPage::doEditContent"]] (duration: 01m 06s) [16:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:57] what is this, a crossover episode? [16:07:07] cdanis: quality. [16:07:41] @brennen it looks like all backports merged properly - time to roll to mediawikwiki and test? [16:07:53] *merged and synced [16:07:56] DannyS712: yep, syncing wikiversions.json momentarily. [16:08:35] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10Ottomata) > Set a log.message.timestamp.difference.max.ms value to reject logs with significant timestamp skew (andrew has an open Q to the kafka user mailing list re: what happens... [16:09:34] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` kafka-jumbo1007.eqiad.wmnet ` The log can be found... [16:10:11] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [16:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:46] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: mediawikiwiki to 1.35.0-wmf.31 (T249963) for testing T252179 [16:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:50] T249963: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 [16:13:51] T252179: Edits saved via PageUpdater need autopatrol status set - https://phabricator.wikimedia.org/T252179 [16:14:03] DannyS712: mediawiki.org is on .31 [16:14:40] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [16:14:57] testing [16:17:38] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [16:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:08] @brennen so far so good - going through some translations on mediawiki and autopatrol is applied properly. Next up: make sure it isn't applied when it shouldn't be [16:21:27] DannyS712: cool [16:21:29] 10Operations, 10serviceops: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10thcipriani) [16:24:02] @brennen confirmed to work that it isn't applied improperly - I think this can be rolled out [16:24:20] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Documentation: TEC3:O6:O:6.1:Q4: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T220397 (10thcipriani) [16:24:23] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 2 others: TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10thcipriani) [16:24:38] brennen: There's a maintenance-script-only wmf.31 patch outstanding (for AbuseFilter); what are you thinking re. timing? [16:27:37] James_F: ready to roll out at any point, apart from that [16:28:04] Cool. I think we're ready for group0 and then 1. [16:28:05] 10Operations, 10serviceops: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 (10thcipriani) [16:28:07] 10Operations, 10serviceops: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10thcipriani) [16:28:16] 10Operations, 10serviceops: Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 (10thcipriani) [16:28:42] James_F: ack. doing group0. [16:29:02] * James_F crosses everything. [16:29:50] (03PS1) 10Brennen Bearnes: group0 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595582 [16:29:52] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595582 (owner: 10Brennen Bearnes) [16:30:32] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595582 (owner: 10Brennen Bearnes) [16:32:00] (03PS5) 10Ppchelko: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:34:10] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.31 [16:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:16] (03PS1) 10Andrew Bogott: Add cloudcontrol2001-dev and 2003-dev back to openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/595583 (https://phabricator.wikimedia.org/T252121) [16:34:32] (03CR) 10jerkins-bot: [V: 04-1] Add cloudcontrol2001-dev and 2003-dev back to openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/595583 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [16:35:17] (03PS2) 10Andrew Bogott: Add cloudcontrol2001-dev and 2003-dev back to openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/595583 (https://phabricator.wikimedia.org/T252121) [16:36:21] (03CR) 10Andrew Bogott: [C: 03+2] Add cloudcontrol2001-dev and 2003-dev back to openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/595583 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [16:36:27] James_F, DannyS712: doubt we'll see anything on group0 since we didn't last go around. readying group1 patch. [16:36:42] Yeah. SGTM. [16:36:57] 10Operations, 10serviceops: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10thcipriani) 05Open→03Invalid Untangling task trees for completed quarters: separated open subtasks, closed completed subtasks. [16:36:59] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595584 [16:37:01] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595584 (owner: 10Brennen Bearnes) [16:37:40] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595584 (owner: 10Brennen Bearnes) [16:38:07] T252179 can be marked as Resolved, I assume? [16:38:07] T252179: Edits saved via PageUpdater need autopatrol status set - https://phabricator.wikimedia.org/T252179 [16:38:08] once group 1 is deployed I'll run through the same checks on metawiki [16:38:16] * James_F nods. [16:40:43] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [16:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:26] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.31 [16:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:45] Testing now [16:42:53] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:57] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [16:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:05] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kafka-jumbo1007.eqiad.wmnet'] ` [16:47:05] Is wmf.31 going to be rolled to group2 today? [16:47:09] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.31 (duration: 04m 43s) [16:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:35] Daimona: Probably. We'll see. [16:48:13] Ok, thanks -- I was asking because of the maint script [16:49:41] (03CR) 10Ppchelko: [C: 04-1] "few comments. One more thing: all he mappings in stated-prometheus mapping have 'changeprop' in the names. Can we replace it with `$1` to " (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:49:48] Yeah. [16:52:16] (03PS3) 10Cmjohnson: Adding mgmt dns for db1141-49 [dns] - 10https://gerrit.wikimedia.org/r/595563 (https://phabricator.wikimedia.org/T251614) [16:53:15] (03PS4) 10Cmjohnson: Adding mgmt dns for db1141-49 [dns] - 10https://gerrit.wikimedia.org/r/595563 (https://phabricator.wikimedia.org/T251614) [16:53:47] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for db1141-49 [dns] - 10https://gerrit.wikimedia.org/r/595563 (https://phabricator.wikimedia.org/T251614) (owner: 10Cmjohnson) [16:56:36] @brennen confirmed to work properly on meta now [16:57:05] Declaring T252179 resolved [16:57:06] T252179: Edits saved via PageUpdater need autopatrol status set - https://phabricator.wikimedia.org/T252179 [16:57:43] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) I had to abort the wmf reimage script because it wasn't getting to the point of running puppet, then I accepted manually the new puppet cert and ran... [16:57:58] DannyS712, James_F: cool. i'm not seeing any other breakage; shall we go to all wikis? [16:58:33] sure [16:59:10] +1 [16:59:26] (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595587 [16:59:28] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595587 (owner: 10Brennen Bearnes) [16:59:41] (03PS2) 10ArielGlenn: in page content fixup script, check for truncation, move into place if good [dumps] - 10https://gerrit.wikimedia.org/r/595518 [17:00:04] gehel and onimisionipe: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T1700). [17:00:10] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595587 (owner: 10Brennen Bearnes) [17:01:11] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` kafka-jumbo1008.eqiad.wmnet ` The log can be found... [17:04:19] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:04:30] Oh, bugger. [17:04:49] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.31 [17:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:54] 👀 [17:05:15] brennen: There was a should-be UBN (T251521) that was fixed in master but not back-ported. [17:05:16] T251521: Regression: Vector skin did not populate all variants option in the variant drop-down menu - https://phabricator.wikimedia.org/T251521 [17:05:22] brennen: https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/595588 [17:05:51] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:05:53] James_F: quick backport or rollback? [17:05:58] Backport. [17:06:12] Want to take it or should I? [17:06:23] James_F: if you don't mind... [17:06:29] Of course. :-) [17:06:36] * brennen staring at logs [17:07:41] (03PS1) 10JMeybohm: New upstream version 2.16.7 [debs/helm] - 10https://gerrit.wikimedia.org/r/595591 (https://phabricator.wikimedia.org/T252428) [17:14:05] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [17:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:24] (03CR) 10JMeybohm: "2.16.7 upstream already imported" [debs/helm] - 10https://gerrit.wikimedia.org/r/595591 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [17:20:10] @hashar I left some notes at T249964 to hopefully avoid a repeat of this train [17:20:11] T249964: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 [17:21:48] DannyS712, thank you for being proactive and productive [17:23:25] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1008.eqiad.wmnet'] ` and were **ALL** successful. [17:23:35] No problem. The big patch I was hoping to have merged before tomorrow (T250761) hasn't merged yet, so that is next week's problem [17:23:35] T250761: DifferenceEngine $mNewRev and $mOldRev are Revision objects - https://phabricator.wikimedia.org/T250761 [17:23:37] so far the only revision-related errors i'm seeing since deploy to all wikis are the known CategoryMembershipChangeJob ones from T212428. [17:23:38] T212428: includes/Revision/RevisionStore.php: Main slot of revision (number) not found in database! - https://phabricator.wikimedia.org/T212428 [17:24:08] Suggest backporting https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/595147/ [17:24:15] (03PS1) 10Cmjohnson: Adding production dns for db1141-1149 [dns] - 10https://gerrit.wikimedia.org/r/595594 (https://phabricator.wikimedia.org/T251614) [17:24:32] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.31/skins/Vector/includes/VectorTemplate.php: T251521 Correctly populate the language variants drop-down rather than breaking early (duration: 00m 59s) [17:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:35] T251521: Regression: Vector skin did not populate all variants option in the variant drop-down menu - https://phabricator.wikimedia.org/T251521 [17:25:06] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [17:27:08] (03CR) 10Bstorm: wikireplicas: remove MCR-obsoleted fields from the replica views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [17:27:29] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.30/skins/Vector/includes/VectorTemplate.php: T251521 Correctly populate the language variants drop-down rather than breaking early (duration: 00m 59s) [17:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:54] (03PS1) 10Cmjohnson: Adding db1141-1149 mac addresses to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/595595 (https://phabricator.wikimedia.org/T251614) [17:30:06] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for db1141-1149 [dns] - 10https://gerrit.wikimedia.org/r/595594 (https://phabricator.wikimedia.org/T251614) (owner: 10Cmjohnson) [17:30:06] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/AbuseFilter/maintenance/updateVarDumps.php: updateVarDumps: wait for replication after each batch (duration: 00m 58s) [17:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:09] DannyS712: Agreed, that seems like a good idea. [17:30:19] brennen: OK for me to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/595596 too? [17:30:27] Daimona: Deployed the script fix. [17:30:41] (03CR) 10Cmjohnson: [C: 03+2] Adding db1141-1149 mac addresses to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/595595 (https://phabricator.wikimedia.org/T251614) (owner: 10Cmjohnson) [17:31:22] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10Gehel) [17:31:25] James_F: yeah, sounds good to me. [17:32:15] (03PS1) 10Jdlrobson: Enable modern Vector on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595597 (https://phabricator.wikimedia.org/T251285) [17:32:25] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Cmjohnson) [17:33:02] James_F: Cool, thank you. When would you be available to run it? [17:33:26] (03PS2) 10Jdlrobson: Enable modern Vector on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595597 (https://phabricator.wikimedia.org/T251285) [17:36:18] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` kafka-jumbo1009.eqiad.wmnet ` The log can be found... [17:40:08] Daimona: Maybe in half an hour's time? [17:40:34] (03PS3) 10Jdlrobson: Enable modern Vector on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595597 (https://phabricator.wikimedia.org/T251285) [17:41:43] (03PS4) 10Jdlrobson: Enable modern Vector on officewiki and reveal preference on test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595597 (https://phabricator.wikimedia.org/T251285) [17:41:54] (03PS1) 10Marostegui: install_server: db114[1-9] need to be installed with buster [puppet] - 10https://gerrit.wikimedia.org/r/595600 (https://phabricator.wikimedia.org/T251614) [17:41:54] James_F: I might have to leave before the script finishes, but WFM [17:42:03] Perhaps just group1, or group0 [17:42:32] Any chance someone can review https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/594807/ to add some tests? Probably would have caught some of the errors in the last train, hopefully will prevent errors with future changes [17:42:35] If needed I can leave you a way to contact me if something explodes [17:42:39] (03CR) 10Marostegui: [C: 03+2] install_server: db114[1-9] need to be installed with buster [puppet] - 10https://gerrit.wikimedia.org/r/595600 (https://phabricator.wikimedia.org/T251614) (owner: 10Marostegui) [17:42:59] Daimona: I was first going to run it just on aawiki. :-) [17:43:44] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Marostegui) @Cmjohnson I have ammended your patch for the DHCP to make sure they use the Buster installer. [17:44:07] Even better, yes :) [17:44:45] But first, let's finish the backports. [17:47:05] Sure, ping me when ready [17:47:12] I should be available in roughly 20 mins [17:48:54] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1148.eqiad.wmnet ` The log can be... [17:48:57] (03PS1) 10Ppchelko: Preserve datetime field of the purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/595601 (https://phabricator.wikimedia.org/T252127) [17:49:08] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [17:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:12] Krinkle: so for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/595545 do you think I should just set my webproxy domains explicitly? [17:55:59] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1149.eqiad.wmnet ` The log can be... [17:56:22] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1009.eqiad.wmnet'] ` and were **ALL** successful. [17:56:36] (03PS1) 10Dave Pifke: Decommission old ArcLamp HHVM pipeline [puppet] - 10https://gerrit.wikimedia.org/r/595602 (https://phabricator.wikimedia.org/T233884) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T1800). [18:00:04] Zoranzoki21 and Jdlrobson: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:48] (Still backporting.) [18:00:59] James_F: OK let me know when you're done [18:01:13] o/ [18:01:31] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1141.eqiad.wmnet ` The log can be... [18:01:32] * James_F is mostly twiddling thumbs waiting for CI, sadly. [18:01:43] stashbot: now [18:01:43] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [18:01:47] oops [18:02:07] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1142.eqiad.wmnet ` The log can be... [18:02:18] jouncebot: now [18:02:18] For the next 0 hour(s) and 57 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T1800) [18:02:22] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1148.eqiad.wmnet ` The log can be... [18:02:24] jouncebot: next [18:02:24] In 1 hour(s) and 57 minute(s): Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T2000) [18:03:04] SWAT is now :D [18:03:38] Zoranzoki21: UBNs always take priority. [18:03:42] Can I still add something to SWAT? [18:03:53] Sure, if Ro's OK with that. [18:03:54] there is some ongoing issues on s6 after last deployment [18:04:10] jynus: From the train? [18:04:17] DannyS712: Yes feel free [18:04:19] https://grafana.wikimedia.org/d/000000273/mysql?panelId=5&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1113&var-port=13316&from=1589209454600&to=1589220254600 [18:04:23] James_F: I thinked that I'm late because I had to upgrade my system on laptop [18:04:23] But we'll have to wait for James to be done first [18:04:31] lots of inserts causing lag [18:04:38] Hmm. [18:04:58] whic in case it is causing log spam [18:05:08] s6 is only frwiki/jawiki/ruwiki? [18:05:33] Why'd that be so much higher? Nothing particularly odd about those wikis in config IIRC. [18:05:43] https://logstash.wikimedia.org/goto/4c54a3b3ae512cb822436dd174647e4e [18:05:55] not necesarilly deployment, but matches times [18:06:01] the relation is a guess [18:06:07] (03PS8) 10DannyS712: Remove "Create a book" link on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561403 (https://phabricator.wikimedia.org/T241683) [18:06:13] * James_F nods. [18:06:19] No, I'd suspect the deployment too. [18:06:27] may I ask you to keep an eye on it? [18:06:32] not necesarilly you, ofc [18:06:33] Of course. [18:06:51] somenone should keep an eye, maybe try to debug some weirdness [18:06:52] (03CR) 10Krinkle: Add Horizon webproxies that end in -beta.wmflabs.org to CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595545 (https://phabricator.wikimedia.org/T252417) (owner: 10Ottomata) [18:06:59] as I was about to finish my day [18:07:10] From LinsUpdate jobs it seems? [18:07:22] Maybe someone edited a mega template there. [18:07:28] that an rc replica, it is the first that could sign load issues [18:07:39] as they have less resources than other replicas [18:07:45] Ack. [18:07:52] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) 05Open→03Resolved [18:08:00] again, if temporary no issue, just keep an eye [18:08:06] Right. [18:08:30] (03Abandoned) 10Ottomata: Add Horizon webproxies that end in -beta.wmflabs.org to CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595545 (https://phabricator.wikimedia.org/T252417) (owner: 10Ottomata) [18:08:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:43] 10Operations, 10ops-eqiad: Degraded RAID on kafka-jumbo1001 - https://phabricator.wikimedia.org/T251586 (10Cmjohnson) 05Open→03Resolved the disk was replaced and is rebuilding...resolving the task, if it fails please let me know. The tracking number for the return disk is USPS 9202 3946 5301 2445 77393... [18:11:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:51] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.31/includes/Revision/RevisionStore.php: T252156 T212428 RevisionStore: fall back to master db if main slot is missing (duration: 00m 58s) [18:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:54] T212428: includes/Revision/RevisionStore.php: Main slot of revision (number) not found in database! - https://phabricator.wikimedia.org/T212428 [18:11:54] T252156: Increase in "Main slot of revision [number] not found in database!" after deploy of 1.35.0-wmf.31 to all wikis - https://phabricator.wikimedia.org/T252156 [18:12:31] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1143.eqiad.wmnet ` The log can be found in `/var/log/wmf... [18:14:02] sorry, James_F it is actually a dump/vslow host, so it is of very low importance [18:14:11] 10Operations, 10ops-eqiad, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10Cmjohnson) I just attempted us use ge-1/0/6 and it did not work [18:14:12] Aha, OK, that makes me feel better. [18:14:20] it could be that just at that time maintenance was runing and makes it lag [18:14:23] jynus: Thanks for bringing it up though. [18:14:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:14:25] but it shouldn't affect [18:14:25] * James_F nods. [18:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:30] webrequests [18:14:32] sorry [18:14:46] I mistook the role of that server [18:14:46] Better to worry about ghosts than ignore real issues. [18:15:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:27] whew. [18:15:27] RoanKattouw: Ready for SWAT. [18:15:36] * James_F will be around if needed. [18:15:38] yay [18:15:47] cool [18:15:53] brennen: Let's call the train deployed and you can go have some tea? :-) [18:16:33] (03PS2) 10Bstorm: wikireplicas: remove MCR-obsoleted fields from the replica views [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) [18:16:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:03] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1149.eqiad.wmnet'] ` and were **ALL** successful. [18:17:23] James_F: sounds a plan. i will in fact take a few for tea, but keeping an eye on it rest of the day. [18:17:27] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1144.eqiad.wmnet ` The log can be found in `/var/log/wmf... [18:17:37] * James_F grins. [18:17:48] Alright! Let's do Zoranzoki21's patches first [18:18:08] (03CR) 10Catrope: [C: 03+2] Drop scowiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595141 (https://phabricator.wikimedia.org/T252048) (owner: 10Zoranzoki21) [18:18:10] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10Cmjohnson) [18:18:22] (03CR) 10Catrope: [C: 03+2] Drop itwiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595142 (https://phabricator.wikimedia.org/T252065) (owner: 10Zoranzoki21) [18:18:24] Cool.. I'm ready, Linux Mint is fully operational [18:18:58] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10Cmjohnson) 05Open→03Resolved the ops-eqiad portion of this task has been completed. Thank you for finishing the install @jcrespo/@marostegui [18:19:04] (03Merged) 10jenkins-bot: Drop scowiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595141 (https://phabricator.wikimedia.org/T252048) (owner: 10Zoranzoki21) [18:19:07] (03Merged) 10jenkins-bot: Drop itwiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595142 (https://phabricator.wikimedia.org/T252065) (owner: 10Zoranzoki21) [18:19:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:16] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:49] Which IRC clients you use? :) [18:22:03] I use IRCcloud [18:22:46] Zoranzoki21: Your scowiki+itwiki patches are now on mwdebug1002, please test [18:23:01] I figured I'd combine those two, because they're the same change just on two different wikis [18:23:33] (for my question) Cool, I think that I should use it also [18:23:37] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1141.eqiad.wmnet'] ` and were **ALL** successful. [18:23:48] Will test patches [18:23:52] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1142.eqiad.wmnet'] ` and were **ALL** successful. [18:23:57] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1145.eqiad.wmnet ` The log can be found in `/var/log/wmf... [18:24:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1146.eqiad.wmnet ` The log can be found in `/var/log/wmf... [18:25:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:28] 10Operations, 10DBA, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) last host needed, backup1002 is finally fully setup, HW and OS-wise and ready to implement the last part of external storage backups (cross-dc redundancy). [18:25:32] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1148.eqiad.wmnet'] ` and were **ALL** successful. [18:25:42] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1147.eqiad.wmnet ` The log can be found in `/var/log/wmf... [18:26:27] RoanKattouw: Itwiki looks good to me [18:26:31] Jdlrobson: One wiki less :) [18:26:37] Testing scowiki now... [18:26:54] w00t [18:26:55] thanks Zoranzoki21 ! [18:27:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:05] RoanKattouw: Scowiki is excellent, let's go [18:28:46] !log catrope@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: Drop mainpage special casing for scowiki and itwiki (T252048, T252065) (duration: 00m 58s) [18:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:51] T252065: Turn off main page special casing on itwiki - https://phabricator.wikimedia.org/T252065 [18:28:51] T252048: Disable $wgMFSpecialCaseMainPage for Scots Wikipedia - https://phabricator.wikimedia.org/T252048 [18:30:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:31] Jdlrobson: Should I close tasks as resolved? [18:31:20] RoanKattouw: Changes are deployed as I see, and I checked production via phone, everything is okay as should be :) [18:31:40] (03CR) 10DannyS712: wikireplicas: remove MCR-obsoleted fields from the replica views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [18:32:21] (03CR) 10Jcrespo: "I saw the patch, but didn't have the time to read it. Add me to "reviewer" (jcrespo) when you want me to start the review so I don't miss " [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [18:32:31] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1143.eqiad.wmnet'] ` and were **ALL** successful. [18:32:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:55] (03PS2) 10Catrope: Add tw-photomedia.de in wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595146 (https://phabricator.wikimedia.org/T252141) (owner: 10Zoranzoki21) [18:35:55] Zoranzoki21: feel free to resolve yes :) [18:36:11] Jdlrobson: Will do [18:36:19] (03PS3) 10Catrope: Add tw-photomedia.de in wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595146 (https://phabricator.wikimedia.org/T252141) (owner: 10Zoranzoki21) [18:36:26] (03CR) 10Catrope: [C: 03+2] Add tw-photomedia.de in wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595146 (https://phabricator.wikimedia.org/T252141) (owner: 10Zoranzoki21) [18:37:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:09] (03Merged) 10jenkins-bot: Add tw-photomedia.de in wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595146 (https://phabricator.wikimedia.org/T252141) (owner: 10Zoranzoki21) [18:38:13] (03CR) 10Bstorm: wikireplicas: remove MCR-obsoleted fields from the replica views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [18:38:32] James_F: I'm almost off for today, so perhaps it's better to postpone. However, I just realized that no-one ever did a dry-run in prod. If you're able to do that later, I can review it tomorrow [18:38:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:38] 10Operations, 10Privacy Engineering, 10Research, 10Traffic, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10leila) @bmansurov thanks for the heads up. those removals are fine. (and btw, I expect James Fishback t... [18:38:45] Zoranzoki21: The copyuploads patch is now on mwdebug1002, please test [18:38:59] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1144.eqiad.wmnet'] ` and were **ALL** successful. [18:39:02] RoanKattouw: It should work, you can deploy it [18:39:16] We usually deploy it to production directly [18:39:26] As I know :) [18:39:29] OK [18:39:33] Daimona: I was going to do a dry run first. :-) But sure. [18:39:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:43] Yeah I mean, you can do that without me [18:40:04] So we don't have to do that next time [18:40:20] (03PS5) 10Catrope: Enable modern Vector on officewiki and reveal preference on test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595597 (https://phabricator.wikimedia.org/T251285) (owner: 10Jdlrobson) [18:40:24] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add tw-photometa.de to $wgCopyUploadsDomains (T252141) (duration: 00m 58s) [18:40:25] (03CR) 10Catrope: [C: 03+2] Enable modern Vector on officewiki and reveal preference on test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595597 (https://phabricator.wikimedia.org/T251285) (owner: 10Jdlrobson) [18:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:26] T252141: Add tw-photomedia.de to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T252141 [18:40:37] Jdlrobson: You're up next [18:40:55] I'm not very available lately... [18:41:04] Daimona: No worries. [18:41:19] (03Merged) 10jenkins-bot: Enable modern Vector on officewiki and reveal preference on test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595597 (https://phabricator.wikimedia.org/T251285) (owner: 10Jdlrobson) [18:41:28] Sure, just trying to make things simpler (hopefully) :-) [18:41:30] RoanKattouw: I will have two patches more for wgCopyUploadDomains [18:41:58] Zoranzoki21: OK, I can do those after Jdlrobson's patch and DannyS712's patch [18:42:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:12] Jdlrobson: Your Vector-on-officewiki patch is on mwdebug1002, please test [18:42:19] RoanKattouw: Yes, until it I'll be ready [18:42:19] sweeeett [18:42:24] And I will add patches in calendar [18:42:41] RoanKattouw: sync please :D [18:44:16] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable modern Vector on officewiki, reveal preference on testwiki (T251285) (duration: 00m 58s) [18:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:20] T251285: Deploy new header to officewiki and testwiki - https://phabricator.wikimedia.org/T251285 [18:44:52] (03PS1) 10Zoranzoki21: wgCopyUploadDomains: Partial revert of I30a4b8c9bb9c1240d7e7422446af55ad50c41e70 to make upload working from bollywoodhungama.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595610 (https://phabricator.wikimedia.org/T235415) [18:45:13] thanks RoanKattouw ! [18:45:18] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1146.eqiad.wmnet'] ` and were **ALL** successful. [18:45:35] (03PS2) 10Zoranzoki21: wgCopyUploadDomains: Partial revert of I30a4b8c9bb9c1240d7e7422446af55ad50c41e70 to make upload working from bollywoodhungama.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595610 (https://phabricator.wikimedia.org/T235415) [18:46:55] 10Operations, 10Security-Team, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) [18:47:01] (03CR) 10Privacybatm: "Please review this when you have free time. I am currently working on the refactoring issue :)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [18:47:04] (03PS9) 10Catrope: Remove "Create a book" link on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561403 (https://phabricator.wikimedia.org/T241683) (owner: 10DannyS712) [18:47:41] (03CR) 10Catrope: [C: 03+2] Remove "Create a book" link on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561403 (https://phabricator.wikimedia.org/T241683) (owner: 10DannyS712) [18:47:50] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1147.eqiad.wmnet'] ` and were **ALL** successful. [18:47:57] DannyS712: You're up next [18:48:06] Ready to test when called upon [18:48:26] (03Merged) 10jenkins-bot: Remove "Create a book" link on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561403 (https://phabricator.wikimedia.org/T241683) (owner: 10DannyS712) [18:49:04] DannyS712: It's on mwdebug1002 now, please test [18:49:17] (03PS1) 10Zoranzoki21: wgCopyUploadDomains: Add *.britishmuseum.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595614 (https://phabricator.wikimedia.org/T251882) [18:49:39] Confirmed to work [18:50:40] OK, syncing [18:50:47] (03PS2) 10Zoranzoki21: wgCopyUploadDomains: Add *.britishmuseum.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595614 (https://phabricator.wikimedia.org/T251882) [18:51:14] RoanKattouw: Cool, and I'm ready [18:51:32] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1145.eqiad.wmnet ` The log can be found in `/var/log/wmf... [18:51:32] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove "Create a book" link on enwiki (T241683) (duration: 00m 57s) [18:51:34] 595610 and 595614 [18:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:35] T241683: Remove "Create a book" link from sidebar on English Wikipedia - https://phabricator.wikimedia.org/T241683 [18:51:48] (03PS3) 10Zoranzoki21: wgCopyUploadDomains: Partial revert of I30a4b8c9bb9c1240d7e7422446af55ad50c41e70 to make upload working from bollywoodhungama.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595610 (https://phabricator.wikimedia.org/T235415) [18:51:57] (03PS3) 10Zoranzoki21: wgCopyUploadDomains: Add *.britishmuseum.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595614 (https://phabricator.wikimedia.org/T251882) [18:53:21] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1145.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1145.eqiad.wmnet'] ` [18:54:05] Krinkle: in your ATS performance.wikimedia.beta.wmflabs.org rule, is there a way to specify the port? [18:54:19] or do I need to put the port into the configured endpoint URL the client will use? [18:54:25] the backend port* [18:54:36] (03PS4) 10Catrope: wgCopyUploadDomains: Partial revert of I30a4b8c9bb9c1240d7e7422446af55ad50c41e70 to make upload working from bollywoodhungama.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595610 (https://phabricator.wikimedia.org/T235415) (owner: 10Zoranzoki21) [18:54:47] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Cmjohnson) [18:57:30] (03CR) 10Catrope: [C: 03+2] wgCopyUploadDomains: Partial revert of I30a4b8c9bb9c1240d7e7422446af55ad50c41e70 to make upload working from bollywoodhungama.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595610 (https://phabricator.wikimedia.org/T235415) (owner: 10Zoranzoki21) [18:57:43] ottomata: I defer to SRE for that. I don't know. You might be able to cargocult/copy something based on examples in prod puppet.git hiera files for that key [18:58:06] so for your performance example, it just does port 80" [18:58:09] 80 -> 80? [18:58:22] (03Merged) 10jenkins-bot: wgCopyUploadDomains: Partial revert of I30a4b8c9bb9c1240d7e7422446af55ad50c41e70 to make upload working from bollywoodhungama.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595610 (https://phabricator.wikimedia.org/T235415) (owner: 10Zoranzoki21) [18:59:02] OH i do see some examples there [18:59:03] ottomata: ish, it doesn't specify incoming port there, that's implicit afaik (it's 443 I guess) [18:59:04] yes i can give port [18:59:05] cool [18:59:19] but yeah, for us the webperf host is port 80 [18:59:46] (03PS4) 10Catrope: wgCopyUploadDomains: Add *.britishmuseum.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595614 (https://phabricator.wikimedia.org/T251882) (owner: 10Zoranzoki21) [19:00:06] (03CR) 10Catrope: [C: 03+2] wgCopyUploadDomains: Add *.britishmuseum.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595614 (https://phabricator.wikimedia.org/T251882) (owner: 10Zoranzoki21) [19:00:59] (03Merged) 10jenkins-bot: wgCopyUploadDomains: Add *.britishmuseum.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595614 (https://phabricator.wikimedia.org/T251882) (owner: 10Zoranzoki21) [19:01:09] hopefully https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/master/deployment-prep/deployment-cache-text.yaml#99 will do me [19:02:37] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add *.bollywoodhungama.in and *.britishmuseum.org to $wgCopyUploadDomains (T235414, T251882) (duration: 00m 57s) [19:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:41] T235414: Deploy core version of watchlist for AMC users - https://phabricator.wikimedia.org/T235414 [19:02:41] T251882: Whitelist for upload by url https://www.britishmuseum.org/api/ - https://phabricator.wikimedia.org/T251882 [19:03:23] !log T235414 is wrong task number, T235415 is correct [19:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:27] T235415: Copy uploads not working for https://www.bollywoodhungama.com - https://phabricator.wikimedia.org/T235415 [19:05:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:11:40] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:14:10] (03CR) 10Jeena Huneidi: parsoid: Add TLS termination support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/595505 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [19:14:14] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1145.eqiad.wmnet'] ` and were **ALL** successful. [19:18:53] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Cmjohnson) [19:19:26] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Cmjohnson) 05Open→03Resolved These are all yours @Marostegui [19:22:08] (03PS1) 10Ottomata: Set intake-logging,analytics beta URLs to use ATS defined endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595620 (https://phabricator.wikimedia.org/T252417) [19:24:16] (03CR) 10Ottomata: [C: 03+2] Set intake-logging,analytics beta URLs to use ATS defined endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595620 (https://phabricator.wikimedia.org/T252417) (owner: 10Ottomata) [19:29:04] (03PS1) 10Ottomata: jdlrobson - set krb: present [puppet] - 10https://gerrit.wikimedia.org/r/595621 (https://phabricator.wikimedia.org/T252222) [19:29:22] (03CR) 10QEDK: [C: 03+1] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [19:30:07] (03CR) 10Ottomata: [C: 03+2] jdlrobson - set krb: present [puppet] - 10https://gerrit.wikimedia.org/r/595621 (https://phabricator.wikimedia.org/T252222) (owner: 10Ottomata) [19:39:38] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Marostegui) Thank you! They look good: ` _____FORMATTED_OUTPUT_____ db1141.eqiad.wmnet: Filesystem Type Size Used Avail Use% Mounted on db1141.eqiad.wmne... [19:39:45] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (13) node(s) change every puppet run: cloudservices2003-dev.wikimedia.org, db1142.eqiad.wmnet, kafka-jumbo1008.eqiad.wmnet, db1144.eqiad.wmnet, db1149.eqiad.wmnet, kafka-jumbo1007.eqiad.wmnet, db1148.eqiad.wmnet, db1146.eqiad.wmnet, db1143.eqiad.wmnet, db1145.eqiad.wmnet, db1147.eqiad.wmnet, db1141.eqiad.wmn [19:39:45] 09.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [20:00:04] halfak and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T2000). [20:00:27] (03PS1) 10Ottomata: Configure wgEventLoggingSchemas overrides in beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) [20:03:29] (03CR) 10Krinkle: "It seems the labs setting might not be working as intended. View the "diffConfig" build output to see, but all the labs wikis only have "T" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:08:54] pssh hm [20:10:33] 10Operations, 10ops-eqiad, 10DC-Ops: Remove all out of warranty unused cp10xx's from A2 - https://phabricator.wikimedia.org/T120856 (10Cmjohnson) 05Open→03Resolved Most of these decom'd already. [20:10:35] 10Operations: eqiad out of warranty spares to decommission - approval request - https://phabricator.wikimedia.org/T120679 (10Cmjohnson) [20:12:12] (03PS1) 10Ppchelko: Change-prop: add produce metric mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/595637 [20:12:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: scb1001: Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T250482 (10Cmjohnson) 05Stalled→03Declined Since we are not replacing, just setup a decom task when ready [20:13:59] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.esams.wikimedia.org, port=443): Read timed out.,): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [20:15:21] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:16:09] I tihnk maybe +default doesn't work [20:16:15] i thihnk that only works on specific wikiis and tags.... i think [20:19:18] !log cdanis@cumin1001 START - Cookbook sre.network.cf [20:19:18] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [20:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:25] (03CR) 10Ottomata: "From reading some code, I _think_ that + merging does not work with 'default'. Trying..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:23:42] (03PS2) 10Ottomata: Configure wgEventLoggingSchemas overrides in beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) [20:28:01] (03CR) 10Ottomata: "Hm, I think https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/2370/console is right. It does" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:31:08] (03CR) 10Daniel Kinzler: wikireplicas: remove MCR-obsoleted fields from the replica views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [20:32:05] (03CR) 10Jdlrobson: [C: 03+1] "Perfect! Just seen - https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#Main_Page_on_mobile" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [20:33:00] (03CR) 10Ottomata: "Will merge this tomorrow morning and test some tests!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:35:33] 10Operations, 10ops-eqiad, 10serviceops: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10wiki_willy) Hi @elukey or @Dzahn - just wanted to follow up on this, to see if it's worth buying parts to keep this server online, especially with all the previous issues... [20:43:37] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 54.47 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:44:33] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 22.58 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:47:19] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 77.79 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:50:07] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 98.81 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:52:37] (03PS1) 10Ayounsi: Generate blackhole prefix-list from private list [homer/public] - 10https://gerrit.wikimedia.org/r/595647 [20:54:35] (03CR) 10Ayounsi: "Tested with:" [homer/public] - 10https://gerrit.wikimedia.org/r/595647 (owner: 10Ayounsi) [20:55:06] (03CR) 10CDanis: [C: 03+1] Generate blackhole prefix-list from private list [homer/public] - 10https://gerrit.wikimedia.org/r/595647 (owner: 10Ayounsi) [21:00:04] Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T2100). [21:00:36] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [21:00:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [21:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:09] (03PS1) 10Ottomata: systemd::timer::job - add ability to syslog to file based on programname equality [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) [21:02:03] (03CR) 10Ayounsi: [C: 03+2] Generate blackhole prefix-list from private list [homer/public] - 10https://gerrit.wikimedia.org/r/595647 (owner: 10Ayounsi) [21:02:11] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job - add ability to syslog to file based on programname equality [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [21:02:28] (03Merged) 10jenkins-bot: Generate blackhole prefix-list from private list [homer/public] - 10https://gerrit.wikimedia.org/r/595647 (owner: 10Ayounsi) [21:02:35] (03PS1) 10Ryan Kemper: sre.wdqs.data-transfer: fix syntax, simplify rule [cookbooks] - 10https://gerrit.wikimedia.org/r/595649 (https://phabricator.wikimedia.org/T206951) [21:06:30] 10Operations, 10Sustainability (MediaWiki-MultiDC), 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) [21:09:07] 10Operations, 10Sustainability (MediaWiki-MultiDC), 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) [21:09:39] 10Operations, 10Analytics, 10Tools, 10Wikimedia-IRC-RC-Server, 10Code-Stewardship-Reviews: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10Krinkle) [21:10:40] (03PS3) 10Ryan Kemper: sre.wdqs.data-transfer: fix syntax, simplify rule [cookbooks] - 10https://gerrit.wikimedia.org/r/595061 (https://phabricator.wikimedia.org/T206951) [21:16:49] (03PS1) 10Andrew Bogott: role::puppetmaster::standalone: explicitly include httpd [puppet] - 10https://gerrit.wikimedia.org/r/595650 [21:17:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:19:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:19:36] Hey all - was going to deploy updated change to PS.php for T250887 unless there are objections... [21:19:45] (03PS1) 10CDanis: prepend {es,kn}ams [homer/public] - 10https://gerrit.wikimedia.org/r/595651 [21:22:45] (03PS1) 10Papaul: Add cloudceph200[1-3]-dev MAC address, partman with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/595652 (https://phabricator.wikimedia.org/T250846) [21:23:37] (03CR) 10Ayounsi: [C: 03+2] prepend {es,kn}ams [homer/public] - 10https://gerrit.wikimedia.org/r/595651 (owner: 10CDanis) [21:26:17] (03PS2) 10Papaul: Add cloudceph200[1-3]-dev MAC address with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/595652 (https://phabricator.wikimedia.org/T250846) [21:28:18] (03CR) 10Papaul: [C: 03+2] Add cloudceph200[1-3]-dev MAC address with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/595652 (https://phabricator.wikimedia.org/T250846) (owner: 10Papaul) [21:30:51] (03PS4) 10Ryan Kemper: sre.wdqs.data-transfer: fix syntax, simplify rule [cookbooks] - 10https://gerrit.wikimedia.org/r/595061 (https://phabricator.wikimedia.org/T206951) [21:31:25] PROBLEM - SSH on dumpsdata1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:31:41] (03PS5) 10Ryan Kemper: sre.wdqs.data-transfer: fix syntax, simplify rule [cookbooks] - 10https://gerrit.wikimedia.org/r/595061 (https://phabricator.wikimedia.org/T206951) [21:46:07] (03PS5) 10Brennen Bearnes: logspam-watch: add time & sortable columns, improve formatting [puppet] - 10https://gerrit.wikimedia.org/r/593936 (https://phabricator.wikimedia.org/T242882) [21:52:32] (03CR) 10Hashar: "Thanks for the component!" [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [21:52:57] (03PS2) 10Hashar: Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [21:53:21] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [21:53:23] (03CR) 10jerkins-bot: [V: 04-1] Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [21:55:22] PROBLEM - MariaDB Slave Lag: s5 on db2113 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 702.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:55:24] PROBLEM - MariaDB Slave Lag: s5 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 706.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:57:33] (03PS3) 10Hashar: Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [21:57:34] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [21:57:36] (03CR) 10jerkins-bot: [V: 04-1] Initially use Java 8 for contint on Buster [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [21:58:40] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 901.75 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:58:40] PROBLEM - MariaDB Slave Lag: s5 on db2128 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 901.78 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:00:04] gehel and maryum: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T2200). [22:05:34] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1314.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:10:36] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10diego) [22:12:46] PROBLEM - MariaDB Slave Lag: s5 on db2111 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1747.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:18:38] PROBLEM - MariaDB Slave SQL: s5 on db2123 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 185.107.95.230-0-0-0 for key ipb_address_unique on query. Default database: dewiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:23:46] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2408.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:29:22] PROBLEM - MariaDB Slave Lag: s5 on db2123 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2742.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:32:06] RECOVERY - SSH on dumpsdata1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:33:28] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2990.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:48:30] 10Operations, 10Wikimedia-Mailing-lists: Mailing-list sending notifications for inexistent spam messages - https://phabricator.wikimedia.org/T251816 (10colewhite) 05Open→03Resolved Closing as duplicate of parent task. [22:48:33] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Duplicate "moderator request(s) waiting" emails sent to list admins - https://phabricator.wikimedia.org/T250032 (10colewhite) [23:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200511T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:07:56] (03PS1) 10Bstorm: puppetmaster: fix standalone puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/595701 [23:12:30] (03CR) 10Bstorm: "Trying to determine if this will be a sufficient fix or not quickly." [puppet] - 10https://gerrit.wikimedia.org/r/595701 (owner: 10Bstorm) [23:16:20] (03PS2) 10Bstorm: puppetmaster: fix standalone puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/595701 [23:20:03] (03PS1) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/595706 (https://phabricator.wikimedia.org/T244153) [23:20:29] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/595706 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [23:20:51] (03CR) 10Bstorm: "I see we were looking at the same problem https://gerrit.wikimedia.org/r/c/operations/puppet/+/595701" [puppet] - 10https://gerrit.wikimedia.org/r/595650 (owner: 10Andrew Bogott) [23:21:30] (03PS11) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [23:21:42] (03Abandoned) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/595706 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [23:21:52] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [23:22:55] (03PS12) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [23:23:17] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [23:25:22] (03CR) 10Bstorm: [C: 03+2] puppetmaster: fix standalone puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/595701 (owner: 10Bstorm) [23:30:16] (03PS1) 10CRusnov: netbox: Change scripts.cfg group to www-data so that external scripts can access [puppet] - 10https://gerrit.wikimedia.org/r/595709 [23:30:55] (03CR) 10CRusnov: [C: 03+2] "Self-merging to unblock since it is trivial. This seems to have been the generally agreed upon solution." [puppet] - 10https://gerrit.wikimedia.org/r/595709 (owner: 10CRusnov) [23:34:28] (03CR) 10CRusnov: "Poke to recheck this. I have tested the scrape address and it spits out the expected Prometheus output." [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [23:43:54] (03PS4) 10CRusnov: prometheus::ops: Add prometheus job to scrape Netbox scripts [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) [23:44:56] (03CR) 10jerkins-bot: [V: 04-1] prometheus::ops: Add prometheus job to scrape Netbox scripts [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [23:46:45] (03PS5) 10CRusnov: prometheus::ops: Add prometheus job to scrape Netbox scripts [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) [23:48:56] (03CR) 10CRusnov: "Okay I have renamed the variable as suggested (and fixed a merge issue)." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov)