[00:00:04] (03CR) 10jerkins-bot: [V: 04-1] Define a shared profile to remove duplication in roles [puppet] - 10https://gerrit.wikimedia.org/r/598885 (owner: 10EBernhardson) [00:00:21] (03CR) 10EBernhardson: "Kinda fishing for ideas here. Note also the this is not the limit of the duplication to deal with. profile::query_service::category is a " [puppet] - 10https://gerrit.wikimedia.org/r/598886 (owner: 10EBernhardson) [00:21:09] 10Operations, 10ORES, 10Release Pipeline (Blubber), 10Scoring-platform-team (Current): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10ACraze) A couple of questions here so far: * Does the base image need to come from the wmf docker registry? If so, then it might make sense for... [00:33:36] 10Operations, 10ORES, 10Release Pipeline (Blubber), 10Scoring-platform-team (Current): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10Jdforrester-WMF) >>! In T210268#6167598, @ACraze wrote: > A couple of questions here so far: > > * Does the base image need to come from the wmf... [00:45:19] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10ACraze) I'm wondering about pod size limits and what that means for our current architecture. I believe I've heard there is a 2GB limit, is that correct? Currently, I'm abl... [01:03:39] (03PS1) 10Jforrester: Stop loading PerformanceInspector on any wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598891 (https://phabricator.wikimedia.org/T253689) [01:03:41] (03PS1) 10Jforrester: Stop defining wmgUsePerformanceInspector, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598892 (https://phabricator.wikimedia.org/T253689) [01:03:43] (03PS1) 10Jforrester: Stop loading i18n for PerformanceInspector, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598893 (https://phabricator.wikimedia.org/T253689) [02:05:41] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) [02:16:55] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [03:27:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:29:19] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 55 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:31:13] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:35:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:15:56] (03PS1) 10Marostegui: dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/598909 (https://phabricator.wikimedia.org/T249188) [04:18:53] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/598909 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [04:20:53] !log Depool labsdb1011 - T249188 [04:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:53] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [04:21:07] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [04:22:29] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:26:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:31:47] PROBLEM - puppet last run on an-tool1006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:33:59] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:34:17] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:19] PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:34:23] PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:35:17] <_joe_> uh [04:35:45] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [04:36:23] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [04:36:23] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [04:37:45] <_joe_> md/raid1:md1: Disk failure on sdd2, disabling device. [04:39:26] <_joe_> !log restarting cassandra instances on restbase2009, has a broken disk [04:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:22] <_joe_> !log cassandra cannot start on restbase2009, one of the disk is failed. [04:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:33] <_joe_> hnowlan: ^^ for when you're around [04:44:25] PROBLEM - MD RAID on restbase2009 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:44:26] ACKNOWLEDGEMENT - MD RAID on restbase2009 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T253715 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:44:29] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10ops-monitoring-bot) [04:46:23] (03PS1) 10Marostegui: dashboard.sql: Create tendril_purge_global_status_log event [software/tendril] - 10https://gerrit.wikimedia.org/r/598910 (https://phabricator.wikimedia.org/T252331) [04:49:17] RECOVERY - puppet last run on an-tool1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:49:51] (03CR) 10Marostegui: [V: 03+2 C: 03+2] dashboard.sql: Create tendril_purge_global_status_log event [software/tendril] - 10https://gerrit.wikimedia.org/r/598910 (https://phabricator.wikimedia.org/T252331) (owner: 10Marostegui) [05:16:15] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:17:10] !log Remove tmp_3 key from enwiki.recentchanges on db1099:3311 - T206103 [05:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:14] T206103: recentchanges table indexes: tmp1, tmp2 and tmp3 - https://phabricator.wikimedia.org/T206103 [05:21:45] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:36:18] (03PS1) 10Giuseppe Lavagetto: jobrunner: use envoy by default [puppet] - 10https://gerrit.wikimedia.org/r/598912 [05:36:20] (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: use envoy only for TLS [puppet] - 10https://gerrit.wikimedia.org/r/598913 [05:37:28] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::webserver: use envoy only for TLS [puppet] - 10https://gerrit.wikimedia.org/r/598913 (owner: 10Giuseppe Lavagetto) [05:38:28] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Joe) p:05Triage→03High Setting priority to "high" as the failed disk was also used in JBOD configuration for cassandra, which is not failing to start. [05:51:26] (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: use envoy only for TLS [puppet] - 10https://gerrit.wikimedia.org/r/598913 [05:55:20] (03PS1) 10Elukey: superset: set default gunicorn app name [puppet] - 10https://gerrit.wikimedia.org/r/598914 (https://phabricator.wikimedia.org/T249495) [06:04:50] (03CR) 10Elukey: [C: 03+1] "Left a couple of comments just to verify that my understanding is correct, if so LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) (owner: 10CDanis) [06:05:25] (03CR) 10Elukey: [C: 03+2] superset: set default gunicorn app name [puppet] - 10https://gerrit.wikimedia.org/r/598914 (https://phabricator.wikimedia.org/T249495) (owner: 10Elukey) [06:09:13] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@369a2dd]: Upgrade Superset to 0.36 - second attempt [06:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:09] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@369a2dd]: Upgrade Superset to 0.36 - second attempt (duration: 00m 57s) [06:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:09] (03CR) 10Ayounsi: [C: 03+1] "lgtm, no specific comment and looks sane." [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) (owner: 10CDanis) [06:16:35] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:22:05] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:30:10] <_joe_> uhm we should probably mask cassandra there [06:30:22] <_joe_> but I'll wait for hnowlan to be here :) [06:31:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: use envoy by default [puppet] - 10https://gerrit.wikimedia.org/r/598912 (owner: 10Giuseppe Lavagetto) [06:53:37] (03CR) 10Elukey: [C: 03+2] Import lexeme ttl dumps to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/598681 (owner: 10DCausse) [06:57:18] !log update matomo on stretch-wikimedia to 3.13.5 [06:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:17] !log matomo upgraded to 3.13.5 on matomo1001 - T252741 [07:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:22] T252741: Upgrade matomo to the latest upstream - https://phabricator.wikimedia.org/T252741 [07:04:41] heads-up: soon there will be planned downtime of CI/jenkins [07:05:57] 10Operations, 10netops: Zayo link eqiad-codfw (OGYX/120003//ZYO) down - TTN-0004110251 - https://phabricator.wikimedia.org/T253610 (10ayounsi) Current status as of 30min ago: > Zayo Technician has cleaned Fibers connections. Zayo is waiting for the circuit to take Errors before changing out the Card. We curren... [07:07:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::webserver: use envoy only for TLS [puppet] - 10https://gerrit.wikimedia.org/r/598913 (owner: 10Giuseppe Lavagetto) [07:12:18] (03PS1) 10Dzahn: Revert "Revert "switch contint from 1001 to 2001"" [dns] - 10https://gerrit.wikimedia.org/r/598953 [07:13:05] (03PS1) 10Dzahn: Revert "Revert "contint: switch jenkins/zuul/gearman to contint2001"" [puppet] - 10https://gerrit.wikimedia.org/r/598954 [07:13:55] mutante: guess we can rsync the data [07:14:04] ACKNOWLEDGEMENT - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-06-20 07:01:41 +0000 (expires in 23 days) Ayounsi https://phabricator.wikimedia.org/T251726 https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [07:14:04] ACKNOWLEDGEMENT - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-06-20 07:01:41 +0000 (expires in 23 days) Ayounsi https://phabricator.wikimedia.org/T251726 https://phabricator.wikimedia.org/tag/phabricator/ [07:14:20] which based on the task should be contint1001 - rsync -avpz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/ [07:14:24] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10ayounsi) ACKing the alerts again with that task as comment. [07:14:37] XioNoX: thanks for the ACK, that's because i reverted the change to lower thresholds for reasons [07:14:49] yep, I saw, no pb! [07:15:22] hashar: bonjour.. so.. i looked at other pending patches besides the actual "switch over" ones (2), and i think they are either for Gerrit or something we need to do but not today.. right? [07:15:29] hashar: yes, will do rsync now., brb [07:16:48] and the one for /var/lib/jenkins: contint1001: rsync -avpn --delete /var/lib/jenkins/ rsync://contint2001.wikimedia.org/ci--var-lib-jenkins- [07:17:59] !log installing bind security updates (only client-side tools/libraries in use) [07:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:41] hashar: var-lib-zuul, var-lib-jenins and ci-srv [07:18:53] ci-srv/jenins [07:20:27] /var/lib/zuul after zuul has been stopped, and maybe we can even skip that one but it is small [07:20:41] /var/lib/jenkins which has all the configurations [07:20:49] /srv/jenkins/ which has the build artifacts [07:21:12] /var/lib/zuul - done (can repeat of course) [07:21:22] /var/lib/jenkins - running [07:22:39] 10Operations: Add Prometheus machine metric to track core dumps - https://phabricator.wikimedia.org/T165323 (10MoritzMuehlenhoff) The use case back then was HHVM dumping core when processing some rare API traffic involving Chinese Wikipedia. We can revisit this at some point, but it's really low prio. [07:24:08] hashar: done - starting /srv/jenkins [07:24:19] 110G total [07:24:46] I forgot about that part, we could have done a preliminary one yesterday bah [07:24:52] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [07:25:01] !log contint1001:~# rsync -avp --delete /var/lib/zuul/ rsync://contint2001.wikimedia.org/ci--var-lib-zuul- [07:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:40] !log contint1001:~# rsync -avp --delete /var/lib/jenkins/ rsync://contint2001.wikimedia.org/ci--var-lib-jenkins- [07:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:58] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [07:26:03] !log contint1001:~# rsync -avpz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/ [07:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:17] !log contint2001 - chown -R zuul:zuul /var/lib/zuul/ [07:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:23] !log contint2001 - find /var/lib/jenkins -user statsite -exec chown jenkins {} \; [07:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:59] mutante: so the sync of /var/lib/jenkins has completed ? [07:32:12] hashar: yes [07:32:18] that was fast [07:32:26] so I guess we can stop the services on contint1001 [07:32:38] rsync again /var/lib/jenkins for eventual catchup [07:32:42] hashar: /srv/jenkins is running and takes much longer [07:32:44] and do the dns switch [07:32:46] ah [07:33:01] also fixing permissions [07:34:22] since we run with --delete it is now deleting a bunch of old builds on contint2001 [07:34:53] cool [07:35:21] !log contint2001 - find /var/lib/jenkins -group bacula -user jenkins -exec chown jenkins:jenkins {} \; [07:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:33] hashar: comparing the permissions of files under /var/lib/jenkins on both servers now looks ok to me after the find commands above [07:37:55] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: update tiller to 2.16.7-wmf1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/598501 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [07:38:23] some things are jenkins:jenkins, some are jenkins:nogroup and some are jenkins:adm but it is generally the same as on the active server [07:38:50] yeah that sounds good [07:39:25] fun fact, find doesn't proceed symlinks [07:39:52] or maybe that is chmod [07:40:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:40:45] unless you specify -L ? what links do you see? [07:42:22] find doesn't follow symlinks by default no [07:42:28] the lastStable / lastSuccessful links [07:42:49] so the chown jenkins command needs to be passed either -h or --no-dereference [07:43:00] to have it act on the symlink instead of the targeted file [07:43:40] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:43:56] sudo find /var/lib/jenkins -not -user jenkins -exec chmod -h jenkins:jenkins {} + [07:44:31] bah s/chmod/chown/ [07:44:32] sudo find /var/lib/jenkins -not -user jenkins -exec chown -h jenkins:jenkins {} + [07:44:33] (03PS2) 10Ayounsi: Anycast: introduce new "deterministic" variable [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) [07:44:47] i can't confirm. the first random "lastStable" link is already owned by jenkins:jenkins [07:44:54] yeah I fixed them [07:44:56] can you give an example which one isn't [07:45:15] ok, let's log [07:45:31] !log contint2001 also fixing symlink permissions: sudo find /var/lib/jenkins -not -user jenkins -exec chown -h jenkins:jenkins {} + [07:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:39] we should really fix that rsync configuration ;D [07:45:45] we should really fix the UIDs for next time [07:45:58] so that no find will be needed at all [07:46:28] not the rsync config, just the users having the same UID on both sides [07:46:47] (03PS1) 10Filippo Giunchedi: thanos: don't enable compact when not needed [puppet] - 10https://gerrit.wikimedia.org/r/598956 (https://phabricator.wikimedia.org/T252186) [07:47:19] yeah I tracked it down to rsync config using a chroot which defaults to set 'numeric-ids' (which skips name to uid mapping) [07:47:51] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: don't enable compact when not needed [puppet] - 10https://gerrit.wikimedia.org/r/598956 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [07:49:43] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1003/22804/" [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) (owner: 10Ayounsi) [07:49:44] yes, but we have been there multiple times before and afair it wasn't simply flipping a switch [07:51:19] (03PS2) 10Dzahn: Revert "Revert "contint: switch jenkins/zuul/gearman to contint2001"" [puppet] - 10https://gerrit.wikimedia.org/r/598954 [07:51:22] (03PS2) 10Dzahn: Revert "Revert "switch contint from 1001 to 2001"" [dns] - 10https://gerrit.wikimedia.org/r/598953 [07:51:32] preparing the reverts [07:54:36] too many revert ;D [07:54:50] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Aklapper) >>! In T205619#6165085, @Jidanni wrote: > So how about providing an 1990's back to bare bones basics oldest fashioned HTML 3 or wha... [07:54:51] !log test new bird conf on dns4001 - T253666 [07:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:54] T253666: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 [08:00:12] hashar: /var/lib/zuul/git is root owned on 1001 but zuul:zuul on 2001 but it is an empty dir anyways [08:00:19] (03PS2) 10Ema: vcl: apply mobileaction/useformat ttl cap to cacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/598744 (https://phabricator.wikimedia.org/T247783) [08:01:47] changing it anyways [08:01:49] (03CR) 10Ema: vcl: apply mobileaction/useformat ttl cap to cacheable responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598744 (https://phabricator.wikimedia.org/T247783) (owner: 10Ema) [08:02:15] !log contint2001 - chown root:root /var/lib/zuul/git [08:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:17] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10Nintendofan885) >>! In T249648#6050626, @Bugreporter wrote: > Note sco.wiktionary.org/wiki/ and sco... [08:03:09] (03PS2) 10JMeybohm: admin: update tiller to 2.16.7-wmf1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/598501 (https://phabricator.wikimedia.org/T252428) [08:03:18] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:04:03] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:05:09] (03CR) 10JMeybohm: [C: 03+2] admin: update tiller to 2.16.7-wmf1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/598501 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [08:05:32] (03Merged) 10jenkins-bot: admin: update tiller to 2.16.7-wmf1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/598501 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [08:05:33] mutante: not sure from where /var/lib/zuul/git comes from [08:06:04] ok, no worries about it. it's the same now [08:06:09] oh that is the default dir for zuul::merger class [08:06:14] but we have it set to /srv/zuul/git [08:06:53] (03PS1) 10Muehlenhoff: Remove further package leftovers after stretch->buster upgrades [puppet] - 10https://gerrit.wikimedia.org/r/598959 [08:08:53] !log contint1001 / contint2001 : deleted unused /var/lib/zuul/git (the real one is /srv/zuul/git ) [08:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:13] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) @chasemp Yes, i can create the VM. But I would ask you to please add your new cluster name, description and contact on h... [08:09:22] hashar: ah, ok :) [08:09:38] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:09:46] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:09:46] PROBLEM - Bird Internet Routing Daemon on dns4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:11:03] mutante: and the /var/lib/jenkins rsync is still ongoing isn't it ? [08:11:08] !log updated admin tiller (namespace: kube-system) to 2.16.7-wmf1 in clusters: staging, codfw, eqiad [08:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:10] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:11:14] XioNoX: ^ [08:11:17] @hashar please see https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/598960/ [08:11:25] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:11:31] mutante: yep, that's me, it's coming back up [08:11:35] PROBLEM - Check systemd state on dns4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:36] hashar: yes [08:11:36] DannyS712: that was fast :] [08:11:38] XioNoX: ack, thx [08:12:05] my "quick" test lasted a bit longer than the icinga timeouts :) [08:12:13] its the same as a previous bug that I quick patched, and then the fail safe logic was removed because it was believed to be no longer needed [08:12:23] I just restored the fail safe from the previous patch [08:12:49] (03PS2) 10Gilles: Optimise all static PNGs losslessly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) [08:13:30] (03Abandoned) 10Dzahn: introduce sectools1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/595892 (owner: 10Dzahn) [08:13:53] DannyS712: who knows? ;) maybe the cache is corrupted somehow [08:14:00] (03CR) 10Gilles: "@Krinkle the problematic images you were pointing to had been deleted since and are now gone from this rebased version of the patch: https" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [08:14:20] idk, but its better to have a check that shouldn't be needed that not have one that should be [08:16:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598959 (owner: 10Muehlenhoff) [08:16:27] (03CR) 10Muehlenhoff: [C: 03+2] Remove further package leftovers after stretch->buster upgrades [puppet] - 10https://gerrit.wikimedia.org/r/598959 (owner: 10Muehlenhoff) [08:16:55] DannyS712: thanks for the quick fix, i will let parser folks to approve that one and cherry pick it. It only happens on a single talk page so it is probably not a big deal right now ;]] [08:16:56] (03CR) 10Ema: [C: 03+2] vcl: apply mobileaction/useformat ttl cap to cacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/598744 (https://phabricator.wikimedia.org/T247783) (owner: 10Ema) [08:17:58] (03PS1) 10Gehel: maps: remove useless hiera config [puppet] - 10https://gerrit.wikimedia.org/r/598962 (https://phabricator.wikimedia.org/T249086) [08:18:00] (03PS1) 10Gehel: maps: maps2003 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598963 (https://phabricator.wikimedia.org/T249086) [08:18:02] (03PS1) 10Gehel: maps: maps2002 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598964 (https://phabricator.wikimedia.org/T249086) [08:18:04] (03PS1) 10Gehel: maps: maps2001 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598965 (https://phabricator.wikimedia.org/T249086) [08:18:45] (03CR) 10jerkins-bot: [V: 04-1] maps: maps2002 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598964 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [08:18:58] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) Hi, @wiki_willy I just want to ping you so your team is aware that the maintenance here didn't complete correctly and that we need more onsite help (I don't need this fast, jus... [08:19:02] (03PS1) 10Dzahn: introduce peek2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/598966 (https://phabricator.wikimedia.org/T252210) [08:20:16] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:20:30] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:20:32] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:40] RECOVERY - Check systemd state on dns4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:41] (03PS2) 10Gehel: maps: maps2002 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598964 (https://phabricator.wikimedia.org/T249086) [08:20:42] RECOVERY - Bird Internet Routing Daemon on dns4001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:20:42] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:43] (03PS2) 10Gehel: maps: maps2001 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598965 (https://phabricator.wikimedia.org/T249086) [08:21:09] (03CR) 10Dzahn: [C: 03+2] introduce peek2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/598966 (https://phabricator.wikimedia.org/T252210) (owner: 10Dzahn) [08:21:25] @hashar just want to make sure you saw my note at T253022#6167984 - hopefully there won't be any more bugs found though [08:21:26] T253022: 1.35.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T253022 [08:22:59] DannyS712: dont worry, we would get other folks available ot fix them hopefully :] [08:24:18] (03PS1) 10Ema: vcl: use duration instead of int for ttl cap [puppet] - 10https://gerrit.wikimedia.org/r/598967 (https://phabricator.wikimedia.org/T247783) [08:25:00] (03PS1) 10Dzahn: delete people2001.codfw.wmnet ? [dns] - 10https://gerrit.wikimedia.org/r/598968 [08:25:10] (03CR) 10Ema: [C: 03+2] vcl: use duration instead of int for ttl cap [puppet] - 10https://gerrit.wikimedia.org/r/598967 (https://phabricator.wikimedia.org/T247783) (owner: 10Ema) [08:26:34] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01702 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:26:37] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, and 2 others: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) Ah, wait, so i was about to create it and already added peek2001.codfw.wmnet to DNS but then noticed it asks for e... [08:27:19] (03PS1) 10Vgutierrez: prometheus: Fetch atstls.mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/598969 [08:27:34] PROBLEM - DPKG on labstore1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:28:11] (03PS3) 10Ayounsi: Anycast: introduce new "deterministic" variable [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) [08:32:54] hashar: size of /srv/jenkins on contint2001 is actually shrinking ..deleting builds again [08:33:15] but overall still smaller than on the source [08:33:50] I am not sure why rsync is so slow to sync those though [08:33:55] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/22806/" [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) (owner: 10Ayounsi) [08:33:59] maybe because there are ton of files [08:34:21] yes, because there are a ton of files [08:34:26] maybe it would have been faster to just delete everything on contint2001 and just blindly transfer the ~120G of files in one go [08:34:30] and they apparently change quickly [08:34:37] (03PS1) 10Filippo Giunchedi: thanos: monitor compact metrics only for enabled host [puppet] - 10https://gerrit.wikimedia.org/r/598971 (https://phabricator.wikimedia.org/T252186) [08:34:44] maybe,, yea.. not sure [08:35:04] I found some high traffic jobs holding 30 days of artifacts and changed them to just 7 days ; [08:35:09] so that would be less files in the future [08:35:19] there is a gazillion workspace/files/foo.tmp [08:35:32] workspace-files/something.tmp [08:35:43] oh [08:35:43] (03PS4) 10Ayounsi: Anycast: introduce new "deterministic" variable [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) [08:35:51] no clue what those files are [08:35:58] builds/operations-puppet-wmf-style-guide/9224/workspace-files/3a583cfa.tmp [08:37:18] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: monitor compact metrics only for enabled host [puppet] - 10https://gerrit.wikimedia.org/r/598971 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:37:54] seems that some error being logged bah [08:38:54] (03CR) 10Ema: [C: 03+1] prometheus: Fetch atstls.mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/598969 (owner: 10Vgutierrez) [08:42:19] !log starting again db2097 db instances T252492 [08:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:23] T252492: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 [08:42:46] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Fetch atstls.mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/598969 (owner: 10Vgutierrez) [08:43:05] (03PS1) 10Marostegui: mariadb: Place db1146 into s2 and s4 [puppet] - 10https://gerrit.wikimedia.org/r/598972 (https://phabricator.wikimedia.org/T252512) [08:43:07] (03PS2) 10Vgutierrez: prometheus: Fetch atstls.mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/598969 [08:43:21] (03PS1) 10JMeybohm: admin: Fix cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 [08:43:40] !log running apt-get autoremove on labstore1006 [08:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:51] (03PS1) 10Jcrespo: Revert "icinga: Disable notifications for db2097 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/598975 [08:46:07] (03PS2) 10Jcrespo: Revert "icinga: Disable notifications for db2097 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/598975 [08:46:44] !log removing more old packages in labstore1006 (all packages in 'rc' state) [08:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:58] (03PS2) 10Marostegui: mariadb: Place db1146 into s2 and s4 [puppet] - 10https://gerrit.wikimedia.org/r/598972 (https://phabricator.wikimedia.org/T252512) [08:47:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103:3312 db1103:3314 to clone db1146 T252512', diff saved to https://phabricator.wikimedia.org/P11308 and previous config saved to /var/cache/conftool/dbconfig/20200527-084713-marostegui.json [08:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:17] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [08:47:46] (03CR) 10Jcrespo: [C: 03+2] Revert "icinga: Disable notifications for db2097 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/598975 (owner: 10Jcrespo) [08:48:15] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:48:38] !log Stop MySQL on db1103 [08:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:49] mutante: those tmp files are actual errors and I have filed a task for it :] [08:50:16] hashar: ah:) [08:50:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Place db1146 into s2 and s4 [puppet] - 10https://gerrit.wikimedia.org/r/598972 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [08:51:35] we can just purge them [08:52:56] !log contint1001: find /srv/jenkins/builds/operations-puppet-wmf-style-guide -type f -name '*.tmp' -delete # T253729 [08:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:01] T253729: operations-puppet-wmf-style-guide spurts warning trying to copy files to master - https://phabricator.wikimedia.org/T253729 [08:58:22] RECOVERY - DPKG on labstore1006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:59:20] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002522 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:01:25] (03CR) 10Giuseppe Lavagetto: admin: Fix cluster-helmfile.sh (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 (owner: 10JMeybohm) [09:03:38] 10Operations, 10Traffic, 10netops: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) p:05Triage→03Low [09:03:42] (03CR) 10Jbond: [C: 03+1] "im unfamiliar with dnsdist.conf but otherwise looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [09:03:52] hashar: you deleted stuff while rsync is running on both sides? [09:04:15] mutante: yeah I purged a few files. So rysnc would probably report some warnings :/ [09:04:31] (03PS1) 10Filippo Giunchedi: swift: use default to https for dispersion [puppet] - 10https://gerrit.wikimedia.org/r/598979 (https://phabricator.wikimedia.org/T252186) [09:05:21] (03CR) 10Elukey: [C: 04-1] web::fetches::analytics::job: do not rsync mediawiki if missing source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [09:05:32] really I should have remembered to get the rsync done yesterday :-\ [09:05:54] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: use default to https for dispersion [puppet] - 10https://gerrit.wikimedia.org/r/598979 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:06:07] (03PS2) 10Filippo Giunchedi: swift: use default to https for dispersion [puppet] - 10https://gerrit.wikimedia.org/r/598979 (https://phabricator.wikimedia.org/T252186) [09:07:05] (03PS2) 10Ema: prometheus: job definition for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/598464 (https://phabricator.wikimedia.org/T253551) [09:10:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598464 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [09:11:32] (03CR) 10Dzahn: "The reason for harcoding the path in the module was the assumption that the certificates will be generated by acme_chief and hence will be" [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [09:12:32] hashar: things keep changing at the source. so we have to start over i guess [09:12:57] yeah we would have to start again after jenkins is stopped [09:13:15] .. it will be a long time [09:13:24] to delete gigabytes of small files [09:14:52] well, one way to find out, we will see [09:17:03] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) > Option B via MEDs sounds like a good path forward for now, though! https://gerrit.wikimedia.org/r/598836 has been tested and is ready to be merged.... [09:17:38] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) [09:17:41] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [09:17:50] 10Operations, 10Traffic, 10netops: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) [09:17:55] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [09:18:40] (03PS2) 10JMeybohm: admin: Fix cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 [09:19:49] (03CR) 10JMeybohm: admin: Fix cluster-helmfile.sh (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 (owner: 10JMeybohm) [09:21:51] (03PS3) 10JMeybohm: admin: Fix cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 [09:22:07] (03CR) 10Kormat: [C: 03+1] dbproxy1018: Add labsdb1010 with reduced weight [puppet] - 10https://gerrit.wikimedia.org/r/598691 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [09:28:50] hashar: this is coming up again, not sure if related to migration or not https://phabricator.wikimedia.org/T252310#6168492 [09:29:28] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:29:43] (03PS10) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [09:29:55] 10Operations, 10Traffic, 10netops: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) pmtud send the packets to the broadcast MAC address, which mean it only works within the same subnet. While we have hosts on different subnets (rows) in the core DCs. How... [09:33:39] mutante: not sure [09:33:58] hashar: sent 38,665,347,398 bytes received 80,524,992 bytes 5,216,542.90 bytes/sec [09:34:02] :p starting the second time [09:35:36] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=prometheus2003.codfw.wmnet [09:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:24] (03CR) 10Kormat: base: Add some small quality-of-life packages. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598752 (owner: 10Kormat) [09:36:31] hashar: good news, second run was quite fast [09:36:47] ah cool :] [09:36:56] so we can now stop zuul / jenkins on contint1001 [09:37:01] rsync again /var/lib/jenkins just in case [09:37:06] and do the switch [09:37:21] hashar: contint1001: 88347820 contint2001: 88312544 [09:37:30] (bytes in /srv/jenkins) [09:37:43] it was much more before you deleted stuff, so good [09:38:04] yeah some high traffic jobs were keeping 30 days of artifacts [09:38:16] I did a cleanup a month or so ago but only looked at disk size [09:38:23] hashar: ok, can you do the service stop/mask [09:38:26] this time I looked at disk count and caught a few more offenders [09:38:26] (03PS2) 10Kormat: base: Add some small quality-of-life packages. [puppet] - 10https://gerrit.wikimedia.org/r/598752 [09:39:11] !log Stopping Zuul and Jenkins CI for scheduled maintenance # T224591 [09:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:14] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [09:39:15] hashar: if desried we can also do the masking with puppet, remember we added the option [09:39:24] stopped [09:39:43] !log repeated rsync -avp --delete /var/lib/jenkins/ rsync://contint2001.wikimedia.org/ci--var-lib-jenkins- [09:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:46] done [09:39:57] I have masked them both manually on contint1001 [09:40:08] !log contint1001: masked jenkins and zuul [09:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:21] !log repeated rsync -avp --delete /var/lib/zuul/ rsync://contint2001.wikimedia.org/ci--var-lib-zuul- [09:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:11] (03CR) 10Jbond: "lgtm some minor nits" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) (owner: 10Ayounsi) [09:41:20] hashar: should i go ahead with the switch? both changes together? [09:41:25] yes [09:41:33] (03CR) 10Hashar: [C: 03+1] Revert "Revert "contint: switch jenkins/zuul/gearman to contint2001"" [puppet] - 10https://gerrit.wikimedia.org/r/598954 (owner: 10Dzahn) [09:41:39] (03CR) 10Hashar: [C: 03+1] Revert "Revert "switch contint from 1001 to 2001"" [dns] - 10https://gerrit.wikimedia.org/r/598953 (owner: 10Dzahn) [09:41:46] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "contint: switch jenkins/zuul/gearman to contint2001"" [puppet] - 10https://gerrit.wikimedia.org/r/598954 (owner: 10Dzahn) [09:41:54] then puppet all the way [09:41:56] wait a bit for dns [09:42:25] !log switching CI backend from contint1001 to contint2001 [09:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:43] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=prometheus2003.codfw.wmnet [09:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:48] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "switch contint from 1001 to 2001"" [dns] - 10https://gerrit.wikimedia.org/r/598953 (owner: 10Dzahn) [09:42:51] (03PS3) 10Dzahn: Revert "Revert "switch contint from 1001 to 2001"" [dns] - 10https://gerrit.wikimedia.org/r/598953 [09:43:02] (03CR) 10Muehlenhoff: base: Add some small quality-of-life packages. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598752 (owner: 10Kormat) [09:43:03] running puppet on contint* [09:43:13] (03CR) 10Muehlenhoff: [C: 03+1] base: Add some small quality-of-life packages. [puppet] - 10https://gerrit.wikimedia.org/r/598752 (owner: 10Kormat) [09:43:22] PROBLEM - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused https://www.mediawiki.org/wiki/Continuous_integration/Zuul [09:43:30] PROBLEM - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-server https://www.mediawiki.org/wiki/Continuous_integration/Zuul [09:43:44] PROBLEM - jenkins_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [09:44:01] moritzm: ty <3 [09:44:22] PROBLEM - Prometheus prometheus2003/analytics restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9905 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [09:44:22] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance={127.0.0.1:9900,127.0.0.1:9904} job=prometheus site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [09:44:30] known ^ [09:44:32] running puppet on both hosts [09:44:33] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "Revert "switch contint from 1001 to 2001"" [dns] - 10https://gerrit.wikimedia.org/r/598953 (owner: 10Dzahn) [09:44:36] ACKNOWLEDGEMENT - jenkins_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn migration ongoing https://wikitech.wikimedia.org/wiki/Jenkins [09:44:36] ACKNOWLEDGEMENT - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused daniel_zahn migration ongoing https://www.mediawiki.org/wiki/Continuous_integration/Zuul [09:44:36] ACKNOWLEDGEMENT - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-server daniel_zahn migration ongoing https://www.mediawiki.org/wiki/Continuous_integration/Zuul [09:45:03] (03CR) 10Muehlenhoff: [C: 03+2] Integrate hardened java.security into profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598682 (owner: 10Muehlenhoff) [09:45:05] hashar: puppet already done. DNS right now [09:45:12] syncing [09:45:37] PROBLEM - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [09:45:54] gosh, ok I'll downtime those alerts [09:45:56] hashar: note that puppet did fix ownerships under /var/lib/zuul/.ssh/ [09:46:05] ah [09:46:17] PROBLEM - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [09:46:23] and also some under /var/lib/jenkins [09:46:41] arr yea, of course we need to repeat the find commands again [09:46:59] May 27 09:46:26 contint2001 zuul-merger[11968]: IOError: [Errno 13] Permission denied: '/var/lib/zuul/test' [09:47:00] since rsync ran a second time [09:47:01] :D [09:47:15] sigh [09:47:18] cause /var/lib/zuul are owned by envoy [09:47:19] fixing [09:47:38] better [09:47:53] hashar: i don't thin the command line "sudo find /var/lib/jenkins -not -user jenkins -exec chown -h jenkins:jenkins {} +" was correct [09:48:03] :-\ [09:48:28] not everything not owned by jenkins is automatically jenkins:jenkins [09:48:39] that's why i did that separately for user and group before [09:48:56] !log contint2001: unmasked jenkins and started it [09:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:13] (03PS1) 10Muehlenhoff: Remove obsolete pdns3hack class [puppet] - 10https://gerrit.wikimedia.org/r/598983 [09:49:22] (03CR) 10DCausse: [C: 03+1] maps: remove useless hiera config [puppet] - 10https://gerrit.wikimedia.org/r/598962 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [09:49:39] (03CR) 10jerkins-bot: [V: 04-1] Remove obsolete pdns3hack class [puppet] - 10https://gerrit.wikimedia.org/r/598983 (owner: 10Muehlenhoff) [09:49:49] !log contint2001 - find /var/lib/jenkins -group bacula -user statsite -exec chown jenkins:jenkins {} \; [09:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:43] ah yeah [09:51:08] (03CR) 10Jbond: "> Patch Set 5:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [09:51:30] !log roll restart prometheus on the fleet to apply I0e2fe8af [09:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:50] and now i should do that again with the -h for chmod.. doing [09:52:21] !log contint2001 - find /var/lib/jenkins -user statsite -exec chown -h jenkins:jenkins {} \; [09:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:14] hashar: better, but stuff that used to be jenkins:nogroup is now messed up [09:53:22] (03PS3) 10Muehlenhoff: Switch the IDPs to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598488 (https://phabricator.wikimedia.org/T253553) [09:53:48] (03CR) 10jerkins-bot: [V: 04-1] Switch the IDPs to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598488 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [09:54:47] (03CR) 10DCausse: [C: 03+1] maps: maps2003 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598963 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [09:54:52] (03CR) 10DCausse: [C: 03+1] maps: maps2002 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598964 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [09:54:58] !log contint1001 / contint2001 : deleted obsolete files /var/lib/jenkins/.git and /var/lib/jenkins/jobs/_shared/ [09:55:00] (03CR) 10DCausse: [C: 03+1] maps: maps2001 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598965 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [09:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:42] I am starting jenkins [09:55:48] !log contint2001: starting jenkins [09:55:50] hashar: wait, stuff is broken [09:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:16] oh ? :( [09:56:30] (03CR) 10Kormat: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598752 (owner: 10Kormat) [09:56:52] (03CR) 10jerkins-bot: [V: 04-1] base: Add some small quality-of-life packages. [puppet] - 10https://gerrit.wikimedia.org/r/598752 (owner: 10Kormat) [09:56:55] hashar: caused by " -not -user jenkins -exec chown -h jenkins:jenkins" [09:57:33] and /srv/jenkins is misowned yeah ;( [09:57:48] hashar: that is a separate issue [09:57:55] a [09:58:05] looks like we have to start over one more time [09:58:23] on which directory? [09:58:25] manually trying to revert that stuff is tedious [09:59:12] /var/lib/jenkins [09:59:18] do not run the same find command again please [09:59:31] let me rsync it and re-fix it [09:59:42] well those files are all owned by jenkins so that sounds right? [09:59:50] no, that is wrong [10:00:16] ok I stopped zuul and jenkins on contint2001 [10:00:47] ok, rsynced /var/lib/jenkins, fixing permissions [10:02:40] !log repeated rsync of /var/lib/jenkins with -p ; find /var/lib/jenkins -group bacula -user statsite -exec chown -h jenkins:jenkins {} \; [10:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:09] !log contint2001 - find /var/lib/jenkins -user statsite -exec chown -h jenkins:jenkins {} \; [10:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:07] hashar: it is still not correct.. [10:04:09] PROBLEM - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [10:04:13] what is incorrect? [10:04:56] all the things are owned jenkins:jenkins that should be jenkins:nogroup [10:05:05] and jenkins:adm [10:05:05] PROBLEM - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused https://www.mediawiki.org/wiki/Continuous_integration/Zuul [10:05:21] ACKNOWLEDGEMENT - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn migration https://wikitech.wikimedia.org/wiki/Jenkins [10:05:21] ACKNOWLEDGEMENT - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused daniel_zahn migration https://www.mediawiki.org/wiki/Continuous_integration/Zuul [10:05:21] ACKNOWLEDGEMENT - zuul_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-server daniel_zahn migration https://www.mediawiki.org/wiki/Continuous_integration/Zuul [10:05:27] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:05:55] mutante: I am not sure why we have some nogroup or adm files though [10:08:25] ah from puppet adm group owned: /var/lib/jenkins/hudson.plugins.ircbot.IrcPublisher.xml /var/lib/jenkins/logs /var/lib/jenkins/secrets [10:08:31] RECOVERY - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [10:08:39] RECOVERY - Prometheus prometheus2003/analytics restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [10:09:00] hashar: ok, i got it fixed, look now [10:09:18] (03PS8) 10Jbond: docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) [10:09:42] yeah better [10:10:06] (03PS1) 10Privacybatm: [WIP] wmfmariadbpy: Package wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) [10:10:07] hashar: now about /srv/jenkins, it is root:root on 1001 [10:10:29] yeah that is a mount point [10:10:32] isn't it ? [10:10:51] ah no /srv is [10:11:01] hashar: when you just said /srv/jenkins is misowned, what did you mean? [10:11:39] that part looks like it is actually all jenkins:jenkins. but let me chec [10:12:22] so /srv/jenkins/builds/ and /srv/jenkins/workspace were owned by something else due to rsync [10:12:27] and I have made them owned by jenkins:jenkins [10:12:31] hashar: yea, i can't find any file _not_ owned by jenkins on either server under /srv/jenkins [10:12:58] so i guess it is fine now [10:13:17] ok, if you did not run that on /var/lib/jenkins then it should be fine now [10:13:23] let's try [10:13:29] RECOVERY - jenkins_service_running on contint2001 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [10:13:30] starting jenkins [10:13:36] icinga is too fast [10:14:07] and jenkins does run with java 8 [10:14:10] polled the agents [10:14:54] great [10:15:02] it is running some jobs for some reasons bah [10:15:04] started zuul [10:15:08] !log contint2001: started jenkins [10:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:13] !log contint2001: starting zuul [10:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:15] RECOVERY - zuul_gearman_service on contint2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730 https://www.mediawiki.org/wiki/Continuous_integration/Zuul [10:17:26] mutante: seems like jobs trigger properly now [10:17:52] (03CR) 10Privacybatm: [WIP] wmfmariadbpy: Package wmfmariadbpy (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [10:18:24] hashar: great [10:18:29] hashar: i see this though: May 27 10:18:06 contint2001 git-daemon[21413]: fatal: the remote end hung up unexpectedly [10:18:56] yeah known issue [10:19:05] it is merely spam log [10:19:48] o [10:19:53] ok [10:21:53] mutante: so that looks good so far :] [10:22:12] I apologize for the long delay due to rsync :-\ [10:22:19] should I thought about running it yesterday [10:23:02] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) ` 2020-05-27 10:15 hashar: contint... [10:23:12] mutante: I will monitor it and report on the task whether anything goes wrong [10:23:18] but I guess we can have a lunch break now [10:23:20] ! [10:23:55] hashar: i am glad it is looking good. great! [10:24:01] (03CR) 10Kormat: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598752 (owner: 10Kormat) [10:24:44] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/598968 (owner: 10Dzahn) [10:25:25] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) I forgot about having the data synchro... [10:25:34] will retrigger bunch of stuff [10:25:50] hashar: have we seen jenkins-bot vote yet ? [10:26:04] ok [10:26:21] there is one :) nice [10:26:22] yes https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/598752/ :] [10:26:30] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598488 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [10:26:36] confirmed on 598968 [10:26:51] (03CR) 10Kormat: [C: 03+2] base: Add some small quality-of-life packages. [puppet] - 10https://gerrit.wikimedia.org/r/598752 (owner: 10Kormat) [10:27:12] (03CR) 10Hashar: "recheck" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [10:27:18] (03CR) 10Hashar: "recheck" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [10:27:46] (03PS3) 10Kormat: base: Add some small quality-of-life packages. [puppet] - 10https://gerrit.wikimedia.org/r/598752 [10:28:55] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598983 (owner: 10Muehlenhoff) [10:30:01] (03CR) 10Muehlenhoff: [C: 03+2] Switch the IDPs to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598488 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [10:30:16] mutante: thank you :] [10:31:08] hashar: you're welcome. i could have also thought of rsync last night.. did not expect it to be that extreme though since we had already done it last time [10:31:21] expected a smaller diff [10:31:35] RECOVERY - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [10:31:58] checking icinga on contint* and removing downtimes if any [10:32:16] all green except a long running screen [10:32:19] the builds get almost all discarded after a few days [10:33:01] terminated the screen on contint1001 [10:34:19] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) p:05Medium→03High [10:34:24] (03PS1) 10Volans: scripts offline_device: fix logging [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598989 [10:36:16] (03PS1) 10Muehlenhoff: ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 [10:37:20] (03CR) 10Volans: [C: 03+2] "Merging as it fixes the logging, happy to address any comment post-merge." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598989 (owner: 10Volans) [10:37:36] (03CR) 10jerkins-bot: [V: 04-1] ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [10:38:01] ACKNOWLEDGEMENT - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T253715 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:01] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T253715 https://phabricator.wikimedia.org/T93886 [10:38:01] ACKNOWLEDGEMENT - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed daniel_zahn https://phabricator.wikimedia.org/T253715 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:38:01] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T253715 https://phabricator.wikimedia.org/T93886 [10:38:01] ACKNOWLEDGEMENT - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed daniel_zahn https://phabricator.wikimedia.org/T253715 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:38:01] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T253715 https://phabricator.wikimedia.org/T93886 [10:38:01] ACKNOWLEDGEMENT - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed daniel_zahn https://phabricator.wikimedia.org/T253715 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:40:12] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [10:42:15] 10Operations, 10ops-eqiad, 10decommission: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10Dzahn) Is this still stalled nowadays? [10:42:35] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01135 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:42:52] (03CR) 10Ema: [C: 03+2] prometheus: job definition for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/598464 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [10:43:29] "widespread" puppet failures = cloudservices2003-dev [10:43:48] "dev" server causing prod alerts [10:43:57] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:45:10] oh, there are more [10:45:17] it's probably the change to base packages [10:45:20] yea, it is [10:46:24] kormat: the change to add colordiff is causing puppet failure on (many but maybe not all?) hosts [10:46:38] but enough to trigger the "widespread failures" alert [10:46:53] f.e. "Duplicate declaration: Package[colordiff] is already declared .." [10:47:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Code LGTM but you need 0.14. Sorry about that :/" (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/598472 (owner: 10Ema) [10:47:35] not sure colordiff is still needed this days [10:47:37] (03PS1) 10Muehlenhoff: Drop colordiff from gerrit role, now part of standard packages [puppet] - 10https://gerrit.wikimedia.org/r/598992 [10:47:39] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [10:47:55] diff --color=always works ;] [10:47:57] hashar: https://gerrit.wikimedia.org/r/c/operations/puppet/+/598752 [10:48:00] a tleast on buster [10:49:26] ah icdiff is quite nice ;) [10:49:39] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:50:43] (03CR) 10Dzahn: [C: 03+2] Drop colordiff from gerrit role, now part of standard packages [puppet] - 10https://gerrit.wikimedia.org/r/598992 (owner: 10Muehlenhoff) [10:51:43] (03PS4) 10Ema: 0.14: Expose rdkafka prometheus metrics using promrdkafka [software/purged] - 10https://gerrit.wikimedia.org/r/598472 [10:53:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] 0.14: Expose rdkafka prometheus metrics using promrdkafka [software/purged] - 10https://gerrit.wikimedia.org/r/598472 (owner: 10Ema) [10:54:47] (03CR) 10Ema: [C: 03+2] 0.14: Expose rdkafka prometheus metrics using promrdkafka [software/purged] - 10https://gerrit.wikimedia.org/r/598472 (owner: 10Ema) [10:56:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [10:59:28] (03PS1) 10Jbond: aptrepo: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/598993 [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200527T1100). [11:00:05] kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:23] * kart_ is here and procedding for SWAT. [11:00:44] (03PS2) 10KartikMistry: Enable ContentTranslation in Galician Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598678 (https://phabricator.wikimedia.org/T250355) [11:00:47] (03CR) 10Jbond: [V: 03+2 C: 03+2] clean out old repo [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596778 (owner: 10Jbond) [11:00:58] (03CR) 10Jbond: [V: 03+2 C: 03+2] docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [11:02:23] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005675 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:02:26] (03CR) 10KartikMistry: [C: 03+2] Enable ContentTranslation in Galician Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598678 (https://phabricator.wikimedia.org/T250355) (owner: 10KartikMistry) [11:03:00] (03CR) 10Muehlenhoff: aptrepo: add mcrouter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598993 (owner: 10Jbond) [11:03:13] (03Merged) 10jenkins-bot: Enable ContentTranslation in Galician Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598678 (https://phabricator.wikimedia.org/T250355) (owner: 10KartikMistry) [11:06:48] mutante: ah, crap. this is what i get for not knowing puppet - i had no idea that it being listed twice for a given machine would be an issue [11:07:00] hashar: please show me the `--color=always` flag for `dbctl config diff` ;) [11:07:15] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:07:30] kart_: could I add a patch to this window? [11:07:55] kostajh: Can you self deploy or want me to deploy? :) [11:08:13] kart_: I can't deploy, you'd have to do it. It could wait until the next window if you'd prefer. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/598995 [11:08:31] kostajh: cool. Please add and self deploy :) [11:08:40] kostajh: I'll ping when I'm done. [11:09:09] kart_: I don't have access, nor knowledge of how to deploy :) maybe one of the other deployers is around? [11:09:24] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|598678|Enable ContentTranslation in Galician Wikipedia as a default tool (T250355)]] (duration: 01m 18s) [11:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:28] T250355: Enable Content Translation in Galician Wikipedia as a default tool - https://phabricator.wikimedia.org/T250355 [11:09:28] kostajh: oh. Let's check. [11:09:55] (03PS1) 10Jbond: licence: remove the mcrouter/debian/licence [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/598996 [11:11:21] Urbanecm / Amir1 / awight are one of you around? [11:11:33] (03Abandoned) 10Jbond: aptrepo: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/598993 (owner: 10Jbond) [11:11:40] kostajh: Sure :) [11:11:48] thanks Urbanecm [11:12:26] kostajh: sorry. I have to be around for some prescheduled work at home :/ [11:12:28] kormat: not a big deal. unlike i originally thought it was not affecting a lot of hosts, only gerrit [11:12:39] and that was already fixed now [11:13:02] kart_: no worries! [11:13:11] a bunch of other hosts showed up on grafana as failed.. but manually running puppet showed they were actually ok on next run [11:13:59] kart_: I'm now +2'ing the backport, are you still deploying? [11:14:21] Urbanecm: oh. Sorry. I'm done. [11:14:28] Please go ahead. [11:14:39] (03PS6) 10Ssingh: dnsdist: add parameters for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) [11:14:43] no problem, thank you kart_ [11:15:00] kormat: yea, duplicate definition is a quite common issue. there is also "ensure_packages" which is different from require_package or package{} in that it only installs them if they don't already exist [11:15:59] (03CR) 10jerkins-bot: [V: 04-1] dnsdist: add parameters for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:16:14] off for lunch etc [11:17:09] !log purged 0.14 uploaded to buster-wikimedia [11:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (commit message mentions debian/licence instead of debian/copyright, though)" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/598996 (owner: 10Jbond) [11:17:26] (03PS7) 10Ssingh: dnsdist: add parameters for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) [11:17:55] (03PS2) 10Jbond: licence: remove the mcrouter/debian/copyright [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/598996 [11:18:02] (03CR) 10Jbond: [V: 03+2 C: 03+2] licence: remove the mcrouter/debian/copyright [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/598996 (owner: 10Jbond) [11:18:16] !log cp2027: upgrade purged to 0.14 [11:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:18] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:20:57] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/22810/ thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:21:59] (03CR) 10Ssingh: [C: 03+2] dnsdist: add parameters for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:24:41] kostajh: available at mwdebug1001 for testing :) [11:24:49] Urbanecm: looking [11:27:07] Urbanecm: all good! [11:27:13] great, syncing [11:28:21] (through it seems the dialog is bigger than it used to be, but that's surely a totally different thing) [11:28:56] Urbanecm: that shouldn't have changed; the dialog size does depend on the viewport though [11:28:59] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.34/extensions/GrowthExperiments/: SWAT: 983eda5: Mentorship dialog: Swap panel to ask-help on open (T253692) (duration: 01m 06s) [11:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:03] T253692: [regression-wmf.34] Help panel displayed instead of Mentor dialog - https://phabricator.wikimedia.org/T253692 [11:29:06] (03PS1) 10Arturo Borrero Gonzalez: toolforge: shell_environ: drop byobu package declaration [puppet] - 10https://gerrit.wikimedia.org/r/598998 [11:29:08] synced [11:29:37] kostajh: I've tried to open the dialog at both cswiki and testwiki [11:29:48] testwiki https://usercontent.irccloud-cdn.com/file/vbwTEjQA/image.png [11:30:00] cswiki https://usercontent.irccloud-cdn.com/file/fsm9wB5y/image.png [11:30:43] Urbanecm: oh, the height. yes, we'll probably come back to that in a follow-up. [11:31:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: shell_environ: drop byobu package declaration [puppet] - 10https://gerrit.wikimedia.org/r/598998 (owner: 10Arturo Borrero Gonzalez) [11:31:30] okay, cool :) [11:33:19] (03PS1) 10Urbanecm: wgNamespaceRobotPolicies: Set several namespaces to noindex,nofollow for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598999 (https://phabricator.wikimedia.org/T253574) [11:34:03] (03PS2) 10Urbanecm: wgNamespaceRobotPolicies: Set several namespaces to noindex,nofollow for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598999 (https://phabricator.wikimedia.org/T253574) [11:34:11] !log EU SWAT done [11:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:37] (03CR) 10CDanis: nfacctd: various increases to buffer sizes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) (owner: 10CDanis) [11:52:21] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:54:54] (03PS1) 10Ema: prometheus: filter interesting rdkafka metrics in purged job [puppet] - 10https://gerrit.wikimedia.org/r/599002 [11:57:41] 10Operations, 10SRE-tools: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10MoritzMuehlenhoff) [12:03:25] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: fix some inconsistencies in the worker upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/599003 (https://phabricator.wikimedia.org/T246122) [12:07:35] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10hnowlan) I've disabled the cassandra service and removed the nodes running on restbase2009 so vnodes should be reallocated and this disk can be replaced. I will reprovision when the disk is replaced [12:08:16] (03PS3) 10CDanis: nfacctd: various increases to buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) [12:08:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: fix some inconsistencies in the worker upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/599003 (https://phabricator.wikimedia.org/T246122) (owner: 10Arturo Borrero Gonzalez) [12:08:55] (03PS2) 10Ema: prometheus: filter interesting rdkafka metrics in purged job [puppet] - 10https://gerrit.wikimedia.org/r/599002 [12:10:56] (03PS1) 10Jbond: admin::home::jbond: updated to use apt for reprepro [puppet] - 10https://gerrit.wikimedia.org/r/599004 [12:11:39] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: filter interesting rdkafka metrics in purged job [puppet] - 10https://gerrit.wikimedia.org/r/599002 (owner: 10Ema) [12:13:11] (03CR) 10Jbond: [C: 03+2] admin::home::jbond: updated to use apt for reprepro [puppet] - 10https://gerrit.wikimedia.org/r/599004 (owner: 10Jbond) [12:14:35] (03CR) 10Ema: [C: 03+2] prometheus: filter interesting rdkafka metrics in purged job [puppet] - 10https://gerrit.wikimedia.org/r/599002 (owner: 10Ema) [12:15:13] ema: ok to mereg? [12:15:17] jbond42: y! [12:15:34] merging [12:24:40] jouncebot: now [12:24:40] No deployments scheduled for the next 5 hour(s) and 35 minute(s) [12:25:47] PROBLEM - Prometheus bast5001/ops restarted: beware possible monitoring artifacts on bast5001 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [12:25:57] *sigh* downtime expired [12:30:06] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade: improve a bit output reading [puppet] - 10https://gerrit.wikimedia.org/r/599006 (https://phabricator.wikimedia.org/T250867) [12:31:11] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jbond) [12:32:05] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:32:51] (03PS1) 10Addshore: Add awa to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599008 (https://phabricator.wikimedia.org/T252870) [12:33:35] (03CR) 10CDanis: "> Patch Set 1:" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/598754 (https://phabricator.wikimedia.org/T253025) (owner: 10CDanis) [12:34:20] (03PS2) 10CDanis: dbctl: diffs: when available, prefer icdiff interactively [software/conftool] - 10https://gerrit.wikimedia.org/r/598754 (https://phabricator.wikimedia.org/T253025) [12:34:29] (03CR) 10MSantos: [C: 03+1] maps: remove useless hiera config [puppet] - 10https://gerrit.wikimedia.org/r/598962 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [12:36:02] (03CR) 10MSantos: [C: 03+1] maps: maps2003 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598963 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [12:36:20] (03CR) 10MSantos: [C: 03+1] maps: maps2002 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598964 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [12:36:29] (03CR) 10MSantos: [C: 03+1] maps: maps2001 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598965 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [12:45:20] (03Abandoned) 10Addshore: Add awa to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599008 (https://phabricator.wikimedia.org/T252870) (owner: 10Addshore) [12:45:50] (03PS3) 10Muehlenhoff: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 [12:47:10] (03CR) 10jerkins-bot: [V: 04-1] Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (owner: 10Muehlenhoff) [12:47:25] RECOVERY - Prometheus bast5001/ops restarted: beware possible monitoring artifacts on bast5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [12:50:53] (03PS4) 10Muehlenhoff: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 [12:52:15] (03CR) 10jerkins-bot: [V: 04-1] Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (owner: 10Muehlenhoff) [12:56:02] 10Operations, 10netops: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10jbond) > I think I'm leaning towards a few stable anchors in similar geographic locations to our PoPs. Maybe also a few root servers as well even though they're less apples-to... [12:57:07] (03PS5) 10Muehlenhoff: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 [12:58:31] (03CR) 10jerkins-bot: [V: 04-1] Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (owner: 10Muehlenhoff) [12:58:49] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) >>! In T252210#6168163, @Dzahn wrote: > Ah, wait, so i was about to create it and already added peek2001.codfw.wmnet t... [12:59:13] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) [12:59:35] 10Operations, 10SRE-tools: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10Volans) [13:00:21] (03PS1) 10Marostegui: db1146: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/599010 (https://phabricator.wikimedia.org/T252512) [13:02:44] (03PS1) 10Elukey: Improve performance of datapoint ingestion [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599011 [13:03:08] (03PS2) 10Gehel: maps: remove useless hiera config [puppet] - 10https://gerrit.wikimedia.org/r/598962 (https://phabricator.wikimedia.org/T249086) [13:03:33] (03CR) 10Volans: [C: 03+1] "Change LGTM." (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/598754 (https://phabricator.wikimedia.org/T253025) (owner: 10CDanis) [13:03:35] (03PS6) 10Muehlenhoff: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 [13:03:48] (03CR) 10Marostegui: [C: 03+2] db1146: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/599010 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [13:04:17] (03CR) 10Gehel: [C: 03+2] maps: remove useless hiera config [puppet] - 10https://gerrit.wikimedia.org/r/598962 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [13:04:19] (03PS3) 10CDanis: dbctl: diffs: when available, prefer icdiff interactively [software/conftool] - 10https://gerrit.wikimedia.org/r/598754 (https://phabricator.wikimedia.org/T253025) [13:04:44] (03CR) 10CDanis: [C: 03+2] "thanks for the review!" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/598754 (https://phabricator.wikimedia.org/T253025) (owner: 10CDanis) [13:04:56] (03CR) 10jerkins-bot: [V: 04-1] Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (owner: 10Muehlenhoff) [13:06:48] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [13:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:00] (03Merged) 10jenkins-bot: dbctl: diffs: when available, prefer icdiff interactively [software/conftool] - 10https://gerrit.wikimedia.org/r/598754 (https://phabricator.wikimedia.org/T253025) (owner: 10CDanis) [13:07:17] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) Creating VM peek2001.codfw.wmnet in cluster ganeti01.svc.codfw.wmnet with row=B vcpus=1 memory=2GB disk=20GB link=privat... [13:07:27] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) 05Stalled→03Open [13:08:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1146:3312 and db1146:3314 to dbctl T252512', diff saved to https://phabricator.wikimedia.org/P11312 and previous config saved to /var/cache/conftool/dbconfig/20200527-130820-marostegui.json [13:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:24] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [13:10:51] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [13:11:24] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) I had a meeting on my cal for this AM. I was the only one there :) I made all the decisions, dont' worry. Th... [13:12:33] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) >>! In T252210#6168983, @Dzahn wrote: > Creating VM peek2001.codfw.wmnet in cluster ganeti01.svc.codfw.wmnet with row=... [13:13:23] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC effectively noop https://puppet-compiler.wmflabs.org/compiler1002/22813/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:13:32] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10jbond) @chasemp the meeting is scheduled for 14:00 UTC (i.e. 45 mins) in my calendar [13:14:57] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) >>! In T244792#6169016, @jbond wrote: > @chasemp the meeting is scheduled for 14:00 UTC (i.e. 45 mins) in my c... [13:15:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1146:3312, db1146:3314 and db1103:3312, db1103:3314', diff saved to https://phabricator.wikimedia.org/P11313 and previous config saved to /var/cache/conftool/dbconfig/20200527-131515-marostegui.json [13:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [13:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:33] (03PS1) 10CDanis: release 1.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/599015 [13:18:59] !log Kill /usr/local/bin/mwscriptwikiset updateSpecialPages.php s8.dblist --override --only=Fewestrevisions T238199 [13:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:03] T238199: SpecialFewestRevisions::reallyDoQuery takes more than 9h to run - https://phabricator.wikimedia.org/T238199 [13:19:33] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/599015 (owner: 10CDanis) [13:19:38] (03PS7) 10Muehlenhoff: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 [13:20:50] (03PS1) 10Dzahn: site/DHCP/partman: add peek2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/599016 (https://phabricator.wikimedia.org/T252210) [13:21:39] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, and 2 others: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) a:03Dzahn [13:21:40] !log cp: upgrade purged to 0.14 [13:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:43] !log repool maps2004 / depool maps2003 [13:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:27] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01387 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:22:46] (03CR) 10Dzahn: [C: 03+2] site/DHCP/partman: add peek2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/599016 (https://phabricator.wikimedia.org/T252210) (owner: 10Dzahn) [13:23:11] (03PS2) 10Dzahn: site/DHCP/partman: add peek2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/599016 (https://phabricator.wikimedia.org/T252210) [13:23:34] (03CR) 10Marostegui: [C: 03+2] mediawiki: Disable fewestrevisions for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/596172 (https://phabricator.wikimedia.org/T238199) (owner: 10ArielGlenn) [13:24:23] (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/599015 (owner: 10CDanis) [13:24:28] (03CR) 10Marostegui: mediawiki: Disable fewestrevisions for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/596172 (https://phabricator.wikimedia.org/T238199) (owner: 10ArielGlenn) [13:25:22] (03PS2) 10ArielGlenn: mediawiki: Disable fewestrevisions for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/596172 (https://phabricator.wikimedia.org/T238199) [13:27:03] (03CR) 10ArielGlenn: [C: 03+2] mediawiki: Disable fewestrevisions for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/596172 (https://phabricator.wikimedia.org/T238199) (owner: 10ArielGlenn) [13:27:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: Fix cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 (owner: 10JMeybohm) [13:27:16] (03PS2) 10Gehel: maps: maps2003 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598963 (https://phabricator.wikimedia.org/T249086) [13:27:18] (03PS2) 10CDanis: release 1.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/599015 [13:27:20] (03PS1) 10CDanis: 'upstream' changes for release 1.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/599017 [13:28:05] (03CR) 10Gehel: [C: 03+2] maps: maps2003 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598963 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [13:30:01] (03CR) 10CDanis: [C: 03+2] 'upstream' changes for release 1.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/599017 (owner: 10CDanis) [13:31:13] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) Note @wiki_willy one of the new server i received yesterday just after plugging it in i am getting the error below ` 66 Detailed Description: The system hardware or cablin... [13:32:11] (03Merged) 10jenkins-bot: 'upstream' changes for release 1.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/599017 (owner: 10CDanis) [13:32:25] (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/599015 (owner: 10CDanis) [13:34:32] !log gehel@cumin1001 START - Cookbook sre.postgresql.postgres-init [13:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:42] !log gehel@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [13:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:53] !log gehel@cumin1001 START - Cookbook sre.postgresql.postgres-init [13:36:55] who owns the foreachwikiindblist script? [13:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:29] 10Operations, 10Release-Engineering-Team: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. in production when running populateSitesTable for aawikibooks with foreachwikiindblist - https://phabricator.wikimedia.org/T253756 (10Addshore) [13:39:20] (03CR) 10Jcrespo: "Not sure about the context of this patch, but we don't want to package all the wmfmariadbpy repo, only transfer.py and its dependencies." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [13:42:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:28] (03CR) 10Jcrespo: "See comment below- I think it will be able to answer your question." (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [13:44:35] mutante: do you have more hosts to decom? [13:44:42] (03CR) 10CDanis: [C: 03+2] release 1.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/599015 (owner: 10CDanis) [13:44:53] I have a patch to merge and would be nice to be able to verify it all still works [13:45:23] (03CR) 10Jcrespo: "I am guessing you want to package python3-transferpy ckass and transferpy cli on separate debian packages. That is ok, but we shouldn't in" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [13:45:30] volans: not at the moment, that was just a VM where i created it in the same row [13:45:38] the wrong row [13:45:42] k [13:45:42] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:50] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, and 2 others: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `peek2001.codfw.wmnet` - pee... [13:46:03] (03PS1) 10Ottomata: eventgate-* - upgrade to version 2020-05-27-130632-production to get /stream-configs route [deployment-charts] - 10https://gerrit.wikimedia.org/r/599030 (https://phabricator.wikimedia.org/T251609) [13:47:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1146:3312, db1146:3314 and db1103:3312, db1103:3314', diff saved to https://phabricator.wikimedia.org/P11316 and previous config saved to /var/cache/conftool/dbconfig/20200527-134704-marostegui.json [13:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:14] (03Merged) 10jenkins-bot: release 1.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/599015 (owner: 10CDanis) [13:47:16] (03CR) 10Ottomata: [C: 03+2] eventgate-* - upgrade to version 2020-05-27-130632-production to get /stream-configs route [deployment-charts] - 10https://gerrit.wikimedia.org/r/599030 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:47:57] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [13:47:57] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [13:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:11] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [13:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:59] (03CR) 10Jcrespo: "This looks ok in terms of content (as a first version, obviously there are things that are not completely finished), but please read the p" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [13:50:54] 10Operations, 10User-MoritzMuehlenhoff: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610 (10MoritzMuehlenhoff) [13:51:29] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:51:29] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:23] (03PS2) 10Elukey: Improve performance of datapoint ingestion [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599011 [13:52:29] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005044 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:52:47] 10Operations, 10User-MoritzMuehlenhoff, 10User-jbond: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163 (10MoritzMuehlenhoff) [13:56:31] (03PS1) 10Muehlenhoff: Enable adduser/sysusers config for role(test) [puppet] - 10https://gerrit.wikimedia.org/r/599032 (https://phabricator.wikimedia.org/T235162) [13:57:01] (03CR) 10jerkins-bot: [V: 04-1] Enable adduser/sysusers config for role(test) [puppet] - 10https://gerrit.wikimedia.org/r/599032 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [13:58:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [13:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade: improve a bit output reading [puppet] - 10https://gerrit.wikimedia.org/r/599006 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [13:59:56] (03CR) 10Privacybatm: "> Patch Set 1:" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [14:00:07] (03PS2) 10Muehlenhoff: Enable adduser/sysusers config for role(test) [puppet] - 10https://gerrit.wikimedia.org/r/599032 (https://phabricator.wikimedia.org/T235162) [14:01:14] (03PS1) 10Dzahn: DHCP: update MAC for peek2001 [puppet] - 10https://gerrit.wikimedia.org/r/599035 (https://phabricator.wikimedia.org/T252210) [14:01:38] (03CR) 10Dzahn: [V: 03+2 C: 03+2] DHCP: update MAC for peek2001 [puppet] - 10https://gerrit.wikimedia.org/r/599035 (https://phabricator.wikimedia.org/T252210) (owner: 10Dzahn) [14:01:48] (03PS2) 10Dzahn: DHCP: update MAC for peek2001 [puppet] - 10https://gerrit.wikimedia.org/r/599035 (https://phabricator.wikimedia.org/T252210) [14:03:09] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:03:09] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1146:3312, db1146:3314 and db1103:3312, db1103:3314', diff saved to https://phabricator.wikimedia.org/P11317 and previous config saved to /var/cache/conftool/dbconfig/20200527-140442-marostegui.json [14:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:47] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:02] (03CR) 10Privacybatm: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [14:07:14] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:07:14] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:35] (03CR) 10Muehlenhoff: [C: 03+2] Enable adduser/sysusers config for role(test) [puppet] - 10https://gerrit.wikimedia.org/r/599032 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [14:09:22] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul) @Dzahn Please to not resolve yet. I still have mgmt DNS and switch port to remove. Thanks [14:13:42] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:13:42] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:18] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) @chasemp The VM has been created and I installed the OS and signed the puppet cert request. It is in site.pp with the "i... [14:16:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1146:3312, db1146:3314 and db1103:3312, db1103:3314', diff saved to https://phabricator.wikimedia.org/P11318 and previous config saved to /var/cache/conftool/dbconfig/20200527-141635-marostegui.json [14:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:37] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) 05Open→03Resolved >>! In T252210#6169212, @Dzahn wrote: > @chasemp The VM has been created and I installed the OS... [14:16:53] (03CR) 10Jcrespo: "Don't worry, I got your question, ;-D I just want to give feedback early so you don't work a lot in a bad direction." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [14:17:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:17:47] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01261 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:19:31] (03PS1) 10Filippo Giunchedi: prometheus: rename Thanos jobs [puppet] - 10https://gerrit.wikimedia.org/r/599040 [14:20:11] (03PS2) 10Filippo Giunchedi: prometheus: rename Thanos jobs [puppet] - 10https://gerrit.wikimedia.org/r/599040 (https://phabricator.wikimedia.org/T233956) [14:20:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] add golang 1.13-1 builder image based on golang 1.13 using wikimedia-buster base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 (owner: 10Cwhite) [14:20:19] (03PS1) 10Papaul: DNS: Add production and mgmt DNS for rdb200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/599041 [14:21:34] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10Dzahn) Alright, done. Your SSH user exists now. [14:21:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: rename Thanos jobs [puppet] - 10https://gerrit.wikimedia.org/r/599040 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:21:48] (03PS1) 10Muehlenhoff: Enable managed adduser config for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/599042 (https://phabricator.wikimedia.org/T235162) [14:21:50] (03CR) 10Cwhite: [C: 03+2] add golang 1.13-1 builder image based on golang 1.13 using wikimedia-buster base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 (owner: 10Cwhite) [14:21:54] (03PS7) 10Cwhite: add golang 1.13-1 builder image based on golang 1.13 using wikimedia-buster base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 [14:22:10] (03CR) 10Cwhite: [V: 03+2 C: 03+2] add golang 1.13-1 builder image based on golang 1.13 using wikimedia-buster base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 (owner: 10Cwhite) [14:25:01] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: change testjob to thumbnailRender [deployment-charts] - 10https://gerrit.wikimedia.org/r/598802 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:25:22] (03CR) 10Elukey: [C: 03+2] Decommission Maps & Search metrics legacy dashboards [puppet] - 10https://gerrit.wikimedia.org/r/596221 (https://phabricator.wikimedia.org/T252365) (owner: 10Bearloga) [14:26:27] (03CR) 10JMeybohm: [C: 03+2] admin: Fix cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 (owner: 10JMeybohm) [14:26:49] (03PS3) 10Hnowlan: changeprop-jobqueue: change testjob to thumbnailRender [deployment-charts] - 10https://gerrit.wikimedia.org/r/598802 (https://phabricator.wikimedia.org/T220399) [14:29:44] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop-jobqueue: change testjob to thumbnailRender [deployment-charts] - 10https://gerrit.wikimedia.org/r/598802 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:29:56] (03PS4) 10JMeybohm: admin: Fix cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 [14:30:08] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:12] (03Merged) 10jenkins-bot: changeprop-jobqueue: change testjob to thumbnailRender [deployment-charts] - 10https://gerrit.wikimedia.org/r/598802 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:30:14] 10Operations, 10ORES, 10Release Pipeline (Blubber), 10Scoring-platform-team (Current): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10akosiaris) >>! In T210268#6167607, @Jdforrester-WMF wrote: >>>! In T210268#6167598, @ACraze wrote: >> A couple of questions here so far: >> > >... [14:32:43] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:32:43] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:52] (03PS1) 10Ssingh: dnsdist: add parameter to limit number of queries [puppet] - 10https://gerrit.wikimedia.org/r/599045 (https://phabricator.wikimedia.org/T252132) [14:34:58] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/22814/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/599045 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:36:04] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:36:04] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:26] (03PS5) 10JMeybohm: admin: Fix cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 [14:38:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/599045 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:40:27] !log reprepro: upload conftool_1.3.1-1{,+deb10u1} to {stretch,buster}-wikimedia [14:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:59] (03PS1) 10Muehlenhoff: Create debmonitor-client system user using systemd-sysusers [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/599049 [14:41:45] (03PS2) 10Muehlenhoff: Create debmonitor-client system user using systemd-sysusers [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/599049 [14:42:20] (03CR) 10Ssingh: [C: 03+2] dnsdist: add parameter to limit number of queries [puppet] - 10https://gerrit.wikimedia.org/r/599045 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:43:06] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:43:06] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:33] !log cumin2001: upgrading python3-conftool and python3-conftool-dbctl [14:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:08] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:50:08] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:11] (03PS2) 10ArielGlenn: enable dumps of structured data from commons [puppet] - 10https://gerrit.wikimedia.org/r/598787 (https://phabricator.wikimedia.org/T221917) [14:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:26] (03PS3) 10Elukey: Improve performance of datapoint ingestion [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599011 [14:51:00] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003781 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:51:13] (03CR) 10ArielGlenn: [C: 03+2] enable dumps of structured data from commons [puppet] - 10https://gerrit.wikimedia.org/r/598787 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [14:51:28] !log cumin1001: upgrading python3-conftool and python3-conftool-dbctl [14:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:22] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:56] whines about puppet on snapshot1008 will be fixed momentarily [14:52:58] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10wiki_willy) @Papaul - thanks for the heads up. Let me know what the cause ends up being (loose connection, bad part, etc) and I'll relay the information alon... [14:53:21] (03PS4) 10Elukey: Improve performance of datapoint ingestion [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599011 [14:53:37] (03CR) 10Elukey: "This is clearly not the right version, uploading a new one.." [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599011 (owner: 10Elukey) [14:54:31] !log volans@cumin1001 START - Cookbook sre.dns.netbox [14:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:09] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10wiki_willy) Thanks @jcrespo . I don't think @Jclark-ctr has been onsite at the data center since the last update, but I'll follow up with him on this when he's out there this week. T... [14:56:34] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:56:34] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:57] (03PS1) 10Esanders: Enable DiscussionTools as beta on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) [14:57:45] (03CR) 10jerkins-bot: [V: 04-1] Enable DiscussionTools as beta on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) (owner: 10Esanders) [14:58:03] !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:29] (03PS1) 10ArielGlenn: fix up cron name of commons structured data dumps [puppet] - 10https://gerrit.wikimedia.org/r/599052 (https://phabricator.wikimedia.org/T221917) [14:58:48] !log updated tiller to 2.16.7-wmf1 for all services in cluster: staging [14:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:23] (03CR) 10ArielGlenn: [C: 03+2] fix up cron name of commons structured data dumps [puppet] - 10https://gerrit.wikimedia.org/r/599052 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [14:59:38] 10Operations, 10ORES, 10Release Pipeline (Blubber), 10Scoring-platform-team (Current): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10thcipriani) >>! In T210268#6167598, @ACraze wrote: > * For the production image, we need to start a container to run the uwsgi service and also s... [14:59:51] (03CR) 10Jbond: Enable managed adduser config for ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599042 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [15:01:45] fixed, sorry about that [15:02:48] !log eqiad-prod: decom ms-be101[678] - T252008 [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:51] T252008: Decom ms-be101[678] - https://phabricator.wikimedia.org/T252008 [15:04:29] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) Hi all, Thanks again for taking the time to meet. @jbond in addition to enabling the Admin SDK API, I have en... [15:04:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/599049 (owner: 10Muehlenhoff) [15:05:52] (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/599049 (owner: 10Muehlenhoff) [15:18:53] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [15:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:32] (03CR) 10Papaul: [C: 03+2] DNS: Add production and mgmt DNS for rdb200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/599041 (owner: 10Papaul) [15:26:26] 10Operations, 10conftool: dbctl gives user-hostile diffs - https://phabricator.wikimedia.org/T253025 (10CDanis) 05Open→03Resolved We seem to have successfully worked-around difflib's bug, and that fix is released. Despite the other patch, we don't actually use icdiff, as I didn't realize while writing/tes... [15:27:13] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) [15:28:42] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10AMooney) [15:28:52] (03PS5) 10Elukey: Improve performance of datapoint ingestion [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599011 [15:38:32] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Jdforrester-WMF) Sorry, my mistake, not two we... [15:38:52] (03PS4) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (https://phabricator.wikimedia.org/T252807) [15:38:56] (03CR) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (https://phabricator.wikimedia.org/T252807) (owner: 10Muehlenhoff) [15:39:29] (03PS1) 10Filippo Giunchedi: grafana: provision Thanos datasource [puppet] - 10https://gerrit.wikimedia.org/r/599059 (https://phabricator.wikimedia.org/T233956) [15:42:26] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10akosiaris) >>! In T182331#6167624, @ACraze wrote: > I'm wondering about pod size limits and what that means for our current architecture. I believe I've heard there is a 2GB... [15:43:32] PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:14] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10jbond) Hi Harry, This has got me a bit further in the google API's however im still getting a 403 ` HttpError: (03PS2) 10Jforrester: Stop loading PerformanceInspector on any wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598891 (https://phabricator.wikimedia.org/T253689) [15:46:59] (03CR) 10Jforrester: [C: 03+2] Stop loading PerformanceInspector on any wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598891 (https://phabricator.wikimedia.org/T253689) (owner: 10Jforrester) [15:47:51] (03Merged) 10jenkins-bot: Stop loading PerformanceInspector on any wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598891 (https://phabricator.wikimedia.org/T253689) (owner: 10Jforrester) [15:49:30] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop loading PerformanceInspector on any wiki T253689 (duration: 01m 06s) [15:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:35] T253689: Undeploy PerformanceInspector from Beta Cluster - https://phabricator.wikimedia.org/T253689 [15:49:42] (03PS2) 10Jforrester: Stop defining wmgUsePerformanceInspector, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598892 (https://phabricator.wikimedia.org/T253689) [15:49:52] (03CR) 10Jforrester: [C: 03+2] Stop defining wmgUsePerformanceInspector, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598892 (https://phabricator.wikimedia.org/T253689) (owner: 10Jforrester) [15:50:52] (03Merged) 10jenkins-bot: Stop defining wmgUsePerformanceInspector, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598892 (https://phabricator.wikimedia.org/T253689) (owner: 10Jforrester) [15:51:46] (03PS2) 10Jforrester: Stop loading i18n for PerformanceInspector, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598893 (https://phabricator.wikimedia.org/T253689) [15:51:48] (03CR) 10Volans: "@ebernhardson: as a follow up of our chat on IRC, some quick suggestion inline. But I didn't had a full check of all the puppet classes in" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598886 (owner: 10EBernhardson) [15:51:57] (03CR) 10Jforrester: [C: 03+2] Stop loading i18n for PerformanceInspector, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598893 (https://phabricator.wikimedia.org/T253689) (owner: 10Jforrester) [15:52:01] !log updated tiller to 2.16.7-wmf1 for all services in kubernetes cluster: codfw [15:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:04] !log gehel@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [15:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:24] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) Ah, I wasn't sure if that was needed or not. I have enabled the option, let me know if you still get stuck. [15:52:41] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop defining wmgUsePerformanceInspector, unread T253689 (duration: 01m 04s) [15:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:45] (03Merged) 10jenkins-bot: Stop loading i18n for PerformanceInspector, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598893 (https://phabricator.wikimedia.org/T253689) (owner: 10Jforrester) [15:53:08] (03CR) 10Volans: [C: 03+1] "Change looks sane." [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599011 (owner: 10Elukey) [15:54:09] (03PS1) 10Hnowlan: cpjobqueue: use https to talk to jobrunner and videoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/599062 (https://phabricator.wikimedia.org/T220399) [15:55:34] (03PS2) 10Muehlenhoff: Enable managed adduser config for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/599042 (https://phabricator.wikimedia.org/T235162) [15:55:41] (03CR) 10Muehlenhoff: Enable managed adduser config for ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599042 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [15:56:06] (03PS1) 10Ssingh: wikidough: allow traffic to tcp/443 (DoH port) [puppet] - 10https://gerrit.wikimedia.org/r/599063 (https://phabricator.wikimedia.org/T252132) [15:57:36] (03PS1) 10Jforrester: Undeploy InterwikiSorting - I: Disable everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599064 (https://phabricator.wikimedia.org/T253764) [15:57:38] (03PS1) 10Jforrester: Undeploy InterwikiSorting - II: Drop loading ability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599065 (https://phabricator.wikimedia.org/T253764) [15:57:40] (03PS1) 10Jforrester: Undeploy InterwikiSorting - III: Drop InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599066 (https://phabricator.wikimedia.org/T253764) [15:57:42] (03PS1) 10Jforrester: Undeploy InterwikiSorting - IV: Drop all config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599067 (https://phabricator.wikimedia.org/T253764) [15:57:44] (03PS1) 10Jforrester: Undeploy InterwikiSorting - V: Stop loading i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599068 (https://phabricator.wikimedia.org/T253764) [15:58:02] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/22815/" [puppet] - 10https://gerrit.wikimedia.org/r/599063 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:58:53] (03CR) 10jerkins-bot: [V: 04-1] Undeploy InterwikiSorting - IV: Drop all config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599067 (https://phabricator.wikimedia.org/T253764) (owner: 10Jforrester) [15:59:53] (03CR) 10Jforrester: [C: 04-2] "Before we deploy this, let's have Addshore and me carefully think through the implications. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599064 (https://phabricator.wikimedia.org/T253764) (owner: 10Jforrester) [16:03:15] (03CR) 10Ppchelko: [C: 03+1] "Please self merge when needed for deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/599062 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:04:16] (03CR) 10Hnowlan: [C: 03+2] cpjobqueue: use https to talk to jobrunner and videoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/599062 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:04:44] (03Merged) 10jenkins-bot: cpjobqueue: use https to talk to jobrunner and videoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/599062 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:05:59] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10jbond) hi harry, Im still getting an error, my reading makes me believe that with the last permission i can now imperso... [16:06:23] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [16:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:25] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:31] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:22] (03CR) 10Alexandros Kosiaris: "Wow, thanks for this. I was planning to see if I can get rid of this shell script and rely just on helmfile, but I am meeting some issues " [deployment-charts] - 10https://gerrit.wikimedia.org/r/598973 (owner: 10JMeybohm) [16:12:00] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10hnowlan) This host isn't using JBOD so this bad disk can be replaced at any point. [16:16:10] RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:07] !log hnowlan@deploy1001 Started deploy [cpjobqueue/deploy@c8c653e]: Disabling ThumbnailRender as a test of k8s cpjobqueue [16:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:04] !log hnowlan@deploy1001 Finished deploy [cpjobqueue/deploy@c8c653e]: Disabling ThumbnailRender as a test of k8s cpjobqueue (duration: 01m 57s) [16:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:05] 10Operations, 10ORES, 10Release Pipeline (Blubber), 10Scoring-platform-team (Current): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10thcipriani) >>! In T210268#6167598, @ACraze wrote: > * Does the base image need to come from the wmf docker registry? If so, then it might make s... [16:34:01] 10Operations, 10observability: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10colewhite) [16:40:30] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:48:40] (03PS1) 10Hnowlan: cpjobqueue: fix service name and metrics name [deployment-charts] - 10https://gerrit.wikimedia.org/r/599080 (https://phabricator.wikimedia.org/T220399) [16:49:14] (03CR) 10Ppchelko: [C: 03+1] cpjobqueue: fix service name and metrics name [deployment-charts] - 10https://gerrit.wikimedia.org/r/599080 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:51:25] (03CR) 10Hnowlan: [C: 03+2] cpjobqueue: fix service name and metrics name [deployment-charts] - 10https://gerrit.wikimedia.org/r/599080 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:51:51] (03Merged) 10jenkins-bot: cpjobqueue: fix service name and metrics name [deployment-charts] - 10https://gerrit.wikimedia.org/r/599080 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:52:10] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [16:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:41] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:34] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:56] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:59:09] (03PS2) 10Jforrester: Undeploy InterwikiSorting - IV: Drop all config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599067 (https://phabricator.wikimedia.org/T253764) [16:59:11] (03PS2) 10Jforrester: Undeploy InterwikiSorting - V: Stop loading i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599068 (https://phabricator.wikimedia.org/T253764) [16:59:13] (03PS1) 10Jforrester: Provide wgULSCompactLinksPrepend as wgInterwikiSortingSortPrepend is going away [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599081 [17:01:24] fyi: gerrit loading super slow for me atm [17:11:33] (03CR) 10Cwhite: [C: 03+2] mtail: update varnishrls compatibility with rc35 [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [17:13:10] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:18:02] (03PS11) 10Rush: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [17:18:37] (03PS1) 10Jdlrobson: Stop special casing the main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599085 (https://phabricator.wikimedia.org/T32405) [17:19:40] (03CR) 10jerkins-bot: [V: 04-1] Stop special casing the main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599085 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [17:24:34] (03PS1) 10Rush: peek: add role for peek2001 [puppet] - 10https://gerrit.wikimedia.org/r/599086 (https://phabricator.wikimedia.org/T251784) [17:27:03] (03CR) 10Rush: [C: 03+2] peek: add role for peek2001 [puppet] - 10https://gerrit.wikimedia.org/r/599086 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [17:27:06] (03PS2) 10Jdlrobson: Stop special casing the main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599085 (https://phabricator.wikimedia.org/T32405) [17:30:36] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:32:26] !log twentyafterfour@deploy1001 Synchronized php-1.35.0-wmf.34/extensions/Translate/: Deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Translate/+/599027/ to wmf.34 refs T253748 and T253022 (duration: 01m 07s) [17:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:31] T253022: 1.35.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T253022 [17:32:31] T253748: Fatal: Class 'MessageIndexException' not found - https://phabricator.wikimedia.org/T253748 [17:40:04] (03PS3) 10Gehel: maps: maps2002 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598964 (https://phabricator.wikimedia.org/T249086) [17:40:20] !log repool maps2003 [17:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:26] (03PS1) 10Rush: peek: add dummy lookup values [labs/private] - 10https://gerrit.wikimedia.org/r/599087 (https://phabricator.wikimedia.org/T251784) [17:40:58] (03CR) 10Gehel: [C: 03+2] maps: maps2002 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598964 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [17:41:10] (03PS12) 10Rush: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [17:42:39] !log gehel@cumin1001 START - Cookbook sre.postgresql.postgres-init [17:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:48] (03CR) 10Esanders: "Signed off by product in today's meeting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) (owner: 10Esanders) [17:44:33] (03CR) 10Rush: [V: 03+2 C: 03+2] peek: add dummy lookup values [labs/private] - 10https://gerrit.wikimedia.org/r/599087 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [17:45:50] (03CR) 10MarcoAurelio: [C: 04-1] Enable DiscussionTools as beta on mw.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) (owner: 10Esanders) [17:47:39] (03PS2) 10Jforrester: Enable DiscussionTools as beta on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) (owner: 10Esanders) [17:48:25] (03PS3) 10Esanders: Enable DiscussionTools as beta on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) [17:48:48] (03PS4) 10Jforrester: Enable DiscussionTools as beta on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) (owner: 10Esanders) [17:49:34] (03CR) 10MarcoAurelio: [C: 03+1] Enable DiscussionTools as beta on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) (owner: 10Esanders) [17:51:01] (03CR) 10Jforrester: [C: 03+2] Enable DiscussionTools as beta on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) (owner: 10Esanders) [17:51:53] (03Merged) 10jenkins-bot: Enable DiscussionTools as beta on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599051 (https://phabricator.wikimedia.org/T251208) (owner: 10Esanders) [17:53:48] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable DiscussionTools as beta on mediawiki.org T251208 (duration: 01m 05s) [17:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:52] T251208: Deploy DiscussionTools to mw.org as a Beta Feature - https://phabricator.wikimedia.org/T251208 [17:54:06] 10Operations: Migrate role::bastionhost::general and role::bastionhost::pop to Buster - https://phabricator.wikimedia.org/T253779 (10CDanis) [17:54:36] * hauskatze enables DT on mediawiki [17:55:29] James_F: for some reason DiscussionTools ain't there yet. Caching maybe? [17:56:45] !log updated tiller to 2.16.7-wmf1 for all services in kubernetes cluster: eqiad [17:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:41] hauskatze: No. [17:57:42] (03PS1) 10Jforrester: Follow-up 17c4f07a3: Also enable DiscussionTools at all on MW.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599091 [17:58:15] wmgUseDiscussionTools yeah, makes sense :) [17:58:34] (03CR) 10Jforrester: [C: 03+2] Follow-up 17c4f07a3: Also enable DiscussionTools at all on MW.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599091 (owner: 10Jforrester) [17:59:44] (03Merged) 10jenkins-bot: Follow-up 17c4f07a3: Also enable DiscussionTools at all on MW.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599091 (owner: 10Jforrester) [18:00:04] twentyafterfour and James_F: My dear minions, it's time we take the moon! Just kidding. Time for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200527T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200527T1800). [18:00:04] Jdlrobson: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:04] here! [18:06:27] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable DiscussionTools as beta on mediawiki.org, part II T251208 (duration: 01m 05s) [18:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:31] T251208: Deploy DiscussionTools to mw.org as a Beta Feature - https://phabricator.wikimedia.org/T251208 [18:07:29] PROBLEM - Check systemd state on ms-be1047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:40] (03PS1) 10Hashar: Merge tag 'debian/1.8.18-1_exp1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/599094 [18:13:54] James_F: would you be able to do SWAT or should I punt my SWAT to another day? [18:16:28] Jdlrobson: Pre-train deploy meeting, sorry. [18:16:59] Jdlrobson: This is just the main page stuff? [18:22:30] (03PS5) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) [18:23:15] (03CR) 10jerkins-bot: [V: 04-1] Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [18:25:02] 10Operations, 10Continuous-Integration-Config, 10Developer Productivity, 10Doxygen, and 3 others: Update Doxygen in CI to 1.8.17 or greater - https://phabricator.wikimedia.org/T242155 (10hashar) Doxygen 1.8.17 has a bunch of issues, notably on Wikibase. T253723 has some details and gdb stacktraces. 1.8.1... [18:27:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10hashar) [18:27:38] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10hashar) [18:28:24] (03PS1) 10Bstorm: icinga: set all SMS user types for WMCS to email type [puppet] - 10https://gerrit.wikimedia.org/r/599097 [18:28:54] (03PS2) 10Hashar: Merge tag 'debian/1.8.18-1_exp1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/599094 (https://phabricator.wikimedia.org/T253723) [18:29:35] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Patch-For-Review, and 2 others: Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10hashar) [18:30:21] (03PS2) 10EBernhardson: Rename role::wdqs to role::wdqs::public [puppet] - 10https://gerrit.wikimedia.org/r/598884 [18:30:23] (03PS2) 10EBernhardson: Define a shared profile to remove duplication in roles [puppet] - 10https://gerrit.wikimedia.org/r/598885 [18:32:32] (03CR) 10Privacybatm: "I have added sphinx as a requirement for tox, Still, jenkins-bot failed?!" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [18:34:19] (03CR) 10Andrew Bogott: [C: 03+1] icinga: set all SMS user types for WMCS to email type [puppet] - 10https://gerrit.wikimedia.org/r/599097 (owner: 10Bstorm) [18:34:53] (03PS13) 10Rush: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [18:35:23] RECOVERY - Check systemd state on ms-be1047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:20] 10Operations, 10netops, 10Patch-For-Review: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 (10CDanis) One more of these today: `May 27 18:33:01 netflow3001 nfacctd[28442]: INFO ( default_kafka/kafka ): [/etc/pmacct/librdkafka.conf] Reading librdkafka glob... [18:36:25] (03CR) 10Privacybatm: "I don't understand why!" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [18:36:50] (03CR) 10Rush: [C: 03+2] peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [18:37:51] (03PS2) 10Bstorm: icinga: set all SMS user types for WMCS to email type [puppet] - 10https://gerrit.wikimedia.org/r/599097 [18:38:48] (03CR) 10Bstorm: "I'd missed one of andrew's mentions in the first patch, and I had an unsaved change that I needed to add. This one is complete." [puppet] - 10https://gerrit.wikimedia.org/r/599097 (owner: 10Bstorm) [18:41:06] (03PS1) 10Rush: peek: add profile to role [puppet] - 10https://gerrit.wikimedia.org/r/599098 (https://phabricator.wikimedia.org/T251784) [18:41:24] James_F: it's just main page yeh so not urgent [18:41:37] but it would be one less thing off my plate if someone was able to swat today. no big deal. [18:41:40] !log joal@deploy1001 Started deploy [analytics/refinery@8a3dcb3]: Analytics regular weekly train [8a3dcb3] [18:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:07] (03CR) 10Rush: [C: 03+2] peek: add profile to role [puppet] - 10https://gerrit.wikimedia.org/r/599098 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [18:43:54] Jdlrobson: I can sling it out later today, no worries. [18:44:25] 10Operations, 10netops: Zayo link eqiad-codfw (OGYX/120003//ZYO) down - TTN-0004110251 - https://phabricator.wikimedia.org/T253610 (10ayounsi) As of 16min ago: > Zayo has opened a case against your service. TTN-0004116026 > We are investigating a possible interruption, which may be impacting your service, and... [18:48:40] (03PS6) 10WMDE-leszek: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) [18:49:54] (03PS1) 10Rush: peek: fix require resource definition for template dir [puppet] - 10https://gerrit.wikimedia.org/r/599099 (https://phabricator.wikimedia.org/T251784) [18:50:52] (03PS7) 10WMDE-leszek: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) [18:50:54] (03CR) 10Rush: [C: 03+2] peek: fix require resource definition for template dir [puppet] - 10https://gerrit.wikimedia.org/r/599099 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [18:51:01] (03CR) 10WMDE-leszek: Wikidata: Define entity sources configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [18:51:39] (03PS6) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) [18:52:31] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [18:54:23] (03PS7) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) [18:54:43] (03CR) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [18:55:19] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [18:58:47] (03PS1) 10Rush: peek: update remote repo resource [puppet] - 10https://gerrit.wikimedia.org/r/599101 (https://phabricator.wikimedia.org/T251784) [19:00:04] twentyafterfour and James_F: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200527T1900). [19:00:15] (03CR) 10Rush: [C: 03+2] peek: update remote repo resource [puppet] - 10https://gerrit.wikimedia.org/r/599101 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [19:02:53] twentyafterfour: Landing fixes for T253725, hopefully. [19:02:53] T253725: Call to a member function getUser() on boolean ( CoreParserFunctions::revisionuser ?) - https://phabricator.wikimedia.org/T253725 [19:03:01] !log joal@deploy1001 Finished deploy [analytics/refinery@8a3dcb3]: Analytics regular weekly train [8a3dcb3] (duration: 21m 20s) [19:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:22] !log joal@deploy1001 Started deploy [analytics/refinery@8a3dcb3] (thin): Analytics regular weekly train THIN [8a3dcb3] [19:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:31] !log joal@deploy1001 Finished deploy [analytics/refinery@8a3dcb3] (thin): Analytics regular weekly train THIN [8a3dcb3] (duration: 00m 08s) [19:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:01] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:31] (03PS3) 10Jforrester: Stop special casing the main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599085 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [19:05:33] Hi team - Errors from an-launcher1001 are taken care of by ottomata and myself (mostly ottomata :) [19:05:43] Given I'm waiting, will deploy Jdlrobson's config change from earlier. [19:05:55] (03CR) 10Krinkle: [C: 03+1] "Based on a very rudimentary search, it seems there are no instances of WANCache interaction in Wikibase-related code that bypasses makeKey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598852 (owner: 10Aaron Schulz) [19:06:03] (03PS4) 10Jforrester: Stop special casing the main page on mobile for twelve wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599085 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [19:06:12] (03CR) 10Jforrester: [C: 03+2] Stop special casing the main page on mobile for twelve wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599085 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [19:06:19] !log joal@deploy1001 Started deploy [analytics/refinery@8a3dcb3]: Analytics regular weekly train (an-launcher1001 only) [8a3dcb3] [19:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:47] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) >>! In T205619#6168027, @Aklapper wrote: > That has always and ever existed at https://commons.wikimedia.org/wiki/Special:Upload ...... [19:07:18] (03Merged) 10jenkins-bot: Stop special casing the main page on mobile for twelve wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599085 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [19:09:24] !log jforrester@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: T32405 Stop special casing the main page on mobile for twelve wikis (duration: 01m 05s) [19:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:28] T32405: [EPIC] MobileFrontend extension should stop special-casing main page - https://phabricator.wikimedia.org/T32405 [19:09:48] Jdlrobson: Deployed. Looks OK to me on a few I checked. [19:09:52] thanks James_F [19:10:09] and yep working great [19:10:20] Happy to help. [19:11:29] (03CR) 10WMDE-leszek: [C: 04-1] Wikidata client wikis: Define entity sources configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [19:12:26] !log joal@deploy1001 Finished deploy [analytics/refinery@8a3dcb3]: Analytics regular weekly train (an-launcher1001 only) [8a3dcb3] (duration: 06m 07s) [19:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:40] (03PS8) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) [19:14:01] (03PS6) 10WMDE-leszek: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) [19:14:35] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [19:14:55] (03CR) 10jerkins-bot: [V: 04-1] Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [19:16:50] (03PS7) 10WMDE-leszek: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) [19:17:00] (03CR) 10WMDE-leszek: Commons: Define entity sources configuration (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [19:17:56] (03CR) 10jerkins-bot: [V: 04-1] Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [19:19:28] (03PS1) 10Rush: WIP peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599103 (https://phabricator.wikimedia.org/T251784) [19:19:36] (03CR) 10WMDE-leszek: "I have to admit that this test failure is a bit of mystery to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [19:20:07] (03CR) 10jerkins-bot: [V: 04-1] WIP peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599103 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [19:21:14] * James_F sighs at CI taking so long. [19:24:27] (03PS3) 10EBernhardson: Rename role::wdqs to role::wdqs::public [puppet] - 10https://gerrit.wikimedia.org/r/598884 [19:30:23] (03CR) 10Andrew Bogott: [C: 03+1] "looks right to me! Although it looked right to me before also :)" [puppet] - 10https://gerrit.wikimedia.org/r/599097 (owner: 10Bstorm) [19:34:36] (03PS3) 10EBernhardson: Define a shared profile to remove duplication in roles [puppet] - 10https://gerrit.wikimedia.org/r/598885 [19:45:41] twentyafterfour: OK, T253725 fix syncing now, works in mwdebug1001. Train should be unblocked, hopefully. [19:45:42] T253725: Call to a member function getUser() on boolean ( CoreParserFunctions::revisionuser ?) - https://phabricator.wikimedia.org/T253725 [19:45:55] James_F: wow awesome [19:46:25] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.34/includes/parser/CoreParserFunctions.php: T253725 Partially revert 'Fix impedance mismatch with Parser::getRevisionRecordObject()' (duration: 01m 05s) [19:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:29] There we go. [19:50:03] * twentyafterfour deploys to group1 [19:51:09] (03PS1) 1020after4: group1 wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599110 [19:51:11] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599110 (owner: 1020after4) [19:51:59] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599110 (owner: 1020after4) [19:56:05] !log twentyafterfour@deploy1001 scap failed: average error rate on 4/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details) [19:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:12] lovely [19:57:38] LogicException from line 74 of /srv/mediawiki/php-1.35.0-wmf.34/extensions/Wikibase/lib/includes/Store/Sql/Terms/TermStoreWriterFactory.php: Local entity source does not have items. [19:57:56] That was happening before, wasn't it? [19:58:06] ISTR seeing it on wikidata. [19:58:06] Eh. [19:58:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:59:39] (03CR) 10Jforrester: Wikidata client wikis: Define entity sources configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [20:00:04] halfak and accraze: Dear deployers, time to do the Services – Graphoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200527T2000). [20:00:20] https://phabricator.wikimedia.org/T253804 [20:00:36] Fun. [20:00:43] Sorry for giving you hope [20:00:47] (03PS1) 1020after4: group1 wikis to 1.35.0-wmf.32 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599111 [20:00:49] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.35.0-wmf.32 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599111 (owner: 1020after4) [20:00:57] !log gehel@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [20:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:48] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.32 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599111 (owner: 1020after4) [20:03:22] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.32 refs T253022 [20:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:26] T253022: 1.35.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T253022 [20:04:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:04:27] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.32 refs T253022 (duration: 01m 04s) [20:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:31] (03PS1) 10Joal: Add page_restrictions to analytics sqooped tables [puppet] - 10https://gerrit.wikimedia.org/r/599112 (https://phabricator.wikimedia.org/T251749) [20:07:25] deployment done - gone for tonight team [20:08:18] !log mbsantos@deploy1001 Started deploy [mobileapps/deploy@9dc827f]: Update mobileapps to b3b9214c (T253648) [20:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:22] T253648: Remove title and page description from Main Page and non-mainspace pages in mobile-html - https://phabricator.wikimedia.org/T253648 [20:10:59] (03PS1) 10Reedy: Remove PHP version if around $wgOverrideUcfirstCharacters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599113 [20:11:41] (03PS2) 10Reedy: Remove PHP version if around $wgOverrideUcfirstCharacters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599113 [20:11:49] !log mbsantos@deploy1001 Finished deploy [mobileapps/deploy@9dc827f]: Update mobileapps to b3b9214c (T253648) (duration: 03m 31s) [20:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:30] ^ joewalshwmf [20:12:42] mateusbs17: thanks! [20:16:27] 10Operations, 10DNS, 10Domains, 10Traffic: Create diff.wikimedia.org subdomain - https://phabricator.wikimedia.org/T253807 (10CKoerner_WMF) [20:22:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:23:27] (03CR) 10EBernhardson: "Thanks! I've been toying with this today, the general concept helps but I'm still running into conceptual issues. First off we don't have " [puppet] - 10https://gerrit.wikimedia.org/r/598886 (owner: 10EBernhardson) [20:24:40] PROBLEM - MariaDB read only s4 #page on db1138 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 187s, 1553.58 QPS, connection latency: 0.001497s, query latency: 0.000270s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [20:25:00] around, looking [20:25:03] here [20:25:06] here [20:25:06] hey [20:25:14] I'm around, but it's getting late [20:25:21] woot? [20:25:24] here but also etting late [20:25:25] * volans around [20:25:29] hello marostegui ! [20:25:30] here, no expertise but can take direction. [20:25:32] yeah same here, I'm around too (just finished watching the aborted launch) [20:25:58] mysql:root@localhost [(none)]> show global status like 'Uptime'; [20:25:58] +---------------+-------+ [20:25:58] | Variable_name | Value | [20:25:58] +---------------+-------+ [20:25:58] | Uptime | 262 | [20:25:59] +---------------+-------+ [20:25:59] 1 row in set (0.00 sec) [20:26:07] s4 master [20:26:07] <_joe_> o/ [20:26:12] marostegui: how can I help? [20:26:13] spike of MW fatals, peaked around 2022Z, not sure if related [20:26:31] May 27 20:19:45 db1138 mysqld[163407]: 200527 20:19:45 [ERROR] mysqld got signal 7 ; [20:26:33] <_joe_> rzl: probably related given the db just came back up [20:26:43] s4 master got restarted [20:26:47] mysqld I mean [20:26:52] looks like the process crashed [20:27:03] [33134188.617181] Memory failure: 0x7d3c38f: Killing mysqld:163407 due to hardware memory corruption [20:27:03] [33134188.626049] Memory failure: 0x7d3c38f: recovery action for dirty LRU page: Recovered [20:27:12] SIGBUS [20:27:23] marostegui: do you need any extra hands or should we go away and let you do DBA stuff to it? [20:27:31] May 27 20:21:00 db1138 kernel: [33134263.297543] MCE: Killing mysqld:163468 due to hardware memory corruption fault at 7feced3dc580 [20:27:36] Going to reduce pool size, restart mysqld again and then we can handle this with onsite [20:27:38] oh sorry, I'm late [20:27:47] <_joe_> volans: oh joy [20:27:56] also around [20:28:02] * jayme here - if that helps :) [20:28:32] Uncorrected hardware memory error in user-access at 7d3c38f580 [20:28:47] !log Decrease innodb poolsize on s4 master and restart mysql [20:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:11] * shdubsh here [20:29:21] do we have already a task or doc? [20:29:25] to paste stuff [20:31:20] <_joe_> everything's normal on the application side, except for a certain number of errors [20:31:24] ok, we should be good now [20:31:26] 05/27/2020 20:20:26 Critical: "Multi-bit memory errors detected on a memory device at location(s) DIMM_B4." in SEL on db1138 [20:31:53] can someone check if recentchanges on commons is getting new stuff? [20:31:58] RECOVERY - MariaDB read only s4 #page on db1138 is OK: Version 10.1.43-MariaDB, Uptime 139s, read_only: False, 1284.21 QPS, connection latency: 0.001735s, query latency: 0.000269s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [20:32:10] <_joe_> errors are down [20:32:14] marostegui: it's gotten some edits in teh past minute [20:32:16] <_joe_> marostegui: yes [20:32:18] cool [20:32:22] yeah, commons is writable again [20:32:24] * apergos looks in [20:32:26] I will create a task to follow this up and do a master failover [20:32:36] marostegui: looks like it [20:33:05] ugh, I did not even register the first page, sorry about that [20:33:18] also, we might have been able to catch it one earlier before it eventually bailed it seems: from SEL before: 05/27/2020 01:33:44 Critical: Correctable memory error rate exceeded for DIMM_B4. [20:33:55] yes [20:33:58] I was looking at the same [20:34:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:34:40] FYI: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Unable_to_delete reports recovery [20:34:47] 10Operations, 10ops-eqiad, 10DBA: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) [20:35:15] AntiComposite: thank you :] [20:37:13] 10Operations, 10ops-eqiad, 10DBA: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10CDanis) p:05Triage→03High [20:37:49] so I gather this was entirely hardware and not related to any new software deployment [20:37:58] twentyafterfour: correct [20:38:13] for what we know so far [20:38:20] the train for group1 died on errors for commons during the canary checks and I rolled back group1 to wmf.32 [20:38:25] where did you get the SEL logs from from moritzm || volans? From the management interfaces directly? [20:38:50] jayme: yes, either 'racadm getsel' or 'racadm lclog view' [20:39:11] in the mgmt console [20:39:23] or "show /system/log" on HP machines [20:39:36] 10Operations, 10ops-eqiad, 10DBA: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) I am running a compare from this host to its candidate master (db1081) to make sure we are good for Friday. [20:39:43] that reminds me that we should add those in the pages where missing in https://wikitech.wikimedia.org/wiki/Platform-specific_documentation [20:40:24] This is good, we can get this master failed over so I can finish a schema change on the image table \o/ [20:40:32] lol [20:40:32] Anyways, I am off to sleep [20:40:36] (03PS3) 10Gehel: maps: maps2001 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598965 (https://phabricator.wikimedia.org/T249086) [20:40:37] Call me if needed [20:41:23] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) [20:41:33] (03CR) 10Gehel: [C: 03+2] maps: maps2001 rejoins master [puppet] - 10https://gerrit.wikimedia.org/r/598965 (https://phabricator.wikimedia.org/T249086) (owner: 10Gehel) [20:41:45] yeah volans - I haven't even found those commands in https://wikitech.wikimedia.org/wiki/Management_Interfaces [20:42:16] that page is all about remote IPMI [20:42:22] and local IPMI [20:42:35] `sudo ipmi-sel` is another route to see SEL logs from within the OS [20:43:10] cit. "There should be one-- and preferably only one --obvious way to do it." [20:43:13] :-P [20:43:26] !log gehel@cumin1001 START - Cookbook sre.postgresql.postgres-init [20:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:47] hah this quote clearly not written by a sysadmin [20:44:01] someone who hates Perl? [20:44:35] oh - I thought that those are IPMI commands. Got you wrong then [20:44:37] gehel: you can't speak, running a postgresql init cookbook right after a master mariadb died... it's clearly a snub! [20:44:54] jayme: no those you run after ssh $host.mgmt.$dc.wmnet [20:45:03] *after you ssh into [20:45:03] gehel is clearly trolling us :-) [20:45:23] indeed [20:45:34] nah... that postgres server has been disconnected from its master for 4 month, I can't really brag [20:45:44] maybe only 2, but still :/ [20:46:33] ACKNOWLEDGEMENT - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel tilerator is disabled during data reload https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:41] jayme: for some servers you need to force legacy SSH options when connecting to the mgmt, some of the sshd on there are so old that they don't offer the default kexes requested by SSH 7.x (unless you make it ask for legacy crypto) [20:47:17] typically oKexAlgorithms to d-h-group14-sha1 does the trick [20:47:39] jayme: added https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN20_Gen8#Show_logs [20:47:45] and similar to the other pages [20:48:13] pretty sure it is reasonable to say that The Zen of Python was written by a PERL hater, or at least someone who escaped from loving PERL [20:48:34] volans: <3 [20:49:03] (03PS8) 10Jforrester: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [20:49:05] (03PS9) 10Jforrester: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [20:49:07] (03PS1) 10Jforrester: testNoAmbiguouslyTaggedSettings: Re-work to identify the dblists and values at fault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599122 [20:49:56] (03CR) 10jerkins-bot: [V: 04-1] Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [20:50:02] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [20:50:05] (03CR) 10jerkins-bot: [V: 04-1] testNoAmbiguouslyTaggedSettings: Re-work to identify the dblists and values at fault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599122 (owner: 10Jforrester) [20:51:10] (03CR) 10Bstorm: [C: 03+2] "I think this should be good to merge overall. I just want to at very least keep the reviewers informed at this point." [puppet] - 10https://gerrit.wikimedia.org/r/599097 (owner: 10Bstorm) [20:53:19] 10Operations, 10observability: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10MoritzMuehlenhoff) [20:55:11] (03PS2) 10Jforrester: testNoAmbiguouslyTaggedSettings: Re-work to identify the dblists and values at fault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599122 [20:55:41] (03PS9) 10Jforrester: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [20:58:10] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) ` [edit interfaces interface-range vlan-private1-a-codfw] member ge-6/0/23 { ... } + member ge-5/0/8; [edit interfaces interface-range disabled] - member ge-5/0/... [20:59:01] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) [21:12:59] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10wiki_willy) a:03Jclark-ctr [21:15:15] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10wiki_willy) @Marostegui - will do, Papaul and John are working on pulling the TSR right now for the RMA. Thanks, Willy [21:34:18] (03PS1) 10Jforrester: [DNM] Split Wikidata and Commons out of group1, into groupD and C respectively [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599125 (https://phabricator.wikimedia.org/T223410) [21:35:21] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Split Wikidata and Commons out of group1, into groupD and C respectively [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599125 (https://phabricator.wikimedia.org/T223410) (owner: 10Jforrester) [21:38:32] (03PS2) 10Jforrester: [DNM] Split Wikidata and Commons out of group1, into groupD and C respectively [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599125 (https://phabricator.wikimedia.org/T223410) [21:51:25] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Upgrade Netbox to v2.8.4-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/595717 (owner: 10CRusnov) [21:57:09] !log crusnov@deploy1001 Started deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.1 (part1) [21:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:10] !log crusnov@deploy1001 Finished deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.1 (part1) (duration: 01m 01s) [21:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:52] !log crusnov@deploy1001 Started deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part2) [21:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:23] !log crusnov@deploy1001 deploy aborted: Netbox Upgrade to 2.8.4 (part2) (duration: 01m 31s) [22:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:27] !log crusnov@deploy1001 Started deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part3) [22:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:56] !log crusnov@deploy1001 Finished deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part3) (duration: 01m 29s) [22:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:00] !log crusnov@deploy1001 Started deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part4) [22:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:11] !log crusnov@deploy1001 Finished deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part4) (duration: 00m 10s) [22:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:21] Are zuul and jenkins being upgraded? [22:04:43] See -releng. Just restarting jenkins. [22:07:21] * RhinosF1 hopes SWAT is still going in an hour [22:08:42] 10Operations, 10SRE-Access-Requests: Request for srv/phab/phabricator/bin/bulk make-silent --id * command via SSH for moving tasks quarterly - https://phabricator.wikimedia.org/T251349 (10MBinder_WMF) I can confirm that changing those lines in the config to match the private key file name let me log in. Thanks... [22:11:01] (03PS5) 10Jdlrobson: Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) [22:29:30] (03PS2) 10Andrew Bogott: profile::openstack::base::designate::service: tighten up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/598112 (https://phabricator.wikimedia.org/T251604) [22:29:32] (03PS1) 10Andrew Bogott: codfw1dev: add cloudservices2003-dev to the designate host list [puppet] - 10https://gerrit.wikimedia.org/r/599133 (https://phabricator.wikimedia.org/T253780) [22:29:49] (03PS2) 10Andrew Bogott: codfw1dev: add cloudservices2003-dev to the designate host list [puppet] - 10https://gerrit.wikimedia.org/r/599133 (https://phabricator.wikimedia.org/T253780) [22:29:51] (03PS3) 10Andrew Bogott: profile::openstack::base::designate::service: tighten up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/598112 (https://phabricator.wikimedia.org/T251604) [22:30:45] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: add cloudservices2003-dev to the designate host list [puppet] - 10https://gerrit.wikimedia.org/r/599133 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [22:37:01] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:57:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 49 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:58:14] !log gehel@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [22:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200527T2300). [23:00:05] RhinosF1: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] * RhinosF1 waves [23:00:43] I'm here, I can do the SWAT in a minute [23:01:12] any order [23:10:36] (03PS4) 10Catrope: Add thwiki's draft namespace to wmgExemptFromUserRobotsControlExtra and enable VE. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597029 (https://phabricator.wikimedia.org/T252959) (owner: 10RhinosF1) [23:10:56] (03CR) 10Catrope: [C: 03+2] Add thwiki's draft namespace to wmgExemptFromUserRobotsControlExtra and enable VE. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597029 (https://phabricator.wikimedia.org/T252959) (owner: 10RhinosF1) [23:11:09] (03PS4) 10Catrope: Add autoreviewrestore into the rollbacker group in hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598245 (https://phabricator.wikimedia.org/T252986) (owner: 10RhinosF1) [23:11:42] (03Merged) 10jenkins-bot: Add thwiki's draft namespace to wmgExemptFromUserRobotsControlExtra and enable VE. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597029 (https://phabricator.wikimedia.org/T252959) (owner: 10RhinosF1) [23:14:02] RoanKattouw: thwiki working [23:15:13] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::base::designate::service: tighten up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/598112 (https://phabricator.wikimedia.org/T251604) (owner: 10Andrew Bogott) [23:15:25] (03PS1) 10Andrew Bogott: M5 grants: allow designate access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/599137 (https://phabricator.wikimedia.org/T253780) [23:16:27] (03CR) 10Catrope: [C: 03+2] Add autoreviewrestore into the rollbacker group in hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598245 (https://phabricator.wikimedia.org/T252986) (owner: 10RhinosF1) [23:16:32] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add thwiki Draft namespace to wmgExemptFromUserRobotsControlExtra and enable VE there (T252959) (duration: 01m 06s) [23:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:36] T252959: Set additional config for draft/draft talk namespace on thwiki - https://phabricator.wikimedia.org/T252959 [23:17:20] (03Merged) 10jenkins-bot: Add autoreviewrestore into the rollbacker group in hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598245 (https://phabricator.wikimedia.org/T252986) (owner: 10RhinosF1) [23:18:53] RhinosF1: hiwiki now ready on mwdebug1002 [23:19:27] RoanKattouw: working [23:20:55] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add autoreviewrestore right to rollbacker group on hiwiki (T252986) (duration: 01m 05s) [23:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:58] T252986: Add autoreviewrestore into the rollbacker group in hiwiki - https://phabricator.wikimedia.org/T252986 [23:21:21] RoanKattouw: ty [23:21:31] * RhinosF1 sleep [23:25:03] 10Puppet, 10Puppet-infrastructure-modernization, 10cloud-services-team (Kanban): broken puppet on codfw1dev VMs - https://phabricator.wikimedia.org/T253817 (10Andrew) [23:26:56] (03PS1) 10BryanDavis: dynamicproxy: Short-circuit urlproxy lookups against canonical_domain [puppet] - 10https://gerrit.wikimedia.org/r/599139 (https://phabricator.wikimedia.org/T253816) [23:27:34] (03CR) 10Krinkle: [C: 03+1] Optimise all static PNGs losslessly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [23:31:14] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Jclark-ctr) Sent TSR report to Dell Confirmed: Service Request 1025886499 was successfully submitted. [23:33:21] (03PS1) 10Ryan Kemper: debian/changelog: Use spaces between maintainer and date [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/599140 [23:36:10] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10wiki_willy) a:03Cmjohnson @Cmjohnson - looks like we're right on the border with the warranty for this one. Netbox shows May 12, 2017 as the install date. Can you see if the HP site allows us to RMA... [23:49:58] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10MBeat33) Thank you @Dzahn for the substantial update. I really appreciate it and am learning this is a pretty layered question. I am meeting with Josephine tomorrow to see if we can push forward any of the... [23:51:07] (03PS2) 10Ryan Kemper: debian/changelog: Make spaces match previous entry [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/599140 [23:57:26] 10Operations, 10Privacy Engineering, 10Research, 10Traffic, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10bmansurov) @JFishback_WMF, issues mentioned at T251732#6158467 have been addressed.