[00:01:50] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Later), 10Services (next): Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) a:03holger.knust We need to locally test whether the `bynuan-syslog-udp` actually wor... [00:08:19] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:08:19] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:08:19] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:08:19] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:08:21] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:08:39] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:08:39] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:08:47] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:08:53] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:09:15] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) [00:30:03] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:57:09] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [01:20:55] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 60.16 seconds [01:59:31] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:26:49] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [03:29:03] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:40:45] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Tgr) [03:56:27] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [04:58:45] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:25:59] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [05:34:03] (03CR) 10CRusnov: [V: 03+2] Properly detect connected ports. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487599 (owner: 10CRusnov) [06:31:31] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:43:09] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 291.26 seconds [06:57:55] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:29:11] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:37:17] ACKNOWLEDGEMENT - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Vgutierrez T201366 [07:43:40] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy (noop in upload, expected change in text): https://puppet-compiler.wmflabs.org/compiler1002/14504/" [puppet] - 10https://gerrit.wikimedia.org/r/486423 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [07:50:16] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.8 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485014 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [07:52:13] (03Merged) 10jenkins-bot: debian: Add release 0.8 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485014 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [07:54:01] (03CR) 10jenkins-bot: debian: Add release 0.8 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485014 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [07:56:44] !log uploaded certcentral 0.8 to apt.wikimedia.org (stretch) - T209980 T213820 T213301 [07:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:56] T213301: Avoid inter-hosts puppet dependencies on certificate deployment - https://phabricator.wikimedia.org/T213301 [07:56:56] T209980: certcentral crashes on network errors - https://phabricator.wikimedia.org/T209980 [07:56:57] T213820: certcentral is incompatible with the current python3-acme version shipped in stretch-backports - https://phabricator.wikimedia.org/T213820 [07:57:29] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [07:57:51] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 47.38 seconds [08:05:26] (03PS2) 10Giuseppe Lavagetto: Add a prune action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485499 (https://phabricator.wikimedia.org/T207703) [08:05:28] (03PS1) 10Giuseppe Lavagetto: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 [08:06:51] (03CR) 10jerkins-bot: [V: 04-1] Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [08:17:53] (03CR) 10Vgutierrez: [C: 03+1] "LGTM after Iafcea7f606663b9dbf42faa3ee87717150c90288" [puppet] - 10https://gerrit.wikimedia.org/r/485823 (https://phabricator.wikimedia.org/T212251) (owner: 10Alexandros Kosiaris) [08:48:40] !log fixing dbstore1002 x1 replication [08:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:53] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:59:33] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:01:09] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27847 MB (5% inode=99%) [09:02:27] RECOVERY - Disk space on elastic1017 is OK: DISK OK [09:17:19] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table enwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000331, end_log_pos 437043378 [09:26:43] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [09:32:59] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:03:25] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:13:15] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:16:18] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27644 MB (5% inode=99%) [10:20:13] RECOVERY - Disk space on elastic1017 is OK: DISK OK [10:20:30] ^shards are relocating. Should be fine soon. wikidata wiki seems to be growing at an amazing rate [10:28:59] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T1030). [10:56:13] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [11:03:22] (03PS2) 10Jbond: Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/486464 [11:04:15] !log installing ghostscript security updates [11:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:53] (03PS2) 10Muehlenhoff: disabling user mkroetzsch [puppet] - 10https://gerrit.wikimedia.org/r/486111 (https://phabricator.wikimedia.org/T214498) (owner: 10RobH) [11:12:51] (03CR) 10Muehlenhoff: [C: 03+2] disabling user mkroetzsch [puppet] - 10https://gerrit.wikimedia.org/r/486111 (https://phabricator.wikimedia.org/T214498) (owner: 10RobH) [11:13:27] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [11:15:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: remove shell access for mkroetzsch on 2019-01-26 - https://phabricator.wikimedia.org/T214498 (10MoritzMuehlenhoff) 05Open→03Resolved [11:24:37] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[enforce-users-groups-cleanup] [11:27:31] (03PS1) 10Arturo Borrero Gonzalez: toolforge: bastion: introduce apt pinning for systemd [puppet] - 10https://gerrit.wikimedia.org/r/487823 (https://phabricator.wikimedia.org/T215154) [11:28:23] (03CR) 10Jbond: [C: 03+2] Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/486464 (owner: 10Jbond) [11:28:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: bastion: introduce apt pinning for systemd [puppet] - 10https://gerrit.wikimedia.org/r/487823 (https://phabricator.wikimedia.org/T215154) (owner: 10Arturo Borrero Gonzalez) [11:28:46] (03PS3) 10Jbond: Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/486464 [11:29:55] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:30:22] (03PS4) 10Jbond: Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/486464 [11:38:13] PROBLEM - puppet last run on db1113 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:39:15] (03PS1) 10Jbond: Revert "Add apt pinning for buster" [puppet] - 10https://gerrit.wikimedia.org/r/487825 [11:39:55] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:39:57] 10Operations: parsoid-vd - "no such file or directory, open '/srv/visualdiff/testreduce/testrun.ids" - https://phabricator.wikimedia.org/T215049 (10GTirloni) Related T201366 [11:40:03] ^^db113 is related to my change, reverting now [11:40:14] :-/ [11:40:36] (03CR) 10Jbond: [C: 03+2] Revert "Add apt pinning for buster" [puppet] - 10https://gerrit.wikimedia.org/r/487825 (owner: 10Jbond) [11:41:37] what was the error? a manual puppet run worked fine for me [11:42:21] 'ould not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Class[Base::Puppet]: has no parameter named 'puppet_major_version' at /etc/puppet/modules/profile/manifests/base.pp:42:5 on node db1113.eqiad.wmne' [11:43:29] RECOVERY - puppet last run on db1113 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:44:36] ah, ok.I might have run at the time when the revert was already merged [11:48:51] 10Operations, 10Analytics, 10Product-Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10aborrero) >>! In T212824#4922929, @Dzahn wrote: >>>! In T212824#4922713, @elukey wrote: >> @aborrero has already done a similar thing for the tool-forg... [11:51:40] (03CR) 10Vgutierrez: [C: 03+2] "pcc looking good: https://puppet-compiler.wmflabs.org/compiler1002/14507/" [puppet] - 10https://gerrit.wikimedia.org/r/483728 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [11:52:09] (03PS2) 10Vgutierrez: certcentral: Set authorized_hosts or regexes for every cert [puppet] - 10https://gerrit.wikimedia.org/r/483728 (https://phabricator.wikimedia.org/T213301) [11:55:21] (03PS2) 10Jbond: Revert "Revert "Add apt pinning for buster"" [puppet] - 10https://gerrit.wikimedia.org/r/486461 (owner: 10Jcrespo) [11:56:06] (03CR) 10Jbond: [C: 03+2] "checked db1113 with puppet-compile, believe last error was due to a race condition with rsync" [puppet] - 10https://gerrit.wikimedia.org/r/486461 (owner: 10Jcrespo) [12:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T1200). [12:00:05] Amir1 and Lucas_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:41] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:01:03] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:01:31] (03PS1) 10Jbond: Revert "Revert "Revert "Add apt pinning for buster""" [puppet] - 10https://gerrit.wikimedia.org/r/487827 [12:02:12] I have to wait until :15 to deploy my patch, is that okay? [12:02:33] Amir1 is up first anyways [12:02:36] (03CR) 10Jbond: [C: 03+2] Revert "Revert "Revert "Add apt pinning for buster""" [puppet] - 10https://gerrit.wikimedia.org/r/487827 (owner: 10Jbond) [12:02:39] PROBLEM - puppet last run on cloudvirtan1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:02:46] it's fine for me. [12:02:52] who's SWATing today? [12:03:25] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:03:31] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:03:31] PROBLEM - puppet last run on dbmonitor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:04:07] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:04:21] PROBLEM - puppet last run on mc1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:04:29] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:04:33] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:04:45] ^^ sorry theses errors are related to my change i pushed an old version of my patch set which included puppet-common. it has allready been reverted [12:04:57] PROBLEM - puppet last run on aluminium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:04:59] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:01] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:15] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:18] Okay, I think I SWAT today then [12:05:29] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:30] jbond42: is it okay to move forward or we should wait? [12:05:31] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:32] jbond42: ^ [12:05:33] PROBLEM - puppet last run on mc2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:33] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:33] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:53] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:57] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:59] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:06:00] Amir1: pleaee go ahead i wilkl wait untill you have completed before progressing [12:06:03] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:06:17] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:06:23] PROBLEM - puppet last run on cloudvirtan1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:06:23] PROBLEM - puppet last run on scb2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:06:24] should we wait for the alerts to recover first? [12:06:35] before swat ? [12:06:41] PROBLEM - puppet last run on mc2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:06:51] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:11] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:13] PROBLEM - puppet last run on restbase2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:29] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:33] 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10hashar) Icinga does not monitor Gerrit CPU usage / system load. We would need to add the `check_load` plugin mentioned above by @Dzahn. [12:07:37] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:42] if you give me 5 mins ill go and re rune puppet on the alerting nodes [12:08:00] It's fine for me, today SWAT is small and fast [12:08:07] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:08:08] yeah, let’s take the time :) [12:08:47] RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:08:47] RECOVERY - puppet last run on dbmonitor1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:09:00] ok ill let you know when its clear [12:09:05] PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:05] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:23] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:25] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:09:25] PROBLEM - puppet last run on heze is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:37] PROBLEM - puppet last run on analytics1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:37] RECOVERY - puppet last run on mc1028 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:09:45] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:09:51] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:57] PROBLEM - puppet last run on thumbor2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:37] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:12:07] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:15:05] RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:15:29] RECOVERY - puppet last run on aluminium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:15:33] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:15:35] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:15:49] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:15:59] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:16:05] RECOVERY - puppet last run on mc2023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:16:05] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:16:07] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:16:07] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:16:27] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:16:33] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:16:37] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:16:49] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:16:57] RECOVERY - puppet last run on cloudvirtan1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:16:57] RECOVERY - puppet last run on scb2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:17:15] RECOVERY - puppet last run on mc2019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:17:47] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:17:47] RECOVERY - puppet last run on restbase2007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:18:05] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:18:13] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:18:43] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:19:39] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:19:57] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:19:57] RECOVERY - puppet last run on heze is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:20:09] RECOVERY - puppet last run on analytics1068 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:20:25] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:20:31] RECOVERY - puppet last run on thumbor2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:23:51] RECOVERY - puppet last run on cloudvirtan1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:23:51] echo $? [12:24:33] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:27:13] everything okay now? [12:27:16] Amir1: sorry for the delay, should be all clear now [12:27:27] thank you! [12:27:33] Let's move forward [12:27:50] Lucas_WMDE: yes everything should be good now, may get one more recovery from labstor1007 but everything elses is clear [12:27:55] okay thanks [12:28:49] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487076 (https://phabricator.wikimedia.org/T213975) (owner: 10Ladsgroup) [12:29:55] (03Merged) 10jenkins-bot: Populate wmgWikibaseRepoSpecialSiteLinkGroups for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487076 (https://phabricator.wikimedia.org/T213975) (owner: 10Ladsgroup) [12:30:11] (03CR) 10jenkins-bot: Populate wmgWikibaseRepoSpecialSiteLinkGroups for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487076 (https://phabricator.wikimedia.org/T213975) (owner: 10Ladsgroup) [12:30:13] RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:32:32] (03PS6) 10Lucas Werkmeister (WMDE): Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) [12:32:44] tested it on mwdebug1002, worked fine. Moving forward [12:34:01] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) (owner: 10Lucas Werkmeister (WMDE)) [12:34:12] Lucas_WMDE: is your patch testable on mwdebug1002? [12:34:18] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:487076|Populate wmgWikibaseRepoSpecialSiteLinkGroups for commonswiki (T213975)]] (duration: 00m 51s) [12:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:21] T213975: Link page from wikimedia.Commons with page exisiting on en.wikipedia with no existing item on wikidata fails - https://phabricator.wikimedia.org/T213975 [12:34:43] Amir1: I’m not sure how to test it, but I’ll still try it out just to make sure [12:34:51] okay noted [12:34:52] please don’t deploy for me, I’m supposed to be demonstrating this process :) [12:35:04] (03Merged) 10jenkins-bot: Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) (owner: 10Lucas Werkmeister (WMDE)) [12:35:09] oh okay [12:35:15] I just merged it [12:35:17] sorry [12:35:20] SWAT is yours [12:36:41] change is on mwdebug1002, testing [12:37:43] (03PS1) 10Jbond: Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/487830 [12:37:56] seems to work, proceeding [12:40:05] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:477522|Fix Wikidata base URI in client config (T198946)]] (duration: 00m 46s) [12:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:08] T198946: Add Schema property 'sameAs' pointing to Wikidata entries - https://phabricator.wikimedia.org/T198946 [12:40:35] (03CR) 10jenkins-bot: Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) (owner: 10Lucas Werkmeister (WMDE)) [12:40:58] okay, I think I’m done [12:41:11] anything else [12:41:13] ? [12:41:28] !log EU SWAT done [12:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:48] (03CR) 10Jbond: [C: 03+2] Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/487830 (owner: 10Jbond) [12:46:29] (03PS1) 10Jbond: Revert "Add apt pinning for buster" [puppet] - 10https://gerrit.wikimedia.org/r/487834 [12:58:53] (03Abandoned) 10Jbond: Revert "Add apt pinning for buster" [puppet] - 10https://gerrit.wikimedia.org/r/487834 (owner: 10Jbond) [12:59:19] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:26:33] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [13:30:51] (03PS6) 10Hashar: scan and process templates in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 [13:31:36] (03CR) 10Hashar: "Fix the test for DockerImageBuilder.scan in docker_pkg/tests/test_cli.py . The Mock was wrong :)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [13:31:57] (03CR) 10jerkins-bot: [V: 04-1] scan and process templates in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [13:33:36] (03CR) 10Muehlenhoff: admin: create new system groups for cloudelastic nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [13:36:19] (03CR) 10Mathew.onipe: admin: create new system groups for cloudelastic nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [13:40:58] (03PS1) 10Hashar: test python3.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487843 [13:42:21] (03CR) 10jerkins-bot: [V: 04-1] test python3.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487843 (owner: 10Hashar) [13:44:58] (03PS2) 10Muehlenhoff: keyholder: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/486446 [13:46:20] (03CR) 10Muehlenhoff: [C: 03+2] keyholder: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/486446 (owner: 10Muehlenhoff) [13:50:01] (03PS2) 10Muehlenhoff: Remove unused statsite::decommission class [puppet] - 10https://gerrit.wikimedia.org/r/486447 [13:56:55] PROBLEM - Host ms-be2030 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:19] (03PS2) 10Hashar: test python3.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487843 [14:09:16] (03PS7) 10Hashar: scan and process templates in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 [14:09:31] (03Abandoned) 10Hashar: test python3.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487843 (owner: 10Hashar) [14:10:47] (03CR) 10Hashar: "Fixed python3.4 :)" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [14:12:35] RECOVERY - Host ms-be2030 is UP: PING OK - Packet loss = 0%, RTA = 37.99 ms [14:18:39] jynus: thanks for fixing dbstore1002, just noticed it :) [14:20:55] (03PS1) 10Arturo Borrero Gonzalez: toolforge: bastion: also apt pin udev [puppet] - 10https://gerrit.wikimedia.org/r/487847 (https://phabricator.wikimedia.org/T215154) [14:22:32] (03CR) 10Muehlenhoff: [C: 03+1] toolforge: bastion: also apt pin udev [puppet] - 10https://gerrit.wikimedia.org/r/487847 (https://phabricator.wikimedia.org/T215154) (owner: 10Arturo Borrero Gonzalez) [14:23:13] ^ I've reopened T204567 for the ms-be2030 reboot [14:23:14] T204567: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 [14:23:59] (03PS1) 10Hashar: cli.defaults was altered by read_config() [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487848 [14:24:28] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10MoritzMuehlenhoff) 05Resolved→03Open I'm reopening the task, the server went down again today: ` [13:56] PROBLEM - Host ms-be2030 is DOWN: PING CRITICAL - Packet loss = 100% ` I was looking... [14:24:35] PROBLEM - Disk space on stat1007 is CRITICAL: DISK CRITICAL - free space: / 3419 MB (3% inode=97%) [14:25:57] (03CR) 10Hashar: "I had that issue when iterating on another patch. Running a single test worked while running the full suite eventually failed." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487848 (owner: 10Hashar) [14:28:31] (03CR) 10Dzahn: "@Mathew.onipe The next unused UID is 809. This is how i checked: grep "gid:" data.yaml | cut -d: -f2 | uniq | sort -n" [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [14:28:47] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:29:12] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [14:36:25] (03PS1) 10Hashar: Never try apt_pkg when parsing control [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487849 [14:37:11] (03CR) 10Hashar: "The warnings can be seen in the test suite output ;)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487849 (owner: 10Hashar) [14:47:37] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on db1068 is CRITICAL: 7.001 ge 4 Muehlenhoff T213664 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [14:49:23] RECOVERY - Disk space on stat1007 is OK: DISK OK [14:56:11] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [14:56:15] (03CR) 10Muehlenhoff: [C: 03+2] Remove unused statsite::decommission class [puppet] - 10https://gerrit.wikimedia.org/r/486447 (owner: 10Muehlenhoff) [14:58:24] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10CDanis) 05Resolved→03Open Seems like it is happening again [15:07:17] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10jcrespo) `name=[times are CET] [15:34] there's an EDAC Icinga alert for db1068, system is OOW, known/worth opening a Phab task? (sometimes we ha... [15:09:01] if one apt guru has some spare time, I could use jenkins-debian-glue deb package to be backported from Debian testing to our jessie/stretch apt repo ( https://phabricator.wikimedia.org/T212774#4896831 ) [15:09:33] and I got the 0.20.0 version to build properly under jessie/stretch using package_builder module \o/ [15:10:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10MoritzMuehlenhoff) [15:15:10] (03PS2) 10Muehlenhoff: hhvm: Remove support for pre stretch [puppet] - 10https://gerrit.wikimedia.org/r/486449 [15:16:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10Andrew) Since this host is empty we should rebuild it with Stretch before putting any real VMs back on it. Maybe b... [15:18:20] (03CR) 10Muehlenhoff: [C: 03+2] hhvm: Remove support for pre stretch [puppet] - 10https://gerrit.wikimedia.org/r/486449 (owner: 10Muehlenhoff) [15:18:45] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [15:21:14] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [15:25:31] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [15:26:17] (03CR) 10Anomie: [C: 03+1] "Seems sane. Haven't tested." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/487785 (https://phabricator.wikimedia.org/T198370) (owner: 10Gilles) [15:28:00] 10Operations: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10MoritzMuehlenhoff) [15:30:58] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [15:33:49] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [15:37:00] elukey: I can think about T214275 a bit this morning if you're around. I can have a look at getting the proper packages installed on those boxes… actually fixing the config appropriately would require me to learn a lot more. [15:37:01] T214275: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 [15:41:27] (03PS1) 10Vgutierrez: certcentral: Render new authorized_(hosts|regexes) parameters [puppet] - 10https://gerrit.wikimedia.org/r/487860 (https://phabricator.wikimedia.org/T213301) [15:42:11] (03PS1) 10Tim Eulitz: Disable confirmation prompt on rollback by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487861 (https://phabricator.wikimedia.org/T215019) [15:44:17] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10Andrew) My short answer is: I don't object to moving wikitech from nutracker to mcrouter but I'm cur... [15:45:54] (03CR) 10Tim Eulitz: "Related change: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/487862" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487861 (https://phabricator.wikimedia.org/T215019) (owner: 10Tim Eulitz) [15:46:48] andrewbogott: o/ [15:47:31] I am around if you want! This is not super urgent, didn't want to push you but only to get a list of next steps.. I can definitely help! [15:48:12] elukey: I added a comment on the task about where the changes need to be made. If you've ever set up mcrouter before then you might be a better choice to actually make the changes :) [15:48:44] Every time something like this comes up I think "Well, maybe wikitech will be merged into the main cluster by the time this matters" but it never comes true :( [15:48:52] :( [15:49:14] so we could add mcrouter, test that it works, and then flip the config (I guess via mw-config?) [15:49:39] yep, sounds right [15:49:51] Is 'adding mcrouter' just a question of including an additional class in the profile? [15:50:19] andrewbogott: am I right to say that the two labweb hosts have a memcached instance each, and also a proxy config? I am checking nutcracker's config now, and it lists the two labwebs [15:50:34] I think that's right [15:51:01] which, btw, I believe the nutcracker is used by other services there besides wikitech... [15:51:39] ah! good to know :) [15:51:43] but I can catch up with that after we have mcrouter working for wikitech but before we turn off nutcracker. I assume that mcrouter is like nutcracker in that it masquerades as memcached, just on a different port? [15:52:07] exactly [15:52:26] ok, so we can just run nutcracker/mcrouter side by side and move services over by adjusting the port [15:52:28] it behaves a bit differently in some stuff, but overall it is a proxy like nutcracker [15:52:34] yep [15:53:12] we are currently running both on the mw servers [15:55:38] (03PS2) 10Vgutierrez: certcentral: Render new authorized_(hosts|regexes) parameters [puppet] - 10https://gerrit.wikimedia.org/r/487860 (https://phabricator.wikimedia.org/T213301) [15:58:36] (03PS3) 10Vgutierrez: certcentral: Render new authorized_(hosts|regexes) parameters [puppet] - 10https://gerrit.wikimedia.org/r/487860 (https://phabricator.wikimedia.org/T213301) [16:01:48] (03Abandoned) 10Vgutierrez: certcentral: Render new authorized_(hosts|regexes) parameters [puppet] - 10https://gerrit.wikimedia.org/r/487860 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [16:05:41] PROBLEM - Host ms-be2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:01] elukey: so is my next step to wait for a patch from you to review? :) [16:06:51] (03PS4) 10Vgutierrez: Ditch certcentral config template, configure in puppet [puppet] - 10https://gerrit.wikimedia.org/r/468604 (owner: 10Alex Monk) [16:07:38] andrewbogott: yep I'll try to send one in a bit :) basically adding mcrouter + config to the labweb hosts [16:07:45] ok [16:11:32] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Received replacement server {F28116675} [16:15:17] (03CR) 10Vgutierrez: "pcc looking good, including new options: https://puppet-compiler.wmflabs.org/compiler1002/14513/" [puppet] - 10https://gerrit.wikimedia.org/r/468604 (owner: 10Alex Monk) [16:16:17] (03PS2) 10Muehlenhoff: graphite: Remove support for trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/486232 [16:20:35] (03CR) 10Muehlenhoff: [C: 03+2] graphite: Remove support for trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/486232 (owner: 10Muehlenhoff) [16:20:53] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) update Netbox with new serial number [16:22:13] RECOVERY - Host ms-be2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.05 ms [16:27:49] (03PS1) 10Papaul: DHCP: Update MAC address for ms-be2047 [puppet] - 10https://gerrit.wikimedia.org/r/487873 (https://phabricator.wikimedia.org/T209921) [16:28:32] (03PS1) 10Muehlenhoff: role::graphite::base: Unconditionally use systemd [puppet] - 10https://gerrit.wikimedia.org/r/487874 [16:30:07] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Removed old puppet cert for ms-be2047.codfw.wmnet [16:35:09] (03PS1) 10Zppix: Account creation cap for Women Activists edit-a-thon at Simmons University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487876 (https://phabricator.wikimedia.org/T215069) [16:35:13] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563 (10Jana_Hentschke) Just a note that the same discussion came up among linked library data producers at [[ http://swib.org/swib18/ | SWIB18... [16:38:47] (03PS1) 10Muehlenhoff: statsite: Unconditionally use systemd [puppet] - 10https://gerrit.wikimedia.org/r/487878 [16:41:47] 10Operations, 10Scoring-platform-team (Current), 10User-Ladsgroup: Spec out migrating ORES to kubernetes - https://phabricator.wikimedia.org/T210109 (10Halfak) @Ladsgroup is there some reason you moves this to the "review" column? I can't figure out what to review here. [16:41:49] (03PS2) 10Zppix: Lift Account creation cap for Women Activists edit-a-thon at Simmons University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487876 (https://phabricator.wikimedia.org/T215069) [16:42:27] (03PS1) 10Zppix: Remove past throttles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487881 [16:42:58] (03PS2) 10Zppix: Remove past throttles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487881 [16:43:06] (03PS1) 10Jbond: Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 [16:43:16] 10Operations, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10colewhite) [16:43:18] 10Operations, 10ops-eqiad, 10DC-Ops: icinga1001 mysterious reboots - https://phabricator.wikimedia.org/T210108 (10colewhite) [16:43:24] 10Operations, 10Scoring-platform-team (Current), 10User-Ladsgroup: Spec out migrating ORES to kubernetes - https://phabricator.wikimedia.org/T210109 (10Ladsgroup) The review is if {T182331} are good enough to call this task done or they need more subtasks, etc. [16:43:29] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [16:43:46] (03PS2) 10Jbond: Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 [16:46:11] (03PS1) 10Muehlenhoff: varnishkafka: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/487883 [16:46:38] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (owner: 10Jbond) [16:50:30] !log deployed patch for T212118 [16:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:04] (03PS4) 10Paladox: gerrit: Set zuulUrl for plugin zuul-status [puppet] - 10https://gerrit.wikimedia.org/r/487619 (https://phabricator.wikimedia.org/T214068) [16:51:10] (03PS1) 10Marostegui: phabricator.my.cnf.erb: Disable local_infile [puppet] - 10https://gerrit.wikimedia.org/r/487884 (https://phabricator.wikimedia.org/T214248) [16:51:17] (03PS1) 10Eevans: WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 [16:52:38] (03CR) 10jerkins-bot: [V: 04-1] WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 (owner: 10Eevans) [16:53:23] (03CR) 10Paladox: [C: 03+1] phabricator.my.cnf.erb: Disable local_infile [puppet] - 10https://gerrit.wikimedia.org/r/487884 (https://phabricator.wikimedia.org/T214248) (owner: 10Marostegui) [16:53:33] (03PS3) 10Jbond: Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 [16:54:20] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (owner: 10Jbond) [16:54:43] (03PS2) 10Eevans: WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 [16:54:46] (03CR) 10Paladox: Improve CI checks to ensure a basic catalouge compiles on all supported OS's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (owner: 10Jbond) [16:55:05] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Papaul) a:05Papaul→03RobH Can you please update this disk with which disk failed? Thanks [16:55:20] (03CR) 10jerkins-bot: [V: 04-1] WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 (owner: 10Eevans) [16:56:30] (03PS3) 10Eevans: WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 [16:57:26] (03PS4) 10Jbond: Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 [16:59:45] (03CR) 10jerkins-bot: [V: 04-1] WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 (owner: 10Eevans) [16:59:53] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:00:12] (03CR) 10Andrew Bogott: [C: 03+1] "Refactor looks good! Are these roles mostly used on VMs? I don't see where this code is applied in production but maybe I'm failing at g" [puppet] - 10https://gerrit.wikimedia.org/r/487482 (owner: 10Arturo Borrero Gonzalez) [17:00:48] (03CR) 10Andrew Bogott: [C: 03+1] graphite: refactor into role/profile [puppet] - 10https://gerrit.wikimedia.org/r/487481 (owner: 10Arturo Borrero Gonzalez) [17:01:27] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (owner: 10Jbond) [17:01:33] (03PS1) 10Arturo Borrero Gonzalez: toolforge: bastion: split resource control puppet code [puppet] - 10https://gerrit.wikimedia.org/r/487886 (https://phabricator.wikimedia.org/T215154) [17:03:52] (03CR) 10Andrew Bogott: "I don't feel strongly about the wmcs prefix... if it makes arturo happy then I'm find with it :)" [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez) [17:04:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: bastion: split resource control puppet code [puppet] - 10https://gerrit.wikimedia.org/r/487886 (https://phabricator.wikimedia.org/T215154) (owner: 10Arturo Borrero Gonzalez) [17:05:12] (03PS1) 10Muehlenhoff: imagemagick: Unconditionally use /etc/ImageMagick-6/ [puppet] - 10https://gerrit.wikimedia.org/r/487888 [17:05:14] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10RobH) a:05RobH→03Papaul Ok, here are the full commands (so you can also run in future as needed): ` robh@thumbor2002:~$ cat /proc/mdstat Personalities : [raid1] md2 :... [17:07:34] (03PS1) 10Elukey: role::wmcs::openstack::main::labweb: add mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/487889 (https://phabricator.wikimedia.org/T214275) [17:07:40] (03CR) 10Dzahn: [C: 03+2] DHCP: Update MAC address for ms-be2047 [puppet] - 10https://gerrit.wikimedia.org/r/487873 (https://phabricator.wikimedia.org/T209921) (owner: 10Papaul) [17:07:54] (03PS2) 10Dzahn: DHCP: Update MAC address for ms-be2047 [puppet] - 10https://gerrit.wikimedia.org/r/487873 (https://phabricator.wikimedia.org/T209921) (owner: 10Papaul) [17:10:00] !log revert ospf metrics to normal values on esams-eqiad Level3 link [17:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:17] (03CR) 10Arturo Borrero Gonzalez: "Perhaps is even a good idea to separate this code into a .deb package instead and decouple from puppet." [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez) [17:14:04] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "3" [puppet] - 10https://gerrit.wikimedia.org/r/487873 (https://phabricator.wikimedia.org/T209921) (owner: 10Papaul) [17:14:31] (03PS3) 10Dzahn: DHCP: Update MAC address for ms-be2047 [puppet] - 10https://gerrit.wikimedia.org/r/487873 (https://phabricator.wikimedia.org/T209921) (owner: 10Papaul) [17:15:31] (03CR) 10BryanDavis: "> Perhaps is even a good idea to separate this code into a .deb" [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez) [17:16:01] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10RobH) In checking dc spares tracking, it shows 11 500GB SATA disks in codfw spare hardware. If this isn't right, please update task and update the tracking sheet. Thanks! [17:16:02] !zuul [17:16:02] a python daemon which acts as a gateway between Gerrit and Jenkins. [17:17:01] (03PS2) 10Arturo Borrero Gonzalez: toolforge: bastion: also apt pin udev [puppet] - 10https://gerrit.wikimedia.org/r/487847 (https://phabricator.wikimedia.org/T215154) [17:22:29] (03CR) 10jerkins-bot: [V: 04-1] toolforge: bastion: also apt pin udev [puppet] - 10https://gerrit.wikimedia.org/r/487847 (https://phabricator.wikimedia.org/T215154) (owner: 10Arturo Borrero Gonzalez) [17:23:01] (03CR) 10Andrew Bogott: "Making things more libary-like sounds great; I think we can do that without actually moving things out of the puppet repo." [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez) [17:23:41] (03PS5) 10Jbond: Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 [17:24:39] (03PS3) 10Arturo Borrero Gonzalez: toolforge: bastion: also apt pin udev [puppet] - 10https://gerrit.wikimedia.org/r/487847 (https://phabricator.wikimedia.org/T215154) [17:24:41] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (owner: 10Jbond) [17:25:00] (03PS2) 10Elukey: role::wmcs::openstack::main::labweb: add mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/487889 (https://phabricator.wikimedia.org/T214275) [17:25:14] !log powering down thumbor2002 for disk replacement [17:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: bastion: also apt pin udev [puppet] - 10https://gerrit.wikimedia.org/r/487847 (https://phabricator.wikimedia.org/T215154) (owner: 10Arturo Borrero Gonzalez) [17:27:07] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [17:29:03] PROBLEM - Host thumbor2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:29:26] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10RStallman-legalteam) @alaa_wmde - Happy to create an NDA for you. I couldn't find contact info on the WMDE site. Could you share your email... [17:32:44] (03CR) 10Bstorm: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/487482 (owner: 10Arturo Borrero Gonzalez) [17:32:59] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10Dzahn) @WMDE-leszek Does https://wikimedia.de/de/menschen/mitarbeitende need an update? [17:34:22] (03PS3) 10Dzahn: Disallow local_infile for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/487459 (https://phabricator.wikimedia.org/T214248) (owner: 1020after4) [17:34:27] RECOVERY - Host thumbor2002 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [17:36:55] PROBLEM - Host thumbor2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:02] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10WMDE-leszek) @Dzahn @RStallman-legalteam WMDE website is not yet updated indeed, apologies for that. For NDA please email address as regist... [17:38:05] (03CR) 10Bstorm: [C: 04-1] "Please add the changes to the site.pp for labmon (see my reply to Andrew)." [puppet] - 10https://gerrit.wikimedia.org/r/487482 (owner: 10Arturo Borrero Gonzalez) [17:38:53] (03CR) 10Dzahn: [C: 03+2] Disallow local_infile for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/487459 (https://phabricator.wikimedia.org/T214248) (owner: 1020after4) [17:40:04] 10Puppet, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Write puppet for redis-sentinel - https://phabricator.wikimedia.org/T210580 (10Ladsgroup) [17:41:12] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10Dzahn) p:05Triage→03Normal [17:46:51] (03PS6) 10Jbond: Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 [17:47:44] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (owner: 10Jbond) [17:48:23] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Papaul) Disk with serial number WMAYP0E607DT has been replaced. Server can not find boot device. Server can not boot to OS after disk replacement. [17:49:32] (03PS2) 10Jforrester: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part I – Create it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484801 (https://phabricator.wikimedia.org/T213504) [17:50:17] (03CR) 10Elukey: [C: 04-1] "Currently not working due to the mediawiki mcrouter config stated in a high priority hiera namespace, going to follow up." [puppet] - 10https://gerrit.wikimedia.org/r/487889 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [17:50:32] I'm going to land some config clean-up patches. Shout if that's not OK. [17:51:07] (03PS2) 10Jforrester: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part II – Use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484802 (https://phabricator.wikimedia.org/T213504) [17:51:09] jouncebot: next [17:51:09] In 0 hour(s) and 8 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T1800) [17:51:14] (03PS2) 10Jforrester: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part III – Stop using it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484803 (https://phabricator.wikimedia.org/T213504) [17:51:24] (03PS7) 10Jbond: Improve CI checks to ensure a basic catalouge compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 [17:51:27] (03PS2) 10Jforrester: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part IV – Delete it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484804 (https://phabricator.wikimedia.org/T213504) [17:51:32] (03PS1) 10Muehlenhoff: haproxy: Remove Ubuntu support [puppet] - 10https://gerrit.wikimedia.org/r/487895 [17:51:50] mutante: Yeah, the WDQS stuff is rather distinct. [17:51:57] mutante: no deploy today. [17:52:03] Even better. :-) [17:52:05] So you can proceed [17:52:15] Thanks. [17:52:19] (03CR) 10Jforrester: [C: 03+2] dblists: Rename 'wikidatarepo' to 'wikibaserepo' part I – Create it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484801 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [17:52:23] Yw! [17:52:34] ok, thanks both [17:53:26] (03Merged) 10jenkins-bot: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part I – Create it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484801 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [17:56:44] !log jforrester@deploy1001 Synchronized dblists/wikibaserepo.dblist: T213504: Create the new wikibaserepo dblist (duration: 00m 47s) [17:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:46] T213504: Rename the 'wikidatarepo' dblist to 'wikibaserepo' which it meant - https://phabricator.wikimedia.org/T213504 [17:57:19] PROBLEM - Host thumbor2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:57:52] 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10CDanis) If the thing we want to monitor is "Gerrit is responding slowly / not at all", IMO that is the thing we should check. High CPU load is just... [17:58:28] (03CR) 10Jforrester: [C: 03+2] dblists: Rename 'wikidatarepo' to 'wikibaserepo' part II – Use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484802 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [17:58:53] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T213504: Tell CommonSettings about the new wikibaserepo dblist (duration: 00m 47s) [17:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:42] (03Merged) 10jenkins-bot: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part II – Use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484802 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [18:00:08] gehel and onimisionipe: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T1800). [18:00:50] (03PS3) 10Arturo Borrero Gonzalez: graphite: refactor into role/profile [puppet] - 10https://gerrit.wikimedia.org/r/487481 [18:00:52] (03PS3) 10Arturo Borrero Gonzalez: wmcs: monitoring: refactor code into roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/487482 [18:01:12] (03PS1) 10Muehlenhoff: memcached: Unconditionally use systemd [puppet] - 10https://gerrit.wikimedia.org/r/487898 [18:02:26] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T213504: Configure wikibaserepo dblist just like the wikidatarepo one (duration: 00m 46s) [18:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:29] T213504: Rename the 'wikidatarepo' dblist to 'wikibaserepo' which it meant - https://phabricator.wikimedia.org/T213504 [18:02:41] RECOVERY - Host thumbor2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.67 ms [18:02:49] (03CR) 10Jforrester: [C: 03+2] dblists: Rename 'wikidatarepo' to 'wikibaserepo' part III – Stop using it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484803 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [18:03:52] (03CR) 10Jcrespo: "The patch is ok, but I didn't send it because we may want to apply it to all or most servers. However, that needs more testing." [puppet] - 10https://gerrit.wikimedia.org/r/487884 (https://phabricator.wikimedia.org/T214248) (owner: 10Marostegui) [18:04:06] (03Merged) 10jenkins-bot: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part III – Stop using it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484803 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [18:04:41] (03CR) 10jenkins-bot: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part I – Create it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484801 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [18:04:43] (03CR) 10jenkins-bot: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part II – Use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484802 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [18:04:45] (03CR) 10jenkins-bot: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part III – Stop using it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484803 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [18:05:44] !log manually rotate log file wtmp on csw2-esams [18:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:14] 10Operations: sw raid1 doesnt install grub on sdb - https://phabricator.wikimedia.org/T215183 (10RobH) p:05Triage→03Normal [18:06:27] RECOVERY - Host thumbor2002 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [18:06:29] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T213504: Unconfigure the wikidatarepo dblist (duration: 00m 46s) [18:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:31] (03CR) 10Jforrester: [C: 03+2] dblists: Rename 'wikidatarepo' to 'wikibaserepo' part IV – Delete it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484804 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [18:07:47] (03Merged) 10jenkins-bot: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part IV – Delete it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484804 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [18:08:14] (03CR) 10Jcrespo: [C: 03+1] "This looks about right, although we should be careful on deployment (disable puppet on deployment and apply it on a passive host first) as" [puppet] - 10https://gerrit.wikimedia.org/r/487895 (owner: 10Muehlenhoff) [18:08:36] 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Dzahn) >>! In T215033#4925098, @CDanis wrote: > Would it be difficult to use `check_http`, .. pointed at a few key Gerrit URLs? We already check i... [18:09:01] (03PS1) 10Muehlenhoff: vagrant: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/487899 [18:09:25] ACKNOWLEDGEMENT - MD RAID on thumbor2002 is CRITICAL: CRITICAL: State: degraded, Active: 4, Working: 4, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T215185 [18:09:29] 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T215185 (10ops-monitoring-bot) [18:09:33] PROBLEM - Check whether ferm is active by checking the default input chain on thumbor2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [18:09:50] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T213504: Stop telling CommonsSettings about the wikidatarepo dblist (duration: 00m 45s) [18:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:01] T213504: Rename the 'wikidatarepo' dblist to 'wikibaserepo' which it meant - https://phabricator.wikimedia.org/T213504 [18:10:23] PROBLEM - Check systemd state on thumbor2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:11:29] !log jforrester@deploy1001 Synchronized dblists/: T213504: Finally, drop the wikidatarepo dblist (duration: 00m 45s) [18:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:07] (03PS2) 10Muehlenhoff: statsite/statsd: Unconditionally use systemd [puppet] - 10https://gerrit.wikimedia.org/r/487878 [18:12:51] (03PS2) 10Jforrester: Clean-up: Drop B/C checking for $wgEchoConfig, not used since 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486117 [18:12:59] (03CR) 10Jforrester: [C: 03+2] Clean-up: Drop B/C checking for $wgEchoConfig, not used since 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486117 (owner: 10Jforrester) [18:14:24] (03Merged) 10jenkins-bot: Clean-up: Drop B/C checking for $wgEchoConfig, not used since 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486117 (owner: 10Jforrester) [18:14:26] (03PS2) 10Jforrester: Clean-up: Drop reading for wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486118 [18:14:29] (03PS2) 10Jforrester: Clean-up: Drop writing to wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486119 [18:14:32] (03PS2) 10Jforrester: Clean-up: Stop setting $wgFlowEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486120 [18:14:35] (03PS2) 10Jforrester: Clean-up: Stop setting $wgParsoidWikiPrefix, unused since the Parsoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486121 [18:15:37] (03CR) 10BryanDavis: [C: 03+1] vagrant: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/487899 (owner: 10Muehlenhoff) [18:16:55] (03PS1) 10Muehlenhoff: shinken: Remove trusty support [puppet] - 10https://gerrit.wikimedia.org/r/487900 [18:16:58] (03CR) 10jenkins-bot: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part IV – Delete it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484804 (https://phabricator.wikimedia.org/T213504) (owner: 10Jforrester) [18:17:00] (03CR) 10jenkins-bot: Clean-up: Drop B/C checking for $wgEchoConfig, not used since 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486117 (owner: 10Jforrester) [18:17:02] gehel, onimisionipe: WDQS deploy today is happening? [18:17:28] SMalyshev: No. [18:17:41] James_F: why not? [18:17:55] I don't know. [18:18:01] I see :) [18:18:07] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Clean-up: Stop setting wgEchoConfig, unused since 2016 (duration: 00m 48s) [18:18:10] "09:51:57 mutante: no deploy today." [18:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:25] SMalyshev: just out of the plane, so don't count on me ;) [18:18:46] gehel: ok then, onimisionipe - do you want to do the deploy or should I do it? [18:19:02] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Papaul) I put back the bad disk and boot the system and the system boot into OS with no problem. it looks like what @jcrespo and other mentioned on IRC the grub is installed o... [18:19:20] (03CR) 10Jforrester: [C: 03+2] Clean-up: Drop reading for wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486118 (owner: 10Jforrester) [18:19:24] gehel: I expected you to be still traveling, pinged you just in case :) [18:19:29] SMalyshev: sorry. I didn't see anything in the deployment window then..but now I see something [18:19:44] onimisionipe: yes I added it late, sorry... [18:19:47] I can stop deploying, just clean-up patches. [18:19:57] Give me 60 seconds. :-) [18:20:18] You can proceed please. I'm in cab, going home :) [18:20:27] (03Merged) 10jenkins-bot: Clean-up: Drop reading for wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486118 (owner: 10Jforrester) [18:22:16] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Clean-up: Drop reading for wgEcho*FooterNotice*, unread (duration: 00m 46s) [18:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:20] Clear. [18:22:28] SMalyshev: Go ahead. [18:22:49] anyone seen legoktm? [18:22:50] (03PS1) 10Dzahn: gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) [18:23:33] Krenair: Not since Friday. Is it something with which I could help? [18:23:41] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [18:23:41] no, but thanks [18:23:47] Kk. [18:27:02] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/14518/" [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [18:28:23] (03CR) 10jenkins-bot: Clean-up: Drop reading for wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486118 (owner: 10Jforrester) [18:28:35] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:28:42] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) Ack, I will do it tomorrow, thank you @Papaul ! [18:30:02] (03PS2) 10Dzahn: gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) [18:31:31] !log adding Papaul to root@wiki [18:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:45] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037 (10EBernhardson) [18:40:17] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037 (10EBernhardson) @Gehel Should we close this? It seems the current `search.svc..wmnet` certs serve basically the same purpose. But ther... [18:42:09] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037 (10Gehel) We want to keep this open. This is a step on the way to have an etcd / conftool based way to switch traffic. [18:42:58] 10Operations, 10Gerrit, 10Icinga, 10monitoring, 10Patch-For-Review: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10CDanis) The timeouts on that check_ssl invocation make little sense to me -- a warning after 60 seconds, but critical after 30?... [18:46:23] (03PS3) 10Dzahn: gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) [18:50:07] (03CR) 10GTirloni: [C: 03+2] shinken: Remove trusty support [puppet] - 10https://gerrit.wikimedia.org/r/487900 (owner: 10Muehlenhoff) [18:50:59] 10Operations, 10Gerrit, 10Icinga, 10monitoring, 10Patch-For-Review: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Dzahn) >>! In T215033#4925229, @CDanis wrote: > a warning after 60 seconds, but critical after 30? Those seem backwards Agre... [18:51:16] 10Operations: sw raid1 doesnt install grub on sdb - https://phabricator.wikimedia.org/T215183 (10CDanis) I know very little about debian-installer, but here's a guess based on what I found in the puppet repo: `% git grep grub-installer/bootdev modules/install_server/files/autoinstall/common.cfg:d-i grub-install... [18:51:26] (03CR) 10Paladox: gerrit: add icinga https check for dashboard content (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [18:53:12] (03PS3) 10Herron: rsyslog::kafka_shipper: raise rsyslog MaxMessageSize from 8k to 64k [puppet] - 10https://gerrit.wikimedia.org/r/480793 (https://phabricator.wikimedia.org/T205849) [18:54:19] 10Operations, 10Gerrit, 10Icinga, 10monitoring, 10Patch-For-Review: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Paladox) Im thinking we need the health check plugin. That way we just need to check the http status code. [18:55:20] (03PS4) 10Dzahn: gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) [18:56:03] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [18:57:06] (03CR) 10Dzahn: "wait with review.. this translates to: root@icinga1001:/etc/icinga# /usr/lib/nagios/plugins/check_http -H gerrit.wikimedia.org -I gerrit.w" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [18:59:01] (03CR) 10Herron: [C: 03+1] elasticsearch: exit the JVM on OutOfMemoryError [puppet] - 10https://gerrit.wikimedia.org/r/487787 (https://phabricator.wikimedia.org/T76090) (owner: 10Gehel) [19:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T1900) [19:00:04] Daimona: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:13] o/ [19:00:36] (03PS1) 10Urbanecm: Milestone lobo for atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487907 (https://phabricator.wikimedia.org/T215122) [19:00:47] * Urbanecm would like to get one patch deployed in Morning SWAT as well :) [19:03:49] 10Operations, 10Gerrit, 10Icinga, 10monitoring, 10Patch-For-Review: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Dzahn) >>! In T215033#4925229, @CDanis wrote: > The timeouts on that check_ssl invocation make little sense to me Actually i... [19:04:34] Uhm who's SWATting? [19:05:07] mutante: that latest one you just posted still looks like the timeouts are backwards between warning and critical? [19:05:55] cdanis: yes, it does. ACK [19:06:21] Ah I guess jouncebot forgot to ping deployers [19:06:33] no, wait.. those are days before expiry [19:06:41] addshore, Antoine (hashar), Katie (aude), Max (MaxSem), Mukunda (twentyafterfour), Roan (RoanKattouw), Sébastien (Dereckson), Tyler (thcipriani), Niharika (Niharika), or Željko (zeljkof) [19:06:42] ^ [19:06:58] cdanis: 7 days before expiry = warn .. 3 days before expiry = crit ? seems right? [19:07:04] (03PS1) 10Cdentinger: rename frmon to frmon.frdev, slight retab [dns] - 10https://gerrit.wikimedia.org/r/487909 (https://phabricator.wikimedia.org/T215182) [19:07:08] i was going back and forth myself [19:07:19] haha [19:07:22] oh, well then [19:07:24] seems right [19:07:35] I was going to ask, if those were indeed seconds of timeout, why they didn't trigger on this outage [19:07:51] and ruminating that maybe gerrit was only being slow on the JSON calls and not on the probably-static content of the HTML itself [19:08:02] yea.. it's for the cert expiry in this case :) [19:08:03] )which is maybe still true!) [19:08:21] gerrit generates the content on the fly (GWTUI) [19:08:24] Actually the ping didn't work, trying again: addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, zeljkof [19:08:26] it's polygerrit that's static [19:08:39] paladox: are you saying the /r/ URL will change with the skin? [19:08:46] i hope not [19:08:51] No, /r/ should still work [19:09:01] but gwtui generates the content on the fly [19:09:10] there's no plain .html files for GWTUI [19:09:13] ok, so that's the one i used for a new check and string "Wikimedia Code Review" in it [19:09:14] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@8b2f078]: Weekly GUI deploy [19:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:17] which is the case when not logged in [19:09:44] (03CR) 10Herron: "> http_status needs conversion to number first" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480943 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [19:09:46] we should install https://gerrit-review.googlesource.com/admin/repos/plugins/healthcheck [19:10:07] They have a branch for stable-2.15 [19:10:22] paladox: installing health check plugin sounds like a good idea [19:10:27] yup [19:10:57] (03PS16) 10Cwhite: prometheus: upgrade to node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) [19:11:23] Firing my last bullet: ping greg-g [19:11:39] ? [19:12:04] jouncebot: now [19:12:05] For the next 0 hour(s) and 47 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T1900) [19:12:34] greg-g No one seems to be around for SWAT [19:12:49] (03PS2) 10Cwhite: aptrepo: add prometheus-node-exporter components for all dists [puppet] - 10https://gerrit.wikimedia.org/r/486493 (https://phabricator.wikimedia.org/T213708) [19:13:22] 10Operations, 10Gerrit, 10Icinga, 10monitoring, 10Patch-For-Review: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Dzahn) P.S. The --warning and --critical values are not backwards because they are "days until expiry". [19:13:53] Reedy might be around for SWAT [19:14:09] Daimona: yeah, the monday after all hands is a very slow day, not many people working due to travel&recovery. [19:14:31] if no one responds quickly, just move it to the next day that works for you. Sorry for the inconvenience. [19:14:37] greg-g Oh right, I keep forgetting about all hands :/ [19:14:44] Yeah sure, thanks :) [19:14:50] np [19:15:06] In case someone is available for SWAT, just please ping me [19:15:36] I can SWAT. [19:15:42] Daimona: Ping [19:16:04] Also ping Urbanecm. :_0 [19:16:05] lol [19:16:10] James_F cool, thanks [19:16:19] Hi James_F [19:16:21] (03PS8) 10Jforrester: Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [19:16:26] (03CR) 10Jforrester: [C: 03+2] Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [19:16:33] (you know, there should probably be some explanation of it somewhere. because just about the only public page about the event is https://commons.wikimedia.org/wiki/Category:Wikimedia_Foundation_All_Hands) [19:16:34] (03PS3) 10MarcoAurelio: Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491) [19:17:07] MatmaRex: It did say on Deployments last week that nothing was going to be deployed because of it, but of course that note has now been archived. :-) [19:17:36] My meeting with a few people from WMF staff was canceled due to all hands and even that, I forgot about all hands, so just a public people probably wouldn't help :D [19:17:48] * James_F grins. [19:18:10] *public page [19:18:36] 10Operations, 10Gerrit, 10Icinga, 10monitoring, 10Patch-For-Review: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) [19:18:53] * hauskatze has a patch for mutante :) [19:19:01] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@8b2f078]: Weekly GUI deploy (duration: 09m 47s) [19:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:45] 10Operations, 10Gerrit, 10Icinga, 10monitoring, 10Patch-For-Review: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) Technically "investigate why it did not alert" has been resolved. But of course we also... [19:19:49] (03Merged) 10jenkins-bot: Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [19:20:19] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) [19:20:42] Daimona: Live on mwdebug1002; testable? [19:20:51] I don't think it is [19:21:01] OK, let's just push it and hope. [19:21:03] But for sure it is after deployment [19:21:06] Yup [19:22:21] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: T191039 Enable wgAbuseFilterRuntimeProfile on all wikis (duration: 00m 47s) [19:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:27] T191039: Re-enable filter profiling on every wiki - https://phabricator.wikimedia.org/T191039 [19:22:30] Daimona: Done. Please check. [19:22:43] We'll know if we see some performance server exploding :) [19:22:51] (03CR) 10Jforrester: [C: 03+2] Milestone lobo for atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487907 (https://phabricator.wikimedia.org/T215122) (owner: 10Urbanecm) [19:23:04] * Daimona is checking [19:23:31] typo in the commit message, but not sure it matters much [19:23:41] (fwiw lobo = wolf :-) ) [19:23:51] (03CR) 10Dzahn: [C: 03+1] "nice! i was going to ask whether it needs the DB access first but i see that has been resolved" [puppet] - 10https://gerrit.wikimedia.org/r/486423 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [19:23:54] (03Merged) 10jenkins-bot: Milestone lobo for atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487907 (https://phabricator.wikimedia.org/T215122) (owner: 10Urbanecm) [19:24:08] (03CR) 10Mobrovac: "It's a good first draft. I left some comments and thoughts in-line." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487885 (owner: 10Eevans) [19:24:13] Urbanecm: I guess this can't really be tested either. ;-) [19:24:33] James_F, just checked if the new logo is there :) [19:24:44] Daimona: Nothing exploding on fatalmonitor. I'm going to proceed. [19:24:52] Uhm on logstash I see 16 warning for "undefined index", but I think that's just something due the creation of new stats [19:24:56] So yes, you can go on [19:25:14] (03CR) 10jenkins-bot: Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [19:25:16] (03CR) 10jenkins-bot: Milestone lobo for atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487907 (https://phabricator.wikimedia.org/T215122) (owner: 10Urbanecm) [19:25:19] This probably has no impact and I'll fix it later in AF codebase [19:26:03] !log jforrester@deploy1001 Synchronized static/images/project-logos/atjwiki-2x.png: SWAT: Milestone lobo for atjwiki T215122, 2x (duration: 00m 44s) [19:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:10] T215122: Change logo for 6 months of atj.wp to celebrate 1000 articles - https://phabricator.wikimedia.org/T215122 [19:26:34] Confirming the statement above, it's some cache data missing for edits stashed before profiling was enabled, and saved after [19:27:10] !log jforrester@deploy1001 Synchronized static/images/project-logos/atjwiki-1.5x.png: SWAT: Milestone lobo for atjwiki T215122, 1.5x (duration: 00m 45s) [19:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:17] (03PS1) 10Paladox: Add healthcheck plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/487913 [19:27:39] (03PS2) 10Paladox: Add healthcheck plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/487913 (https://phabricator.wikimedia.org/T214326) [19:28:15] !log jforrester@deploy1001 Synchronized static/images/project-logos/atjwiki.png: SWAT: Milestone lobo for atjwiki T215122, 1x (duration: 00m 46s) [19:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:33] Urbanecm: Well, it's pushed, but the files seem cached/ [19:28:47] That's true, you'll have to purge the cache [19:29:15] echo "https://en.wikipedia.org/static/images/project-logos/atjwiki.png" | mwscript purgeList.php should purge atjwiki.png [19:29:18] similar for the others :) [19:29:43] James_F, ^^, also see the docs [19:30:15] 10Operations, 10netops: Fix codfw x-connect 65373 - https://phabricator.wikimedia.org/T215193 (10ayounsi) p:05Triage→03Normal [19:30:20] Urbanecm: Lovely. [19:31:55] Urbanecm: And yup, seems purged for me. [19:32:15] Yes, indeed, thanks James_F [19:32:20] !log Manually purged atjwiki*.png logos for T215122. [19:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:23] T215122: Change logo for 6 months of atj.wp to celebrate 1000 articles - https://phabricator.wikimedia.org/T215122 [19:32:28] OK, SWAT done. [19:32:37] Now to push a few more of my patches. ;-) [19:32:37] RECOVERY - Host ms-be2047 is UP: PING WARNING - Packet loss = 28%, RTA = 36.65 ms [19:33:10] (03PS3) 10Jforrester: Clean-up: Drop writing to wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486119 [19:33:18] (03CR) 10Jforrester: [C: 03+2] Clean-up: Drop writing to wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486119 (owner: 10Jforrester) [19:33:44] (03CR) 10Andrew Bogott: "A couple of comments inline. Also, if there aren't any deployment-specific args for the new profile it will be slightly less repetitive t" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487889 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [19:34:28] (03Merged) 10jenkins-bot: Clean-up: Drop writing to wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486119 (owner: 10Jforrester) [19:36:26] (03CR) 10Andrew Bogott: [C: 03+1] "There's the bit I was missing :)" [puppet] - 10https://gerrit.wikimedia.org/r/487482 (owner: 10Arturo Borrero Gonzalez) [19:36:33] (03CR) 10jenkins-bot: Clean-up: Drop writing to wgEcho*FooterNotice*, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486119 (owner: 10Jforrester) [19:38:33] (03PS3) 10Jforrester: Clean-up: Stop setting $wgFlowEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486120 [19:39:00] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Clean-up: Stop setting values for wgEcho*FooterNotice*, unread (duration: 00m 46s) [19:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:47] (03CR) 10Jforrester: [C: 03+2] Clean-up: Stop setting $wgFlowEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486120 (owner: 10Jforrester) [19:41:57] (03Merged) 10jenkins-bot: Clean-up: Stop setting $wgFlowEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486120 (owner: 10Jforrester) [19:45:02] (03PS3) 10Jforrester: Clean-up: Stop setting $wgParsoidWikiPrefix, unused since the Parsoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486121 [19:45:22] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Clean-up: Stop setting wgFlowEventLogging, unread (duration: 00m 45s) [19:45:24] (03CR) 10Jforrester: [C: 03+2] Clean-up: Stop setting $wgParsoidWikiPrefix, unused since the Parsoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486121 (owner: 10Jforrester) [19:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:33] (03Merged) 10jenkins-bot: Clean-up: Stop setting $wgParsoidWikiPrefix, unused since the Parsoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486121 (owner: 10Jforrester) [19:48:02] (03CR) 10jenkins-bot: Clean-up: Stop setting $wgFlowEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486120 (owner: 10Jforrester) [19:48:04] (03CR) 10jenkins-bot: Clean-up: Stop setting $wgParsoidWikiPrefix, unused since the Parsoid extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486121 (owner: 10Jforrester) [19:50:06] (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/487482 (owner: 10Arturo Borrero Gonzalez) [19:50:55] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Clean-up: Stop setting wgParsoidWikiPrefix, unused since the Parsoid extension (duration: 00m 45s) [19:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:35] (03PS4) 10Dzahn: base/cassandra: Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491) (owner: 10MarcoAurelio) [19:55:13] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Paladox) So we just need a http check that checks the website (without checking if the ssl cert is val... [19:57:01] (03PS1) 10Dzahn: add some common typo words to CI checks [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) [19:57:43] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10CDanis) >>! In T215033#4925462, @Paladox wrote: > So we just need a http check that checks the website... [19:58:15] PROBLEM - puppet last run on mc1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:00:48] (03CR) 10Dzahn: [C: 03+2] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491) (owner: 10MarcoAurelio) [20:01:30] !log manually ran puppet on mc1023 [20:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:06] (03CR) 10Dzahn: "thanks. also https://gerrit.wikimedia.org/r/c/operations/puppet/+/487917 to prevent these in the future" [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491) (owner: 10MarcoAurelio) [20:03:25] RECOVERY - puppet last run on mc1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:04:27] (03PS2) 10Dzahn: add some common typo words to CI checks [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) [20:05:40] (03CR) 10Dzahn: [C: 03+2] "the jenkins-bot +2 tells us all the existing ones are gone" [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [20:06:56] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi I replaced the problematic server with the new one Dell shipped to me. The OS is installed and puppet first run done... [20:10:34] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) Indeed, i would say merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/487901 should... [20:11:59] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) >>! In T215033#4925518, @Dzahn wrote: > healthcheck plugin you mentioned. Maybe in a separate t... [20:12:15] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Paladox) Already have T214326 for the health check plugin :) [20:12:43] (03CR) 10Paladox: [C: 03+1] gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [20:20:33] (03PS3) 10Herron: logstash::collector add input identifier tags [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) [20:21:29] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:25:03] (03CR) 10Volans: [C: 04-1] "I've some questions and comments, see inline, that I think should be answered before merging." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [20:26:51] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational [20:28:55] (03CR) 10CDanis: [C: 03+1] "Hmm. The check_http documentation is not clear -- it implies that the default -t timeout of 10 seconds applies for just establishing the " [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [20:30:46] (03CR) 10CDanis: [C: 03+1] gerrit: add icinga https check for dashboard content (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [20:41:38] (03CR) 10Sbisson: [C: 03+2] GrowthExperiments: Add help panel link for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486992 (owner: 10Kosta Harlan) [20:42:42] (03Merged) 10jenkins-bot: GrowthExperiments: Add help panel link for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486992 (owner: 10Kosta Harlan) [20:42:56] (03CR) 10jenkins-bot: GrowthExperiments: Add help panel link for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486992 (owner: 10Kosta Harlan) [20:50:49] 10Operations, 10cloud-services-team (Kanban): puppet ca_server confusion - https://phabricator.wikimedia.org/T176437 (10Andrew) I'm re-reading those docs for the fourth time and I'm still moderately confused :) I think my main question is: is ca_server a setting that is read by a puppetmaster, or only by a p... [20:53:21] the only way mine would be better is that the stylus is more responsive [20:53:30] oops [20:53:34] wrong channel [20:55:48] (03CR) 10Dzahn: gerrit: add icinga https check for dashboard content (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [20:57:15] (03PS1) 10EBernhardson: mwgrep: Query all search clusters [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: How many deployers does it take to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T2100). [21:00:45] (03PS4) 10Herron: logstash::collector add input identifier tags [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) [21:06:19] (03CR) 10Volans: gerrit: add icinga https check for dashboard content (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [21:08:32] (03CR) 10Paladox: [C: 03+1] gerrit: add icinga https check for dashboard content (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [21:12:02] (03CR) 10Volans: [C: 03+1] "LGTM and tested on af-netbox" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487599 (owner: 10CRusnov) [21:12:22] (03PS5) 10Dzahn: gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) [21:15:50] (03CR) 10CRusnov: [V: 03+2 C: 03+1] "Mergin" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487599 (owner: 10CRusnov) [21:16:02] 10Operations: sw raid1 doesnt install grub on sdb - https://phabricator.wikimedia.org/T215183 (10CDanis) Assumption 1: the `partman-auto-raid` directive exactly correlates with our use of Linux software RAID in production. Assumption 2: in order to have a working grub install on each mirror, software RAID1/10 co... [21:17:19] (03CR) 10Dzahn: gerrit: add icinga https check for dashboard content (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [21:18:35] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "h" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487599 (owner: 10CRusnov) [21:21:42] (03CR) 10CRusnov: [C: 04-1] "We need to do a merge in puppet to adjust the path before this can be merged." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov) [21:22:55] I'm looking for a sysadmin to supervise a rename of an account with more than 100,000 edits [21:22:56] anyone here who can help me? / I've been told everyone's back in action :) [21:24:36] (03CR) 10Volans: [C: 04-1] "Nice, few minor things to adjust, see inline comments." (034 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov) [21:25:40] Trijnstel: more than a generic one I think it would be better if you notify our DBAs for this as the most impact is usually on the databases [21:26:01] volans: and how and where can I do that? [21:26:34] Trijnstel: #wikimedia-databases is probably the best option here, keep in mind they are in EU timezones though ;) [21:26:51] I'm so too, so that shouldn't matter :P [21:27:15] (03PS6) 10Dzahn: gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) [21:27:56] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [21:28:27] Trijnstel, sure but do you normally do wikimedia volunteering during normal working hours? [21:28:53] Krenair: no, usually I don't [21:28:59] I work during normal working hours ;-) [21:29:10] right which is why it might be tricky [21:29:29] but how else could I do this? [21:29:49] (03CR) 10Dzahn: gerrit: add icinga https check for dashboard content (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [21:29:56] I don't want to do wikimedia work when I need to do my 'real' work [21:30:12] and when I'm off, no one seems to be available... [21:30:22] (03CR) 10Dzahn: gerrit: add icinga https check for dashboard content (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [21:32:55] (03PS7) 10Dzahn: gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) [21:33:35] (03PS5) 10Herron: logstash::collector add input identifier tags [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) [21:33:41] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add icinga https check for dashboard content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [21:33:43] Trijnstel, to be brutally honest it might be easier for someone else to run the rename [21:33:57] someone who can deal with the DBA's timezones [21:34:00] ehhh, what? :S [21:34:03] that's not fair [21:34:16] I'd say you should make people available at other timezones too [21:34:23] well [21:34:27] I do so much at other timezones too [21:34:28] I think the foundation should as well [21:34:34] both in my volunteer work and in my real job [21:34:43] but I am not them [21:34:47] pffff [21:34:54] I'm sorry I don't have a good solution [21:35:00] (03CR) 10Herron: [C: 03+2] logstash::collector add input identifier tags [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [21:35:15] in all the years I've been active, I never heard such an answer [21:35:26] that I need to find someone who can do something during working hours [21:35:28] ridiculous [21:35:29] Trijnstel: what I meant is get in touch with them to organize the thing, I'm sure you'll be able to find a way to run this [21:35:40] I'm a volunteer too... [21:35:43] you might be able to find someone who knows enough to keep an eye out and page if something goes wrong, idk [21:35:51] the only thing I ask is to look at the DB [21:35:59] alright then, I'll ask someone else to do [21:36:08] I've asked it twice and then I get the answer to get someone else [21:36:15] I'm done with it [21:36:26] I can't give an official answer [21:36:34] no one can help me either [21:37:09] Trijnstel: do you have a phabricator task for this already open by any chance? that's the best way to syncup asynchronously [21:37:23] I would've liked to give you a real solution [21:37:27] no I haven't, and I won't create that either [21:37:31] I'll ask others to do it [21:41:37] :( [21:58:43] 10Operations: sw raid1 doesnt install grub on sdb - https://phabricator.wikimedia.org/T215183 (10RobH) Please note this is related to T156955. [21:59:33] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:00:05] bawolff and Reedy: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T2200). [22:00:17] (03PS3) 10Dzahn: varnish/trafficserver: switch parsoid-tests backend, rename director [puppet] - 10https://gerrit.wikimedia.org/r/486423 (https://phabricator.wikimedia.org/T201366) [22:02:23] looking at scandium.. and that's just a testing server [22:04:47] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [22:05:42] !log scandium - systemctl start parsoid-vd (T201366) [22:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:45] T201366: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 [22:10:33] (03PS1) 10Dzahn: icinga/parsoid: no monitoring notifications on test servers [puppet] - 10https://gerrit.wikimedia.org/r/487964 (https://phabricator.wikimedia.org/T201366) [22:44:00] (03CR) 10Dzahn: [C: 03+1] "compiler looks good in production (https://puppet-compiler.wmflabs.org/compiler1002/14520/) it removes a bunch of style warnings and jenki" [puppet] - 10https://gerrit.wikimedia.org/r/487481 (owner: 10Arturo Borrero Gonzalez) [23:07:23] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:09:52] ACKNOWLEDGEMENT - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T201366 [23:15:28] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [23:16:01] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [23:16:09] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [23:16:11] (03CR) 10Volans: [C: 03+2] icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [23:19:34] mutante, so, does parsoid-rt-tests now point to scandium? [23:20:14] (03CR) 10Subramanya Sastry: [C: 03+1] icinga/parsoid: no monitoring notifications on test servers [puppet] - 10https://gerrit.wikimedia.org/r/487964 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [23:22:27] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:23:10] 10Operations, 10PHP 7.0 support, 10Patch-For-Review: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) >>! In T215126#4926064, @Krinkle wrote: >[..] a PHP Warning was logged directly into the HTML output stream as part of a regular response.... [23:24:53] subbu: no, i was about to merge that and then i noticed we have flapping parsoid-vd service there, i started it manually once.. then it broke again so i decided to tell you about that first and wait until tomorrow [23:25:42] flapping service on ruthenium or scandium? [23:26:12] scandium [23:27:07] i did 'systemctl start parsoid-vd' to start it and it works for a while [23:27:22] then later icinga reports a degraded systemd unit and it's that [23:27:38] then i saw in backlog it did that before and recovered by itself afaict [23:27:40] we run those tests far less frequently .. so, it is less critical .. but, i can take a look into that tomorrow. [23:28:12] ok, that's good to know. this relates to that monitoring change i uploaded and you +1ed as well, thanks [23:28:44] i will merge that tomorrow morning, ok? [23:29:01] havent been to lunch yet [23:29:48] maybe wait till later in the day. [23:30:00] just to make sure there isn't any missing config / software. [23:30:00] ok, sure, will do [23:32:45] subbu: might be because of Error: ENOENT: no such file or directory, open '/srv/visualdiff/testreduce/testrun.ids' [23:34:06] ah .. ok. you can copy over that file from ruthenium .. or have puppet create a dummy file. [23:34:13] the path exists and is a git repo , just testrun.ids is not there.. but on ruthenium it is [23:34:25] it is not in the git repo. [23:34:29] should it be? [23:34:41] i could create a dummy file in the git repo perhaps. [23:34:54] i'll take a look tomorrow .. about to get on a plane in a little bit. [23:34:59] that would feel nicer than manually copying [23:35:02] yes. [23:35:21] same here, need to get an errand done. safe travels and talk to you tomorrow [23:35:26] ttyl [23:39:26] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10Tgr) Per {T215126}, we are actually loading the PEAR version of classes in production, even though the Composer version is included in `mediawiki/... [23:40:37] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:40:37] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:40:37] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:40:37] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:40:37] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:40:37] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:40:37] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:40:38] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:40:38] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:40:39] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T215222 [23:42:16] ^ i noticed these in icinga web UI but there were no IRC notifications lately, that is odd. then i reported in -services channel and the ticket has been created. i also wanted to see if notiifications work and it's ACKed. root cause: "caused by an error in MW API " [23:42:36] bbiaw [23:44:16] had been going on for a while, we should check why icinga-wm did not talk more about it, since it was not disabled or known [23:45:33] it was neither in ACKed state nor notifications disabled... so unless that just checks once every couple hours it should have repeated a lot more visiably [23:48:55] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:56:54] (03CR) 10Volans: "Thanks a lot for the suggestions! I've integrated this change in the main CR with a couple of changes and adding docstrings. See inline fo" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/487094 (owner: 10Gehel) [23:57:09] (03PS3) 10Volans: management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) [23:57:11] (03PS3) 10Volans: icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 [23:58:38] (03CR) 10Volans: "Addressed comments, integrated proposed changes from I9ad88f70b99d04d5b6e1d3c986360e244446babb" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)