[00:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T0000). [00:17:50] PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 74.32, 34.17, 21.48 [00:18:51] RECOVERY - High CPU load on API appserver on mw1347 is OK: OK - load average: 42.16, 33.46, 22.15 [00:24:16] nearly there... [00:25:24] !log legoktm@deploy1001 Finished scap: Preference for responsive MonoBook, plus set mobile width cutoff to 550px ([[gerrit:437875]], [[gerrit:437814]]) (duration: 64m 01s) [00:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:28] MatmaRex: ^^ wheee [00:26:31] oof [00:27:11] well, that only took forever. thank you for deploying [00:27:29] :) [00:33:58] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4262808 (10Varnent) @greg - July 30 has been set as target launch date. [00:50:28] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@a07af40]: (no justification provided) [00:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:51] ^ dry run, please disregard [00:51:30] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@a07af40]: (no justification provided) (duration: 01m 02s) [00:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:56] i stopped receiving any email from gerrit and phabricator a couple hours ago. is it just me? [01:21:16] e.g. i have no emails for any of the comments on https://phabricator.wikimedia.org/T196588 [01:22:01] the last one i received at was 19:12 UTC [01:26:31] herron: ^ [01:29:17] (i need to head off, sorry.) [01:29:53] robh: If you're about as you're on duty... :P [01:30:20] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4262846 (10Reedy) Brandon reckons it's something to do with `confd::srv_dns` not being set correctly on beta I'm guessing the config got broken a little while ago, and because it did... [01:34:44] 10Operations, 10Gerrit, 10Mail, 10Phabricator: Phab and Gerrit emails stopped at around 1900 UTC 6th June - https://phabricator.wikimedia.org/T196598#4262848 (10Reedy) [01:35:19] 10Operations, 10Gerrit, 10Mail, 10Phabricator: Phab and Gerrit emails stopped at around 1900 UTC 6th June - https://phabricator.wikimedia.org/T196598#4262858 (10Reedy) p:05Triage>03High [01:41:35] 10Operations, 10Gerrit, 10Mail, 10Phabricator: Phab and Gerrit emails stopped at around 1900 UTC 6th June - https://phabricator.wikimedia.org/T196598#4262848 (10Legoktm) https://grafana.wikimedia.org/dashboard/db/mail?refresh=5m&orgId=1&from=now-2d&to=now suggests all mail is down, but I'm still getting ma... [01:52:39] 10Operations, 10Gerrit, 10Mail, 10Phabricator: Phab and Gerrit emails stopped at around 1900 UTC 6th June - https://phabricator.wikimedia.org/T196598#4262868 (10Reedy) Mail seems to be generally flowing through codfw though... https://grafana.wikimedia.org/dashboard/db/mail?refresh=5m&orgId=1&from=now-2d&t... [02:00:29] !log starting exim4 and reenabling puppet on mx1001, due to T196598 [02:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:35] T196598: Phab and Gerrit emails stopped at around 1900 UTC 6th June - https://phabricator.wikimedia.org/T196598 [02:12:12] 10Operations, 10Gerrit, 10Mail, 10Phabricator: Phab and Gerrit emails stopped at around 1900 UTC 6th June - https://phabricator.wikimedia.org/T196598#4262872 (10faidon) 05Open>03Resolved a:03faidon The cause was the prep for T175361, in combination with a couple of unexpected misconfigurations/SPOFs,... [02:15:34] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4262877 (10faidon) So this backfired, but thankfully the fix was as simple as starting exim :) Good thinking @herron! We've heard of and noticed at least two breakages: 1. Phabricator. It s... [02:25:50] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@a07af40]: log [02:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:28] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.6) (duration: 11m 24s) [02:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:46] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.7) (duration: 07m 02s) [02:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:52] 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4262895 (10Papaul) [03:05:05] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Jun 7 03:05:05 UTC 2018 (duration 10m 19s) [03:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:21] (03PS2) 10Andrew Bogott: Toolforge: Update toolforge description in bastion banner [puppet] - 10https://gerrit.wikimedia.org/r/437813 (https://phabricator.wikimedia.org/T168480) (owner: 10Chico Venancio) [03:40:04] (03CR) 10Andrew Bogott: [C: 032] Toolforge: Update toolforge description in bastion banner [puppet] - 10https://gerrit.wikimedia.org/r/437813 (https://phabricator.wikimedia.org/T168480) (owner: 10Chico Venancio) [03:51:07] (03PS4) 10KartikMistry: WIP: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) [03:52:03] (03CR) 10jerkins-bot: [V: 04-1] WIP: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [03:53:58] (03PS5) 10KartikMistry: WIP: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) [03:55:04] (03CR) 10jerkins-bot: [V: 04-1] WIP: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [03:59:21] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [03:59:30] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:59:31] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [03:59:40] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.14 seconds [03:59:51] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [04:00:01] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [04:00:10] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:00:20] (03PS6) 10KartikMistry: WIP: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) [04:06:37] (03PS7) 10KartikMistry: WIP: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) [04:58:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437898 [04:58:47] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437898 [05:01:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437898 (owner: 10Marostegui) [05:03:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437898 (owner: 10Marostegui) [05:03:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437898 (owner: 10Marostegui) [05:05:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096:3316 after alter table (duration: 00m 58s) [05:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:42] (03CR) 10Marostegui: [C: 032] events_sanitarium: Update sanitarium hosts [software] - 10https://gerrit.wikimedia.org/r/437802 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:07:33] (03Merged) 10jenkins-bot: events_sanitarium: Update sanitarium hosts [software] - 10https://gerrit.wikimedia.org/r/437802 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:07:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437899 (https://phabricator.wikimedia.org/T191316) [05:09:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437899 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:10:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437899 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:10:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437899 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:11:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3316 for alter table (duration: 00m 56s) [05:12:00] !log Deploy schema change on db1098:3316 - T191316 T192926 T195193 T89737 [05:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:07] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:12:07] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:12:07] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:12:07] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:15:06] !log Deploy event_sanitarium on codfw sanitariums - T190704 [05:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:11] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:19:21] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [05:22:40] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:31:50] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 404 (expecting: 200) [05:32:51] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [05:46:12] (03PS5) 10Marostegui: mariadb: Failover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437707 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [05:46:52] (03CR) 10Marostegui: [C: 032] mariadb: Failover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437707 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [05:47:58] (03PS4) 10Marostegui: mariadb: Switchover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437715 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [05:49:29] (03CR) 10Marostegui: [C: 04-1] mariadb: Switchover m3-master to db1072 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437715 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [05:49:40] jynus: can you amend? ^ [05:50:30] (03PS5) 10Jcrespo: mariadb: Switchover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437715 (https://phabricator.wikimedia.org/T186320) [05:51:01] check now [05:51:09] thank you! [05:51:13] (03CR) 10Marostegui: [C: 032] mariadb: Switchover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437715 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [05:52:24] twentyafterfour: around? [05:52:33] we are almost ready [05:56:52] we should figure out how to put phabricator in read only ourselves, just in case [05:57:37] Let's see what we have in gerrit.. [06:00:08] This is upstream: https://secure.phabricator.com/D15662 so there is an option apparently [06:00:51] but I don't see ir puppetized [06:02:10] I am grepping for that on phab1001 but there are hundreds of them, so not sure which one is the one [06:02:14] twentyafterfour: ping [06:03:39] phab doesn't handle well a read_only on a DB level? [06:03:58] it goes unavailable [06:04:03] :( [06:04:56] I say we deploy anyway, we said we were going to do it today at this time [06:05:03] Agreed [06:05:29] let's go for it [06:10:09] !log start database maintenance on phabricator- brief interruptions could happen [06:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:56] you take care of the proxies, marostegui? [06:11:00] yep! [06:11:05] when I say "now", ok? [06:11:09] roger! [06:12:59] marostegui: get ready [06:13:02] ok [06:13:14] now [06:13:15] ok [06:13:24] done [06:13:25] and checked [06:14:12] killed connections [06:14:23] I can write on phab [06:14:28] cool [06:15:44] yeah, all writes are good [06:18:23] (03PS18) 10Elukey: Create profile::analytics::cluster::packages::* classes [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [06:19:51] tendril updated [06:20:07] looks good [06:20:23] the problem now is that port is not puppetized [06:20:24] (03CR) 10Elukey: [C: 032] Create profile::analytics::cluster::packages::* classes [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [06:20:36] and the replica is on a different port [06:20:55] ah, I get what you mean [06:20:56] so replica functionality is broken [06:21:11] and it is not host=hostname:port [06:21:30] But we don't use the replica right now, right? [06:21:31] it needs extra puppetization (mysql.host and mysql.port) [06:21:35] we do [06:21:40] but for statistics [06:21:42] etc [06:21:46] Ah right [06:21:58] the phab rick aitor mail, for example [06:22:10] I will open a ticket [06:22:24] yeah, let's follow up there [06:23:47] 10Operations: test task, ignore - https://phabricator.wikimedia.org/T196603#4262993 (10jcrespo) [06:24:05] 10Operations: test task, ignore - https://phabricator.wikimedia.org/T196603#4263004 (10jcrespo) 05Open>03Invalid [06:24:41] !log phabricator maintenance finished [06:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:08] if you see something strange on phabricator, speak up [06:26:20] RECOVERY - ores on ores2001 is OK: HTTP OK: HTTP/1.0 200 OK - 3691 bytes in 0.078 second response time [06:26:26] !log Deploy sanitarium events on db1125 - T190704 [06:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:31] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [06:29:32] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:31:51] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/Lets_Encrypt_Authority_X3.crt] [06:31:51] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-intel-microcode] [06:41:22] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:43:21] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[python3-sklearn],Package[python3-sklearn-lib],Package[hunspell-bs],Package[jupyter-notebook] [06:44:06] this is me, working on it --^ [06:45:48] 10Operations, 10Deployments, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#4263044 (10demon) a:05demon>03None [06:55:38] !log relaxing write consistency on db2048 due to ongoing maintenance (sync_binlog,flush_log) [06:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:59] (03PS1) 10Elukey: ores::base: deploy hunspell-bs only on stretch [puppet] - 10https://gerrit.wikimedia.org/r/437913 [06:56:41] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:11] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:12] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:15] (03CR) 10Elukey: [C: 032] ores::base: deploy hunspell-bs only on stretch [puppet] - 10https://gerrit.wikimedia.org/r/437913 (owner: 10Elukey) [06:59:51] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:03:41] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:09:27] !log installing git security updates on trusty (Debian already fixed) [07:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:13] (03PS1) 10Elukey: Delete mediawiki::monitoring::graphite [puppet] - 10https://gerrit.wikimedia.org/r/437915 [07:28:37] (03CR) 10Muehlenhoff: [C: 031] Delete mediawiki::monitoring::graphite [puppet] - 10https://gerrit.wikimedia.org/r/437915 (owner: 10Elukey) [07:32:45] (03PS2) 10Muehlenhoff: admin: Replace phuedx's key [puppet] - 10https://gerrit.wikimedia.org/r/437794 (owner: 10Phuedx) [07:34:06] (03CR) 10Muehlenhoff: [C: 032] admin: Replace phuedx's key [puppet] - 10https://gerrit.wikimedia.org/r/437794 (owner: 10Phuedx) [07:35:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437917 [07:37:14] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437917 (owner: 10Marostegui) [07:39:13] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437917 (owner: 10Marostegui) [07:39:15] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437917 (owner: 10Marostegui) [07:40:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3316 after alter table (duration: 00m 57s) [07:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:06] 10Operations, 10cloud-services-team, 10netops: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424#4263117 (10ayounsi) It looks like the dhcp bit is something like: ``` option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/jessie-installer/"; filena... [07:41:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437918 (https://phabricator.wikimedia.org/T191316) [07:43:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437918 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:44:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437918 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:46:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1093 for alter table (duration: 00m 56s) [07:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:13] !log Deploy schema change on db1093 - T191316 T192926 T195193 T89737 [07:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:20] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [07:46:20] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [07:46:20] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [07:46:20] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [07:47:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437918 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:58:43] (03PS1) 10Muehlenhoff: Remove deployment access for khorn [puppet] - 10https://gerrit.wikimedia.org/r/437919 [08:05:21] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4263154 (10WMDE-Fisch) a:03WMDE-Fisch [08:07:51] (03CR) 10Giuseppe Lavagetto: [C: 032] Delete mediawiki::monitoring::graphite [puppet] - 10https://gerrit.wikimedia.org/r/437915 (owner: 10Elukey) [08:10:02] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4263164 (10WMDE-Fisch) Seems the first step has been done already for some time https://gerrit.wikimedia.org/r/#/c/414768/ [08:13:35] (03PS2) 10Elukey: Delete mediawiki::monitoring::graphite [puppet] - 10https://gerrit.wikimedia.org/r/437915 [08:14:17] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-mcrouter-exporter [puppet] - 10https://gerrit.wikimedia.org/r/436782 (https://phabricator.wikimedia.org/T135991) [08:15:24] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-mcrouter-exporter [puppet] - 10https://gerrit.wikimedia.org/r/436782 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:15:28] (03PS5) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-mcrouter-exporter [puppet] - 10https://gerrit.wikimedia.org/r/436782 (https://phabricator.wikimedia.org/T135991) [08:15:49] !log running ANALYZE on db2091 T196526 [08:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:53] T196526: Geodata running long running queries on Commons - https://phabricator.wikimedia.org/T196526 [08:16:55] 10Operations, 10Traffic, 10netops: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365#4263188 (10ayounsi) Created https://gerrit.wikimedia.org/r/#/c/437784/ to force the mtu to 1450 for IPsec links. [08:17:10] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4263191 (10WMDE-Fisch) > The extension has been added to the extension-list file in mediawiki-config > The extension has b... [08:28:06] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: Rack/Setup frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T196417#4263203 (10ayounsi) [08:28:12] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: switch port configuration for frbast2001 - https://phabricator.wikimedia.org/T196503#4263201 (10ayounsi) 05Open>03Resolved ```lang=diff [edit interfaces interface-range disabled] - member ge-0/0/15; - member ge-1/0/15; [edit interfaces... [08:32:42] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4263212 (10Marostegui) >>! In T195690#4259933, @jcrespo wrote: > @RobH Can you check if we have next-business day support for defects for this hw provider and purchase? Because they seem to not be honori... [08:32:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437925 [08:34:32] !log Stop replication on db1102:3316 and db1125:3316 to update triggers for archive table - T192926 [08:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:38] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [08:35:38] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437925 (owner: 10Marostegui) [08:36:07] (03PS2) 10Ayounsi: [WIP] Add static routes with MTU 1450 for ipsec dests [puppet] - 10https://gerrit.wikimedia.org/r/437784 [08:36:44] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add static routes with MTU 1450 for ipsec dests [puppet] - 10https://gerrit.wikimedia.org/r/437784 (owner: 10Ayounsi) [08:37:09] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437925 (owner: 10Marostegui) [08:37:27] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437925 (owner: 10Marostegui) [08:38:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1093 after alter table (duration: 00m 55s) [08:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437927 (https://phabricator.wikimedia.org/T191316) [08:46:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437927 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [08:48:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437927 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [08:48:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437927 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [08:49:46] !log slow reboot of all ganeti eqiad VMs (except bohrium, puppetdb1001, poolcounter1001, mx1001) for kernel upgrades and picking up spec-ctrl cpu flag [08:49:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1085 for alter table (duration: 00m 56s) [08:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:54] 10Operations, 10ops-eqiad: mw1280: CPU error - https://phabricator.wikimedia.org/T195734#4263262 (10MoritzMuehlenhoff) @Cmjohnson Not seeing a new CPU error logged in "racadm getsel", but it's also still depooled and thus not receiving traffic (and may show up only under load). Unless you wanna do additional t... [08:50:30] !log Deploy schema change on db1085 with replication, this will generate lag on labsdb for s6 section - T191316 T192926 T195193 T89737 [08:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:37] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [08:50:37] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [08:50:37] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [08:50:37] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [08:59:56] (03CR) 10Ema: [C: 031] Enable base::service_auto_restart for varnishreqstats [puppet] - 10https://gerrit.wikimedia.org/r/436538 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:00:05] 10Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimk-l - https://phabricator.wikimedia.org/T196616#4263318 (10Misos) [09:00:16] (03CR) 10Ema: [C: 031] Enable base::service_auto_restart for varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/436520 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:08:25] (03PS4) 10Sau226: Implementing Patroller User Rights for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) [09:13:22] (03PS3) 10Ayounsi: [WIP] Add static routes with MTU 1450 for ipsec dests [puppet] - 10https://gerrit.wikimedia.org/r/437784 [09:19:50] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/436520 (https://phabricator.wikimedia.org/T135991) [09:22:13] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/436520 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:45:21] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for varnishreqstats [puppet] - 10https://gerrit.wikimedia.org/r/436538 (https://phabricator.wikimedia.org/T135991) [09:52:22] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for varnishreqstats [puppet] - 10https://gerrit.wikimedia.org/r/436538 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:57:22] 10Operations, 10ops-eqiad: mw1280: CPU error - https://phabricator.wikimedia.org/T195734#4263471 (10Cmjohnson) @MoritzMuehlenhoff yes, go ahead and repool the server. Thanks [10:00:11] PROBLEM - Host debmonitor1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:02:07] wut? [10:02:15] akosiaris: anything going on on ganeti? [10:02:55] reboots [10:03:13] to pick ip IBPS/IBRS for ganeti clients [10:03:17] ack [10:04:11] but I didn't see the other alarming, why my little snowflake is not happy? :) [10:05:19] btw we have 2 hosts [10:05:28] dbmonitor1001.wikimedia.org, debmonitor1001.eqiad.wmnet [10:05:35] for a moment I was like.. huh ? [10:05:40] yay! I know [10:05:43] bad naming [10:05:48] blame me [10:05:53] we were first [10:05:59] Job 422450 for debmonitor1001.eqiad.wmnet has failed: Failure: command execution error: [10:06:00] Could not reboot instance: Hypervisor error: Failed to start instance debmonitor1001.eqiad.wmnet: exited with exit code 1 (qemu-system-x86_64: Property '.spec-ctrl' not found [10:06:04] interesting [10:06:09] never seen that before [10:06:23] I knew qemu doesn't like me [10:06:37] hmmh, outdated qemu on one of the ganeti servers? having a look [10:06:48] because of my xen past [10:06:54] yeah I am looking too [10:07:03] if it is stateless, is rebuilding a possibility? [10:07:27] yeah, wouldn't be a problem, the state is safely stored in m2 ;) [10:07:31] moritzm: yup, exactly that [10:07:31] akosiaris: ganeti1001 and 1006 are outdated qemu-wise [10:07:32] * akosiaris fixing [10:07:35] volans: cool [10:07:49] hopefuly it can be debugged [10:08:29] ok this time around it started fine [10:08:51] RECOVERY - Host debmonitor1001 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [10:09:31] and there we go [10:09:39] moritzm: thanks for the hint [10:09:40] host seems ok, service too [10:09:58] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) [10:10:32] thanks for updating :-) [10:10:44] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:16:29] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437944 [10:17:34] (03PS3) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [10:17:36] (03PS1) 10Alexandros Kosiaris: Add proton cluster to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/437945 (https://phabricator.wikimedia.org/T186748) [10:17:41] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:18:22] (03PS2) 10Alexandros Kosiaris: Add proton cluster to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/437945 (https://phabricator.wikimedia.org/T186748) [10:18:24] (03PS4) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [10:18:42] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:20:53] (03CR) 10Alexandros Kosiaris: [C: 032] Add proton cluster to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/437945 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [10:21:12] (03PS4) 10Alexandros Kosiaris: Proton: Apply the role to proton hosts [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [10:21:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Proton: Apply the role to proton hosts [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [10:29:59] !log installing openssl security updates [10:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:01] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:31:33] (03PS11) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) [10:32:11] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:32:21] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [10:32:22] (03CR) 10Vgutierrez: [C: 032] Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [10:32:58] (03Merged) 10jenkins-bot: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [10:33:47] the icinga thing is expected [10:45:38] 10Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimk-l - https://phabricator.wikimedia.org/T196616#4263553 (10MarcoAurelio) Wasn't it reset yesterday? (T196438#4261534) [10:47:09] (03PS6) 10Ema: VCL: Normalise the Accept-Language header for the REST API [puppet] - 10https://gerrit.wikimedia.org/r/434558 (https://phabricator.wikimedia.org/T195327) (owner: 10Mobrovac) [10:47:12] 10Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimk-l - https://phabricator.wikimedia.org/T196616#4263559 (10Misos) No, the reset yesterday was for wikimedia-mk. Both lists are maintained by the same admin. [10:47:53] (03CR) 10Ema: [C: 032] VCL: Normalise the Accept-Language header for the REST API [puppet] - 10https://gerrit.wikimedia.org/r/434558 (https://phabricator.wikimedia.org/T195327) (owner: 10Mobrovac) [10:51:02] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for PDNS recursor Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/437949 (https://phabricator.wikimedia.org/T135991) [10:53:09] (03PS5) 10Elukey: Move the varnishkafka submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437467 (https://phabricator.wikimedia.org/T188377) [10:53:11] (03PS1) 10Elukey: Move the kafkatee submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437950 (https://phabricator.wikimedia.org/T188377) [10:53:13] (03PS1) 10Elukey: Move the jmxtrans submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437951 (https://phabricator.wikimedia.org/T188377) [10:55:07] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4224988 (10Addshore) >>! In T195370#4263191, @WMDE-Fisch wrote: >> The extension has been added to the extension-list file... [10:55:29] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 3 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4263585 (10Addshore) [10:59:15] (03PS1) 10WMDE-Fisch: Add FileExporter to BetaFeaturesWhiteList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437952 (https://phabricator.wikimedia.org/T195370) [11:01:42] I need help for Change 437777 [11:06:03] (03CR) 10Ema: [C: 04-1] varnish: Remove setting of CP cookies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437774 (https://phabricator.wikimedia.org/T110353) (owner: 10Krinkle) [11:11:24] (03PS1) 10Volans: Fine tune security settings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437954 (https://phabricator.wikimedia.org/T191299) [11:11:27] (03PS1) 10Volans: Run CLI tests also with Python 2.7 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437955 (https://phabricator.wikimedia.org/T167504) [11:11:29] (03PS1) 10Volans: Bumped django-auth-ldap to v1.6.1 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437956 (https://phabricator.wikimedia.org/T167504) [11:11:31] (03PS1) 10Volans: Client CLI: read configuration file. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437957 (https://phabricator.wikimedia.org/T191300) [11:11:33] (03PS1) 10Volans: Allow to specify a CA bundle for server validation [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437958 (https://phabricator.wikimedia.org/T167504) [11:12:35] (03CR) 10jerkins-bot: [V: 04-1] Bumped django-auth-ldap to v1.6.1 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437956 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:12:38] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: read configuration file. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437957 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:12:39] (03CR) 10jerkins-bot: [V: 04-1] Allow to specify a CA bundle for server validation [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437958 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:12:42] :( [11:12:44] (03CR) 10jerkins-bot: [V: 04-1] Run CLI tests also with Python 2.7 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437955 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:12:51] ouch [11:13:48] rebellion against Riccardo! :P [11:14:33] why is it trying to install the package :( checking [11:15:15] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team (Radar): Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034#4263635 (10Krinkle) [11:15:31] PROBLEM - Disk space on install1002 is CRITICAL: Return code of 255 is out of bounds [11:15:32] PROBLEM - dhclient process on install1002 is CRITICAL: Return code of 255 is out of bounds [11:15:46] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4263636 (10WMDE-Fisch) [11:16:21] PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. [11:16:32] RECOVERY - Disk space on install1002 is OK: DISK OK [11:16:41] RECOVERY - dhclient process on install1002 is OK: PROCS OK: 0 processes with command name dhclient [11:17:26] hashar: you around by any chance? [11:17:31] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [11:23:42] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:26:41] (03CR) 10Volans: "Apparently we're running an outdated tox (2.6.0) and there is a bug in tox that doesn't allow use properly usedevelop and skip_install tog" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437955 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:27:30] 10Operations, 10JADE, 10TechCom, 10Scoring-platform-team (Current), 10Services (watching): Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#4263654 (10mobrovac) [11:28:19] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437944 (owner: 10Marostegui) [11:28:48] 10Operations, 10RESTBase-API, 10Traffic, 10Services (done): Normalise the Accept-Language header for REST API requests - https://phabricator.wikimedia.org/T195327#4263655 (10mobrovac) 05Open>03Resolved Thank you @ema ! [11:30:00] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437944 (owner: 10Marostegui) [11:30:15] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437944 (owner: 10Marostegui) [11:30:37] (03PS3) 10Volans: Add nginx::snippet define [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 [11:30:57] (03CR) 10jerkins-bot: [V: 04-1] Add nginx::snippet define [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 (owner: 10Volans) [11:31:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1085 after alter table (duration: 00m 57s) [11:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:32] srsly, jenkins, why today you don't like me? [11:31:43] only today? :) [11:31:59] more than usual today :D [11:32:37] volans: yeah I am more or less around [11:33:01] hashar: see my comment in https://gerrit.wikimedia.org/r/#/c/437955/ [11:33:05] 00:00:12.003 Could not parse for environment *root*: Syntax error at 'Wmflib::Ensure'; expected ')' at /src/manifests/snippet.pp:32 [11:33:15] on https://gerrit.wikimedia.org/r/#/c/437761/ [11:34:00] nope, the tox one ;) 437955 [11:34:05] see ^^^ [11:34:09] volans: yeah can you please fill it as a bug ? [11:34:15] upgrade tox? [11:34:15] gotta tweak the containers and bump the version [11:34:16] yeah [11:34:19] ack [11:34:28] do you think it might have side effects? [11:34:51] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [11:35:50] (03PS1) 10WMDE-Fisch: When using the FileExporter set it as BeatFeature by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437965 (https://phabricator.wikimedia.org/T195370) [11:36:31] hashar: done: https://phabricator.wikimedia.org/T196628 [11:36:39] (03PS1) 10Jcrespo: mariadb: Prepare reimage of db1111 and db1112 [puppet] - 10https://gerrit.wikimedia.org/r/437966 (https://phabricator.wikimedia.org/T196172) [11:38:06] (03CR) 10Jcrespo: [C: 032] mariadb: Prepare reimage of db1111 and db1112 [puppet] - 10https://gerrit.wikimedia.org/r/437966 (https://phabricator.wikimedia.org/T196172) (owner: 10Jcrespo) [11:42:27] hashar: for the other one it seems that puppet-nginx-rake-docker is installing puppet 3.7.5 [11:42:42] that would be defined in the Gemfile [11:42:48] probably the version has to be bumped there [11:43:18] yep, I can do that, I thought was in the job at first ;) [11:46:00] (03PS4) 10Volans: Add nginx::snippet define [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 [11:46:02] (03PS1) 10Volans: Bump Gemfile dependencies [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437968 [11:47:30] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Add FileExporter to BetaFeaturesWhiteList (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437952 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [11:48:29] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] When using the FileExporter set it as BeatFeature by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437965 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [11:49:01] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [11:51:56] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4263699 (10WMDE-Fisch) Will SWAT the config changes today in the `European Mid-day SWAT` [11:52:21] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:53:51] (03CR) 10Addshore: Add FileExporter to BetaFeaturesWhiteList (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437952 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [11:54:11] (03CR) 10Addshore: [C: 031] When using the FileExporter set it as BeatFeature by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437965 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [11:54:39] (03CR) 10Addshore: [C: 031] Add FileExporter to BetaFeaturesWhiteList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437952 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [11:55:21] (03CR) 10WMDE-Fisch: Add FileExporter to BetaFeaturesWhiteList (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437952 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [11:59:12] (03PS1) 10Alexandros Kosiaris: Revert "scap3 - deployment of package requires configuration to already exist" [puppet] - 10https://gerrit.wikimedia.org/r/437972 [12:01:27] (03CR) 10jerkins-bot: [V: 04-1] Revert "scap3 - deployment of package requires configuration to already exist" [puppet] - 10https://gerrit.wikimedia.org/r/437972 (owner: 10Alexandros Kosiaris) [12:01:50] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Add FileExporter to BetaFeaturesWhiteList (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437952 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [12:02:10] (03PS2) 10Alexandros Kosiaris: Revert "scap3 - deployment of package requires configuration to already exist" [puppet] - 10https://gerrit.wikimedia.org/r/437972 [12:02:16] (03PS3) 10Alexandros Kosiaris: Revert "scap3 - deployment of package requires configuration to already exist" [puppet] - 10https://gerrit.wikimedia.org/r/437972 [12:02:49] (03PS4) 10Alexandros Kosiaris: Revert "scap3 - deployment of package requires configuration to already exist" [puppet] - 10https://gerrit.wikimedia.org/r/437972 [12:04:14] (03CR) 10Muehlenhoff: [C: 031] "Looks good, one nit." (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437957 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [12:04:31] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "scap3 - deployment of package requires configuration to already exist" [puppet] - 10https://gerrit.wikimedia.org/r/437972 (owner: 10Alexandros Kosiaris) [12:10:12] PROBLEM - DPKG on proton1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:11:21] RECOVERY - DPKG on proton1001 is OK: All packages OK [12:13:11] PROBLEM - puppet last run on proton1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[proton/deploy] [12:14:21] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:14:24] jouncebot: now [12:14:24] No deployments scheduled for the next 0 hour(s) and 45 minute(s) [12:14:26] jouncebot: next [12:14:26] In 0 hour(s) and 45 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T1300) [12:15:12] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 74384 bytes in 0.201 second response time [12:15:23] (03PS1) 10Urbanecm: Fix wrong language in ur.wiktionary namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437974 [12:15:41] !log repooled mw1280 after hardware maintenance (T195734) [12:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:46] T195734: mw1280: CPU error - https://phabricator.wikimedia.org/T195734 [12:15:52] RECOVERY - MariaDB Slave Lag: s1 on db2071 is OK: OK slave_sql_lag Replication lag: 57.17 seconds [12:15:52] RECOVERY - MariaDB Slave Lag: s1 on db2088 is OK: OK slave_sql_lag Replication lag: 57.40 seconds [12:16:11] RECOVERY - MariaDB Slave Lag: s1 on db2072 is OK: OK slave_sql_lag Replication lag: 51.09 seconds [12:16:11] RECOVERY - MariaDB Slave Lag: s1 on db2085 is OK: OK slave_sql_lag Replication lag: 51.67 seconds [12:16:11] RECOVERY - MariaDB Slave Lag: s1 on db2092 is OK: OK slave_sql_lag Replication lag: 50.86 seconds [12:16:41] RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Replication lag: 36.41 seconds [12:16:42] (03PS1) 10Dzahn: install_server: switch planet1001 from jessie to stretch [puppet] - 10https://gerrit.wikimedia.org/r/437975 (https://phabricator.wikimedia.org/T168490) [12:16:51] RECOVERY - MariaDB Slave Lag: s1 on db2094 is OK: OK slave_sql_lag Replication lag: 32.63 seconds [12:16:57] 10Operations, 10ops-eqiad: mw1280: CPU error - https://phabricator.wikimedia.org/T195734#4263728 (10MoritzMuehlenhoff) Repooled, seems fine so far. [12:17:48] (03CR) 10Dzahn: [C: 032] install_server: switch planet1001 from jessie to stretch [puppet] - 10https://gerrit.wikimedia.org/r/437975 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [12:18:01] RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 46.86 seconds [12:18:41] PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[proton/deploy] [12:19:44] (03PS1) 10Urbanecm: English aliases for extra namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437976 (https://phabricator.wikimedia.org/T196614) [12:20:54] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) (owner: 10Sau226) [12:21:41] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4263746 (10WMDE-Fisch) [12:23:52] PROBLEM - DPKG on proton2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:25:02] (03PS5) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [12:25:04] (03PS1) 10Alexandros Kosiaris: proton: Add standard, base::firewall, lvs classes to role [puppet] - 10https://gerrit.wikimedia.org/r/437979 [12:25:37] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] proton: Add standard, base::firewall, lvs classes to role [puppet] - 10https://gerrit.wikimedia.org/r/437979 (owner: 10Alexandros Kosiaris) [12:26:11] RECOVERY - DPKG on proton2001 is OK: All packages OK [12:26:55] !log planet1001 - schedule downtime, boot to PXE, reinstall with stretch (ganeti) (T168490) [12:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:00] T168490: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490 [12:31:22] PROBLEM - Check systemd state on proton1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:31:22] PROBLEM - puppet last run on proton2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[proton/deploy] [12:33:31] RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:34:14] (03PS1) 10Gilles: Remove now-optional performance survey description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437981 (https://phabricator.wikimedia.org/T196630) [12:35:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [12:35:30] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4263792 (10Dzahn) @Cmjohnson i'm not sure, Rob created the task with the check boxes. I think from a template. [12:37:48] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4263793 (10Vgutierrez) As @BBlack said, I think considering other projects for our scope (0,08% of the requests) it's a little bit of an overkill, so... [12:38:52] PROBLEM - puppet last run on proton2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[proton/deploy] [12:39:29] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4263796 (10Dzahn) I used wmf-auto-reimage so that was able to use the mgmt interface to install. I can also confirm i get a console. But i'm' not sure if anything else in BIOS needs to b... [12:43:02] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4263799 (10Dzahn) I don't have a specific problem with this instance but if this is a checkbox item that is supposed to happen on each re-assignment (like to set it back to the right BIO... [12:43:30] (03CR) 10Volans: "Reply inline" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437957 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [12:44:02] (03PS1) 10Muehlenhoff: Mask the default uwsgi service for ores [puppet] - 10https://gerrit.wikimedia.org/r/437984 [12:44:05] jouncebot: next [12:44:05] In 0 hour(s) and 15 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T1300) [12:45:59] (03Draft2) 10Reedy: Add some new beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437983 (https://phabricator.wikimedia.org/T196583) [12:46:08] !log planet1001/puppetmaster: revoke old cert, sign new cert request, initial puppet run, reinstalled, will turn service active-active again once done (T168490) [12:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:13] T168490: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490 [12:47:09] (03PS1) 10Jcrespo: mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437985 [12:47:34] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437985 (owner: 10Jcrespo) [12:47:54] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437985 (owner: 10Jcrespo) [12:48:31] (03CR) 10Vgutierrez: [C: 031] "inline comment, LGTM otherwise" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/437957 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [12:49:02] RECOVERY - puppet last run on proton2002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:49:32] (03PS1) 10Dzahn: planet: make active-active again, reactivate eqiad backend [puppet] - 10https://gerrit.wikimedia.org/r/437987 (https://phabricator.wikimedia.org/T168490) [12:49:41] (03CR) 10jenkins-bot: mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437985 (owner: 10Jcrespo) [12:51:51] RECOVERY - puppet last run on proton2001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:52:23] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 (duration: 00m 57s) [12:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437990 (https://phabricator.wikimedia.org/T191316) [12:54:01] (03PS1) 10Alexandros Kosiaris: Add LVS IPs for the new proton service [dns] - 10https://gerrit.wikimedia.org/r/437991 (https://phabricator.wikimedia.org/T186748) [12:54:59] !log stop, clone and reimage db1091 [12:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437990 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:57:45] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437990 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:58:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437990 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:58:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1088 for alter table (duration: 00m 57s) [12:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:14] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (owner: 10Rush) [12:59:37] !log Deploy schema change on db1088 - T191316 T192926 T195193 T89737 [12:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:43] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [12:59:44] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [12:59:44] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [12:59:44] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [12:59:53] (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (owner: 10Rush) [13:00:01] (03PS4) 10ArielGlenn: allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T1300). [13:00:05] CFisch_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] \o/ [13:00:23] I can SWAT today [13:00:41] CFisch_WMDE: please remind me, you are not a SWAT deployer, right? :D [13:00:59] no I am not :-) [13:01:17] but we had a talk in the team and that might change in the future [13:01:27] cool! [13:01:45] I'll ping you in a few minutes when the first patch is at mwdebug1002 for testing [13:02:16] anything interesting about the patches, can not be tested, needs a long time for testing, needs a script to run...? [13:02:17] zeljkof: yeah not much to test there, more preparation patches [13:02:27] so nothing to see [13:02:45] CFisch_WMDE: should I just deploy both of them without mwdebug? [13:02:54] no let me have a sanity check [13:03:29] PROBLEM - Check whether ferm is active by checking the default input chain on proton1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [13:03:34] (03PS1) 10Jcrespo: mariadb: reimage db1091 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/437993 [13:03:38] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [13:03:42] CFisch_WMDE for swater, woo! [13:04:00] * addshore supports that idea [13:04:10] (03PS2) 10Jcrespo: mariadb: reimage db1091 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/437993 [13:04:34] so there's a cronspam (in my Junk) from [13:04:36] Date: Thu, 07 Jun 2018 02:02:11 +0000 [13:04:44] that has a backtrace, that says: [13:04:44] _mysql_exceptions.OperationalError: (2003, "Can't connect to MySQL server on 'm3-slave.eqiad.wmnet' (110)") [13:04:48] that's from phab1001 [13:05:09] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [13:05:09] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) [13:05:17] paravoid: https://phabricator.wikimedia.org/T196604 [13:05:43] ah alright [13:05:54] thank you [13:06:03] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437952 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [13:06:11] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324#4263849 (10jcrespo) [13:06:39] RECOVERY - mediawiki-installation DSH group on mw1280 is OK: OK [13:06:43] (03PS4) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [13:06:48] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [13:06:49] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) [13:07:29] (03Merged) 10jenkins-bot: Add FileExporter to BetaFeaturesWhiteList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437952 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [13:07:51] (03CR) 10jenkins-bot: Add FileExporter to BetaFeaturesWhiteList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437952 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [13:08:01] CFisch_WMDE: 437952 is at mwdebug1002 [13:08:23] CFisch_WMDE: sorry, will be in a minute [13:08:46] CFisch_WMDE: 437952 is at mwdebug1002, [13:08:52] for real this time :D [13:09:25] cool all fine, thanks can go in [13:09:37] CFisch_WMDE: ok, deploying [13:09:37] (03CR) 10Muehlenhoff: [C: 04-1] decom and remove remnants of tin.eqiad.wmnet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [13:10:24] (03PS5) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [13:10:37] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:437952|Add FileExporter to BetaFeaturesWhiteList (T195370)]] (duration: 00m 57s) [13:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:42] T195370: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370 [13:10:51] CFisch_WMDE: 437952 is deployed [13:11:12] (03PS1) 10Alexandros Kosiaris: Add the proton service to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/437994 (https://phabricator.wikimedia.org/T186748) [13:11:14] (03PS1) 10Alexandros Kosiaris: Add the nodes for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437995 (https://phabricator.wikimedia.org/T186748) [13:11:16] (03PS1) 10Alexandros Kosiaris: Add ip blocks for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437996 (https://phabricator.wikimedia.org/T186748) [13:11:18] (03PS1) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [13:11:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437965 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [13:12:03] zeljkof: that can go directly to production [13:12:16] CFisch_WMDE: ok, will let you know when it's deployed [13:12:58] (03PS4) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (owner: 10Rush) [13:13:50] (03Merged) 10jenkins-bot: When using the FileExporter set it as BeatFeature by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437965 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [13:15:42] (03CR) 10Rush: openstack: allow designate in labtest to contact labtestn keystone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437812 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [13:15:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437998 [13:16:06] (03CR) 10Jcrespo: [C: 032] mariadb: reimage db1091 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/437993 (owner: 10Jcrespo) [13:16:26] !log zfilipin@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:437965|When using the FileExporter set it as BeatFeature by default (T195370)]] (duration: 00m 56s) [13:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:31] T195370: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370 [13:16:35] (03PS6) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [13:16:50] CFisch_WMDE: 437965 deployed, please check and thanks for deploying with #releng! :D [13:17:10] looks like there is nothing else for SWAT [13:17:15] !log EU SWAT finished [13:17:15] yay [13:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:23] thanks zeljkof [13:17:47] (03CR) 10jenkins-bot: When using the FileExporter set it as BeatFeature by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437965 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [13:18:28] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [13:18:28] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) [13:18:32] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4263884 (10WMDE-Fisch) [13:18:33] I'm around if anything explodes after SWAT, please ping me :D [13:18:58] hehe [13:19:09] all fine on my side [13:19:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437998 (owner: 10Marostegui) [13:19:46] zeljkof: I've some other patches you can deploy for me if you're bored :P [13:19:56] (none are testable on mwdebug ;)) [13:20:34] (03PS3) 10Reedy: Add some new beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437983 (https://phabricator.wikimedia.org/T196583) [13:20:39] (03CR) 10Reedy: [C: 032] Add some new beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437983 (https://phabricator.wikimedia.org/T196583) (owner: 10Reedy) [13:21:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437998 (owner: 10Marostegui) [13:21:35] (03PS4) 10Reedy: Add some new beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437983 (https://phabricator.wikimedia.org/T196583) [13:21:40] (03CR) 10Reedy: [C: 032] Add some new beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437983 (https://phabricator.wikimedia.org/T196583) (owner: 10Reedy) [13:21:46] (03PS5) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [13:22:08] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437998 (owner: 10Marostegui) [13:22:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1066 (duration: 00m 56s) [13:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:39] (03PS1) 10Jcrespo: mariadb: Repool db1091 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437999 [13:23:13] (03Merged) 10jenkins-bot: Add some new beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437983 (https://phabricator.wikimedia.org/T196583) (owner: 10Reedy) [13:24:26] !log reedy@deploy1001 Synchronized wikiversions-labs.json: labs (duration: 00m 57s) [13:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:36] (03PS1) 10Dzahn: deployment-prep-logstash2: replace deployment-tin server [puppet] - 10https://gerrit.wikimedia.org/r/438001 [13:25:53] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for mcelog [puppet] - 10https://gerrit.wikimedia.org/r/438002 (https://phabricator.wikimedia.org/T135991) [13:25:54] Reedy: is it urgent? I went to lunch :) [13:26:02] Haha, nope [13:26:06] Doing them myself ;) [13:26:10] Didn't know there is more [13:26:17] Reedy: cool! [13:26:17] (03CR) 10jenkins-bot: Add some new beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437983 (https://phabricator.wikimedia.org/T196583) (owner: 10Reedy) [13:26:48] !log reedy@deploy1001 Synchronized dblists/all-labs.dblist: labs (duration: 00m 54s) [13:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:50] glad to see you keep deploying from deploy1001 and i hear no issues. that means: die tin, die [13:29:13] I keep logging into tin [13:29:16] seeing the banner [13:29:18] going shit [13:29:21] logging out [13:29:27] and logging into the right host [13:29:33] i will change it to the spare/decom role [13:29:39] though.. then the warning banner will go away [13:29:44] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.6/extensions/WikimediaMaintenance/addWiki.php: add all the wikis (duration: 00m 58s) [13:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:49] i can still edit it manually [13:30:13] mutante: people won't be able to log into a spare host due to different access groups [13:30:24] non-SRE at least [13:30:26] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438003 [13:31:17] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.7/extensions/WikimediaMaintenance/addWiki.php: add all the wikis (duration: 00m 56s) [13:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:27] moritzm: oh, right. yea. that's a good thing then :) [13:31:51] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438003 (owner: 10Marostegui) [13:31:53] Reedy: if you use my script you can use deployment.eqiad.wmnet as host ;) (that reminds me I need to publish it better) [13:32:03] heh [13:32:14] volans: welcome back to IRC :) [13:32:26] and i replace deployment-tin here: [13:32:32] https://gerrit.wikimedia.org/r/#/c/438001/ ? [13:32:59] (03PS2) 10Dzahn: deployment-prep-logstash2: replace deployment-tin server [puppet] - 10https://gerrit.wikimedia.org/r/438001 [13:33:13] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438003 (owner: 10Marostegui) [13:33:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438003 (owner: 10Marostegui) [13:34:23] (03CR) 10Dzahn: "role::kibana::auth_realm: "Logstash (ssh deployment-tin.eqiad.wmflabs sudo cat /root/secrets.txt)"" [puppet] - 10https://gerrit.wikimedia.org/r/438001 (owner: 10Dzahn) [13:34:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1066 (duration: 00m 57s) [13:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:22] (03CR) 10Dzahn: deployment-prep-logstash2: replace deployment-tin server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/438001 (owner: 10Dzahn) [13:37:11] 10Operations, 10ops-eqiad, 10decommission: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4102460 (10Dzahn) Ticket needs the templated check boxes from https://wikitech.wikimedia.org/wiki/Server_Lifecycle/reclaim_checklist copying that into ticket description [13:37:39] 10Operations, 10ops-eqiad, 10decommission: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4263927 (10Dzahn) [13:38:34] (03PS1) 10Jcrespo: dblists: Remove db1059 and db1053 for decommission [software] - 10https://gerrit.wikimedia.org/r/438004 (https://phabricator.wikimedia.org/T196606) [13:38:51] (03CR) 10Muehlenhoff: [C: 031] "One nit, but looks good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [13:43:07] (03PS1) 10Addshore: Revert "Revert "Load WikibaseLexeme on testwiki (again)"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438005 [13:43:17] (03PS7) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [13:43:19] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Load WikibaseLexeme on testwiki (again)"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438005 (owner: 10Addshore) [13:43:23] (03PS2) 10Addshore: Load WikibaseLexeme on testwiki (again again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438005 [13:43:27] (03PS3) 10Addshore: Load WikibaseLexeme on testwiki (again again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438005 [13:43:29] (03CR) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [13:43:40] (03CR) 10Addshore: [C: 04-2] "wgLexemeEnableRepo should be set to false for clients first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438005 (owner: 10Addshore) [13:43:49] (03PS1) 10Addshore: Revert "Revert "Load WikibaseLexeme on all of group0"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 [13:43:51] (03CR) 10Dzahn: [C: 032] "ok thanks, nitpick fixed" [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [13:44:00] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Load WikibaseLexeme on all of group0"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 (owner: 10Addshore) [13:44:03] (03PS2) 10Addshore: Load WikibaseLexeme on all of group0 (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 [13:44:10] (03PS8) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [13:44:12] (03CR) 10jerkins-bot: [V: 04-1] decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [13:44:14] (03PS3) 10Addshore: Load WikibaseLexeme on all of group0 (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 [13:44:16] (03CR) 10jerkins-bot: [V: 04-1] Load WikibaseLexeme on all of group0 (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 (owner: 10Addshore) [13:44:25] (03PS3) 10Addshore: Load WikibaseLexeme on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) [13:44:33] (03PS3) 10Addshore: Load WikibaseLexeme on all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) [13:44:42] (03CR) 10Addshore: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 (owner: 10Addshore) [13:44:57] (03CR) 10Dzahn: [C: 032] "haha, right..., now i have to add a lint-ignore around it or ignore jenkins once i touched the "mapped IPv6" line :p" [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [13:45:03] (03CR) 10jerkins-bot: [V: 04-1] decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [13:47:16] (03PS9) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [13:47:33] (03CR) 10Addshore: [C: 04-2] "Also requires https://gerrit.wikimedia.org/r/#/c/437500/ to be merged and deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438005 (owner: 10Addshore) [13:48:56] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4264001 (10Joe) This is not really an issue and redis is correctly working on these servers: ``` deployment-redis05:~$ systemctl status redis-instance-tcp_6379.service ● redis-insta... [13:49:04] (03CR) 10Muehlenhoff: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [13:49:07] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4264002 (10Joe) 05Open>03Invalid [13:49:09] (03CR) 10Dzahn: [C: 032] decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [13:50:42] (03CR) 10Andrew Bogott: [C: 031] "ok -- I thought you were fixing to break out keystone into an explicitly named service everywhere. The effect of this patch certainly see" [puppet] - 10https://gerrit.wikimedia.org/r/437812 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [13:51:59] (03PS10) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [13:52:12] (03Draft2) 10Reedy: Add srwiki and crhwiki to le_subjects [puppet] - 10https://gerrit.wikimedia.org/r/438009 (https://phabricator.wikimedia.org/T196583) [13:54:28] (03PS3) 10Dzahn: Add srwiki and crhwiki to le_subjects [puppet] - 10https://gerrit.wikimedia.org/r/438009 (https://phabricator.wikimedia.org/T196583) (owner: 10Reedy) [13:56:51] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4264029 (10elukey) Ready to merge the code reviews for varnishkafka, kafkatee and jmxtrans, but it seems from the... [13:56:53] (03PS1) 10Giuseppe Lavagetto: Labs: use the newer redis servers for jobqueue, locking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438010 [13:57:01] <_joe_> Reedy: ^^ [13:57:20] (03PS4) 10Dzahn: deployment-prep: Add srwiki and crhwiki to le_subjects [puppet] - 10https://gerrit.wikimedia.org/r/438009 (https://phabricator.wikimedia.org/T196583) (owner: 10Reedy) [13:57:27] (03CR) 10Dzahn: [C: 032] deployment-prep: Add srwiki and crhwiki to le_subjects [puppet] - 10https://gerrit.wikimedia.org/r/438009 (https://phabricator.wikimedia.org/T196583) (owner: 10Reedy) [13:58:20] (03PS1) 10Reedy: Add some more symlinked files... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438011 [13:58:42] RECOVERY - Check whether ferm is active by checking the default input chain on proton1001 is OK: OK ferm input default policy is set [13:59:39] (03CR) 10Reedy: [C: 032] Add some more symlinked files... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438011 (owner: 10Reedy) [14:00:04] reedy: That opportune time is upon us again. Time for a New Wiki Creation! deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T1400). [14:00:57] (03Merged) 10jenkins-bot: Add some more symlinked files... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438011 (owner: 10Reedy) [14:00:59] (03PS1) 10Giuseppe Lavagetto: beta: update references to redis servers [puppet] - 10https://gerrit.wikimedia.org/r/438012 [14:01:13] (03CR) 10jenkins-bot: Add some more symlinked files... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438011 (owner: 10Reedy) [14:01:42] (03PS2) 10Giuseppe Lavagetto: beta: update references to redis servers [puppet] - 10https://gerrit.wikimedia.org/r/438012 [14:01:49] (03CR) 10Ottomata: [C: 031] Move the kafkatee submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437950 (https://phabricator.wikimedia.org/T188377) (owner: 10Elukey) [14:01:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] beta: update references to redis servers [puppet] - 10https://gerrit.wikimedia.org/r/438012 (owner: 10Giuseppe Lavagetto) [14:02:15] (03PS2) 10Jcrespo: dblists: Remove db1059 and db1053 for decommission [software] - 10https://gerrit.wikimedia.org/r/438004 (https://phabricator.wikimedia.org/T196606) [14:02:19] !log reedy@deploy1001 Synchronized docroot/noc/: Add some more symlinked configs (duration: 00m 57s) [14:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:10] (03CR) 10Jcrespo: [C: 032] dblists: Remove db1059 and db1053 for decommission [software] - 10https://gerrit.wikimedia.org/r/438004 (https://phabricator.wikimedia.org/T196606) (owner: 10Jcrespo) [14:03:30] <_joe_> Reedy: if you merge my change https://gerrit.wikimedia.org/r/438010, things should be better in deployment-prep [14:04:44] thanks [14:04:51] I'll merge after the db patch [14:05:28] (03PS2) 10Muehlenhoff: Cleanup after migration of deployment servers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436284 [14:06:02] db patch? [14:06:37] which db patch? [14:06:39] Mindfield of text [14:06:45] "dblists: Remove db1059 and db1053 for decommission" [14:06:49] I presumed that was mw-cofnig [14:06:54] ah, ok [14:07:09] (03CR) 10Reedy: [C: 032] Labs: use the newer redis servers for jobqueue, locking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438010 (owner: 10Giuseppe Lavagetto) [14:07:20] 10Operations: decom tin - https://phabricator.wikimedia.org/T196175#4264064 (10Dzahn) [14:08:13] PROBLEM - Check systemd state on proton1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:08:28] (03Merged) 10jenkins-bot: Labs: use the newer redis servers for jobqueue, locking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438010 (owner: 10Giuseppe Lavagetto) [14:08:42] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [14:08:42] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) [14:08:56] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4264067 (10Dzahn) tin has been replaced. deploy1001 is active since a couple days and tin is now using role spare::system since today and has been removed from network constants and other places: https://gerrit.wikimedia.org/r/... [14:10:01] (03CR) 10jenkins-bot: Labs: use the newer redis servers for jobqueue, locking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438010 (owner: 10Giuseppe Lavagetto) [14:10:02] !log reedy@deploy1001 Synchronized wmf-config/LabsServices.php: labs (duration: 00m 57s) [14:10:03] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:29] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4264070 (10Dzahn) 05Open>03Resolved [14:10:32] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4264071 (10Dzahn) [14:10:36] 10Operations: decom tin - https://phabricator.wikimedia.org/T196175#4264072 (10Dzahn) 05stalled>03Open [14:11:02] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#3911588 (10Dzahn) [14:11:05] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4073640 (10Dzahn) [14:11:09] !log reedy@deploy1001 Synchronized wmf-config/jobqueue-labs.php: labs (duration: 00m 56s) [14:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:26] 10Operations: decom tin - https://phabricator.wikimedia.org/T196175#4264081 (10Dzahn) [14:15:39] (03PS2) 10Reedy: Initial configuration for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437245 (https://phabricator.wikimedia.org/T183706) (owner: 10Urbanecm) [14:15:41] 10Operations: decom tin - https://phabricator.wikimedia.org/T196175#4249218 (10Dzahn) [14:15:45] (03CR) 10Reedy: [C: 032] Initial configuration for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437245 (https://phabricator.wikimedia.org/T183706) (owner: 10Urbanecm) [14:16:00] 10Operations: decom/reclaim tin - https://phabricator.wikimedia.org/T196175#4249218 (10Dzahn) [14:16:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: decom/reclaim tin - https://phabricator.wikimedia.org/T196175#4264088 (10Dzahn) [14:17:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: decom/reclaim tin - https://phabricator.wikimedia.org/T196175#4249218 (10Dzahn) a:05Dzahn>03None [14:17:28] (03Merged) 10jenkins-bot: Initial configuration for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437245 (https://phabricator.wikimedia.org/T183706) (owner: 10Urbanecm) [14:18:17] (03PS11) 10Reedy: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [14:18:19] (03CR) 10Reedy: [C: 032] idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [14:18:23] (03CR) 10jenkins-bot: Initial configuration for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437245 (https://phabricator.wikimedia.org/T183706) (owner: 10Urbanecm) [14:19:35] (03Merged) 10jenkins-bot: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [14:20:17] (03PS6) 10Reedy: Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) (owner: 10Urbanecm) [14:20:23] (03CR) 10Reedy: [C: 032] Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) (owner: 10Urbanecm) [14:21:39] (03Merged) 10jenkins-bot: Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) (owner: 10Urbanecm) [14:21:52] (03PS2) 10Reedy: Initial configuration for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437393 (https://phabricator.wikimedia.org/T196357) (owner: 10Urbanecm) [14:21:55] (03CR) 10Reedy: [C: 032] Initial configuration for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437393 (https://phabricator.wikimedia.org/T196357) (owner: 10Urbanecm) [14:23:00] (03CR) 10jenkins-bot: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [14:23:11] (03Merged) 10jenkins-bot: Initial configuration for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437393 (https://phabricator.wikimedia.org/T196357) (owner: 10Urbanecm) [14:23:26] (03PS2) 10Reedy: Initial configuration for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437238 (https://phabricator.wikimedia.org/T196360) (owner: 10Urbanecm) [14:23:35] (03CR) 10Reedy: [C: 032] Initial configuration for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437238 (https://phabricator.wikimedia.org/T196360) (owner: 10Urbanecm) [14:24:52] (03Merged) 10jenkins-bot: Initial configuration for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437238 (https://phabricator.wikimedia.org/T196360) (owner: 10Urbanecm) [14:24:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438015 [14:26:43] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438015 (owner: 10Marostegui) [14:28:20] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438015 (owner: 10Marostegui) [14:29:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1088 (duration: 00m 57s) [14:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:59] (03PS3) 10Dzahn: deployment-prep-logstash2: replace deployment-tin server [puppet] - 10https://gerrit.wikimedia.org/r/438001 (https://phabricator.wikimedia.org/T192071) [14:30:01] (03PS1) 10Reedy: Add 5 new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438017 (https://phabricator.wikimedia.org/T183706) [14:30:30] (03CR) 10Reedy: [C: 032] Add 5 new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438017 (https://phabricator.wikimedia.org/T183706) (owner: 10Reedy) [14:31:02] (03PS1) 10Muehlenhoff: Add initial Debianisation of debmonitor-client (WIP) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 [14:32:00] (03Merged) 10jenkins-bot: Add 5 new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438017 (https://phabricator.wikimedia.org/T183706) (owner: 10Reedy) [14:32:08] (03PS2) 10Ottomata: Kafka - Change default message.timestamp.type to CreateTime [puppet] - 10https://gerrit.wikimedia.org/r/437614 (https://phabricator.wikimedia.org/T196407) [14:32:13] (03CR) 10Ottomata: [V: 032 C: 032] Kafka - Change default message.timestamp.type to CreateTime [puppet] - 10https://gerrit.wikimedia.org/r/437614 (https://phabricator.wikimedia.org/T196407) (owner: 10Ottomata) [14:32:51] (03PS2) 10Dzahn: planet: remove jessie support and venus references [puppet] - 10https://gerrit.wikimedia.org/r/436589 (https://phabricator.wikimedia.org/T180498) [14:33:02] (03PS2) 10Alexandros Kosiaris: Add ip blocks for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437996 (https://phabricator.wikimedia.org/T186748) [14:33:04] (03PS2) 10Alexandros Kosiaris: Add the proton service to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/437994 (https://phabricator.wikimedia.org/T186748) [14:33:06] (03PS2) 10Alexandros Kosiaris: Add the nodes for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437995 (https://phabricator.wikimedia.org/T186748) [14:33:08] (03PS2) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [14:33:10] (03PS6) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [14:33:14] (03CR) 10jerkins-bot: [V: 04-1] planet: remove jessie support and venus references [puppet] - 10https://gerrit.wikimedia.org/r/436589 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [14:33:55] 5 new wikis at once :) heh, nice [14:34:32] (03CR) 10Alexandros Kosiaris: [C: 032] Add ip blocks for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437996 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [14:34:40] (03PS3) 10Muehlenhoff: Cleanup after migration of deployment servers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436284 [14:35:10] (03CR) 10Alexandros Kosiaris: [C: 032] Add LVS IPs for the new proton service [dns] - 10https://gerrit.wikimedia.org/r/437991 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [14:36:54] mutante: Reedy doesn't mess around, gets it done [14:37:31] (03PS1) 10Alexandros Kosiaris: Fix proton.svc.eqiad.wmnet. DNS RR [dns] - 10https://gerrit.wikimedia.org/r/438019 [14:38:20] (03CR) 10Alexandros Kosiaris: [C: 032] Fix proton.svc.eqiad.wmnet. DNS RR [dns] - 10https://gerrit.wikimedia.org/r/438019 (owner: 10Alexandros Kosiaris) [14:41:48] p858snake|L: :)) [14:42:25] PROBLEM - puppet last run on proton1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:44] PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:50] (03PS1) 10Reedy: Add idwikimedia to MWMutliVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438020 (https://phabricator.wikimedia.org/T192726) [14:43:01] (03CR) 10Reedy: [C: 032] Add idwikimedia to MWMutliVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438020 (https://phabricator.wikimedia.org/T192726) (owner: 10Reedy) [14:44:16] (03Merged) 10jenkins-bot: Add idwikimedia to MWMutliVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438020 (https://phabricator.wikimedia.org/T192726) (owner: 10Reedy) [14:44:59] (03CR) 10Alexandros Kosiaris: [C: 032] Add the proton service to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/437994 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [14:45:28] !log created new wiki databases T183706 T192726 T194879 T196357 T196360 [14:45:29] (03PS4) 10Dzahn: planet: move plugin dir out of feeds dir [puppet] - 10https://gerrit.wikimedia.org/r/436583 [14:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:41] T196357: Create Wikivoyage Bengali - https://phabricator.wikimedia.org/T196357 [14:45:41] T192726: Create a Wikimedia hosted wiki site for Wikimedia Indonesia - https://phabricator.wikimedia.org/T192726 [14:45:41] T183706: Create Wikivoyage Pashto - https://phabricator.wikimedia.org/T183706 [14:45:41] T194879: Create Wikisource Piedmontese - https://phabricator.wikimedia.org/T194879 [14:45:41] T196360: Create Wikiquote Sakha - https://phabricator.wikimedia.org/T196360 [14:45:46] marostegui: ding [14:46:01] haha [14:46:27] is it all done? [14:46:44] just config to deply [14:46:51] ok, i will sanitize them then [14:47:00] (03PS3) 10Alexandros Kosiaris: Add the proton service to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/437994 (https://phabricator.wikimedia.org/T186748) [14:47:02] (03PS3) 10Alexandros Kosiaris: Add the nodes for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437995 (https://phabricator.wikimedia.org/T186748) [14:47:04] (03PS3) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [14:47:06] (03PS7) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [14:47:31] !log reedy@deploy1001 Synchronized dblists/: 5 new wikis (duration: 00m 55s) [14:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:43] (03PS5) 10Dzahn: planet: move plugin dir out of feeds dir [puppet] - 10https://gerrit.wikimedia.org/r/436583 [14:48:47] !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: idwikimedia (duration: 00m 57s) [14:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:05] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 898.82 seconds [14:49:35] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 933.38 seconds [14:49:49] I think I know what that is..... [14:49:53] The creation of the wikis [14:50:02] !log reedy@deploy1001 Synchronized static/images/: new wikis (duration: 00m 57s) [14:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:17] yeah, it is that [14:50:24] (as always) [14:50:33] yay new wikis [14:50:34] (03CR) 10Dzahn: [C: 032] planet: move plugin dir out of feeds dir [puppet] - 10https://gerrit.wikimedia.org/r/436583 (owner: 10Dzahn) [14:50:37] * revi goes to create userpages [14:51:09] ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1000.48 seconds Marostegui New wikis being created [14:51:09] ACKNOWLEDGEMENT - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 965.52 seconds Marostegui New wikis being created [14:52:29] revi: we have global userpages you don't need to [14:52:43] I have disabled it [14:52:48] __NOGLOBAL__ ftw [14:53:08] (03PS4) 10Alexandros Kosiaris: Add the proton service to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/437994 (https://phabricator.wikimedia.org/T186748) [14:53:10] (03PS4) 10Alexandros Kosiaris: Add the nodes for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437995 (https://phabricator.wikimedia.org/T186748) [14:53:12] (03PS4) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [14:53:14] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: new wikis (duration: 00m 57s) [14:53:14] (03PS8) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [14:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:25] !log Sanitize wikis on db1095 (old sanitarium) - T196362 T196358 T196359 T195008 T193187 [14:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:35] ACKNOWLEDGEMENT - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICA [14:53:35] T195008: Prepare and check storage layer for pmswikisource - https://phabricator.wikimedia.org/T195008 [14:53:35] ar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) alexandros kosiaris known, no phab task yet [14:53:35] ACKNOWLEDGEMENT - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICA [14:53:35] ar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) alexandros kosiaris known, no phab task yet [14:53:35] ACKNOWLEDGEMENT - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICA [14:53:36] T196362: Prepare and check storage layer for sahwikiquote - https://phabricator.wikimedia.org/T196362 [14:53:36] ar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) alexandros kosiaris known, no phab task yet [14:53:36] ACKNOWLEDGEMENT - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICA [14:53:36] T193187: Prepare and check storage layer for idwikimedia - https://phabricator.wikimedia.org/T193187 [14:53:36] T196359: Prepare and check storage layer for pswikivoyage - https://phabricator.wikimedia.org/T196359 [14:53:36] T196358: Prepare and check storage layer for bn.wikivoyage - https://phabricator.wikimedia.org/T196358 [14:53:37] ar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) alexandros kosiaris known, no phab task yet [14:53:52] (03CR) 10Alexandros Kosiaris: [C: 032] Add the proton service to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/437994 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [14:53:55] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: new wikis [14:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:12] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438022 [14:55:14] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438022 (owner: 10Reedy) [14:55:24] PROBLEM - puppet last run on proton2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:59] !log beginning rolling restarts of all cluster kafka brokers to apply log.message.timestamp.type=CreateTime - T196407 [14:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:03] T196407: EventBus should produce messages to Kafka with event time set to meta.dt - https://phabricator.wikimedia.org/T196407 [14:56:27] (03CR) 10jerkins-bot: [V: 04-1] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438022 (owner: 10Reedy) [14:56:31] (03CR) 10jerkins-bot: [V: 04-1] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438022 (owner: 10Reedy) [14:57:22] (03PS3) 10Dzahn: planet: remove jessie support and venus references [puppet] - 10https://gerrit.wikimedia.org/r/436589 (https://phabricator.wikimedia.org/T180498) [14:58:16] (03PS4) 10Dzahn: planet: remove jessie support and venus references [puppet] - 10https://gerrit.wikimedia.org/r/436589 (https://phabricator.wikimedia.org/T180498) [14:58:52] (03PS1) 10Reedy: Add pswikivoyage to rtl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438023 (https://phabricator.wikimedia.org/T183706) [14:59:08] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814#4209672 (10Nuria) Or (maybe crazy thought) remove the headers entirely at the nginx layer while varnish work of reorganizi... [14:59:23] (03CR) 10Reedy: [C: 032] Add pswikivoyage to rtl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438023 (https://phabricator.wikimedia.org/T183706) (owner: 10Reedy) [15:00:38] (03Merged) 10jenkins-bot: Add pswikivoyage to rtl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438023 (https://phabricator.wikimedia.org/T183706) (owner: 10Reedy) [15:00:50] (03PS2) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438022 [15:00:57] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438022 (owner: 10Reedy) [15:02:09] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438022 (owner: 10Reedy) [15:02:55] PROBLEM - puppet last run on proton2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:04:26] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: (no justification provided) (duration: 00m 56s) [15:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:34] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4264282 (10RobH) We had next day support until it expired on May 25th. However, if this case was open before hten, they should honor the warranty. [15:06:34] 10Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimk-l - https://phabricator.wikimedia.org/T196616#4264296 (10RobH) 05Open>03Resolved a:03RobH I've gone ahead and reset the password, which emails it out automatically to all the list admins for wikimk-l. [15:06:49] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4264299 (10Marostegui) >>! In T195690#4264282, @RobH wrote: > We had next day support until it expired on May 25th. However, if this case was open before hten, they should honor the warranty. No, it wa... [15:08:20] (03PS5) 10Dzahn: planet: remove jessie support and venus references [puppet] - 10https://gerrit.wikimedia.org/r/436589 (https://phabricator.wikimedia.org/T180498) [15:08:29] !log Sanitize wikis on db1124 (current sanitarium for s3) - T196362 T196358 T196359 T195008 T193187 [15:08:35] ACKNOWLEDGEMENT - Check systemd state on proton1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris service still being initialized. Puppet not running because of bad integration with LVS configuration data. Will look into it [15:08:35] ACKNOWLEDGEMENT - puppet last run on proton1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues alexandros kosiaris service still being initialized. Puppet not running because of bad integration with LVS configuration data. Will look into it [15:08:36] ACKNOWLEDGEMENT - puppet last run on proton1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues alexandros kosiaris service still being initialized. Puppet not running because of bad integration with LVS configuration data. Will look into it [15:08:36] ACKNOWLEDGEMENT - puppet last run on proton2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues alexandros kosiaris service still being initialized. Puppet not running because of bad integration with LVS configuration data. Will look into it [15:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:37] T195008: Prepare and check storage layer for pmswikisource - https://phabricator.wikimedia.org/T195008 [15:08:37] T196362: Prepare and check storage layer for sahwikiquote - https://phabricator.wikimedia.org/T196362 [15:08:39] T193187: Prepare and check storage layer for idwikimedia - https://phabricator.wikimedia.org/T193187 [15:08:40] T196359: Prepare and check storage layer for pswikivoyage - https://phabricator.wikimedia.org/T196359 [15:08:40] T196358: Prepare and check storage layer for bn.wikivoyage - https://phabricator.wikimedia.org/T196358 [15:15:21] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/11405/planet2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/436589 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [15:15:50] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4264338 (10RobH) >>! In T195690#4264282, @RobH wrote: > We had next day support until it expired on May 25th. However, if this case was open before hten, they should honor the warranty. I misread the r... [15:15:55] (03PS1) 10Hoo man: Enable WikidataClient on sahwikiquote and pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438025 (https://phabricator.wikimedia.org/T183706) [15:16:58] (03CR) 10Hoo man: [C: 032] Enable WikidataClient on sahwikiquote and pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438025 (https://phabricator.wikimedia.org/T183706) (owner: 10Hoo man) [15:18:16] (03Merged) 10jenkins-bot: Enable WikidataClient on sahwikiquote and pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438025 (https://phabricator.wikimedia.org/T183706) (owner: 10Hoo man) [15:20:36] !log hoo@deploy1001 Synchronized dblists/wikidataclient.dblist: Enable WikidataClient on sahwikiquote and pswikivoyage - T183706, T196360 (duration: 00m 57s) [15:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:41] T183706: Create Wikivoyage Pashto - https://phabricator.wikimedia.org/T183706 [15:20:42] T196360: Create Wikiquote Sakha - https://phabricator.wikimedia.org/T196360 [15:23:10] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team (Radar): Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034#4244858 (10Imarlier) Can I propose an alternative? 1. Get rid of that hiera thing 2. Write a small script that r... [15:23:59] !log Emptied out the sites and site_identifiers tables on pswikivoyage, pmswikisource, bnwikivoyage and sahwikiquote for T122520, [15:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:03] T122520: Error running `extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https` on jbowiki - https://phabricator.wikimedia.org/T122520 [15:25:52] (03PS1) 10Dzahn: deployment::server: require libwww-perl [puppet] - 10https://gerrit.wikimedia.org/r/438028 (https://phabricator.wikimedia.org/T185275) [15:27:19] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: add proper active parameter for service [puppet] - 10https://gerrit.wikimedia.org/r/438029 (https://phabricator.wikimedia.org/T196633) [15:27:39] (03PS2) 10Dzahn: deployment::server: require libwww-perl [puppet] - 10https://gerrit.wikimedia.org/r/438028 (https://phabricator.wikimedia.org/T185275) [15:28:14] (03CR) 10jerkins-bot: [V: 04-1] deployment::server: require libwww-perl [puppet] - 10https://gerrit.wikimedia.org/r/438028 (https://phabricator.wikimedia.org/T185275) (owner: 10Dzahn) [15:29:55] !log Running "foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https" T196360 T183706 T195014 T196357 [15:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:20] T196357: Create Wikivoyage Bengali - https://phabricator.wikimedia.org/T196357 [15:30:20] T195014: Add Wikidata support to pmswikisource - https://phabricator.wikimedia.org/T195014 [15:30:20] T183706: Create Wikivoyage Pashto - https://phabricator.wikimedia.org/T183706 [15:30:21] T196360: Create Wikiquote Sakha - https://phabricator.wikimedia.org/T196360 [15:33:26] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Not what I would expect:" [puppet] - 10https://gerrit.wikimedia.org/r/438029 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:33:44] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: add proper active parameter for service [puppet] - 10https://gerrit.wikimedia.org/r/438029 (https://phabricator.wikimedia.org/T196633) [15:38:53] !log upgrade Cassandra to 3.11.2, restbase1010-{a,b,c} - T178905 [15:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:58] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [15:39:50] (03CR) 10Muehlenhoff: deployment::server: require libwww-perl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/438028 (https://phabricator.wikimedia.org/T185275) (owner: 10Dzahn) [15:45:17] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Still requires some more work:" [puppet] - 10https://gerrit.wikimedia.org/r/438029 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:45:24] PROBLEM - Host pc2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:47:36] ^ another pc?? [15:47:49] Ah no [15:47:51] downtime expired [15:47:56] I will silence it again [15:48:08] That is: T196339 [15:48:09] T196339: pc2005 down - https://phabricator.wikimedia.org/T196339 [15:49:09] silenced it for another week [15:50:28] (03PS3) 10Arturo Borrero Gonzalez: openstack: keystone: add proper active parameter for service [puppet] - 10https://gerrit.wikimedia.org/r/438029 (https://phabricator.wikimedia.org/T196633) [15:50:44] RECOVERY - Host pc2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.87 ms [15:51:07] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone: add proper active parameter for service [puppet] - 10https://gerrit.wikimedia.org/r/438029 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:52:04] yay [15:55:21] (03PS4) 10Arturo Borrero Gonzalez: openstack: keystone: add proper active parameter for service [puppet] - 10https://gerrit.wikimedia.org/r/438029 (https://phabricator.wikimedia.org/T196633) [15:55:24] (03PS1) 10Eevans: cassandra: upgrade 3.x version to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/438035 (https://phabricator.wikimedia.org/T178905) [15:56:03] (03CR) 10EBernhardson: "checked and while the patch we were waiting for has been deployed, there is a new regression preventing this from dropping hourly partitio" [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) (owner: 10EBernhardson) [15:56:09] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Far better this time:" [puppet] - 10https://gerrit.wikimedia.org/r/438029 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:57:06] (03CR) 10Andrew Bogott: [C: 031] "If the compiler is happy then I'm happy!" [puppet] - 10https://gerrit.wikimedia.org/r/438029 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:59:08] !log Finished running "foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https" T196360 T183706 T195014 T196357 [15:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:15] T196357: Create Wikivoyage Bengali - https://phabricator.wikimedia.org/T196357 [15:59:15] T195014: Add Wikidata support to pmswikisource - https://phabricator.wikimedia.org/T195014 [15:59:15] T183706: Create Wikivoyage Pashto - https://phabricator.wikimedia.org/T183706 [15:59:15] T196360: Create Wikiquote Sakha - https://phabricator.wikimedia.org/T196360 [15:59:22] Reedy: Just in time… Wikdiata support for the wikis should be done :) [15:59:31] Sweet [15:59:33] Thanks [16:00:04] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team (Radar): Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034#4264607 (10Imarlier) Gets all scap-enabled projects -- looking for a checkout and cloning if not present wouldn't... [16:00:04] godog, moritzm, and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T1600). Please do the needful. [16:00:05] no_justification: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:02:18] (03PS6) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [16:08:02] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4264628 (10Papaul) a:05Papaul>03jcrespo @jcrespo Dell Shipped a new main board and a new network card. I replaced first the network card to see if the network card was the problem and yes we h... [16:10:48] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4264633 (10Marostegui) Disk arrived and got replaced. Note, it is bigger than the other ones. It is rebuilding: ``` logicaldrive 1 (11.6 TB, RAID 1+0, Recovering, 2% complete) physicaldrive... [16:10:52] (03PS1) 10Cmjohnson: adding dns for snapshot1009 [dns] - 10https://gerrit.wikimedia.org/r/438039 (https://phabricator.wikimedia.org/T196189) [16:11:38] (03PS7) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [16:12:19] (03CR) 10Cmjohnson: [C: 032] adding dns for snapshot1009 [dns] - 10https://gerrit.wikimedia.org/r/438039 (https://phabricator.wikimedia.org/T196189) (owner: 10Cmjohnson) [16:13:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#4264640 (10Ottomata) [16:14:24] RECOVERY - Host pc2005 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [16:15:37] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4264648 (10Marostegui) I got the server back with network up. We will take it from here. Thanks a lot @Papaul [16:16:39] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "compiler is not happy yet:" [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [16:18:11] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4264650 (10Marostegui) So, MySQL is up, but unfortunately the master's binlog where pc2005 was replicating from is gone [16:21:54] !log rolling Cassandra restart, restbase1007 - T178905 [16:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:58] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [16:22:42] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4264660 (10Marostegui) I guess as this data is just be erased when the number of days expires, we can probably start replication from the current position? [16:23:16] 10Operations, 10ops-eqiad, 10Datasets-General-or-Unknown, 10Patch-For-Review: rack/setup/install snapshot1009 - https://phabricator.wikimedia.org/T196189#4264675 (10Cmjohnson) [16:23:39] (03PS8) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [16:24:05] 10Operations, 10ops-eqiad, 10Datasets-General-or-Unknown, 10Patch-For-Review: rack/setup/install snapshot1009 - https://phabricator.wikimedia.org/T196189#4249646 (10Cmjohnson) a:05Cmjohnson>03RobH @robh can you finish the install please. The network switch is done but I have it disabled. Please enable... [16:25:40] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4264697 (10jcrespo) > I guess as this data is just be erased when the number of days expires, we can probably start replication from the current position? Yes, or replicating from the first pc1005... [16:32:22] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4264711 (10Marostegui) Sounds like a plan, I will start replication from the first available position [16:33:15] (03PS9) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [16:33:48] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4264714 (10Cmjohnson) @chasemp I do not have 2 adjacent 10G racks and do not have space in 2 10G racks in the same row (maybe row D) and 2nd I am not 100... [16:38:24] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4264723 (10Marostegui) Replication has started. It is not ROW: ``` +------------+ | @@hostname | +------------+ | pc1005 | +------------+ +---------------+-----------+ | Variable_name | Value... [16:44:57] (03PS4) 10EBernhardson: Drop query_clicks partitions after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) [16:45:45] !log rolling Cassandra restart, restbase2001, restbase2002, restbase2007 - T178905 [16:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:50] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [16:47:01] No one puppetswatting? [16:47:37] Is that like punch and judy? [16:47:42] Hah [16:48:46] 10Operations, 10Release-Engineering-Team (Kanban): Add CI namespace in staging k8s cluster - https://phabricator.wikimedia.org/T196654#4264759 (10thcipriani) [16:49:04] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Add CI namespace in staging k8s cluster - https://phabricator.wikimedia.org/T196654#4264773 (10thcipriani) [16:50:59] (03PS5) 10EBernhardson: Drop query_clicks partitions after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) [16:51:24] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [16:52:31] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#4264780 (10Jgreen) [16:52:40] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#3973195 (10Jgreen) a:03Jgreen [16:57:06] 10Operations, 10fundraising-tech-ops, 10netops: adjust NAT mapping for frdata.wikimedia.org - https://phabricator.wikimedia.org/T196656#4264802 (10Jgreen) [16:57:48] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC), 10Services (watching): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#4264815 (10mobrovac) [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T1700). [17:01:51] (03CR) 10Muehlenhoff: [WIP] Allow admin module to ensure system user membership in managed groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [17:02:08] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add fake hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/438052 [17:02:40] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: eqiad1: add fake hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/438052 (owner: 10Arturo Borrero Gonzalez) [17:02:43] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] openstack: eqiad1: add fake hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/438052 (owner: 10Arturo Borrero Gonzalez) [17:03:33] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4264832 (10Vgutierrez) >>! In T196371#4258991, @Johan wrote: > OK. So I'll ask for > > //Wikipedia is making the site more secure. You are using an... [17:07:25] RECOVERY - Device not healthy -SMART- on labsdb1009 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labsdb1009&var-datasource=eqiad%2520prometheus%252Fops [17:08:25] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Refactor pipeline build step to be more isolated/secure/scalable - https://phabricator.wikimedia.org/T195050#4264844 (10thcipriani) actually adding operations [17:11:19] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add more hiera keys placeholders for passwords [labs/private] - 10https://gerrit.wikimedia.org/r/438053 (https://phabricator.wikimedia.org/T196633) [17:12:46] (03PS1) 10Reedy: Update beta interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438054 [17:12:49] halfak: do you want to deploy? [17:13:00] !log awight@deploy1001 Started deploy [ores/deploy@65ce165]: New home page for ORES; T196580 [17:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:05] T196580: Give a new look to the home page - https://phabricator.wikimedia.org/T196580 [17:13:10] o/ [17:13:26] (03CR) 10Reedy: [C: 032] Update beta interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438054 (owner: 10Reedy) [17:14:33] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] openstack: eqiad1: add more hiera keys placeholders for passwords [labs/private] - 10https://gerrit.wikimedia.org/r/438053 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [17:14:40] 10Operations, 10cloud-services-team: Disable system-wide apt pinning for OpenStack jessie hosts - https://phabricator.wikimedia.org/T196659#4264894 (10MoritzMuehlenhoff) [17:14:59] (03Merged) 10jenkins-bot: Update beta interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438054 (owner: 10Reedy) [17:15:50] ohboy... scap/sync/2018-06-07/0001\nFetching submodule submodules/ores\nfatal: unable to access 'http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/ores/': Failed to connect to tin.eqiad.wmnet port 80: Connection timed out\n [17:16:11] !log reedy@deploy1001 Synchronized wmf-config/interwiki-labs.php: labs! (duration: 00m 57s) [17:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:18] !log awight@deploy1001 Finished deploy [ores/deploy@65ce165]: New home page for ORES; T196580 (duration: 03m 19s) [17:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:32] (03PS2) 10Muehlenhoff: Remove deployment access for khorn [puppet] - 10https://gerrit.wikimedia.org/r/437919 [17:16:39] !log awight@deploy1001 Started deploy [ores/deploy@65ce165]: New home page for ORES; T196580 (take 2) [17:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:44] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Clean up videoconvert if there are temp files to do so with - https://phabricator.wikimedia.org/T196660#4264915 (10Bstorm) p:05Triage>03High [17:16:52] Itermittent failure? [17:17:18] awight: tin is decommissioned AFAIK [17:17:44] it's not yet decomissioned, but was stripped of it's role as a deployment host earlier the day [17:17:56] (03CR) 10Muehlenhoff: [C: 032] Remove deployment access for khorn [puppet] - 10https://gerrit.wikimedia.org/r/437919 (owner: 10Muehlenhoff) [17:18:07] Amir1: Our .gitmodules specifies the repo URLs as gerrit.wmo [17:18:13] who knows what rewriting is doing, though... [17:18:25] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [17:18:42] moritzm: thanks for the clarification [17:19:47] Looks like our URLs aren't overwritten, so I have no idea where "tin" is even coming from. [17:19:56] (03CR) 10jenkins-bot: Add pswikivoyage to rtl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438023 (https://phabricator.wikimedia.org/T183706) (owner: 10Reedy) [17:20:08] (03PS10) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [17:21:13] !log awight@deploy1001 Finished deploy [ores/deploy@65ce165]: New home page for ORES; T196580 (take 2) (duration: 04m 34s) [17:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:18] T196580: Give a new look to the home page - https://phabricator.wikimedia.org/T196580 [17:21:28] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438015 (owner: 10Marostegui) [17:21:29] thcipriani: twentyafterfour: Any idea why "tin.eqiad.wmnet" cruft would be creeping into a repo where submodules are all coming from gerrit? [17:21:45] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:23:51] (03CR) 10jenkins-bot: Add 5 new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438017 (https://phabricator.wikimedia.org/T183706) (owner: 10Reedy) [17:23:53] (03CR) 10jenkins-bot: Add idwikimedia to MWMutliVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438020 (https://phabricator.wikimedia.org/T192726) (owner: 10Reedy) [17:23:55] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438022 (owner: 10Reedy) [17:24:01] (03CR) 10jenkins-bot: Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) (owner: 10Urbanecm) [17:24:03] (03CR) 10jenkins-bot: Initial configuration for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437393 (https://phabricator.wikimedia.org/T196357) (owner: 10Urbanecm) [17:24:05] (03CR) 10jenkins-bot: Initial configuration for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437238 (https://phabricator.wikimedia.org/T196360) (owner: 10Urbanecm) [17:24:07] (03CR) 10jenkins-bot: Enable WikidataClient on sahwikiquote and pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438025 (https://phabricator.wikimedia.org/T183706) (owner: 10Hoo man) [17:24:09] (03CR) 10jenkins-bot: Update beta interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438054 (owner: 10Reedy) [17:25:21] (03PS11) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [17:26:38] awight: no. Also are you still seeing "tin"? if any cruft leaked in you'd think it would be deploy1001? [17:28:19] awight: core still fetches from deploy1001 even if submodules don't which is what it looks like is failing [17:28:40] thcipriani: Very strange, ok I documented as T196663 but have to run for now. [17:28:40] T196663: ORES deployments blocked by mysterious tin.eqiad.wmnet error - https://phabricator.wikimedia.org/T196663 [17:28:52] awight: ok, I'll follow up there if I spot anything [17:28:56] Amir1: halfak: ^ [17:28:58] thanks! [17:29:40] is this a re-deploy of a revision? Or a fresh deploy of this revision? [17:30:29] fresh revision, thcipriani [17:30:35] k, thanks [17:33:03] 10Operations, 10ops-codfw: rack/setup/install authdns2001.wikimedia.org - https://phabricator.wikimedia.org/T196664#4265022 (10RobH) p:05Triage>03Normal [17:34:58] 10Operations, 10ops-codfw: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665#4265037 (10RobH) p:05Triage>03Normal [17:38:37] 10Operations, 10ops-codfw: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666#4265061 (10RobH) p:05Triage>03Normal [17:39:04] (03Abandoned) 10Anomie: Add --replica parameter to sql script [puppet] - 10https://gerrit.wikimedia.org/r/397913 (owner: 10Anomie) [17:39:38] 10Operations, 10Cloud-VPS: Create custom deployment-prep role that allows editing of Designate records only - https://phabricator.wikimedia.org/T194998#4215239 (10Andrew) I'm going to create a new role, 'designatemanager' and attach a patch here granting some DNS privs to that role. Then I think we should cre... [17:41:21] 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-JobQueue, 10Beta-Cluster-reproducible, 10Performance-Team (Radar): Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#4265112 (10Reedy) [17:42:10] (03PS1) 10Andrew Bogott: designate: support a new 'designatemanager' role [puppet] - 10https://gerrit.wikimedia.org/r/438057 (https://phabricator.wikimedia.org/T194998) [17:53:46] going to do a parsoid deploy in a few mins. [17:55:02] can someone verify that SHA256:fC3OkgwnAX3FbkyyVQCfdpG0W/41rwhZx2sppYsLbN0 is the fingerprint of the key of the deploy host .. i haven't deployed in a while and that has changed. [17:55:19] (03PS1) 10Ladsgroup: labs: set $wgChangeTagsSchemaMigrationStage to MIGRATION_WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438060 (https://phabricator.wikimedia.org/T196671) [17:55:46] mutante, ^^ [17:55:59] just being overcautious before i accept the new key. [17:56:23] 10Operations, 10ops-eqiad, 10netops: upgrade row d to have 3 10G switches - https://phabricator.wikimedia.org/T196487#4265184 (10Cmjohnson) [17:56:27] 10Operations, 10ops-eqiad, 10netops: upgrade row d to have 3 10G switches - https://phabricator.wikimedia.org/T196487#4258394 (10Cmjohnson) I racked the switch in D4, updated racktables [17:56:37] or thcipriani or halfak :) [17:57:22] never mind .. found it at https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy1001.eqiad.wmnet [17:57:25] (03CR) 10Ladsgroup: [C: 032] labs: set $wgChangeTagsSchemaMigrationStage to MIGRATION_WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438060 (https://phabricator.wikimedia.org/T196671) (owner: 10Ladsgroup) [17:57:58] (03PS1) 10Cmjohnson: Removing dns entrie niobium [dns] - 10https://gerrit.wikimedia.org/r/438061 (https://phabricator.wikimedia.org/T191355) [17:59:01] (03Merged) 10jenkins-bot: labs: set $wgChangeTagsSchemaMigrationStage to MIGRATION_WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438060 (https://phabricator.wikimedia.org/T196671) (owner: 10Ladsgroup) [17:59:04] (03CR) 10Cmjohnson: [C: 032] Removing dns entrie niobium [dns] - 10https://gerrit.wikimedia.org/r/438061 (https://phabricator.wikimedia.org/T191355) (owner: 10Cmjohnson) [17:59:17] subbu: do you have access to ops-l? [17:59:32] yes .. that is where i found it :) [17:59:36] subbu: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy1001.eqiad.wmnet [17:59:37] !log ssastry@deploy1001 Started deploy [parsoid/deploy@2f80639]: Updating Parsoid to 7819c9e7 [17:59:38] 10Operations, 10Puppet, 10Cloud-VPS: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188#4265212 (10Krenair) >>! In T171188#3466573, @Andrew wrote: > - Performance: We haven't ever had a VM puppetmaster support more than a few dozen clients. I can't think of any... [17:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:44] yup, I was looking for this :D [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:09] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4265214 (10Bstorm) For the record, yesterday, I ran: `ionice -c 3 nice -19 find /srv/tools -type f -size +100M -printf "%p %k KB\n" > /root/tools_large_fi... [18:02:00] (03CR) 10jenkins-bot: labs: set $wgChangeTagsSchemaMigrationStage to MIGRATION_WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438060 (https://phabricator.wikimedia.org/T196671) (owner: 10Ladsgroup) [18:06:36] (03PS1) 10Chad: Add debug mode to script [software/gerrit] (stable-2.15) - 10https://gerrit.wikimedia.org/r/438064 [18:08:12] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4265241 (10Framawiki) >>! In T183920#4265214, @Bstorm wrote: > Also, there is a list of *.err and *.out files that might be worth truncating. I suppose t... [18:10:36] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@2f80639]: Updating Parsoid to 7819c9e7 (duration: 10m 59s) [18:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:40] !log Updated Parsoid (T183706, T192726, T194879, T196357, T196360, T43716) [18:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:50] T196357: Create Wikivoyage Bengali - https://phabricator.wikimedia.org/T196357 [18:13:50] T183706: Create Wikivoyage Pashto - https://phabricator.wikimedia.org/T183706 [18:13:51] T43716: Support language variant conversion in Parsoid - https://phabricator.wikimedia.org/T43716 [18:13:51] T192726: Create a Wikimedia hosted wiki site for Wikimedia Indonesia - https://phabricator.wikimedia.org/T192726 [18:13:51] T196360: Create Wikiquote Sakha - https://phabricator.wikimedia.org/T196360 [18:13:51] T194879: Create Wikisource Piedmontese - https://phabricator.wikimedia.org/T194879 [18:15:08] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4265289 (10bd808) >>! In T183920#4265241, @Framawiki wrote: > I suppose that they are outputs of grid jobs commands. And as I know nothing delete/delete t... [18:18:15] RECOVERY - HP RAID on labsdb1009 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 1I:1:14, 1I:1:15, 1I:1:16 - Controller: OK - Battery/Capacitor: OK [18:20:12] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#4265312 (10Cmjohnson) @ayounsi the new mr1 is in the rack, powered on and connected to the current mr1-eqiad console cable. [18:20:39] !log restarting Cassandra, restbase2003-c -- T178905 [18:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:45] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [18:22:29] (03Abandoned) 10Chad: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/414763 (owner: 10Paladox) [18:23:13] (03Abandoned) 10Chad: Add debug mode to script [software/gerrit] (stable-2.15) - 10https://gerrit.wikimedia.org/r/438064 (owner: 10Chad) [18:23:16] (03Abandoned) 10Chad: Gerrit 2.15.2 wmf build [software/gerrit] (stable-2.15) - 10https://gerrit.wikimedia.org/r/437865 (owner: 10Chad) [18:28:20] !log rolling Cassandra restart, restbase2004, restbase2008, restbase2011 -- T178905 [18:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:25] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [18:30:58] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4265369 (10Marostegui) 05Open>03Resolved All good! Thank you!! ``` logicaldrive 1 (11.6 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)... [18:51:15] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 404 (expecting: 200) [18:52:25] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [19:00:04] thcipriani: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T1900). [19:01:01] * thcipriani does [19:01:52] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Create custom deployment-prep role that allows editing of Designate records only - https://phabricator.wikimedia.org/T194998#4265421 (10Krenair) >>! In T194998#4265079, @Andrew wrote: > I'm going to create a new role, 'designatemanager' and attach a patch here g... [19:02:03] (03PS3) 10Ottomata: Allow admin module to ensure system user membership in managed groups [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) [19:02:09] (03PS3) 10Dzahn: deployment::server: require libwww-perl [puppet] - 10https://gerrit.wikimedia.org/r/438028 (https://phabricator.wikimedia.org/T185275) [19:03:17] (03CR) 10Ottomata: [C: 032] Allow admin module to ensure system user membership in managed groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [19:03:20] (03PS4) 10Ottomata: Allow admin module to ensure system user membership in managed groups [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) [19:03:22] (03CR) 10Ottomata: [V: 032 C: 032] Allow admin module to ensure system user membership in managed groups [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [19:06:24] (03PS1) 10Ottomata: Fix join call in groupmembers.pp [puppet] - 10https://gerrit.wikimedia.org/r/438075 (https://phabricator.wikimedia.org/T174465) [19:06:35] PROBLEM - puppet last run on elastic2020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:06:35] might have some puppet broken alerts incoming ^... [19:06:44] PROBLEM - puppet last run on graphite2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:07:14] PROBLEM - puppet last run on mc1026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:07:15] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:07:33] (03PS2) 10Ottomata: Fix default_member in groupmembers.pp [puppet] - 10https://gerrit.wikimedia.org/r/438075 (https://phabricator.wikimedia.org/T174465) [19:07:34] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:07:44] PROBLEM - puppet last run on wtp1029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:07:49] (03CR) 10Ottomata: [V: 032 C: 032] Fix default_member in groupmembers.pp [puppet] - 10https://gerrit.wikimedia.org/r/438075 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [19:08:04] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:04] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:04] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:05] PROBLEM - puppet last run on db1074 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:05] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:05] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:15] PROBLEM - puppet last run on ores2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:15] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:24] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:24] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:25] PROBLEM - puppet last run on scb2005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members],Exec[citoid-users_ensure_members] [19:08:34] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:44] PROBLEM - puppet last run on oresrdb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:45] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:45] PROBLEM - puppet last run on conf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:54] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:08:55] PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4265447 (10Cmjohnson) [19:09:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom niobium/WMF3428 - https://phabricator.wikimedia.org/T191355#4265446 (10Cmjohnson) 05Open>03Resolved [19:09:14] PROBLEM - puppet last run on mw2279 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:14] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:15] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:15] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:15] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:17] (03CR) 10Dzahn: deployment::server: require libwww-perl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/438028 (https://phabricator.wikimedia.org/T185275) (owner: 10Dzahn) [19:09:34] PROBLEM - puppet last run on ganeti2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:34] PROBLEM - puppet last run on mw2262 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:34] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:35] PROBLEM - puppet last run on druid1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:44] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:44] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:45] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:45] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:54] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:54] PROBLEM - puppet last run on ganeti1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:55] PROBLEM - puppet last run on wdqs2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:55] PROBLEM - puppet last run on cp4031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:55] PROBLEM - puppet last run on db2082 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:55] PROBLEM - puppet last run on elastic2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:56] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:09:56] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:02] (03PS4) 10Dzahn: deployment::server: require libwww-perl [puppet] - 10https://gerrit.wikimedia.org/r/438028 (https://phabricator.wikimedia.org/T185275) [19:10:04] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 33 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:04] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:05] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:15] PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:24] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:24] PROBLEM - puppet last run on kafkamon1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:25] PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:25] PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:25] PROBLEM - puppet last run on mw2271 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:34] PROBLEM - puppet last run on poolcounter2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:35] PROBLEM - puppet last run on dns4002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:42] (03CR) 10Dzahn: [C: 032] deployment::server: require libwww-perl [puppet] - 10https://gerrit.wikimedia.org/r/438028 (https://phabricator.wikimedia.org/T185275) (owner: 10Dzahn) [19:10:45] PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:45] PROBLEM - puppet last run on etcd1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:45] PROBLEM - puppet last run on mc2022 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:45] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:45] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:46] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:46] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:47] PROBLEM - puppet last run on mc2034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:10:54] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:04] PROBLEM - puppet last run on mc2028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:05] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:05] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:05] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:05] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:14] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:14] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:14] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:25] PROBLEM - puppet last run on db1122 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:25] PROBLEM - puppet last run on mc2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:25] PROBLEM - puppet last run on kafkamon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:35] PROBLEM - puppet last run on kafka-jumbo1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:35] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:44] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:45] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:45] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:54] PROBLEM - puppet last run on mc2036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:54] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:54] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:54] PROBLEM - puppet last run on db2051 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:55] PROBLEM - puppet last run on wtp1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:55] PROBLEM - puppet last run on es2019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:11:55] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:04] PROBLEM - puppet last run on mw2283 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:04] PROBLEM - puppet last run on mw2155 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:04] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:04] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:14] PROBLEM - puppet last run on db1101 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:14] PROBLEM - puppet last run on elastic2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:14] PROBLEM - puppet last run on labtestneutron2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:14] PROBLEM - puppet last run on elastic2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:14] PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:15] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:15] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:16] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:16] PROBLEM - puppet last run on ms-be1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:16] working on it^ [19:12:17] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:17] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:18] sigh ok reverting [19:12:24] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:24] PROBLEM - puppet last run on db1123 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:24] PROBLEM - puppet last run on db1088 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:24] PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:24] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:25] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:25] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:26] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:26] PROBLEM - puppet last run on db2076 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:27] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:34] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:44] PROBLEM - puppet last run on druid1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:44] PROBLEM - puppet last run on analytics1074 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:44] PROBLEM - puppet last run on planet2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:44] PROBLEM - puppet last run on mwlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:44] PROBLEM - puppet last run on wtp2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:45] PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:45] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:46] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:46] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:54] PROBLEM - puppet last run on db1109 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:54] PROBLEM - puppet last run on db1119 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:55] PROBLEM - puppet last run on conf1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:55] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:55] PROBLEM - puppet last run on ores1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:55] PROBLEM - puppet last run on kafka-jumbo1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:55] PROBLEM - puppet last run on etcd1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:56] PROBLEM - puppet last run on wtp1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:56] PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:57] PROBLEM - puppet last run on wtp1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:58] PROBLEM - puppet last run on wtp1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:58] PROBLEM - puppet last run on aqs1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:58] (03PS2) 10Andrew Bogott: designate: support a new 'designatemanager' role [puppet] - 10https://gerrit.wikimedia.org/r/438057 (https://phabricator.wikimedia.org/T194998) [19:12:59] PROBLEM - puppet last run on etcd1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:12:59] PROBLEM - puppet last run on auth2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:12] (03CR) 10Dzahn: [C: 032] "Notice: /Stage[main]/Packages::Libwww_perl/Package[libwww-perl]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/438028 (https://phabricator.wikimedia.org/T185275) (owner: 10Dzahn) [19:13:14] PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:14] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:15] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members],Exec[oit_ensure_members] [19:13:15] PROBLEM - puppet last run on aqs1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:15] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:15] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:15] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:16] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:16] PROBLEM - puppet last run on dbproxy1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:17] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:24] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:24] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:24] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:25] PROBLEM - puppet last run on elastic2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:25] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:25] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:25] PROBLEM - puppet last run on mw1313 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:26] PROBLEM - puppet last run on lvs4005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:26] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:27] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:27] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:34] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:34] PROBLEM - puppet last run on wdqs1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:35] PROBLEM - puppet last run on db2087 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:35] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:35] PROBLEM - puppet last run on analytics1076 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:35] PROBLEM - puppet last run on ms-be2042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:36] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:36] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:37] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:37] PROBLEM - puppet last run on es2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:38] PROBLEM - puppet last run on db2062 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:44] (03CR) 10Andrew Bogott: [C: 032] designate: support a new 'designatemanager' role [puppet] - 10https://gerrit.wikimedia.org/r/438057 (https://phabricator.wikimedia.org/T194998) (owner: 10Andrew Bogott) [19:13:47] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Create custom deployment-prep role that allows editing of Designate records only - https://phabricator.wikimedia.org/T194998#4265453 (10Andrew) I've issued deployment-prep-dns-manager the designateadmin role on deployment-prep. I've changed deployment-prep-dns-... [19:13:54] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:13:54] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:04] PROBLEM - puppet last run on es1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:04] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:05] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:05] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:05] PROBLEM - puppet last run on flerovium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:05] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:05] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:06] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:06] PROBLEM - puppet last run on mw1331 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:07] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:07] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:14] PROBLEM - puppet last run on mc2021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:14] PROBLEM - puppet last run on mw2278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:14] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:14] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:14] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:15] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:15] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:24] PROBLEM - puppet last run on db1093 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:24] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:25] PROBLEM - puppet last run on mw2254 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:25] PROBLEM - puppet last run on db1103 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:25] PROBLEM - puppet last run on db1110 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:25] PROBLEM - puppet last run on ms-be1036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:25] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:26] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:26] ottomata: need any help? [19:14:34] PROBLEM - puppet last run on cp4027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:34] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:34] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:34] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:34] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:35] PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:35] PROBLEM - puppet last run on ununpentium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:36] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:44] PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:44] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:44] PROBLEM - puppet last run on elastic1052 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:45] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:45] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:45] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:45] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:46] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:46] PROBLEM - puppet last run on cp5007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:54] PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:54] PROBLEM - puppet last run on ganeti2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:54] PROBLEM - puppet last run on mw1321 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:55] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:14:55] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:04] PROBLEM - puppet last run on poolcounter1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:04] PROBLEM - puppet last run on labtestnet2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:04] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:04] PROBLEM - puppet last run on ores2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:05] PROBLEM - puppet last run on labtestservices2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:05] PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:05] PROBLEM - puppet last run on druid1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:06] PROBLEM - puppet last run on mc2030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:06] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:07] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:14] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:14] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:14] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:14] PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:14] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:15] PROBLEM - puppet last run on mc2020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:15] PROBLEM - puppet last run on ganeti2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:16] PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:16] PROBLEM - puppet last run on ms-be1042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:17] PROBLEM - puppet last run on ms-be1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:17] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:18] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:34] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:34] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:34] PROBLEM - puppet last run on es2015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:34] PROBLEM - puppet last run on mw2275 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:35] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:44] PROBLEM - puppet last run on meitnerium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:44] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:44] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:45] PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:45] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:45] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:45] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:46] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:47] I’ll stop ircecho [19:15:50] reverting now [19:15:53] (03PS1) 10Ottomata: Reverting https://gerrit.wikimedia.org/r/#/c/379004/ [puppet] - 10https://gerrit.wikimedia.org/r/438076 [19:15:54] PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:54] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:54] PROBLEM - puppet last run on rdb2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:54] PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:54] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:55] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wikidev_ensure_members] [19:15:59] it was two commits so wasn't easy to just click in gerrit [19:16:17] (03CR) 10Ottomata: [V: 032 C: 032] Reverting https://gerrit.wikimedia.org/r/#/c/379004/ [puppet] - 10https://gerrit.wikimedia.org/r/438076 (owner: 10Ottomata) [19:16:18] !log stopped ircecho [19:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:23] (03PS2) 10Ottomata: Reverting https://gerrit.wikimedia.org/r/#/c/379004/ [puppet] - 10https://gerrit.wikimedia.org/r/438076 [19:16:25] (03CR) 10Ottomata: [V: 032 C: 032] Reverting https://gerrit.wikimedia.org/r/#/c/379004/ [puppet] - 10https://gerrit.wikimedia.org/r/438076 (owner: 10Ottomata) [19:16:50] hm, ok, so that failed because the wikidev group in data.yaml is empty [19:17:05] not totally sure why though, puppet/rubys array functions are a little bit weird [19:17:21] hard to debug that, will figure it out in labs [19:17:42] cumin makes puppet fails really easy to come back frompost revert.... its kinda nice. [19:17:52] * robh hasnt run anything just loving on cumin [19:21:40] I seem to recall puppet would behave oddly at times when run through salt [19:21:59] (though often, puppet would just behave oddly regardless) [19:22:46] via cumin you run "run-puppet-agent -q" instead of directly 'puppet agent -tv' [19:22:50] maybe for that reason [19:25:31] (03PS2) 10Dzahn: planet: make active-active again, reactivate eqiad backend [puppet] - 10https://gerrit.wikimedia.org/r/437987 (https://phabricator.wikimedia.org/T168490) [19:26:52] (03PS1) 10Thcipriani: all wikis to 1.32.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438077 [19:26:54] fwiw the wikitech cumin articule has some nice examples like https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [19:26:54] (03CR) 10Thcipriani: [C: 032] all wikis to 1.32.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438077 (owner: 10Thcipriani) [19:28:10] oh wow cool [19:28:37] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438077 (owner: 10Thcipriani) [19:28:53] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438077 (owner: 10Thcipriani) [19:29:25] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.7 [19:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:07] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Create custom deployment-prep role that allows editing of Designate records only - https://phabricator.wikimedia.org/T194998#4265484 (10Andrew) ``` andrew@labcontrol1001:~$ openstack role list +----------------------------------+----------------+ | ID... [19:33:49] (03PS3) 10Dzahn: planet: make active-active again, reactivate eqiad backend [puppet] - 10https://gerrit.wikimedia.org/r/437987 (https://phabricator.wikimedia.org/T168490) [19:35:57] (03CR) 10Dzahn: [C: 032] planet: make active-active again, reactivate eqiad backend [puppet] - 10https://gerrit.wikimedia.org/r/437987 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [19:37:01] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685#4265514 (10RobH) p:05Triage>03Normal [19:41:34] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685#4265531 (10RobH) I've emailed both @joe and @elukey regarding the racking locations of these, email below: > Giuseppe/Luca, > > You are both following this order... [19:41:52] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685#4265532 (10RobH) [19:46:59] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4265539 (10Prolineserver) [19:47:02] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Clean up videoconvert if there are temp files to do so with - https://phabricator.wikimedia.org/T196660#4265537 (10Prolineserver) 05Open>03Resolved I was running the tools' cleanup script, now all old video files should be deleted. [19:47:23] !log rolling Cassandra restart, restbase2005, restbase2006, restbase2012 -- T178905 [19:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:28] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [19:56:01] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#4265565 (10Dzahn) [19:56:04] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#4265564 (10Dzahn) 05Open>03Resolved [19:56:39] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3366199 (10Dzahn) done. planet is served again from both data centers at once. planet1001 and planet2001 are both on stretch and use rawdog. planet-venus has been decom'ed for... [19:57:23] (03PS1) 10C. Scott Ananian: Enable testing LanguageConverter in sandboxes on deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438079 (https://phabricator.wikimedia.org/T143628) [19:58:10] ottomata: looks like puppet run alerts have cleared in icinga, shall I fire ircecho back up? [19:58:34] yeah sure [19:58:37] man this is really hard to test anywhere [19:58:38] hmmm [19:58:49] labs user management is way different...ldap? [19:59:03] i want to try again [19:59:24] herron: hold off one moment, i'm going to try again if it doesn't work i'll revert and give up [19:59:25] ok, if you’re in the middle of it no sweat. can keep it turned down [19:59:27] kk [19:59:31] k [20:00:27] 10Operations, 10ops-codfw: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477#4265595 (10Papaul) [20:01:41] 10Operations, 10ops-codfw, 10DNS, 10Traffic: rack/setup/install dns200[12].wikimedia.org - https://phabricator.wikimedia.org/T196493#4265599 (10Papaul) [20:04:47] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review, 10User-notice: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#4265611 (10Dzahn) @Johan Let me add this comment from T168490 done. planet is served again from both data centers at once. planet... [20:06:11] 10Operations, 10ops-codfw, 10DNS, 10Traffic: rack/setup/install dns200[12].wikimedia.org - https://phabricator.wikimedia.org/T196493#4265615 (10Papaul) [20:11:59] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690#4265618 (10RobH) p:05Triage>03Normal [20:15:51] (03PS1) 10Ottomata: Ensure system_members are in user groups [puppet] - 10https://gerrit.wikimedia.org/r/438081 (https://phabricator.wikimedia.org/T174465) [20:16:03] 10Operations, 10ops-codfw, 10netops: upgrade all codfw switch stacks to include additional 10G switch per row - https://phabricator.wikimedia.org/T196489#4265639 (10RobH) [20:18:29] (03CR) 10Ottomata: [C: 032] Ensure system_members are in user groups [puppet] - 10https://gerrit.wikimedia.org/r/438081 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [20:23:45] phew ok looks like it is working this time [20:23:53] \o/ [20:24:48] herron: should be safe to turn ircecho back on [20:24:52] WAIT [20:24:56] maybe spoke too soon... [20:25:01] notebook1004 failed [20:25:03] other stuff succeeded [20:25:05] investigating... [20:25:16] right. [20:27:19] (03PS1) 10Ottomata: Ensure analytics cluster users are on notebook machines [puppet] - 10https://gerrit.wikimedia.org/r/438083 (https://phabricator.wikimedia.org/T174465) [20:28:28] (03CR) 10Ottomata: [C: 032] Ensure analytics cluster users are on notebook machines [puppet] - 10https://gerrit.wikimedia.org/r/438083 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [20:28:49] (03PS1) 10Urbanecm: Upload wordmark for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438085 (https://phabricator.wikimedia.org/T196680) [20:28:51] (03PS1) 10Urbanecm: Use new wordmark for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438086 (https://phabricator.wikimedia.org/T196680) [20:30:13] k yeah that makes sense [20:30:18] just those nodes needed a little sumpin [20:30:19] makes sense [20:31:44] (03PS1) 10Ottomata: Include analytics cluster users on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/438087 (https://phabricator.wikimedia.org/T174465) [20:31:59] (03CR) 10Ottomata: [V: 032 C: 032] Include analytics cluster users on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/438087 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [20:32:57] ottomata: haha there is always something! [20:33:05] 10Operations, 10ops-eqiad, 10DNS, 10Traffic: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691#4265662 (10RobH) p:05Triage>03Normal [20:33:09] (03PS1) 10Urbanecm: Update static logo resources for bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438088 (https://phabricator.wikimedia.org/T196599) [20:33:11] (03PS1) 10Urbanecm: Use HD logos in bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438089 (https://phabricator.wikimedia.org/T196599) [20:33:18] oook [20:33:21] stat1004 too [20:33:27] should be good now herron! [20:34:42] sweet! flipping the ircecho switch to on position [20:35:23] 10Operations, 10ops-eqiad, 10DNS, 10Traffic: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693#4265704 (10RobH) p:05Triage>03Normal [20:37:17] Anyone with mediawiki/* access could merge https://gerrit.wikimedia.org/r/#/c/438032/ for me? Tnx. [20:37:34] !log ircecho restarted [20:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:55] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [20:52:15] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:55:33] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196697#4265819 (10RobH) p:05Triage>03High [20:55:43] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196697#4265834 (10RobH) p:05High>03Normal [20:57:16] 10Operations, 10ops-eqiad: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698#4265841 (10RobH) p:05Triage>03Normal [21:08:43] (03PS1) 10Chad: Initial stable-2.15 fork for wikimedia [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438097 [21:08:54] PROBLEM - Long running screen/tmux on furud is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 18096, 1737861s 1728000s). [21:09:08] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#4265904 (10Ottomata) [21:10:09] (03CR) 10Chad: [V: 032 C: 032] Initial stable-2.15 fork for wikimedia [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438097 (owner: 10Chad) [21:10:29] (03CR) 10Paladox: [C: 031] Initial stable-2.15 fork for wikimedia [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438097 (owner: 10Chad) [21:13:05] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 60031 MB (12% inode=99%) [21:13:55] 10Operations, 10ops-eqiad: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701#4265921 (10RobH) p:05Triage>03Normal [21:17:34] RECOVERY - Disk space on elastic1027 is OK: DISK OK [21:46:45] (03PS1) 10Chad: Bumping motd to updated upstream sha1 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438113 [21:46:55] (03CR) 10Chad: [V: 032 C: 032] Bumping motd to updated upstream sha1 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438113 (owner: 10Chad) [21:47:31] !log mobrovac@deploy1001 Started deploy [proton/deploy@97ec4bf]: Initial deploy to production - T186748 [21:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:37] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [21:49:50] !log mobrovac@deploy1001 Finished deploy [proton/deploy@97ec4bf]: Initial deploy to production - T186748 (duration: 02m 19s) [21:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:20] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4266001 (10mobrovac) This ^ was actually unsuccessful because the targets tried to fetch the submodule from `tin.eqiad.wmnet`, which obviousl... [21:57:40] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add thcipriani and hashar to gerrit-root - https://phabricator.wikimedia.org/T196702#4266006 (10greg) [21:58:55] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add thcipriani and hashar to gerrit-root - https://phabricator.wikimedia.org/T196702#4266006 (10greg) Rational: reducing SPOF for future maintenance and upgrades of Gerrit. [22:02:30] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add thcipriani to phabricator-roots - https://phabricator.wikimedia.org/T196703#4266022 (10greg) [22:14:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003 SMART failure (again) - https://phabricator.wikimedia.org/T196704#4266037 (10Andrew) [22:14:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003 SMART failure (again) - https://phabricator.wikimedia.org/T196704#4266048 (10Andrew) (previously, https://phabricator.wikimedia.org/T193651 ) [22:14:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003 SMART failure (again) - https://phabricator.wikimedia.org/T196704#4266049 (10Andrew) p:05Triage>03High [22:15:25] ACKNOWLEDGEMENT - Device not healthy -SMART- on labstore1003 is CRITICAL: cluster=labsnfs device=megaraid,13 instance=labstore1003:9100 job=node site=eqiad andrew bogott T196704 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1003&var-datasource=eqiad%2520prometheus%252Fops [22:18:03] 08Warning Alert for device cr1-eqiad.wikimedia.org - Sensor over limit [22:19:20] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4266065 (10thcipriani) >>! In T186748#4266001, @mobrovac wrote: > This ^ was actually unsuccessful because the targets tried to fetch the sub... [22:21:30] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add thcipriani and hashar to gerrit-root - https://phabricator.wikimedia.org/T196702#4266071 (10RobH) This requires SRE team meeting review. Moving to the appropriate column for that. Next meeting is Monday, June 11th, 2018. [22:23:56] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add thcipriani to phabricator-roots - https://phabricator.wikimedia.org/T196703#4266089 (10RobH) This requires SRE team meeting review. Moving to the appropriate column. The next meeting is Monday, June 11th, 2018. [22:25:07] (03PS1) 10Chad: Gerrit 2.15.2 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438115 [22:25:43] (03CR) 10Chad: [V: 032 C: 032] Gerrit 2.15.2 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438115 (owner: 10Chad) [22:29:33] (03PS1) 10Legoktm: admin: Port matrix.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/438116 [22:36:41] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4266117 (10Krenair) Unfortunately this new -deploy-01 instance went into emergency mode after the security reboots y... [22:38:16] legoktm: hah, did I inadvertantly nerd snipe you with matrix.py? [22:38:23] greg-g: yes :p [22:38:42] :) [22:39:39] Is there an EOL for Python 2.x? (Or another reason to drop it from production?) [22:40:01] 2020 i think [22:40:16] James_F https://pythonclock.org [22:42:42] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add thcipriani to phabricator-roots - https://phabricator.wikimedia.org/T196703#4266122 (10RobH) p:05Triage>03Normal [22:42:47] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add thcipriani and hashar to gerrit-root - https://phabricator.wikimedia.org/T196702#4266123 (10RobH) p:05Triage>03Normal [22:46:21] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4262770 (10Krenair) >>! In T196596#4262846, @Reedy wrote: > Brandon reckons it's something to do with `confd::srv_dns` not being set correctly on beta That's weird, I thought that us... [22:47:14] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4266133 (10Krenair) https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=next&oldid=1754779 [22:52:35] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4266141 (10Krenair) More session issues: T172560 T173646 [22:54:15] 10Operations, 10ops-codfw, 10DBA: replace bad disk in db2059 - https://phabricator.wikimedia.org/T196709#4266144 (10RobH) p:05Triage>03High [22:56:16] 10Operations, 10ops-codfw, 10DBA: replace bad disk in db2059 - https://phabricator.wikimedia.org/T196709#4266159 (10RobH) I've set this to high priority due to the looming end of fiscal. If this new SAS disk works fine and the raid rebuilds without incident, we'll be ordering a dozen (or more) additional di... [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180607T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:17] What is the modern redis-cli for stretch machines? [23:01:45] 10Operations, 10Scap: Update Debian Package for Scap3 to 3.8.2-1 - https://phabricator.wikimedia.org/T196710#4266164 (10thcipriani) [23:02:17] ah it comes from the redis-tools package [23:02:59] (03PS1) 10Thcipriani: Scap: Bump version to 3.8.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/438121 (https://phabricator.wikimedia.org/T196710) [23:03:54] 10Operations, 10Scap, 10Patch-For-Review: Update Debian Package for Scap3 to 3.8.2-1 - https://phabricator.wikimedia.org/T196710#4266180 (10thcipriani) [23:13:03] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-eqiad.wikimedia.org recovered from Sensor over limit [23:22:06] (03PS1) 10Bearloga: statistics::discovery: re-enable cron job [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) [23:26:31] (03CR) 10Bearloga: "Not sure how/when/why the cdh submodule got updated and became part of this commit :\" [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [23:32:39] (03PS2) 10Bearloga: statistics::discovery: re-enable cron job [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) [23:35:48] anyone seen this from nutcracker before? [2018-06-07 23:28:00.725] nc_redis.c:1092 parsed unsupported command 'COMMAND' [23:43:57] (03PS1) 10Chad: Dropping quota from deployment for now, will readd later [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438126 [23:43:59] (03CR) 10Chad: [V: 032 C: 032] Dropping quota from deployment for now, will readd later [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438126 (owner: 10Chad) [23:46:32] (03PS1) 10Chad: Dropping wikimedia from deployment for now, will readd later [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438127 [23:46:34] (03CR) 10Chad: [V: 032 C: 032] Dropping wikimedia from deployment for now, will readd later [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/438127 (owner: 10Chad) [23:57:10] !log demon@deploy1001 Started deploy [gerrit/gerrit@a07d943]: No-op of current deployed version, want to sync repo state [23:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:16] !log demon@deploy1001 Finished deploy [gerrit/gerrit@a07d943]: No-op of current deployed version, want to sync repo state (duration: 00m 06s) [23:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:51] paladox: Ok, the new repo structure is live on deploy1001, sync'd it to the others (so we're using deploy/wmf/stable-2.14 now instead of stable-2.14) [23:58:03] :) [23:58:04] woo [23:58:36] !log demon@deploy1001 Started deploy [gerrit/gerrit@a07d943]: No-op of current deployed version, want to sync repo state (x2) [23:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:50] !log demon@deploy1001 Finished deploy [gerrit/gerrit@a07d943]: No-op of current deployed version, want to sync repo state (x2) (duration: 00m 14s) [23:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:45] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4266291 (10Krenair) I found that the nutcracker sockets on some of mediawiki hosts were refusing connections when I tried to run `redis-cli -a $password_here -s /var/run/nutcracker/re...