[00:00:39] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:00:39] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:01:00] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:01:09] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:01:19] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:01:49] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:02:39] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:02:39] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:02:40] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:02:49] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:09:34] (03CR) 10Dzahn: "this should wait until after wiki is created?" [puppet] - 10https://gerrit.wikimedia.org/r/417200 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [00:17:48] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045220 (10RobH) >>! In T188045#4045150, @Platonides wrote: > Well, if the server itself is needed, it will be doing its work with a different IP address than the one of wdqs1004, s... [00:18:09] (03CR) 10Dzahn: "yea, downvoted for adding a Hiera call in the module.. sigh...would need a new parameter for base class" [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [00:30:35] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/10416/dbmonitor1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/415503 (owner: 10Dzahn) [00:30:50] (03PS4) 10Dzahn: tendril: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415503 [00:43:50] (03CR) 10Dzahn: [C: 032] "no-op on dbmonitor1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/415503 (owner: 10Dzahn) [00:47:32] (03PS2) 10Dzahn: tendril: add support for stretch/php7 [puppet] - 10https://gerrit.wikimedia.org/r/415511 [00:48:07] (03CR) 10Dzahn: [C: 032] "not affecting anything since servers are jessie now, just preparing for the future" [puppet] - 10https://gerrit.wikimedia.org/r/415511 (owner: 10Dzahn) [00:48:18] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4045293 (10Smalyshev) [00:54:07] (03CR) 10Dzahn: "i thought we are switching to php7 and away from hhvm..." [puppet] - 10https://gerrit.wikimedia.org/r/415768 (owner: 10Dzahn) [00:54:58] (03PS1) 10BBlack: cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) [00:55:01] (03PS1) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315) [00:55:06] (03PS1) 10BBlack: varnish: remove weekly restart cron entries [puppet] - 10https://gerrit.wikimedia.org/r/419091 (https://phabricator.wikimedia.org/T181315) [00:55:52] (03Abandoned) 10Dzahn: openstack/wikitech: add some php7 support [puppet] - 10https://gerrit.wikimedia.org/r/415768 (owner: 10Dzahn) [00:56:54] (03Abandoned) 10Dzahn: openstack:labtest:web: add some php7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/415765 (owner: 10Dzahn) [00:57:16] (03CR) 10Dzahn: "i'll stop uploading changes to wmcs manifests" [puppet] - 10https://gerrit.wikimedia.org/r/415765 (owner: 10Dzahn) [00:58:44] (03CR) 10Dzahn: "should this wait until after the wiki is created?" [puppet] - 10https://gerrit.wikimedia.org/r/412898 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [00:59:15] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045325 (10Smalyshev) Thanks @RobH! Created {T189548} for loading the data back. @Gehel if you don't see anything else wrong then this one can be resolved. [01:00:39] (03CR) 10Dzahn: [C: 04-1] "blocked" [puppet] - 10https://gerrit.wikimedia.org/r/405230 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [01:00:55] (03CR) 10Dzahn: [C: 04-1] "blocked" [dns] - 10https://gerrit.wikimedia.org/r/405231 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [01:01:04] (03CR) 10Dzahn: [C: 04-1] "blocked" [dns] - 10https://gerrit.wikimedia.org/r/405232 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [01:01:58] (03CR) 10Dzahn: [C: 04-1] "blocked" [puppet] - 10https://gerrit.wikimedia.org/r/405229 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [01:02:07] (03CR) 10Dzahn: [C: 04-1] "blocked" [puppet] - 10https://gerrit.wikimedia.org/r/405226 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [01:02:56] (03CR) 10Dzahn: [C: 04-1] prometheus: ganglia-gen outdated resource names (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/409390 (https://phabricator.wikimedia.org/T186918) (owner: 10Dzahn) [01:03:03] (03CR) 10Dzahn: [C: 04-1] add IPv6 for bast3003 [dns] - 10https://gerrit.wikimedia.org/r/405225 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [01:22:00] (03CR) 10Dzahn: [C: 04-1] site: turn bast1002 into a bastion host [puppet] - 10https://gerrit.wikimedia.org/r/414848 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [01:26:09] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4045341 (10mmodell) [01:26:13] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4045340 (10mmodell) [01:29:33] PROBLEM - Host labtestneutron2002 is DOWN: PING CRITICAL - Packet loss = 100% [01:40:13] PROBLEM - configured eth on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds [01:41:54] PROBLEM - dhclient process on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds [01:43:43] PROBLEM - puppet last run on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds [01:47:04] PROBLEM - DPKG on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds [01:48:53] PROBLEM - Disk space on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds [01:49:43] PROBLEM - IPMI Sensor Status on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds [01:53:03] PROBLEM - MPT RAID on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds [02:04:53] PROBLEM - NTP on labtestneutron2001 is CRITICAL: NTP CRITICAL: No response from NTP server [02:26:20] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4045378 (10ayounsi) @Cmjohnson Can you cable lvs1016 as listed bellow? | Hostname | Hostport | Switchport | note | |---|---|---|---| | lvs1016 | eth0 | asw2-d:xe-7/0/17 | |... [02:34:45] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.24) (duration: 05m 30s) [02:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:54] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.113 second response time [02:55:54] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1978 bytes in 0.118 second response time [03:01:40] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4045408 (10ayounsi) p:05Triage>03Normal [03:26:54] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 864.93 seconds [03:29:34] PROBLEM - Host mw2099 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:34] PROBLEM - Host mw2100 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:34] PROBLEM - Host mw2101 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:34] PROBLEM - Host mw2102 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:34] PROBLEM - Host mw2103 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:34] PROBLEM - Host mw2104 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:34] PROBLEM - Host mw2105 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:35] PROBLEM - Host mw2106 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:35] PROBLEM - Host mw2107 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:36] PROBLEM - Host mw2108 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:36] PROBLEM - Host mw2109 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:37] PROBLEM - Host mw2110 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:37] PROBLEM - Host mw2111 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:38] PROBLEM - Host mw2112 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:25] PROBLEM - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:25] PROBLEM - Host mw2117 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:25] PROBLEM - Host mw2115 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:25] PROBLEM - Host mw2116 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:25] PROBLEM - Host mw2118 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:25] PROBLEM - Host mw2119 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:25] PROBLEM - Host mw2120 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:26] PROBLEM - Host mw2121 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:26] PROBLEM - Host mw2122 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:27] PROBLEM - Host mw2123 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:27] PROBLEM - Host mw2124 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:28] PROBLEM - Host mw2125 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:28] PROBLEM - Host mw2126 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:29] PROBLEM - Host mw2127 is DOWN: PING CRITICAL - Packet loss = 100% [04:07:14] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 271.16 seconds [04:58:29] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4045522 (10Prtksxna) [05:31:24] PROBLEM - Long running screen/tmux on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds [06:31:20] <_joe_> ok now, how do those systems still resurface in puppet, then in icinga [06:31:23] <_joe_> grrr [06:35:43] <_joe_> oh I see [06:35:49] <_joe_> it's herron's fault :P [06:36:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108 [06:41:25] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108 [06:42:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108 (owner: 10Marostegui) [06:44:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108 (owner: 10Marostegui) [06:45:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1103:3314 after alter table (duration: 01m 19s) [06:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419109 (https://phabricator.wikimedia.org/T187089) [06:48:57] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108 (owner: 10Marostegui) [06:49:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419109 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:50:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419109 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:54:16] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419109 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:54:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1081 for alter table (duration: 00m 56s) [06:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:27] !log Deploy schema change on db1081 - T187089 T185128 T153182 [06:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:34] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:56:34] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:56:34] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:58:17] !log Deploy schema change on dbstore1002 - T187089 T185128 T153182 [06:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:26] (03PS1) 10Elukey: profile::geowiki: disable periodic data quality check (cronspam) [puppet] - 10https://gerrit.wikimedia.org/r/419110 (https://phabricator.wikimedia.org/T173486) [07:02:32] (03CR) 10Elukey: [C: 032] profile::geowiki: disable periodic data quality check (cronspam) [puppet] - 10https://gerrit.wikimedia.org/r/419110 (https://phabricator.wikimedia.org/T173486) (owner: 10Elukey) [07:11:48] (03CR) 10Gergő Tisza: beta: Enable password authn for Beta Cluster logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) (owner: 10BryanDavis) [07:13:46] (03PS1) 10Elukey: statistics::user: avoid adding 'stats' to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/419111 [07:14:48] (03CR) 10Elukey: [C: 032] "@Ottomata: maybe I am missing something but I'd merge this straight away, I'll rollback if needed :)" [puppet] - 10https://gerrit.wikimedia.org/r/419111 (owner: 10Elukey) [07:23:40] PROBLEM - Check systemd state on es2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:36:49] RECOVERY - Check systemd state on es2014 is OK: OK - running: The system is fully operational [07:36:50] (03CR) 10WMDE-leszek: [C: 031] Enable Wikidata description override on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419083 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [07:37:38] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4045657 (10Joe) @Papaul I would move the servers you put in row A to row B after you decommission the old servers in B 3, if that works for you. Else, I'll try to resh... [07:45:21] (03CR) 10Jcrespo: "Comment" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:45:48] (03CR) 10Jcrespo: "Sorry, I meant multi-instance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:46:45] (03CR) 10Marostegui: [C: 04-2] db-eqiad,db-codfw.php: Proposal for moving hosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:46:50] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Proposal for moving hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) [08:08:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] base/icinga: add Hiera override to skip systemd monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [08:09:48] 10Operations, 10hardware-requests: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4045706 (10elukey) [08:10:14] 10Operations, 10hardware-requests: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4045706 (10elukey) [08:10:16] \o/ [08:11:10] (03PS1) 10Muehlenhoff: statistics::packages: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419112 [08:12:26] (03CR) 10Hashar: Cumin masters: upgrade to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans) [08:15:56] (03PS2) 10Elukey: statistics::packages: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419112 (owner: 10Muehlenhoff) [08:17:04] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for diamond [puppet] - 10https://gerrit.wikimedia.org/r/418926 (https://phabricator.wikimedia.org/T135991) [08:17:06] (03CR) 10Elukey: [C: 032] statistics::packages: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419112 (owner: 10Muehlenhoff) [08:21:12] (03CR) 10Gilles: [C: 031] varnishslowlog: add Backend-Timing D=, in seconds [puppet] - 10https://gerrit.wikimedia.org/r/418603 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema) [08:22:18] (03PS1) 10Jcrespo: mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469) [08:24:13] (03CR) 10Muehlenhoff: Cumin masters: upgrade to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans) [08:27:21] (03PS1) 10Elukey: statistics::wmde::graphite: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419115 [08:27:47] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:28:38] (03CR) 10Jcrespo: "Looks ok then to me, although probably will need more (unknown) tuning later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [08:28:40] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for diamond [puppet] - 10https://gerrit.wikimedia.org/r/418926 (https://phabricator.wikimedia.org/T135991) [08:29:50] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for diamond [puppet] - 10https://gerrit.wikimedia.org/r/418926 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:29:53] (03CR) 10Elukey: [C: 032] statistics::wmde::graphite: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419115 (owner: 10Elukey) [08:30:01] (03PS2) 10Elukey: statistics::wmde::graphite: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419115 [08:30:08] (03CR) 10Gilles: [C: 031] varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [08:30:16] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:31:25] moritzm: ready to merge? [08:31:28] (03Merged) 10jenkins-bot: mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:31:42] (03CR) 10jenkins-bot: mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:31:44] I was about to ask you that :-) please go aheads [08:32:36] ack! [08:41:08] (03PS1) 10Elukey: statistics::wmde::graphite: depend on generic php-xml [puppet] - 10https://gerrit.wikimedia.org/r/419118 [08:43:44] (03CR) 10Muehlenhoff: [C: 031] statistics::wmde::graphite: depend on generic php-xml [puppet] - 10https://gerrit.wikimedia.org/r/419118 (owner: 10Elukey) [08:44:09] (03CR) 10Elukey: [C: 032] statistics::wmde::graphite: depend on generic php-xml [puppet] - 10https://gerrit.wikimedia.org/r/419118 (owner: 10Elukey) [08:46:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1063 (duration: 00m 57s) [08:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:33] 10Operations, 10Community-Liaisons, 10Security-Reviews, 10Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1554077 (10Bawolff) So to clarify - There is still interest in using lime survey, right (The third party site, not the software package)? And the question that you want answ... [08:54:16] (03PS1) 10Jcrespo: mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469) [08:55:22] (03CR) 10Marostegui: [C: 031] mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:55:52] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991) [08:56:54] (03CR) 10Jcrespo: [C: 032] mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:58:05] (03Merged) 10jenkins-bot: mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:58:45] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:58:51] (03CR) 10jenkins-bot: mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [09:00:48] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4045777 (10Marostegui) [09:02:26] !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove db1051 and db1063 (duration: 00m 56s) [09:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:08] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove db1051 and db1063 (duration: 00m 56s) [09:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:46] (03PS2) 10Muehlenhoff: Depool poolcounter1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418892 [09:13:01] (03CR) 10Muehlenhoff: [C: 032] Depool poolcounter1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418892 (owner: 10Muehlenhoff) [09:13:17] (03CR) 10jenkins-bot: Depool poolcounter1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418892 (owner: 10Muehlenhoff) [09:14:58] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045800 (10Gehel) 05Open>03Resolved Yay! Thanks @faidon for finding the issue! @RobH / @Papaul : the symptoms of wdqs2006 mgmt interface look vaguely similar (T189318). Any cha... [09:15:03] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Depooling poolcounter1001 for kernel security update (duration: 00m 56s) [09:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:13] (03PS2) 10BBlack: cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) [09:24:15] (03PS2) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315) [09:24:17] (03PS2) 10BBlack: varnish: remove weekly restart cron entries [puppet] - 10https://gerrit.wikimedia.org/r/419091 (https://phabricator.wikimedia.org/T181315) [09:37:40] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4045827 (10elukey) I acked on icinga notebook100[34] systemd unit failures to avoid confusion for other people (expected I guess since the task is WIP... [09:37:55] !log rebooting poolcounter1001 for kernel security update [09:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:45] (03PS3) 10Volans: Cumin masters in prod: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) [09:38:46] (03PS1) 10Volans: Cumin masters in WMCS: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T187773) [09:41:29] (03CR) 10Volans: [C: 04-2] "This is only for prod now and still -2, the WMCS part was moved into I364f7a3a23328deeaddb69d632d6e9c7ded47258" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans) [09:42:40] volans: good morning!! Is the production cumin master using Stretch? :] [09:43:23] volans: on labs I have a Jessie one and custom modules are in /usr/local/lib/python3.4 while Stretch apparently uses "python3" :] [09:43:27] hashar: nope, jessie for now, pending an upgrade probably this year [09:43:36] see above [09:43:37] ah [09:44:42] volans: you are magic :] [09:45:36] hashar: that *should* work, but I have no way to test it in the compiler and not much time to test it on my puppetmaster atm [09:47:17] volans: dont worry. I am probably the only one using cumin on the CI master :] At least puppet pass now! [09:47:50] (03PS1) 10Elukey: Update the README file with some notes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/419132 [09:47:58] nice, so if that patch works we could also merge it, I'm not against it [09:48:06] the prod one has to wait though [09:48:16] !log upgrading ulsfo LVSs to pybal 1.15.2 [09:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:37] (03CR) 10Elukey: [C: 032] Update the README file with some notes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/419132 (owner: 10Elukey) [09:49:27] (03CR) 10Hashar: "I cherry picked this and integration-cumin now has /usr/local/lib/python3.4/dist-packages/cumin_file_backend.py :] Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans) [09:54:07] (03PS2) 10Hashar: Cumin masters in WMCS: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans) [09:55:23] (03CR) 10Hashar: [C: 031] "I have edited the commit message to link to T188112:" [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans) [09:56:20] hashar: ack, thanks [09:58:13] hashar: if your test passed, do you think is good to merge? [10:00:04] !log shuttind down blazegraph on wdqs2001 for data transfer to wdqs1004 - T189548 [10:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:10] T189548: reload data on wdqs1004 - https://phabricator.wikimedia.org/T189548 [10:00:56] hashar: just read your full comment, sorry [10:01:18] volans: almost :D [10:01:20] (03CR) 10Hashar: "Then I get:" [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans) [10:01:23] Default backend 'openstack' is not registered [10:01:33] I have updated my task and copy pasted to the gerrit change [10:01:36] yeah, that's because is 'optional' [10:01:40] probably the cumin config file need to explicitly list it ? [10:01:55] there is no hurry [10:01:57] no, it needs to install the cumin package with the suggested ones [10:02:06] (03PS1) 10Muehlenhoff: Revert "Depool poolcounter1001 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419135 [10:02:38] let me see if puppet can do it [10:03:02] AHHH [10:03:06] Suggests: python3-keystoneauth1, python3-keystoneclient, python3-novaclient [10:03:27] and indeed they are not installed [10:03:28] :D [10:03:29] yep, it's an optional backend [10:03:33] <_joe_> volans: no-can-do [10:03:37] <_joe_> and for good reasons [10:03:38] (03PS1) 10Jcrespo: mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) [10:03:39] you got the py2 version of them, not the py3 one [10:03:58] _joe_: so I cannot tell puppet to require a specific package with suggested ones? [10:04:05] <_joe_> nope [10:04:14] :( [10:04:15] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [10:04:21] <_joe_> Suggested/Recommended packages are a debian/ubuntu specific thng [10:04:38] our puppet is ALL debian (and slightly ubuntu) specific [10:04:52] <_joe_> well "package" is part of puppet itself [10:04:59] <_joe_> you're welcome to improve it [10:05:02] eheheh I knew it [10:05:26] <_joe_> oh [10:05:31] <_joe_> actually lemme check [10:05:34] (03CR) 10Muehlenhoff: [C: 032] Revert "Depool poolcounter1001 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419135 (owner: 10Muehlenhoff) [10:05:43] I could add them in puppet as require_package, but meh [10:05:56] (03CR) 10Jcrespo: "Ignore the violations, it is the addition of hosts to the new style, which will be compensated when we decom the old ones." [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [10:06:00] (03CR) 10Marostegui: [C: 031] "Looks good to me and I would override the -1 from jenkins as once we have them on stretch and all that we can work on a proper refactor fo" [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [10:06:07] <_joe_> volans: https://puppet.com/docs/puppet/4.8/type.html#package-attribute-install_options [10:06:14] (03CR) 10Jcrespo: "s/new/old misc/" [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [10:06:16] <_joe_> but then you cannot use require_package ofc [10:07:20] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4045892 (10akosiaris) >>! In T180628#4044173, @mmodell wrote: > @akosiaris: I think it's needed on masters, at least to enable deployers to issue gi... [10:07:22] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Repooling poolcounter1001 after kernel security update (duration: 00m 57s) [10:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:31] (03PS1) 10Elukey: Assing role::analytics_cluster::hadoop::worker to analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/419138 (https://phabricator.wikimedia.org/T188294) [10:07:38] _joe_: would be so bad to not use require_package just for cumin in the WMCS profile? [10:07:54] <_joe_> volans: I don't give a damn, tbh [10:07:58] shouldn't conflict with other requirements [10:08:07] lol [10:08:29] (03CR) 10Elukey: [C: 032] Assing role::analytics_cluster::hadoop::worker to analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/419138 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [10:08:41] !log stopping mysql on db1063 and db1051 to validate the depool before full reimage [10:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:02] (03CR) 10jenkins-bot: Revert "Depool poolcounter1001 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419135 (owner: 10Muehlenhoff) [10:11:11] (03PS3) 10Volans: Cumin masters in WMCS: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) [10:11:13] (03PS4) 10Volans: Cumin masters in prod: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) [10:11:28] hashar: ^^^ [10:12:22] now I don't know if that fixes the situation if you already have cumin tbh, given that you already had it and was upgraded by unattended upgrades (that I think is wrong) [10:17:36] !log upgrading esams LVSs to pybal 1.15.2 [10:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:35] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4045911 (10ayounsi) a:03ayounsi Only looking at the asw ports with link up for now, using LibreNMS: @Papaul If the ports with LLDP neighbors are correct, I can mass add th... [10:20:54] (03PS2) 10Jcrespo: mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) [10:21:29] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [10:21:37] (03PS1) 10Alexandros Kosiaris: Add initialize_namespace.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/419139 [10:22:12] (03PS3) 10BBlack: cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) [10:22:14] (03PS3) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315) [10:22:16] (03PS3) 10BBlack: varnish: remove weekly restart cron entries [puppet] - 10https://gerrit.wikimedia.org/r/419091 (https://phabricator.wikimedia.org/T181315) [10:22:31] !log rebooting DNS recursors in ulsfo and eqsin for kernel security update [10:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:00] (03PS3) 10Jcrespo: mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) [10:23:06] (03CR) 10jerkins-bot: [V: 04-1] cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [10:23:19] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [10:25:58] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4045920 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1063.eqiad.wmnet'] ``` The log... [10:26:39] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4045922 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1051.eqiad.wmnet'] ``` The log... [10:28:37] (03Abandoned) 10BBlack: varnish: remove weekly restart cron entries [puppet] - 10https://gerrit.wikimedia.org/r/419091 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [10:29:26] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4045940 (10elukey) This bit prevents the first couple of puppet runs to complete (and also yarn to start etc..): ``` Error: Could not... [10:30:27] (03PS4) 10BBlack: cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) [10:30:29] (03PS4) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315) [10:33:41] <_joe_> bblack: I like that approach [10:35:47] yeah my other approaches all had the transition issue that it was messy to switch from one to the other and deal with gaps the first week [10:38:08] you could argue the code is a bit redundantly-verbose now vs a more mathematically-compact form, but it's probably easier to follow when each of the period-cases is explicit and separate [10:39:04] <_joe_> I prefer explicit, easy-to-understand code than mathematical compactness [10:40:33] (03CR) 10BBlack: [C: 031] "Compiler confirms (through another commit making use of this code) this leaves the existing cron entry at the original time, and adds a ne" [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [10:41:03] I never said it was easy-to-understand of course :) [10:41:10] (but it could be far worse!) [10:41:58] <_joe_> bblack: it's pretty easy to read cron_splay.rb and understand what's going on [10:42:30] <_joe_> and instead of strange math constructs in the cron output, we have two cron lines, both weekly, correctly staggered [10:42:48] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4045978 (10mmodell) @akosiaris: **tl;dr** I can't think of any reason that we //must// have `git-lfs` on masters, I've only got vague hand-wavy not... [10:44:25] (03CR) 10Hashar: "The poor --install-suggests does not seem to install the suggests :(" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans) [10:44:37] (03PS1) 10Elukey: Assign role::analytics_cluster::hadoop::worker to analytics1073 [puppet] - 10https://gerrit.wikimedia.org/r/419142 (https://phabricator.wikimedia.org/T188294) [10:45:08] (03CR) 10BBlack: [C: 032] cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [10:46:28] (03PS2) 10Elukey: Assign role::analytics_cluster::hadoop::worker to analytics1073 [puppet] - 10https://gerrit.wikimedia.org/r/419142 (https://phabricator.wikimedia.org/T188294) [10:46:39] hashar: have you tried to apt-get remove cumin first? [10:46:55] why would anyone remove cumin? :P [10:47:10] (03CR) 10Elukey: [C: 032] Assign role::analytics_cluster::hadoop::worker to analytics1073 [puppet] - 10https://gerrit.wikimedia.org/r/419142 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [10:47:29] volans: yeah with apt purge [10:47:41] apt -q -y purge cumin; /usr/bin/apt-get -y -o DPkg::Options::=--force-confold -o APT::Install-Suggests=1 install cumin && apt-cache policy python3-keystoneauth1 python3-keystoneclient python3-novaclient [10:47:42] ;D [10:48:29] bblack: to make puppet install the suggested ones in labs :D [10:49:00] hashar: no my suggestion was to apt-get remove cumin and let puppet install it with the suggests [10:49:32] with the latest patch [10:49:40] * hashar tries [10:50:20] volans: yes that the same deal :( [10:50:59] it doesn't install them? [10:52:45] (03PS1) 10Odder: Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448) [10:52:48] 10Operations, 10Puppet, 10User-fgiunchedi: Update jmx_exporter mbeans whitelist for puppetdb 4 - https://phabricator.wikimedia.org/T189516#4046017 (10fgiunchedi) [10:53:58] volans: yeah apt-get with --install-suggests does not install any of the Suggests packages :^/ [10:56:43] hashar: that's not what I see on my test host [10:56:54] (03CR) 10BBlack: [C: 032] varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [10:56:59] (03PS5) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315) [10:57:05] volans: I guess the labs project has a broken apt config so :D [10:57:25] dunno, but cannot debug right now, goal to finish [10:57:32] no worries [10:59:57] install... is... slow... packages... download... slowly [11:00:49] WARNING **: no packages matching running kernel 4.9.0-4-amd64 in archive [11:01:45] No kernel modules were found. This probably is due to a mismatch between the kernel used by this version of the installer :-( [11:02:19] let me guess, the point release happened [11:02:22] <_joe_> jynus: yes [11:02:36] <_joe_> yesterday 9.4 was released [11:02:37] I will see if I can fix it [11:05:37] (03PS6) 10BBlack: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [11:10:10] jynus: there's a script, let me find it [11:12:06] moritzm: https://gerrit.wikimedia.org/r/#/c/292906/ ? [11:12:07] (03PS7) 10BBlack: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [11:12:09] jynus: /home/faidon/update-netboot-stretch.sh on puppetmaster1001 [11:12:09] (03PS7) 10BBlack: varnishslowlog: add Backend-Timing D=, in seconds [puppet] - 10https://gerrit.wikimedia.org/r/418603 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema) [11:12:43] that commit probably needs an update [11:13:12] yeah, should be for stretch. the cleanest way to fix this would be https://phabricator.wikimedia.org/T182699 [11:13:33] but needs some testing (T182699) [11:13:34] T182699: Use firmware-enriched Debian installation images - https://phabricator.wikimedia.org/T182699 [11:13:36] well, that is mostly secondary [11:13:53] the important stuff would be to automatize it as much as reasonable [11:13:53] (03CR) 10BBlack: [C: 032] varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [11:13:59] (03CR) 10BBlack: [C: 032] varnishslowlog: add Backend-Timing D=, in seconds [puppet] - 10https://gerrit.wikimedia.org/r/418603 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema) [11:16:00] s/ize/e/ :P [11:16:48] sorry [11:16:52] I prefer automagicifycation [11:21:20] !log rebooting DNS recursors in esams for kernel security update [11:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:07] !log ran update-netboot-stretch.sh [11:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:49] I have made a backup, but will delete it if it works [11:26:51] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1063.eqiad.wmnet'] ``` The log... [11:27:10] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046130 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1051.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1051.eqiad.wmnet'] ``` [11:29:12] (03PS1) 10Giuseppe Lavagetto: Add golang as a build-dependency [debs/etcd] - 10https://gerrit.wikimedia.org/r/419148 [11:29:44] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add golang as a build-dependency [debs/etcd] - 10https://gerrit.wikimedia.org/r/419148 (owner: 10Giuseppe Lavagetto) [11:30:21] !log kartik@tin Started deploy [cxserver/deploy@30ff3b1]: Update cxserver to bd2ccfc [11:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:21] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046150 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1051.eqiad.wmnet'] ``` The log... [11:33:40] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046158 (10jcrespo) Installing...{F15259084} [11:33:44] 10Operations, 10Community-Liaisons, 10Security-Reviews, 10Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#4046159 (10Aklapper) I'd start with asking what would be our use case of using Linesurvey in 2018. This task lacks a desc of a problem that might get solved by Limesurvey (... [11:33:51] !log kartik@tin Finished deploy [cxserver/deploy@30ff3b1]: Update cxserver to bd2ccfc (duration: 03m 30s) [11:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:24] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=argon.eqiad.wmnet,service=kubemaster [11:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:22] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4046204 (10jhsoby) >>! In T188776#4021634, @Varnent wrote: >>>! In T188776#4021611, @Bawolff wrote: >> That sa... [11:43:56] <_joe_> !log include our own etcd package (3.2.16) on stretch [11:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:47] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046220 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1063.eqiad.wmnet'] ``` and were **ALL** successful. [11:45:53] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:46:37] me ^ [11:46:52] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational [11:52:07] * akosiaris expects WATCHLIST to recover... let's see [11:56:34] (03CR) 10Filippo Giunchedi: initial commit of 4.4.0-1 (031 comment) [debs/puppetdb] (4.4.0-1) - 10https://gerrit.wikimedia.org/r/415591 (owner: 10Herron) [11:56:52] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 2031 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:56:54] puppetdb not starting when using java 9 ^ [11:56:58] akosiaris: neat [11:57:09] (03PS1) 10Reedy: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419153 (https://phabricator.wikimedia.org/T188537) [11:57:11] we don't have Java 9 yet? [11:57:26] jouncebot: next [11:57:26] In 1 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1300) [11:58:09] we need java 10 anyway :P [11:58:15] moritzm: no I deluded myself into hoping it'd work with java 9 from stretch-backports [11:58:36] godog: yeah ok so theory confirmed. Now to just ignore that , now that I know what is going on [11:58:43] Java is always happy to crush your hopes! [11:59:22] !log rebooting DNS recursors in codfw for kernel security update [11:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:55] akosiaris: indeed [11:59:55] (03CR) 10Reedy: [C: 032] Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419153 (https://phabricator.wikimedia.org/T188537) (owner: 10Reedy) [12:01:10] (03Merged) 10jenkins-bot: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419153 (https://phabricator.wikimedia.org/T188537) (owner: 10Reedy) [12:01:31] (03CR) 10jenkins-bot: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419153 (https://phabricator.wikimedia.org/T188537) (owner: 10Reedy) [12:01:37] 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4046290 (10faidon) a:05faidon>03ayounsi > We're happy to announce that your RIPE Atlas anchor is functioning properly and is now connected to the RIPE Atlas network. > > You can see... [12:04:07] !log reedy@tin Synchronized wmf-config/interwiki.php: T188537 (duration: 00m 57s) [12:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:12] T188537: Please update wmf-config/interwiki.php following on-wiki updates - https://phabricator.wikimedia.org/T188537 [12:07:52] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 751629 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:07:52] (03PS1) 10BBlack: eqsin: add ripe-atlas ping measurement monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419156 (https://phabricator.wikimedia.org/T179042) [12:09:28] (03CR) 10BBlack: [C: 032] eqsin: add ripe-atlas ping measurement monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419156 (https://phabricator.wikimedia.org/T179042) (owner: 10BBlack) [12:15:15] 10Operations, 10ops-eqsin, 10Traffic, 10netops, 10Patch-For-Review: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4046328 (10BBlack) [12:15:54] 10Operations, 10ops-eqsin, 10Traffic, 10netops, 10Patch-For-Review: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3711364 (10BBlack) 05Open>03Resolved >>! In T179042#4046290, @faidon wrote: > Only thing left is monitoring, right? I think so AFAIK, and done above, showing... [12:21:31] (03PS1) 10Filippo Giunchedi: puppetmaster: export all puppetdb mbeans [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) [12:22:05] (03PS1) 10Muehlenhoff: Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159 [12:26:19] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=argon.eqiad.wmnet,service=kubemaster [12:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:57] (03CR) 10Addshore: [C: 031] Enable Wikidata description override on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419083 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [12:27:50] (03CR) 10Addshore: "Awesome, removing the -1 because this patch is now based on the best patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [12:29:27] (03CR) 10Filippo Giunchedi: "PCC for some reason doesn't show a diff https://puppet-compiler.wmflabs.org/compiler02/10422/" [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) (owner: 10Filippo Giunchedi) [12:32:43] !log reboot ganeti VMs on row_A in eqiad for cache=none setting. T181121 [12:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:49] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [12:33:03] (03PS1) 10ArielGlenn: remove snapshot01 from mediawiki scap list on beta for testing [puppet] - 10https://gerrit.wikimedia.org/r/419160 [12:33:15] (03CR) 10Filippo Giunchedi: "FWIW when the time comes we can merge this as-is since the only puppet masters using puppetdb too are in production (i.e. not labspuppetma" [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [12:33:55] (03CR) 10ArielGlenn: [C: 032] remove snapshot01 from mediawiki scap list on beta for testing [puppet] - 10https://gerrit.wikimedia.org/r/419160 (owner: 10ArielGlenn) [12:35:26] elukey: bohrium will get rebooted soon ^ [12:35:33] fyi [12:35:53] ack! [12:38:24] (03PS1) 10Odder: Add high-density logos for seven Wikipedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419162 (https://phabricator.wikimedia.org/T150618) [12:41:55] RECOVERY - Host labtestneutron2002 is UP: PING OK - Packet loss = 0%, RTA = 36.87 ms [12:47:10] (03PS1) 10Elukey: profile::hadoop::worker: use require instead of include [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294) [12:48:15] (03PS1) 10Odder: Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578) [12:51:59] (03CR) 10Elukey: "my 2c: since this jvm is really important and iirc it may populate a ton of mbeans that can potentially be expensive to calculate, so I'd " [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) (owner: 10Filippo Giunchedi) [12:52:51] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10423/analytics1030.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [12:54:18] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4046439 (10fgiunchedi) [12:54:22] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade hiera to stretch (version 3) - https://phabricator.wikimedia.org/T188623#4046437 (10fgiunchedi) 05Open>03Resolved This should be resolved as all patches are merged and rhodium is running hiera 3 and compiling fine. [12:59:11] PROBLEM - SSH on install1002 is CRITICAL: connect to address 208.80.154.22 and port 22: Connection refused [12:59:11] PROBLEM - dhclient process on install1002 is CRITICAL: Return code of 255 is out of bounds [12:59:12] PROBLEM - Squid on install1002 is CRITICAL: connect to address 208.80.154.22 and port 8080: Connection refused [12:59:12] PROBLEM - Check size of conntrack table on install1002 is CRITICAL: Return code of 255 is out of bounds [13:00:03] <_joe_> uh [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1300). [13:00:04] marlier: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] RECOVERY - SSH on install1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [13:00:12] RECOVERY - dhclient process on install1002 is OK: PROCS OK: 0 processes with command name dhclient [13:00:12] RECOVERY - Squid on install1002 is OK: TCP OK - 0.003 second response time on 208.80.154.22 port 8080 [13:00:12] RECOVERY - Check size of conntrack table on install1002 is OK: OK: nf_conntrack is 0 % full [13:00:14] I can SWAT today [13:00:26] marlier: around for SWAT? [13:00:37] <_joe_> akosiaris: maybe we should wait swat to be done [13:00:45] <_joe_> mwdebug1001/1002 are on ganeti [13:00:57] (03PS1) 10Urbanecm: Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 [13:01:06] _joe_: something is happening? [13:01:14] <_joe_> zeljkof: nothing [13:01:22] cool :) [13:01:34] <_joe_> zeljkof: I was advising akosiaris to avoid rebooting mwdebug1* servers during SWAT :P [13:01:47] <_joe_> we're not used to swat being this early :) [13:01:50] _joe_: please dont! :) [13:02:00] (03PS2) 10Urbanecm: Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 [13:02:02] it surprises me too [13:02:19] zeljkof: would it be possible to deploy in about 10 minutes? [13:02:36] marlier: sure, I'm around, let me know when you are ready [13:02:43] Had to step away from the computer but only for a moment [13:02:50] Thanks! [13:02:56] _joe_: looks like swat will be 10 minutes late, if you can reboot in that time, go ahead [13:03:14] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 (owner: 10Urbanecm) [13:07:18] (03PS1) 10Gilles: Upgrade to 1.16 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/419172 (https://phabricator.wikimedia.org/T186528) [13:08:33] (03PS1) 10Filippo Giunchedi: puppet: depool and reinstall puppetmaster2002 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/419173 (https://phabricator.wikimedia.org/T184562) [13:09:12] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo [13:09:12] g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled [13:09:31] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo [13:09:31] g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled [13:10:21] wah wah logstash [13:10:26] I'll take a look [13:10:33] did 1008 crash? [13:11:05] rebooted [13:11:12] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [13:11:31] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [13:11:33] ah, I didn't know that was on ganety too [13:11:45] unfortunately closer time wise to logstash1007 than it should [13:11:49] ah that explains it [13:11:55] I did leave 2 minutes of time between reboots [13:12:03] it looks like it wasn't enough for logstash [13:12:33] what is availability mode for logstash? as long as one is up is ok? [13:12:48] (03PS1) 10Elukey: profile::hadoop::prometheus_jmx_exporter: include rpc metrics for each port [puppet] - 10https://gerrit.wikimedia.org/r/419175 [13:13:13] the logstash hosts which are on ganeti (1007-1009) don't hold elastic data [13:13:14] I don't know offhand the numbers but I'd guess one ingestion host for logstash is enough to cope with the load yeah [13:13:24] the others are on baremetal (1004-1006) [13:13:40] moritzm: I literally mean logstash and not elastic [13:13:41] and need to be rebooted one by one until the cluster has recovered, usually takes 10 minutes [13:13:59] (03PS1) 10Odder: Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586) [13:14:04] zeljkof: I'm back, ready whenever works for you. [13:14:17] those can be rebooted without interruption as long as they are depooled one by one [13:14:30] no need for waiting time otherwise [13:14:40] marlier: I'm ready! [13:14:45] SWAT starts! [13:14:52] moritzm: you became an expert on reboots :-) [13:14:59] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: Restricting access for a collaboration nearing completion - https://phabricator.wikimedia.org/T189341#4039520 (10Ottomata) I had emailed Dario about this before, and told him it might be hard, but on second thought, I think it isn'... [13:15:11] you are awarded with more reboots! [13:15:24] :-) [13:15:29] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [13:15:31] RECOVERY - Disk space on labtestneutron2001 is OK: DISK OK [13:15:51] yay! [13:16:01] RECOVERY - DPKG on labtestneutron2001 is OK: All packages OK [13:16:01] RECOVERY - dhclient process on labtestneutron2001 is OK: PROCS OK: 0 processes with command name dhclient [13:16:01] RECOVERY - configured eth on labtestneutron2001 is OK: OK - interfaces up [13:16:47] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) (owner: 10Filippo Giunchedi) [13:16:59] (03Merged) 10jenkins-bot: wmf-config: enable Singapore oversample as default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [13:18:25] marlier: the patch is at mwdebug1002, please test and let me know if I can deploy [13:18:30] Verified [13:18:32] GTG [13:18:39] ok, deploying [13:18:55] (03CR) 10jenkins-bot: wmf-config: enable Singapore oversample as default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [13:19:01] RECOVERY - puppet last run on labtestneutron2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:19:23] 10Operations, 10netops: Config discrepencies on network devices - https://phabricator.wikimedia.org/T189588#4046514 (10ayounsi) p:05Triage>03Low [13:19:42] RECOVERY - IPMI Sensor Status on labtestneutron2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [13:20:07] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:417331|wmf-config: enable Singapore oversample as default on all wikis (T188652)]] (duration: 00m 57s) [13:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:13] T188652: Enable oversampling for Singapore - https://phabricator.wikimedia.org/T188652 [13:20:26] marlier: deployed! please test and thanks for deploying with #releng ;) [13:20:44] no other patches for swat? [13:20:46] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4046529 (10Ottomata) Ah thanks, yeah, I meant to get back to this the next day but we got other thinnngnggs [13:20:58] !log EU SWAT finished [13:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:42] (03CR) 10Ottomata: "Hm, ok! I hope this doesn't break anything! Not that it will, but it seems like there was probably a reason it was in this group." [puppet] - 10https://gerrit.wikimedia.org/r/419111 (owner: 10Elukey) [13:23:51] (03CR) 10Ottomata: [C: 031] profile::hadoop::worker: use require instead of include [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [13:24:10] zeljkof: confirmed that change is live everywhere. Appreciate it! [13:27:08] marlier: /me thumbs up ;) [13:28:46] (03CR) 10Filippo Giunchedi: [C: 031] profile::hadoop::prometheus_jmx_exporter: include rpc metrics for each port [puppet] - 10https://gerrit.wikimedia.org/r/419175 (owner: 10Elukey) [13:29:14] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) [13:29:25] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#4046549 (10chasemp) 05Open>03Resolved closing for now [13:29:27] !log stop db1001 for maintenance (proxies will temporarely complain about lack of redundancy) [13:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:40] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:30:30] 10Operations, 10cloud-services-team, 10Epic: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#4046555 (10chasemp) [13:30:42] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) [13:31:10] * moritzm shakes fist at the commit message CI check [13:31:22] RECOVERY - Long running screen/tmux on labtestneutron2001 is OK: OK: No SCREEN or tmux processes detected. [13:31:27] 10Operations, 10cloud-services-team, 10Epic: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939825 (10chasemp) [13:31:58] 10Operations, 10cloud-services-team, 10Epic: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939825 (10chasemp) [13:32:02] PROBLEM - Disk space on labtestneutron2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:32:43] PROBLEM - haproxy failover on dbproxy1006 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [13:33:02] PROBLEM - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [13:33:05] (03PS1) 10Andrew Bogott: labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470) [13:33:27] jynus: ^ is that you? [13:33:30] ah yes [13:33:33] missed the !log :) [13:33:34] sorry [13:33:54] yes [13:34:09] everthing is ok [13:34:15] :) [13:34:43] the other option, downtim'ing the proxies would prevent us from seein a real outage [13:36:02] RECOVERY - Disk space on labtestneutron2002 is OK: DISK OK [13:37:46] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4046577 (10akosiaris) All row_A eqiad VMs have been rebooted with cache=none. We are now again in a waiting period. [13:39:35] (03PS1) 10Gehel: wdqs: disable kafka poller on new wdqs-internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/419181 [13:39:54] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4046578 (10chasemp) @madhuvishy could you review and potentially close this round of cleanup? :D [13:40:20] (03CR) 10Gehel: [C: 032] wdqs: disable kafka poller on new wdqs-internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/419181 (owner: 10Gehel) [13:42:14] (03PS2) 10Ottomata: Point eventlogging varnishkafka at Kafka jumbo-eqiad with TLS [puppet] - 10https://gerrit.wikimedia.org/r/417319 (https://phabricator.wikimedia.org/T183297) [13:43:20] (03CR) 10MarcoAurelio: Initial configuration for euwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 (owner: 10Urbanecm) [13:43:22] (03PS1) 10Alexandros Kosiaris: kubernetes: Ignore WATCHLIST latencies as well [puppet] - 10https://gerrit.wikimedia.org/r/419182 [13:44:04] (03CR) 10Elukey: [C: 031] Point eventlogging varnishkafka at Kafka jumbo-eqiad with TLS [puppet] - 10https://gerrit.wikimedia.org/r/417319 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [13:44:34] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Ignore WATCHLIST latencies as well [puppet] - 10https://gerrit.wikimedia.org/r/419182 (owner: 10Alexandros Kosiaris) [13:49:41] (03Abandoned) 10Elukey: profile::analytics::refinery::job::sqoop_mediawiki: add stdout redirect to crons [puppet] - 10https://gerrit.wikimedia.org/r/415849 (owner: 10Elukey) [13:51:58] 10Operations, 10cloud-services-team: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#4046607 (10chasemp) [13:52:06] (03PS1) 10Muehlenhoff: Add chicocvenancio to LDAP users (for cn=wmf) [puppet] - 10https://gerrit.wikimedia.org/r/419183 [13:52:34] (03PS2) 10Andrew Bogott: labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470) [13:52:36] (03CR) 10Vgutierrez: [C: 031] Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159 (owner: 10Muehlenhoff) [13:52:38] (03CR) 10jerkins-bot: [V: 04-1] Add chicocvenancio to LDAP users (for cn=wmf) [puppet] - 10https://gerrit.wikimedia.org/r/419183 (owner: 10Muehlenhoff) [13:53:06] (03PS2) 10Elukey: profile::hadoop::prometheus_jmx_exporter: include rpc metrics for each port [puppet] - 10https://gerrit.wikimedia.org/r/419175 [13:54:10] (03CR) 10Elukey: [C: 032] profile::hadoop::prometheus_jmx_exporter: include rpc metrics for each port [puppet] - 10https://gerrit.wikimedia.org/r/419175 (owner: 10Elukey) [13:56:06] hashar: CI tests are failing due to disk space depletion: "mv: cannot create regular file ‘/srv/workspace/log/admin-0.log’: No space left on device", see e.g. https://gerrit.wikimedia.org/r/419183 [13:57:02] agrer [13:57:05] zeljkof: ^^ [13:57:26] hashar: yes, that problem :) [13:57:33] stupid jobs [13:58:12] (03PS3) 10Andrew Bogott: labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470) [13:59:21] zeljkof: I have just deleted a bunch of build workspaces [13:59:25] solved :) [13:59:43] hashar: thanks! [14:01:10] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/419183 (owner: 10Muehlenhoff) [14:01:25] 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Set up external DNS record for wikitech-static - https://phabricator.wikimedia.org/T164290#3228802 (10chasemp) > Simple solution: all opsen and devs who would need wikitech-static should put in a commented line in the... [14:02:21] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Investigate alternative RAID strategies for labstore1001/2 - https://phabricator.wikimedia.org/T162090#4046631 (10chasemp) 05Open>03Invalid These are going to be decommissioned just as soon as we get labstore1008/1009 online. [14:02:24] 10Operations, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Reimage labstore1001 and labstore1002 for DRBD storage setup - https://phabricator.wikimedia.org/T158196#4046633 (10chasemp) [14:02:26] (03PS2) 10Muehlenhoff: Add chicocvenancio to LDAP users (for cn=wmf) [puppet] - 10https://gerrit.wikimedia.org/r/419183 [14:04:14] RECOVERY - haproxy failover on dbproxy1001 is OK: OK check_failover servers up 2 down 0 [14:04:24] (03CR) 10Muehlenhoff: [C: 032] Add chicocvenancio to LDAP users (for cn=wmf) [puppet] - 10https://gerrit.wikimedia.org/r/419183 (owner: 10Muehlenhoff) [14:04:45] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#4046635 (10chasemp) [14:04:54] RECOVERY - haproxy failover on dbproxy1006 is OK: OK check_failover servers up 2 down 0 [14:11:41] (03PS2) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) [14:11:51] !log add chico to wmf-nda (verified nda things with moritz and all the goodness) [14:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:18] (03CR) 10jerkins-bot: [V: 04-1] Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) (owner: 10Alexandros Kosiaris) [14:12:32] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4046639 (10Pchelolo) > Add a second LVS IP, to be served from the same cluster, to use for videoscaling. This will guarantee we evenly distrib... [14:14:18] (03PS2) 10Ottomata: Use roundrobin partition.assignment.strategy for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/418934 (https://phabricator.wikimedia.org/T189464) [14:18:14] (03PS2) 10Muehlenhoff: Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159 [14:20:00] (03CR) 10Ottomata: [C: 032] Use roundrobin partition.assignment.strategy for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/418934 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [14:23:17] (03CR) 10Muehlenhoff: [C: 032] Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159 (owner: 10Muehlenhoff) [14:23:23] (03PS3) 10Muehlenhoff: Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159 [14:24:46] (03PS3) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) [14:25:07] (03PS2) 10Elukey: profile::hadoop::worker: use require instead of include [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294) [14:27:15] (03PS4) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) [14:28:23] (03CR) 10Elukey: [C: 032] profile::hadoop::worker: use require instead of include [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [14:28:25] (03CR) 10MarcoAurelio: "logstash-beta is no longer public; let's just stop collecting IPs for abusefilter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [14:28:46] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 1829 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:28:56] RECOVERY - Request latencies on acrux is OK: OK - apiserver_request_latencies is 1698 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:32:36] zeljkof: do you remember why https://gerrit.wikimedia.org/r/#/c/417189/ didn't went through? [14:37:06] (03PS1) 10Elukey: Assign role::analytics_cluster::hadoop::worker to analytics1074 [puppet] - 10https://gerrit.wikimedia.org/r/419188 (https://phabricator.wikimedia.org/T188294) [14:37:16] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.088 second response time [14:37:53] Hauskatze: we had some problems with deployment yesterday, there was time for only one patch [14:37:56] (03CR) 10Elukey: [C: 032] Assign role::analytics_cluster::hadoop::worker to analytics1074 [puppet] - 10https://gerrit.wikimedia.org/r/419188 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [14:39:11] <_joe_> uhm what's up with wikidata? [14:39:30] <_joe_> Amir1: any idea? should I look? [14:39:49] _joe_: what's up [14:40:02] <_joe_> wikidata.org dispatch lag is higher than 300s [14:40:07] let me check [14:40:58] <_joe_> lag according to scripts running on terbium is ~ 400 seconds on some wikis [14:41:10] _joe_: dispatching seems fine: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1 [14:42:02] !log rebooting chromium for kernel security update [14:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:01] <_joe_> yeah I'm looking at numbers on terbium [14:43:30] _joe_: I keep monitoring it and if it started to get really bad I do something [14:43:46] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 [14:43:50] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 [14:45:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui) [14:47:18] looks like size change on wikidata-edits went up at the same time as dispatch lag started growing https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1&panelId=3&fullscreen&from=1520909213004&to=1520952413005 [14:47:25] (03PS1) 10Muehlenhoff: Revert "Temporarily remove chromium from LVS name servers in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/419191 [14:47:29] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui) [14:47:34] (03PS4) 10Andrew Bogott: labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470) [14:48:49] zeljkof hasharAway : https://integration.wikimedia.org/ci/job/operations-mw-config-php55lint/19880/console -> fatal: write error: No space left on device [14:49:08] marostegui: I think hasharAway fixed it [14:49:16] It just happened [14:49:20] Like a minute ago XD [14:49:24] should I recheck? [14:49:48] marostegui: hm, probably :) [14:49:53] it might be broken again [14:49:56] (03CR) 10Marostegui: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui) [14:49:59] let's see! [14:50:03] > 14:59 zeljkof: I have just deleted a bunch of build workspaces [14:50:19] it was 50 minutes ago, if it happened again, it might be broken again :( [14:51:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui) [14:51:51] !log stopping db2044 (this will make proxies complain about redundancy) [14:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:58] zeljkof: looks like it worked this time [14:52:05] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui) [14:52:09] marostegui: great! [14:52:16] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1966 bytes in 0.086 second response time [14:52:21] (03CR) 10Rush: openstack: rabbit codify nova user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [14:53:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081 after alter table (duration: 00m 57s) [14:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:11] (03PS1) 10Volans: Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) [14:54:23] (03CR) 10jerkins-bot: [V: 04-1] Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [14:54:28] zeljkof marostegui there's a task for this [14:54:37] https://phabricator.wikimedia.org/T189587 [14:54:42] (03PS7) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) [14:54:43] paladox: thanks! [14:54:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) [14:54:51] your welcome :) [14:54:51] (03PS5) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) [14:54:53] hasharAway: integration-slave-jessie-1001 error: unable to create temporary file: No space left on device [14:55:05] volans https://phabricator.wikimedia.org/T189587 [14:55:06] (03CR) 10jerkins-bot: [V: 04-1] cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) (owner: 10ArielGlenn) [14:55:14] !log upgrading codfw LVSs to pybal 1.15.2 [14:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:22] paladox: thx [14:55:22] volans: I just suffered that too, so…zeljkof it is not completely fixed indeed [14:55:29] paladox, volans: uh oh, looks like we will have to wait for hasharAway to get back [14:55:35] yep [14:55:36] I can have a look [14:56:22] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [14:56:26] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:56:49] 12G jenkins-workspace; 7.3G pbuilder [14:56:56] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:59:42] (03PS6) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) [15:00:58] (03CR) 10Volans: "The VMs will be created in row C in eqiad and row B in codfw." [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [15:02:00] (03PS1) 10Rush: neutron dummies for rabbit and db [labs/private] - 10https://gerrit.wikimedia.org/r/419196 [15:02:53] (03PS8) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) [15:03:01] (03CR) 10Muehlenhoff: [C: 032] Revert "Temporarily remove chromium from LVS name servers in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/419191 (owner: 10Muehlenhoff) [15:03:23] (03PS2) 10Volans: Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) [15:03:30] (03CR) 10jerkins-bot: [V: 04-1] Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [15:03:53] it's recovering [15:15:04] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:15:31] (03PS1) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) [15:16:01] (03PS7) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) [15:16:11] (03CR) 10jerkins-bot: [V: 04-1] openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [15:20:48] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10432/ says PCC is happy and a quick overview of the change catalog looks fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) (owner: 10Alexandros Kosiaris) [15:21:28] volans: any joy ressurecting that jenkins worker? [15:21:49] godog: I just took a quick look we have a lot of used space in the jenkins workspace for the android app stuff [15:21:57] and in pbuilder [15:22:03] but I'm unsure what is safe to delete [15:22:54] volans i think normaly we do rm -rf * in the workspace [15:23:15] volans: ah :( [15:23:46] (03PS1) 10Alexandros Kosiaris: Force string for file mode in yaml [puppet] - 10https://gerrit.wikimedia.org/r/419201 [15:24:20] (03CR) 10Alexandros Kosiaris: [C: 032] Force string for file mode in yaml [puppet] - 10https://gerrit.wikimedia.org/r/419201 (owner: 10Alexandros Kosiaris) [15:24:24] (03CR) 10Filippo Giunchedi: [C: 031] Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [15:24:46] is there any wikitech doc about it? [15:25:46] not that I could find, e.g. https://wikitech.wikimedia.org/wiki/Jenkins [15:26:35] though https://phabricator.wikimedia.org/T126176 is similar [15:27:56] the full partition is /srv, but yeah, same content apparently [15:30:08] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add initialize_namespace.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/419139 (owner: 10Alexandros Kosiaris) [15:32:44] !log upgrade and restart dbproxy1001 [15:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:46] !log upgrading eqiad LVSs to pybal 1.15.2 [15:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:03] (03PS1) 10Alexandros Kosiaris: Add tiller/deploy RBAC clusterroles [deployment-charts] - 10https://gerrit.wikimedia.org/r/419203 [15:37:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add tiller/deploy RBAC clusterroles [deployment-charts] - 10https://gerrit.wikimedia.org/r/419203 (owner: 10Alexandros Kosiaris) [15:39:19] !log upgrade and restart dbproxy1007 [15:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:26] (03PS1) 10Alexandros Kosiaris: Add apiVersion attribute to deploy ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/419206 [15:40:27] (03CR) 10Marostegui: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [15:43:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [15:44:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [15:46:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 for alter table (duration: 00m 56s) [15:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [15:50:51] !log Deploy schema change on db1097:3314 - T187089 T185128 T153182 [15:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [15:50:59] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [15:50:59] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [15:55:51] (03PS3) 10Chico Venancio: Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) [15:56:00] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add apiVersion attribute to deploy ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/419206 (owner: 10Alexandros Kosiaris) [15:58:09] (03PS1) 10Odder: Correct logo for the Livvi-Karelian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419213 (https://phabricator.wikimedia.org/T146745) [16:00:04] godog, moritzm, and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:15] (03PS6) 10Vgutierrez: pybal: Prometheus based icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) [16:02:41] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Add UDP monitor for pybal - https://phabricator.wikimedia.org/T178151#3682454 (10Vgutierrez) Pybal 1.15.2 has been successfully deployed in our LVSs, the UDP monitor is now available. [16:05:29] 10Operations, 10Puppet: puppetdb4: systemd config review - https://phabricator.wikimedia.org/T187257#3969415 (10fgiunchedi) We'd still need the oom settings to help debugging oom cases we've seen on nitrogen for example. Passing a directory instead of a file to `-XX:HeapDumpPath` will create dump files with pi... [16:11:07] (03PS1) 10Jcrespo: dbproxy: switchover m1 and m2 master reference [dns] - 10https://gerrit.wikimedia.org/r/419216 (https://phabricator.wikimedia.org/T183469) [16:13:51] (03CR) 10Marostegui: [C: 031] dbproxy: switchover m1 and m2 master reference [dns] - 10https://gerrit.wikimedia.org/r/419216 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [16:14:21] (03CR) 10Rush: [V: 032 C: 032] neutron dummies for rabbit and db [labs/private] - 10https://gerrit.wikimedia.org/r/419196 (owner: 10Rush) [16:15:48] (03CR) 10Jcrespo: [C: 032] dbproxy: switchover m1 and m2 master reference [dns] - 10https://gerrit.wikimedia.org/r/419216 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [16:16:15] How do I log into https:://logstash-beta.wmflabs.org ? My wikitech creds don’t work, maybe I’ve forgotten how this works? [16:18:23] !log update CNAME for m1-master and m2-master [16:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:44] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4047089 (10Papaul) a:05Papaul>03faidon Done [16:21:09] hmm https://gerrit.wikimedia.org/r/ is not loading for me [16:21:32] no_justification ^^ [16:21:48] I may have broken it [16:21:52] let me revert [16:22:19] it works for me [16:22:21] works now :) [16:22:25] mmm [16:22:28] should I wait? [16:22:31] yeah [16:22:33] don't revert [16:22:43] let me prepare the revert at least [16:22:46] it works for me so far [16:23:08] it seems stalled now [16:23:15] if we don't have gerrit, we will not be able to revert [16:23:23] right [16:23:27] yeh seems stalled now [16:23:36] stuck on "Working ..." [16:24:01] let me kick the process [16:24:19] yeah that might be it [16:24:38] hmmm [16:24:45] It's working for me, just hella slow [16:25:07] no_justification: which host is gerrit running? [16:25:15] cobalt [16:25:15] cobalt, signing in now [16:25:22] <_joe_> gerrit1001, no? [16:25:35] <_joe_> oh no sorry, cobalt in eqiad [16:26:10] Bunch of mysql packet failures [16:26:11] !log restarting gerring on cobalt, stalled [16:26:15] Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure [16:26:15] The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server. [16:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:23] no_justification: yeah, we were changing some proxy related stuff [16:26:37] <_joe_> clearly gerrit doesn't like that [16:26:52] it was a cname change only [16:26:56] it is restarting [16:27:50] gerrit wont start if it cannot connect to the db [16:28:02] it did [16:28:27] it restart, but it didn't come back [16:28:38] jynus: FW rules on the proxies maybe? [16:28:40] let me check [16:28:48] jynus that's the systemd service. Gerrit will try to start but if it cannot it will eventually fail [16:28:53] like gerrit2001 [16:29:25] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [16:29:35] yeah, firewall then [16:29:35] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.85 and port 29418: Connection refused [16:29:48] let's clear all firewall rules, then [16:30:00] <_joe_> we can add a rule for the specific address if you want to [16:30:14] <_joe_> but yeah clearing them is faster [16:30:22] <_joe_> remember to disable puppet [16:30:27] gerrit is not on 10.x [16:31:02] <_joe_> we can also go back to the old dns record if needed [16:31:05] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [16:31:38] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4047131 (10madhuvishy) [16:31:39] I have cleared rules in dbproxy1007 [16:31:46] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] [16:32:07] jynus: can you restart gerrit again? [16:32:14] !log restarting gerring on cobalt, stalled [16:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:19] <_joe_> I can connect from cobalt to dbproxy1007 now [16:32:25] \o/ [16:32:35] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_jenkins CI Composer] [16:32:36] but gerrit doesn't work [16:32:46] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Service[eventlogging/init] [16:32:58] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4047140 (10Papaul) @robh thanks [16:33:01] !log Retroactive: cleared iptables rules on dbproxy1007 [16:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:12] it does now? [16:33:24] works yes [16:33:26] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [16:33:32] quite slow, but loads for me [16:33:36] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.14.6-7-g55dde9d68b (SSHD-CORE-1.4.0) (protocol 2.0) [16:33:41] <_joe_> it's coming back [16:34:10] fully works for me now [16:34:17] <_joe_> let the jvm heat the glow plugs people [16:34:25] but if gerrit wasn't on a 10.x and that was down [16:34:30] maybe otrs was down, too [16:34:35] or other services [16:34:46] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_jenkins CI Composer] [16:34:55] <_joe_> jynus: while you add the firewall rule, I can check [16:35:07] can you check otrs? [16:35:28] otrs login mainpage works for me [16:35:53] yeah, but probably it needs a more complexy thinkg to access the db [16:35:55] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4047147 (10madhuvishy) [16:35:58] 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#4047145 (10madhuvishy) 05Resolved>03Open @Kolossos I see utilization has climbed up again to over 600G. How can we ensure we don't have to keep makin... [16:36:13] 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#4047149 (10madhuvishy) p:05High>03Normal [16:36:15] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4047148 (10Papaul) @Joe ok . For now I have 5 new servers in A4 and 7 new servers in B3. so moving all the new server in A3 to B3, B3 will have a total of 12 new server... [16:36:35] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [16:36:37] I think the wise think is to revert the change [16:36:48] and reanalize later [16:37:06] <_joe_> why? we're not in a failure state right now [16:37:15] I wouldn't revert either no [16:37:18] but services with a pool of connections [16:37:22] Oh the Thinks You Can Think [16:37:23] <_joe_> and you can actually see the rejected packets on dbproxy1007 [16:37:37] may fail with some delay [16:37:46] or [16:37:57] ( https://en.wikipedia.org/wiki/Oh,_the_Thinks_You_Can_Think! ) [16:38:14] we dont revert, but clear the iptables of dbproxy1001 [16:38:25] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Service[eventlogging/init] [16:38:33] <_joe_> no errors from otrs [16:38:52] <_joe_> AFAICT [16:39:15] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#4047174 (10Bstorm) [16:40:14] jynus: do you want me to take care of dbproxy1001? [16:40:18] _joe_: the problem is that it happened not just for otrs [16:40:52] it was 2 proxies we changed the alias 2, serving all misc dbs except phabricator, eventlogging and cloud [16:41:11] and I assumed all those servies were on 10.x networks [16:41:38] etherpad is on dbproxy1001 no? [16:41:44] well, you know what I mean XD [16:42:08] yeah, let's clear dbproxy1001 firewall [16:42:21] I will do that [16:42:23] and add it slowly instead of the other way round [16:42:48] for the record: etherpad and librenms work now [16:42:49] <_joe_> AFAICT, we should just allow DOMAIN_NETWORKS as a srange [16:42:58] going to clear the rules anyways [16:42:58] <_joe_> instead of INTERNAL_NETWORKS [16:43:02] <_joe_> wait [16:43:07] waiting [16:43:34] _joe_: the proxy was a TODO [16:43:44] the firewall for the proxy [16:43:48] but I forgot about it [16:44:15] <_joe_> so what's the puppet code that creates the proxy rules on dbproxy1001/7? [16:44:32] dbproxy1006 (the old one) has no rules, that is why I wanted to leave dbproxy1001 the same as 1006, just to make sure everything works as it used to before the cname change [16:45:24] <_joe_> marostegui: ok go on [16:45:28] k [16:45:34] <_joe_> profile::mariadb::proxy::firewall in hiera is what we have to change [16:45:59] <_joe_> well, no, the def within there [16:45:59] !log Clean iptables rules on dbproxy1001 to leave it as dbproxy1006 [16:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:23] so there is a hiera rule to apply a ruleset or not [16:46:28] dbproxy1001 is clean now [16:46:29] <_joe_> it's more complex than I thought it would be. [16:46:37] <_joe_> profile::mariadb::ferm defaults to srange => '$INTERNAL', [16:46:42] <_joe_> without overrides [16:46:53] yeah, we just have to create a new ruleset [16:46:56] <_joe_> so we need to change that, and make the range configurable [16:47:00] that is the one we use for core dbs [16:47:09] <_joe_> or to add a simple ferm::service [16:48:26] guys you mind if I logoff? I need to take care of some stuff and I think stuff is under control now [16:48:35] yes, sorry [16:48:57] <_joe_> marostegui: did you disable puppet on dbproxy1001? [16:49:00] I will just add a new variable other than cloud and internal [16:49:04] _joe_: nope [16:49:09] ferm should only create the rules on start [16:49:13] i can do it now [16:49:22] it alerts because of that [16:49:23] <_joe_> marostegui: I'm doing it [16:49:25] but I don't think puppet will add the rules back [16:49:42] <_joe_> no, it should not, but better to control when it gets executed [16:49:47] don't worry, _joe_ marostegui I will take care of this [16:49:49] i ran puppet to check, and they were not added [16:50:07] I just need to check the right rules for m1 and m2 services [16:50:24] <_joe_> ok, I'm happy to review the changes [16:51:31] the quick change is to set the proxy firewal as = disabled until I fine-tune [16:51:48] <_joe_> I would go that way tbh [16:51:53] +1 [16:51:54] <_joe_> and have time to do things properly [16:52:11] oh, I will just to make sure puppet and current state matches [16:52:44] or basically, previous state [16:52:50] yeah, exactly [16:53:03] Have to go guys, ring me if needed! [16:53:14] then we can test the changes on the passive host to not create more outages [16:54:49] (03CR) 10BryanDavis: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) (owner: 10BryanDavis) [16:57:46] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:58:26] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:58:35] (03PS1) 10Jcrespo: dbproxy: Disable temporarily firewall on the active proxy for m1 & m2 [puppet] - 10https://gerrit.wikimedia.org/r/419221 [16:58:58] ^this is the bad fix, I will now work on the proper one [16:59:20] (03CR) 10Marostegui: [C: 031] dbproxy: Disable temporarily firewall on the active proxy for m1 & m2 [puppet] - 10https://gerrit.wikimedia.org/r/419221 (owner: 10Jcrespo) [16:59:24] (03CR) 10Jcrespo: [C: 032] dbproxy: Disable temporarily firewall on the active proxy for m1 & m2 [puppet] - 10https://gerrit.wikimedia.org/r/419221 (owner: 10Jcrespo) [16:59:25] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:18] (03PS1) 10Andrew Bogott: shorten ttl for horizon.wm.o and toolsadmin.wm.o [dns] - 10https://gerrit.wikimedia.org/r/419222 (https://phabricator.wikimedia.org/T168470) [17:01:05] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:01:17] (03Abandoned) 10Andrew Bogott: shorten ttl for horizon.wm.o and toolsadmin.wm.o [dns] - 10https://gerrit.wikimedia.org/r/419222 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:01:36] (03PS1) 10Awight: Fix new ORES threshold syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419223 (https://phabricator.wikimedia.org/T181159) [17:01:48] no parsoid deploy today [17:01:55] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:02:22] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4047252 (10faidon) So post-mortem, I think there are 4 different things here: - T189519: Audit switch ports/descriptions/enable (and do this on an ongoing basis) - T189522: Detect I... [17:02:35] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:03:25] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:04:46] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:05:59] (03PS1) 10Andrew Bogott: Remove misc-web config for 'newwikitech' [puppet] - 10https://gerrit.wikimedia.org/r/419224 (https://phabricator.wikimedia.org/T168470) [17:06:06] (03PS1) 10Andrew Bogott: Rename newhorizon and newtoolsadmin to horizon and toolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/419225 (https://phabricator.wikimedia.org/T168470) [17:06:07] (03PS1) 10Andrew Bogott: Move horizon and toolsadmin to labweb backends [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470) [17:06:35] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:06:36] (03PS2) 10Andrew Bogott: Remove misc-web config for 'newwikitech' [puppet] - 10https://gerrit.wikimedia.org/r/419224 (https://phabricator.wikimedia.org/T168470) [17:06:37] !log cleanup integration-slave-jessie-1001:/srv/pbuilder/build - T189587 [17:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:43] T189587: integration-slave-jessie-1001 out of disk space - https://phabricator.wikimedia.org/T189587 [17:07:06] (03CR) 10Andrew Bogott: [C: 032] Remove misc-web config for 'newwikitech' [puppet] - 10https://gerrit.wikimedia.org/r/419224 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:07:30] (03CR) 10Filippo Giunchedi: [C: 031] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [17:08:12] (03PS1) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228 [17:08:57] (03PS1) 10Awight: Enable Extension:JADE on all beta cluster wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419229 (https://phabricator.wikimedia.org/T176333) [17:10:45] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 [17:11:05] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [17:11:33] (03CR) 10Andrew Bogott: [C: 04-1] "Scheduled for tomorrow (Wednesday) morning." [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:12:48] (03CR) 10Awight: [C: 032] Fix new ORES threshold syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419223 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [17:12:53] (03CR) 10Awight: [C: 032] Enable Extension:JADE on all beta cluster wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419229 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [17:14:01] (03Merged) 10jenkins-bot: Fix new ORES threshold syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419223 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [17:14:24] (03Merged) 10jenkins-bot: Enable Extension:JADE on all beta cluster wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419229 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [17:14:35] ok, we are now in a far from idea, but stable state [17:14:46] 10Operations, 10DC-Ops, 10monitoring, 10User-fgiunchedi: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4047303 (10fgiunchedi) [17:16:57] (03PS2) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228 [17:17:22] the funny thing is by pure chance, I think we only need a rule for gerrit, it is the only active service with a different range [17:18:09] (03PS3) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228 [17:18:49] jynus will that also fix gerrit2001 too? [17:18:57] (03CR) 10jenkins-bot: Fix new ORES threshold syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419223 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [17:19:11] (03PS4) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228 [17:19:40] !log awight@tin Started scap: Beta: Fix ORES thresholds and enable JADE, T181159, T176333 [17:19:45] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) [17:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:47] T181159: Migrate ORES extension threshold config from old to new syntax - https://phabricator.wikimedia.org/T181159 [17:19:47] T176333: [Blocked] Deploy JADE prototype in Beta Cluster - https://phabricator.wikimedia.org/T176333 [17:19:56] paladox: gerrit2001 is waiting hardware provisioning [17:20:12] jynus hardware provisioning? [17:20:16] so it will not fix it, but it should provent from the same thing happening [17:20:28] paladox: we are missing misc db proxies there [17:20:34] oh [17:20:45] and to be fair, proper database setup [17:20:53] (03CR) 10BBlack: [C: 032] varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228 (owner: 10BBlack) [17:20:53] so we need to setup more servers for that [17:21:02] (03PS5) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228 [17:21:13] we could have rush it, but it would have been setup poorly [17:21:30] ok [17:21:30] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:21:32] thanks [17:25:52] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) [17:26:00] (03PS5) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) [17:27:50] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4047356 (10Gehel) a:03Gehel [17:28:12] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Investigate and improve memory allocation rates of WDQS - https://phabricator.wikimedia.org/T181988#4047358 (10Gehel) a:03Gehel [17:29:02] (03PS2) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) [17:29:44] (03CR) 10jerkins-bot: [V: 04-1] openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:31:22] (03PS3) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) [17:34:21] (03CR) 10Rush: "labtestneutron2001.codfw.wmnet,labtestneutron2002.codfw.wmnet,labtestcontrol2003.codfw.wmnet,labtestvirt2003.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:37:05] thanks godog for the review and recheck [17:37:53] volans: np! was easy enough to fix, sadly not automatic yet [17:38:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4047379 (10Gehel) The decision is to not replace this out of warranty RAM. We'll run with 3% less capacity until this batch of servers is renewed (in ~ 1year). [17:38:32] indeed [17:42:41] (03PS1) 10Jcrespo: dblist: Update db1051 and db1063 location [software] - 10https://gerrit.wikimedia.org/r/419235 [17:43:08] (03PS1) 10Arturo Borrero Gonzalez: icinga: update wikitech-static check contacts [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) [17:43:53] (03CR) 10Jcrespo: [V: 032 C: 032] "I am merging now to not delay it unnecesarily, but please review with further commits if you see issues." [software] - 10https://gerrit.wikimedia.org/r/419235 (owner: 10Jcrespo) [17:49:25] (03PS1) 10Arturo Borrero Gonzalez: shinken: labs: delete wikitech-static check [puppet] - 10https://gerrit.wikimedia.org/r/419237 (https://phabricator.wikimedia.org/T189584) [17:51:00] (03PS4) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) [17:51:41] (03CR) 10jerkins-bot: [V: 04-1] openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:52:36] (03CR) 10Rush: icinga: update wikitech-static check contacts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez) [17:53:30] (03CR) 10Rush: shinken: labs: delete wikitech-static check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419237 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez) [17:54:34] (03PS5) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) [17:55:07] !log installing ncurses updates from stretch point release [17:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:09] (03PS3) 10Volans: Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) [17:57:49] (03PS1) 10Muehlenhoff: Add library hint for ncurses [puppet] - 10https://gerrit.wikimedia.org/r/419241 [17:58:08] (03PS2) 10Muehlenhoff: Add library hint for ncurses [puppet] - 10https://gerrit.wikimedia.org/r/419241 [17:58:42] (03CR) 10Volans: [C: 032] Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [17:59:45] (03CR) 10Muehlenhoff: [C: 032] Add library hint for ncurses [puppet] - 10https://gerrit.wikimedia.org/r/419241 (owner: 10Muehlenhoff) [17:59:47] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/10434/" [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1800) [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:04:40] (03PS1) 10Awight: Add JADE to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419243 [18:06:00] (03CR) 10Awight: [C: 032] Add JADE to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419243 (owner: 10Awight) [18:07:18] (03Merged) 10jenkins-bot: Add JADE to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419243 (owner: 10Awight) [18:09:05] (03PS1) 10DCausse: Add extra-analysis [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/419244 (https://phabricator.wikimedia.org/T189239) [18:14:11] (03CR) 10Gehel: [V: 032 C: 032] "LGTM, checked locally according to procedure in README" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/419244 (https://phabricator.wikimedia.org/T189239) (owner: 10DCausse) [18:14:30] (03PS8) 10Rush: openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) [18:17:25] (03CR) 10Rush: [C: 032] openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [18:18:23] (03CR) 10Rush: "Arturo is on clinic so hopefully can roll this out" [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) (owner: 10Chico Venancio) [18:21:50] (03PS2) 10Arturo Borrero Gonzalez: icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) [18:21:52] (03Abandoned) 10Arturo Borrero Gonzalez: shinken: labs: delete wikitech-static check [puppet] - 10https://gerrit.wikimedia.org/r/419237 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez) [18:22:00] (03PS1) 10Rush: rabbitmq: handle dynamic resource names for deduping [puppet] - 10https://gerrit.wikimedia.org/r/419245 [18:22:23] (03CR) 10jerkins-bot: [V: 04-1] icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez) [18:22:36] (03CR) 10Rush: [C: 032] rabbitmq: handle dynamic resource names for deduping [puppet] - 10https://gerrit.wikimedia.org/r/419245 (owner: 10Rush) [18:26:09] (03PS1) 10Odder: Update logo for the Maithili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419246 (https://phabricator.wikimedia.org/T149790) [18:26:20] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4047526 (10Papaul) Server would not power on - Draining power It looks like another dead main board. I will contact HP and see what they say. [18:29:14] (03PS6) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) [18:29:19] (03PS3) 10Rush: icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez) [18:30:00] (03CR) 10jerkins-bot: [V: 04-1] icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez) [18:30:08] (03PS4) 10Arturo Borrero Gonzalez: icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) [18:30:35] (03CR) 10Rush: "labtestneutron2001.codfw.wmnet,labtestneutron2002.codfw.wmnet,labtestcontrol2003.wikimedia.org,labtestvirt2003.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [18:32:10] (03CR) 10Rush: [C: 031] "cool, fyi the server this lands on for icinga is einsteinium and I usually reach out to make sure the eventual icinga config is valid :)" [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez) [18:32:25] !log installing w3m updates from stretch point release [18:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:28] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/10436/" [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [18:34:57] (03PS1) 10Volans: DHCP: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419247 (https://phabricator.wikimedia.org/T184563) [18:34:59] (03PS1) 10Volans: netboot: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419248 (https://phabricator.wikimedia.org/T184563) [18:35:24] (03CR) 10Arturo Borrero Gonzalez: [C: 032] icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez) [18:37:20] !log installing reportbug updates from stretch point release [18:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:45] (03CR) 10Volans: [C: 032] DHCP: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419247 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [18:40:00] (03PS2) 10Volans: DHCP: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419247 (https://phabricator.wikimedia.org/T184563) [18:41:00] (03CR) 10Rush: [C: 032] openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [18:41:11] (03CR) 10Dzahn: base/icinga: add Hiera override to skip systemd monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [18:41:13] (03PS7) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) [18:42:12] (03PS8) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) [18:42:17] !log repool wdqs1004 & wdqs2001 now that data reload is completed T189548 [18:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:23] T189548: reload data on wdqs1004 - https://phabricator.wikimedia.org/T189548 [18:42:33] (03PS1) 10Ottomata: Update wheels for Debian Stretch [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419251 (https://phabricator.wikimedia.org/T183145) [18:42:37] (03PS1) 10Muehlenhoff: Temporarily remove hydrogen from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419252 [18:42:47] chasemp: sorry, "rush" hour for merging :-P [18:42:47] (03CR) 10Ottomata: [V: 032 C: 032] Update wheels for Debian Stretch [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419251 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [18:43:01] volans: rebase wars! we deserver our own history channel show [18:43:20] :) [18:43:53] I have another one but can wait few minutes :D [18:45:22] volans: all you now dude, I'm off and running [18:45:41] (03PS1) 10Ottomata: wheel frozen-requirements should refer to version [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419253 [18:45:51] (03CR) 10Ottomata: [V: 032 C: 032] wheel frozen-requirements should refer to version [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419253 (owner: 10Ottomata) [18:46:03] lol, ack thanks [18:46:12] (03PS1) 10Chad: group0 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419254 [18:46:46] (03PS1) 10Dzahn: restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) [18:46:47] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [18:46:55] !log demon@tin scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "awight"; reason is "Beta: Fix ORES thresholds and enable JADE, T181159, T176333" (duration: 00m 00s) [18:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:01] T181159: Migrate ORES extension threshold config from old to new syntax - https://phabricator.wikimedia.org/T181159 [18:47:01] T176333: Deploy JADE prototype in Beta Cluster - https://phabricator.wikimedia.org/T176333 [18:47:08] joins the merge wars [18:47:23] (03CR) 10jerkins-bot: [V: 04-1] restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [18:47:38] !log demon@tin scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "awight"; reason is "Beta: Fix ORES thresholds and enable JADE, T181159, T176333" (duration: 00m 00s) [18:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:06] awight|lunch: You have the scap lock...? [18:50:50] (03PS1) 10RobH: decom db2030 [puppet] - 10https://gerrit.wikimedia.org/r/419256 (https://phabricator.wikimedia.org/T187768) [18:50:54] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4047594 (10RobH) [18:51:25] (03PS2) 10Volans: netboot: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419248 (https://phabricator.wikimedia.org/T184563) [18:51:45] (03CR) 10RobH: [C: 032] decom db2030 [puppet] - 10https://gerrit.wikimedia.org/r/419256 (https://phabricator.wikimedia.org/T187768) (owner: 10RobH) [18:51:47] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:52:23] ottomata: is it you? ^^^ [18:53:06] (03PS3) 10Volans: netboot: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419248 (https://phabricator.wikimedia.org/T184563) [18:54:04] (03PS2) 10Dzahn: restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) [18:54:06] (03CR) 10Volans: [C: 032] netboot: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419248 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [18:54:27] (03PS1) 10RobH: decom db2030 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/419257 (https://phabricator.wikimedia.org/T187768) [18:54:44] (03CR) 10jerkins-bot: [V: 04-1] restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [18:54:56] (03CR) 10RobH: [C: 032] decom db2030 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/419257 (https://phabricator.wikimedia.org/T187768) (owner: 10RobH) [18:55:15] awight|lunch: Can you please rm /var/lock/scap.operations_mediawiki-config.lock from tin? [18:55:21] (you seem to have a stuck lock...) [18:57:58] (03PS1) 10Rush: openstack: labtestmetal partmon raid1 recipe [puppet] - 10https://gerrit.wikimedia.org/r/419258 (https://phabricator.wikimedia.org/T188266) [18:58:59] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4047627 (10RobH) [18:59:15] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3984964 (10RobH) a:05RobH>03Papaul @papaul: ready for onsite disk wipe [19:00:04] no_justification: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:01:10] Bleh. Or can I get a root to nuke this file from tin? [19:01:17] /var/lock/scap.operations_mediawiki-config.lock [19:01:54] omg sorry [19:01:55] nuking [19:01:55] There's an awight! [19:01:56] :) [19:02:21] done [19:02:39] !log demon@tin Started scap: bootstrap wmf.25 [19:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:23] 10Operations, 10Traffic, 10Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#4047636 (10BBlack) So, recapping this ticket that's been stale for quite a while: * We've had past applayer bugs with gzipped outputs in e... [19:05:33] (03PS8) 10Herron: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) [19:06:01] (03PS1) 10Ottomata: Update jupyterhub to 0.8.1 to work with newer singleuserauthenticator [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419260 (https://phabricator.wikimedia.org/T183145) [19:06:13] (03CR) 10jerkins-bot: [V: 04-1] puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [19:06:24] (03CR) 10Ottomata: [V: 032 C: 032] Update jupyterhub to 0.8.1 to work with newer singleuserauthenticator [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419260 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [19:10:59] no_justification: ok [19:11:07] No [19:11:08] It's already done [19:11:13] awight handled it :) [19:11:16] just saw,'k [19:11:39] Dig a hole, fill it in again! [19:15:22] or cover it with a peice of paper with a drawing of grass on it and hope nobody notices! :) [19:23:30] 10Operations, 10Traffic, 10Patch-For-Review: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#4047699 (10BBlack) 05Open>03Resolved [19:26:04] bblack: yes! Also good for catching neighbors to have for dinner [19:26:48] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4047703 (10Papaul) @RobH thanks [19:52:34] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4047768 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your request... [19:58:50] PROBLEM - HHVM rendering on mw2187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:49] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 74345 bytes in 1.289 second response time [20:01:06] (03PS1) 10Gehel: wdqs: collect prometheus metrics for both wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/419264 (https://phabricator.wikimedia.org/T187766) [20:03:30] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 66.70, 36.61, 27.67 [20:09:57] !log demon@tin Finished scap: bootstrap wmf.25 (duration: 67m 17s) [20:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:44] (03CR) 10Chad: [C: 032] group0 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419254 (owner: 10Chad) [20:12:32] (03Merged) 10jenkins-bot: group0 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419254 (owner: 10Chad) [20:30:52] (03CR) 10Framawiki: [C: 031] "Note that a -2 is not just an opinion. It's a veto. See MarcoAurelio's comment on the phab ticket. The debate is still present on phab. I " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [20:34:40] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 19.56, 21.89, 23.63 [20:35:15] (03PS2) 10Framawiki: Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277) [20:37:57] (03PS1) 10Ottomata: Blacklist VirtualPageView schema from EL MySQL [puppet] - 10https://gerrit.wikimedia.org/r/419268 (https://phabricator.wikimedia.org/T186728) [20:40:21] !log demon@tin rebuilt and synchronized wikiversions files: group0 to wmf.25 [20:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:15] (03CR) 10BryanDavis: [C: 031] Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) (owner: 10Chico Venancio) [20:45:50] (03CR) 10Ottomata: [C: 032] Blacklist VirtualPageView schema from EL MySQL [puppet] - 10https://gerrit.wikimedia.org/r/419268 (https://phabricator.wikimedia.org/T186728) (owner: 10Ottomata) [20:48:27] no_justification: For when you're out of the train... I realized that my extension has a composer dependency on justinrainbow/json-schema, which is in require-dev for mediawiki-core. Grasping at straws here. [20:48:54] awight: isn't it already in mediawiki/vendor? [20:48:58] awight: Isn't that in the wmf deployment vendor repo anyway? [20:49:01] legoktm: +1 just confirmed that [20:49:14] so it should be fine? [20:49:25] Unless you happen to need a different version [20:49:49] Sorry for the spam. What I'm looking at is a new extension that should have been deployed to the beta cluster a few hours ago, but so far nothing is loaded. [20:50:07] link to the mediawiki-config patch? [20:50:34] https://github.com/wikimedia/operations-mediawiki-config/commit/30ba98a43665e4e025611dd9283948cce2b97d58 [20:50:37] There are follow-ups, but here's the business, https://gerrit.wikimedia.org/r/#/c/419229/1/wmf-config/CommonSettings.php [20:50:41] oh [20:50:43] I see why [20:50:47] jenkins is stuck [20:50:49] I used eval.php to show that $wmgUseJADE is true [20:50:51] wat [20:50:54] hehe okay thanks [20:51:17] Good catch, "postmerge" jobs [20:51:47] (03CR) 10jenkins-bot: Enable Extension:JADE on all beta cluster wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419229 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [20:51:51] (03CR) 10jenkins-bot: Add JADE to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419243 (owner: 10Awight) [20:51:59] (03CR) 10jenkins-bot: group0 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419254 (owner: 10Chad) [20:52:51] now we wait ~5min? [20:53:08] (03PS1) 10Jdlrobson: Enable VirtualPageViews on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419271 (https://phabricator.wikimedia.org/T184793) [20:54:20] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1974 bytes in 0.092 second response time [20:57:32] legoktm: That was amazing. Thanks! [21:00:20] PROBLEM - HHVM rendering on mw2141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:01:10] RECOVERY - HHVM rendering on mw2141 is OK: HTTP OK: HTTP/1.1 200 OK - 74277 bytes in 0.302 second response time [21:20:33] (03PS9) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) [21:20:47] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4048057 (10RobH) a:03RobH [21:23:16] (03PS10) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) [21:28:45] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4048062 (10brion) *nod* If there's general agreement not to add more specific hardware yet, we can just work with the reassigned image servers for now and add later if ne... [21:30:00] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4048063 (10RobH) [21:33:20] (03PS1) 10RobH: decom restbase-test200[123] [puppet] - 10https://gerrit.wikimedia.org/r/419312 (https://phabricator.wikimedia.org/T187447) [21:33:26] (03CR) 10Krinkle: [C: 04-1] "From IRC:"looks like it's a problem with either the confluent-kafka module, or (more likely) with the librdkafka library that it uses unde" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [21:33:59] (03CR) 10RobH: [C: 032] decom restbase-test200[123] [puppet] - 10https://gerrit.wikimedia.org/r/419312 (https://phabricator.wikimedia.org/T187447) (owner: 10RobH) [21:35:36] (03PS11) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) [21:36:10] (03CR) 10jerkins-bot: [V: 04-1] wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [21:39:18] (03PS12) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) [21:39:22] (03PS1) 10Dzahn: rm restbase::monitoring, remnants of module [puppet] - 10https://gerrit.wikimedia.org/r/419314 [21:39:49] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=55%) [21:40:12] (03PS1) 10RobH: restbase-test200* prod dns removal [dns] - 10https://gerrit.wikimedia.org/r/419315 (https://phabricator.wikimedia.org/T187447) [21:40:52] (03CR) 10RobH: [C: 032] restbase-test200* prod dns removal [dns] - 10https://gerrit.wikimedia.org/r/419315 (https://phabricator.wikimedia.org/T187447) (owner: 10RobH) [21:43:20] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4048083 (10RobH) [21:44:19] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#3975465 (10RobH) a:05RobH>03Papaul These are ready to be wiped by onsite. Please note as SSDs, these need the specific smartctl utility run to erase them securely.... [21:49:31] (03PS1) 10Alexandros Kosiaris: Create the basic structure of a helm chart repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/419316 [21:49:41] (03CR) 10Dzahn: [C: 04-1] "turns out this class is not used, monitoring check is already moved -> https://gerrit.wikimedia.org/r/#/c/419314/" [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [21:51:09] mutante: Q: do you think you'd have any time in the coming days to re-image restbase-dev1006 (ala T185494), and optionally, restbase-dev100{4,5}? [21:51:10] T185494: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494 [21:51:50] mutante: these are not production hosts, so nothing Bad can happen, no special considerations are needed, and if you get them back to the point where I have login, I can take it from there [21:52:24] mutante: TL;DR it's kinda blocking a bunch of other things, and everyone in SRE seems pretty slammed :) [21:52:37] (hoping you're not-as) [21:52:55] (03CR) 10Alexandros Kosiaris: [C: 032] Create the basic structure of a helm chart repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/419316 (owner: 10Alexandros Kosiaris) [21:52:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Create the basic structure of a helm chart repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/419316 (owner: 10Alexandros Kosiaris) [21:53:20] it depends whether we get hardware working for bast/deployment server replacements which would be due by end of quarter in 2 weeks [21:53:39] is this related to restbase-test hosts being removed right now above? [21:53:45] looking at ticket [21:55:12] mutante: not really, no [21:55:46] mutante: those machines are really long in the tooth, we're going to stick to the -dev* ones, but that environment needs to be rebuilt [21:55:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#4048126 (10RobH) a:03RobH [21:56:26] mutante: i.e. we're consolidating, but atm we have nothing [22:12:48] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4048150 (10RobH) I just disabled the following ports for decommission of the systems: {master:1}[edit] robh@asw-b-eqiad# show | compare [edit interface... [22:13:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#4048151 (10RobH) [22:17:18] 10Operations, 10Services: rename role::xenon - https://phabricator.wikimedia.org/T189629#4048155 (10RobH) p:05Triage>03Normal [22:20:12] (03PS1) 10RobH: decom of xenon, cerium, & praseodymium [puppet] - 10https://gerrit.wikimedia.org/r/419322 (https://phabricator.wikimedia.org/T187446) [22:21:55] (03CR) 10RobH: [C: 032] decom of xenon, cerium, & praseodymium [puppet] - 10https://gerrit.wikimedia.org/r/419322 (https://phabricator.wikimedia.org/T187446) (owner: 10RobH) [22:26:44] (03PS1) 10RobH: Decommission xenon, cerium, praseodymium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/419324 (https://phabricator.wikimedia.org/T187446) [22:27:24] (03CR) 10RobH: [C: 032] Decommission xenon, cerium, praseodymium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/419324 (https://phabricator.wikimedia.org/T187446) (owner: 10RobH) [22:28:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#4048208 (10RobH) [22:28:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#3975453 (10RobH) a:05RobH>03Cmjohnson this is now ready for onsite wipe of ssds. Please note these are ssds, so will need the smartctl utility to clear them ou... [22:29:22] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:29:32] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:29:41] PROBLEM - puppet last run on db1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:29:59] uh oh [22:30:01] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:30:07] i just erged a change but it was decom, shouldnt cause that [22:30:22] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:31:11] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:31:51] manually running on lvs1002 to see what exactly is happening [22:31:51] PROBLEM - puppet last run on ganeti1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:01] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:05] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4048217 (10mobrovac) >>! In T188947#4046639, @Pchelolo wrote: >> Add a second LVS IP, to be served from the same cluster, to use for videoscal... [22:32:11] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:29] ok lvs1002 runs puppet fine when i run it [22:32:31] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:41] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:21] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:22] PROBLEM - puppet last run on mw1316 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:42] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:34:22] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:34:47] so the one i manually ran of course clears... [22:36:42] PROBLEM - puppet last run on mw1311 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:38:05] confirming one.. mw1311 [22:39:06] and ACK, it's just the icinga noise now [22:39:07] I think it was puppetdb [22:39:11] yes, indeed [22:41:40] 10Operations, 10Services (watching): rename role::xenon - https://phabricator.wikimedia.org/T189629#4048246 (10mobrovac) [`role::xenon`](https://github.com/wikimedia/puppet/blob/c6a8895e6eb1ea858795aca2325d60a877a0276e/modules/role/manifests/xenon.pp) actually refers to the [HHVM extension named Xenon](https:/... [22:41:42] RECOVERY - puppet last run on mw1311 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:44:22] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1977 bytes in 0.087 second response time [22:47:09] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4048260 (10ayounsi) asw2-b-eqiad updated. [22:52:51] PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/e00bf838-2710-11e8-9cb3-aa0000fe6bdf/volumes/kubernetes.iosecret/tiller-token-xf04b is not accessible: Permission denied [22:53:07] (03PS13) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) [22:53:32] RECOVERY - Host restbase-dev1006 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [22:54:51] PROBLEM - IPMI Sensor Status on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:55:32] PROBLEM - Restbase root url on restbase-dev1006 is CRITICAL: connect to address 10.64.48.10 and port 7231: Connection refused [22:55:41] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:55:51] PROBLEM - configured eth on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:55:52] PROBLEM - Disk space on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:56:01] PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:56:02] PROBLEM - DPKG on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:56:02] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:56:21] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:56:31] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:56:31] PROBLEM - Check size of conntrack table on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:56:51] RECOVERY - puppet last run on ganeti1005 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [22:57:41] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [22:58:21] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:58:31] RECOVERY - puppet last run on mw1316 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [22:58:41] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [22:58:51] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:59:32] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:59:41] RECOVERY - puppet last run on db1099 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:00:01] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T2300). [23:00:04] subbu, twkozlowski, and Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:22] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:00:26] o/ [23:00:34] o/ [23:01:05] We're at 9 patches this window, I see [23:01:11] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:02:01] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:02:11] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:02:31] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:06:16] any swatters around? [23:07:28] there's too many patches! [23:07:58] just 9, one more :) [23:08:14] (03PS2) 10Reedy: Enable RemexHTML on wikis with < 25 errors in high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418918 (https://phabricator.wikimedia.org/T188010) (owner: 10Subramanya Sastry) [23:08:20] (03CR) 10Reedy: [C: 032] Enable RemexHTML on wikis with < 25 errors in high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418918 (https://phabricator.wikimedia.org/T188010) (owner: 10Subramanya Sastry) [23:10:01] (03Merged) 10jenkins-bot: Enable RemexHTML on wikis with < 25 errors in high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418918 (https://phabricator.wikimedia.org/T188010) (owner: 10Subramanya Sastry) [23:10:17] (03CR) 10jenkins-bot: Enable RemexHTML on wikis with < 25 errors in high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418918 (https://phabricator.wikimedia.org/T188010) (owner: 10Subramanya Sastry) [23:10:25] !log restbase-dev1006 - reinstalling, manually skipping " Volume group name already in use" (T185494) [23:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:31] T185494: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494 [23:11:42] (03PS2) 10Reedy: Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448) (owner: 10Odder) [23:11:49] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable RemexHTML on 96 wikis T188010 (duration: 01m 16s) [23:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:55] T188010: Enable RemexHTML on additional wikis with < 25 errors in all high priority categories - https://phabricator.wikimedia.org/T188010 [23:11:59] Reedy: I'm here (but at the end of the queue apparently ;-)) [23:12:05] lemme know when we are ready to rumble [23:12:14] (03CR) 10Reedy: [C: 032] Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448) (owner: 10Odder) [23:12:30] * odder has got a few patches but they should be nice, quick n' easy [23:13:15] jerkins is slow [23:13:26] (03Merged) 10jenkins-bot: Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448) (owner: 10Odder) [23:13:42] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:13:48] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4048335 (10ayounsi) 05Open>03stalled [23:14:01] PROBLEM - configured eth on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:14:01] PROBLEM - Disk space on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:14:08] (03PS2) 10Reedy: Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578) (owner: 10Odder) [23:14:11] PROBLEM - DPKG on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:14:11] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:14:11] PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:14:15] (03CR) 10Reedy: [C: 032] Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578) (owner: 10Odder) [23:14:31] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:14:31] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:14:32] PROBLEM - Check size of conntrack table on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:14:37] that wasn't downtime? why [23:14:42] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:14:52] i mean why did it not alert before when it was down [23:15:33] (03CR) 10jenkins-bot: Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448) (owner: 10Odder) [23:15:41] (03Merged) 10jenkins-bot: Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578) (owner: 10Odder) [23:15:51] (03PS2) 10Reedy: Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586) (owner: 10Odder) [23:15:53] (03CR) 10jenkins-bot: Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578) (owner: 10Odder) [23:15:53] apparently because if the return code is 255 and not 0,1,2 or 3 ... [23:15:55] (03CR) 10Reedy: [C: 032] Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586) (owner: 10Odder) [23:16:02] then scheduled downtimes are ignored [23:16:09] Reedy, looks like the remex thing is done already .. based on my testing. [23:16:17] and it's 255 because currently that's installing OS [23:16:20] subbu: Yeah, I just pushed it live :P [23:16:22] oh yes .. you did sync it up there. [23:16:24] ok. [23:17:05] (03Merged) 10jenkins-bot: Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586) (owner: 10Odder) [23:17:26] (03PS2) 10Reedy: Correct logo for the Livvi-Karelian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419213 (https://phabricator.wikimedia.org/T146745) (owner: 10Odder) [23:17:29] (03CR) 10Reedy: [C: 032] Correct logo for the Livvi-Karelian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419213 (https://phabricator.wikimedia.org/T146745) (owner: 10Odder) [23:18:12] and no logmsgbot messages? [23:18:44] morebots seems to be dead [23:18:52] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK [23:19:00] (03Merged) 10jenkins-bot: Correct logo for the Livvi-Karelian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419213 (https://phabricator.wikimedia.org/T146745) (owner: 10Odder) [23:19:16] (03PS2) 10Reedy: Update logo for the Maithili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419246 (https://phabricator.wikimedia.org/T149790) (owner: 10Odder) [23:19:19] (03CR) 10Reedy: [C: 032] Update logo for the Maithili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419246 (https://phabricator.wikimedia.org/T149790) (owner: 10Odder) [23:20:31] (03Merged) 10jenkins-bot: Update logo for the Maithili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419246 (https://phabricator.wikimedia.org/T149790) (owner: 10Odder) [23:21:07] (03PS2) 10Reedy: Add high-density logos for seven Wikipedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419162 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder) [23:21:11] (03CR) 10Reedy: [C: 032] Add high-density logos for seven Wikipedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419162 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder) [23:21:23] (03CR) 10jenkins-bot: Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586) (owner: 10Odder) [23:22:21] (03Merged) 10jenkins-bot: Add high-density logos for seven Wikipedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419162 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder) [23:22:22] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - kubelet_operational_latencies is 33696 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:23:22] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - kubelet_operational_latencies is 1329 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:24:32] !log reedy@tin Synchronized static/images/project-logos/: YOU GET A LOGO, YOU GET A LOGO. YOU ALL GET LOGOS (duration: 01m 16s) [23:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:54] ;-) [23:25:23] lol [23:25:42] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:26:01] PROBLEM - configured eth on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:26:01] PROBLEM - Disk space on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:26:11] PROBLEM - DPKG on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:26:12] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:26:12] PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds [23:26:16] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: moar logos (duration: 01m 15s) [23:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:26] (03PS2) 10Reedy: Enable VirtualPageViews on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419271 (https://phabricator.wikimedia.org/T184793) (owner: 10Jdlrobson) [23:30:37] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:30:38] (03CR) 10Reedy: [C: 032] Enable VirtualPageViews on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419271 (https://phabricator.wikimedia.org/T184793) (owner: 10Jdlrobson) [23:30:47] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:30:47] PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [23:30:57] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:31:08] PROBLEM - configured eth on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:31:08] PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [23:31:08] PROBLEM - Disk space on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:31:18] PROBLEM - DPKG on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:31:18] PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:31:18] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:31:57] RECOVERY - Check whether ferm is active by checking the default input chain on restbase-dev1006 is OK: OK ferm input default policy is set [23:31:58] RECOVERY - MD RAID on restbase-dev1006 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 [23:32:01] (03Merged) 10jenkins-bot: Enable VirtualPageViews on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419271 (https://phabricator.wikimedia.org/T184793) (owner: 10Jdlrobson) [23:32:07] RECOVERY - configured eth on restbase-dev1006 is OK: OK - interfaces up [23:32:08] RECOVERY - Disk space on restbase-dev1006 is OK: DISK OK [23:32:18] 10Operations, 10ops-codfw: rack/setup/install ms-be204[1-4] - https://phabricator.wikimedia.org/T189633#4048384 (10Papaul) p:05Triage>03Normal [23:32:18] RECOVERY - dhclient process on restbase-dev1006 is OK: PROCS OK: 0 processes with command name dhclient [23:32:18] RECOVERY - DPKG on restbase-dev1006 is OK: All packages OK [23:32:27] ^ reinstalled [23:32:34] 10Operations, 10ops-codfw: rack/setup/install ms-be204[1-3] - https://phabricator.wikimedia.org/T189633#4048400 (10Papaul) [23:33:18] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [23:33:47] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [23:34:06] urandom: ^ i think you should be able to SSH again (minus mismatching fingerprint, new one is here https://phabricator.wikimedia.org/P6845) [23:35:31] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 01m 15s) [23:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:37] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#4048405 (10Dzahn) reinstalled, re-added to puppet, initial puppet run, recovered in Icinga, including: 19:33 < icinga-wm> RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cas... [23:35:38] !log that was Enable VirtualPageViews on Hungarian Wikipedia T184793 [23:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:44] T184793: Instrument page interactions - https://phabricator.wikimedia.org/T184793 [23:36:18] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:37:51] mutante: \o/ [23:37:56] mutante: thank you so much! [23:38:36] i know that ended up being more than you bargained for and that you have other stuff to do; it is appreciated! [23:39:33] error: insufficient permission for adding an object to repository database .git/objects [23:39:33] fatal: git write-tree failed to write a tree [23:39:57] tgr: You seem to have a mad umask for groups [23:40:10] drwxr-sr-x 2 tgr wikidev 4096 Mar 9 23:21 e9 [23:40:10] drwxrwsr-x 2 reedy wikidev 4096 Mar 8 22:15 eb [23:40:34] urandom: :) so, one question. did you say 1004/1005 because they should all be stretch? that makes sense [23:40:49] and did you expect to get stretch on 1006 [23:40:52] mutante: Can I get you to chmod -R g+w /srv/mediawiki-staging/.git/objects/* [23:41:24] mutante: hrmm, no actually, our standard setup is still jessie [23:41:45] mutante: how does that work, what is the timetable for upgrading? [23:42:07] Reedy: done on tin [23:42:07] i assume at some point, there will be an effort to get everything migrated to stretch [23:42:08] Reedy: pretty sure I never changed my umask settings on tin [23:42:32] mutante: ugh, wrong folder, damn it [23:42:52] chmod -R g+w /srv/mediawiki-staging/php-1.31.0-wmf.24/.git/objects/* [23:43:09] Reedy: oh we're live? cool. was expecting a mwdebug test round [23:43:11] thanks! [23:43:24] effort :P [23:43:30] Reedy: the UBN is that live? [23:43:31] jdlrobson: other is blocked on bad git file permissions [23:43:31] !log tin: chmod -R g+w /srv/mediawiki-staging/.git/objects/* ; chmod -R g+w /srv/mediawiki-staging/php-1.31.0-wmf.24/.git/objects/* [23:43:34] Reedy: done [23:43:34] oh. [23:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:50] urandom: so .. a little while i ago i changed the default to be stretch .. in DHCP [23:44:02] mutante: <3 [23:44:03] urandom: we did that to encourage more stretch [23:44:14] wait, this is jessie [23:44:35] so basically we wanted to change it to "you need a reason now to still want jessie" [23:44:44] mutante: yeah, this is jesssie [23:44:46] but also.. i did not touch the individual DHCP config for it today [23:44:51] so.. you still got jessie [23:44:57] err, with the apropos number of s's [23:44:58] RECOVERY - Check size of conntrack table on restbase-dev1006 is OK: OK: nf_conntrack is 0 % full [23:45:06] oh, ok [23:45:36] urandom: so while the 3 dev hosts are consistent.. which is good... maybe you also want one that is stretch [23:45:41] so that you can start testing the difference [23:45:51] ¯\_(ツ)_/¯ [23:46:33] that would probably be OK, but you've already got this up, are you trying to find more to do? :) [23:46:54] !log reedy@tin Synchronized php-1.31.0-wmf.24/extensions/MobileFrontend/: T188825 (duration: 01m 18s) [23:46:59] lol, no. but why did you say to reinstall 1004/1005 optinally [23:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:00] T188825: Infobox relocation ends up messing up paragraphs placing - https://phabricator.wikimedia.org/T188825 [23:47:42] well, i said optionally because we've typically reimaged when reseting one of these clusters, but it isn't strictly needed [23:47:49] i can clean them up [23:48:05] and it is in my power to do so, i can't reimage :) [23:48:59] Good night people. Try not to work too hard. [23:49:51] urandom: ah! ok.. well then.. if the cleanup is easy enough, ok [23:50:02] it's doable :) [23:50:35] and this is the dev environment, so if there are missteps, then it's OK [23:51:35] btw, i just made a change to disable the monitoring for "restbase root url" if the host is a "dev" host [23:51:56] ok [23:52:04] and then i found that we had an old restbase::monitoring class that isn't used [23:52:14] replaced by profile::restbase [23:52:50] (03PS1) 10Odder: Add a localised logo for the Cree Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419332 [23:53:18] Reedy: Would you mind doing ^^ or do you prefer to wait until another window tomorrow? [23:53:45] I created the two HiDPI logos today, but it looks like we missed the normal logo somehow :/ [23:54:19] (03CR) 10Dzahn: [C: 032] rm restbase::monitoring, remnants of module [puppet] - 10https://gerrit.wikimedia.org/r/419314 (owner: 10Dzahn) [23:54:25] (03PS2) 10Dzahn: rm restbase::monitoring, remnants of module [puppet] - 10https://gerrit.wikimedia.org/r/419314 [23:54:47] (03CR) 10Dzahn: [C: 032] "profile::restbase already does this" [puppet] - 10https://gerrit.wikimedia.org/r/419314 (owner: 10Dzahn) [23:54:57] RECOVERY - IPMI Sensor Status on restbase-dev1006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [23:55:24] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#4048451 (10Dzahn) 05stalled>03Resolved a:05Cmjohnson>03Dzahn [23:55:35] (03CR) 10Reedy: [C: 032] Add a localised logo for the Cree Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419332 (owner: 10Odder) [23:56:21] (03Merged) 10jenkins-bot: Add a localised logo for the Cree Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419332 (owner: 10Odder) [23:57:42] Reedy: any luck with the git issue? [23:57:49] It's deployed [23:57:57] 11 minutes ago [23:58:14] Reedy: fwiw my umask is u=rwx,g=rwx,o=rx [23:58:24] hmm [23:58:30] you were definitely listed as the user... [23:58:37] robh: i think you removed all the restbase-test hosts but one. test2002 survived in icinga [23:59:50] ok, resent [23:59:53] it should go away