[00:00:39] <icinga-wm>	 RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[00:00:39] <icinga-wm>	 RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[00:01:00] <icinga-wm>	 RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:01:09] <icinga-wm>	 RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[00:01:19] <icinga-wm>	 RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[00:01:49] <icinga-wm>	 RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[00:02:39] <icinga-wm>	 RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[00:02:39] <icinga-wm>	 RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[00:02:40] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[00:02:49] <icinga-wm>	 RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[00:09:34] <wikibugs>	 (03CR) 10Dzahn: "this should wait until after wiki is created?" [puppet] - 10https://gerrit.wikimedia.org/r/417200 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm)
[00:17:48] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045220 (10RobH) >>! In T188045#4045150, @Platonides wrote: > Well, if the server itself is needed, it will be doing its work with a different IP address than the one of wdqs1004, s...
[00:18:09] <wikibugs>	 (03CR) 10Dzahn: "yea, downvoted for adding a Hiera call in the module.. sigh...would need a new parameter for base class" [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn)
[00:30:35] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/10416/dbmonitor1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/415503 (owner: 10Dzahn)
[00:30:50] <wikibugs>	 (03PS4) 10Dzahn: tendril: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415503
[00:43:50] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "no-op on dbmonitor1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/415503 (owner: 10Dzahn)
[00:47:32] <wikibugs>	 (03PS2) 10Dzahn: tendril: add support for stretch/php7 [puppet] - 10https://gerrit.wikimedia.org/r/415511
[00:48:07] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "not affecting anything since servers are jessie now, just preparing for the future" [puppet] - 10https://gerrit.wikimedia.org/r/415511 (owner: 10Dzahn)
[00:48:18] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4045293 (10Smalyshev)
[00:54:07] <wikibugs>	 (03CR) 10Dzahn: "i thought we are switching to php7 and away from hhvm..." [puppet] - 10https://gerrit.wikimedia.org/r/415768 (owner: 10Dzahn)
[00:54:58] <wikibugs>	 (03PS1) 10BBlack: cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315)
[00:55:01] <wikibugs>	 (03PS1) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315)
[00:55:06] <wikibugs>	 (03PS1) 10BBlack: varnish: remove weekly restart cron entries [puppet] - 10https://gerrit.wikimedia.org/r/419091 (https://phabricator.wikimedia.org/T181315)
[00:55:52] <wikibugs>	 (03Abandoned) 10Dzahn: openstack/wikitech: add some php7 support [puppet] - 10https://gerrit.wikimedia.org/r/415768 (owner: 10Dzahn)
[00:56:54] <wikibugs>	 (03Abandoned) 10Dzahn: openstack:labtest:web: add some php7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/415765 (owner: 10Dzahn)
[00:57:16] <wikibugs>	 (03CR) 10Dzahn: "i'll stop uploading changes to wmcs manifests" [puppet] - 10https://gerrit.wikimedia.org/r/415765 (owner: 10Dzahn)
[00:58:44] <wikibugs>	 (03CR) 10Dzahn: "should this wait until after the wiki is created?" [puppet] - 10https://gerrit.wikimedia.org/r/412898 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm)
[00:59:15] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045325 (10Smalyshev) Thanks @RobH! Created {T189548} for loading the data back. @Gehel if you don't see anything else wrong then this one can be resolved.
[01:00:39] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "blocked" [puppet] - 10https://gerrit.wikimedia.org/r/405230 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn)
[01:00:55] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "blocked" [dns] - 10https://gerrit.wikimedia.org/r/405231 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn)
[01:01:04] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "blocked" [dns] - 10https://gerrit.wikimedia.org/r/405232 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn)
[01:01:58] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "blocked" [puppet] - 10https://gerrit.wikimedia.org/r/405229 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn)
[01:02:07] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "blocked" [puppet] - 10https://gerrit.wikimedia.org/r/405226 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn)
[01:02:56] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] prometheus: ganglia-gen outdated resource names (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/409390 (https://phabricator.wikimedia.org/T186918) (owner: 10Dzahn)
[01:03:03] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] add IPv6 for bast3003 [dns] - 10https://gerrit.wikimedia.org/r/405225 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn)
[01:22:00] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] site: turn bast1002 into a bastion host [puppet] - 10https://gerrit.wikimedia.org/r/414848 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn)
[01:26:09] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4045341 (10mmodell)
[01:26:13] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4045340 (10mmodell)
[01:29:33] <icinga-wm>	 PROBLEM - Host labtestneutron2002 is DOWN: PING CRITICAL - Packet loss = 100%
[01:40:13] <icinga-wm>	 PROBLEM - configured eth on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds
[01:41:54] <icinga-wm>	 PROBLEM - dhclient process on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds
[01:43:43] <icinga-wm>	 PROBLEM - puppet last run on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds
[01:47:04] <icinga-wm>	 PROBLEM - DPKG on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds
[01:48:53] <icinga-wm>	 PROBLEM - Disk space on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds
[01:49:43] <icinga-wm>	 PROBLEM - IPMI Sensor Status on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds
[01:53:03] <icinga-wm>	 PROBLEM - MPT RAID on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds
[02:04:53] <icinga-wm>	 PROBLEM - NTP on labtestneutron2001 is CRITICAL: NTP CRITICAL: No response from NTP server
[02:26:20] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4045378 (10ayounsi) @Cmjohnson Can you cable lvs1016 as listed bellow? | Hostname | Hostport | Switchport | note | |---|---|---|---| | lvs1016 | eth0 | asw2-d:xe-7/0/17  | |...
[02:34:45] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.24) (duration: 05m 30s)
[02:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:50:54] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.113 second response time
[02:55:54] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1978 bytes in 0.118 second response time
[03:01:40] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic, 10netops: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4045408 (10ayounsi) p:05Triage>03Normal
[03:26:54] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 864.93 seconds
[03:29:34] <icinga-wm>	 PROBLEM - Host mw2099 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:34] <icinga-wm>	 PROBLEM - Host mw2100 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:34] <icinga-wm>	 PROBLEM - Host mw2101 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:34] <icinga-wm>	 PROBLEM - Host mw2102 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:34] <icinga-wm>	 PROBLEM - Host mw2103 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:34] <icinga-wm>	 PROBLEM - Host mw2104 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:34] <icinga-wm>	 PROBLEM - Host mw2105 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:35] <icinga-wm>	 PROBLEM - Host mw2106 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:35] <icinga-wm>	 PROBLEM - Host mw2107 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:36] <icinga-wm>	 PROBLEM - Host mw2108 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:36] <icinga-wm>	 PROBLEM - Host mw2109 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:37] <icinga-wm>	 PROBLEM - Host mw2110 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:37] <icinga-wm>	 PROBLEM - Host mw2111 is DOWN: PING CRITICAL - Packet loss = 100%
[03:29:38] <icinga-wm>	 PROBLEM - Host mw2112 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:25] <icinga-wm>	 PROBLEM - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:25] <icinga-wm>	 PROBLEM - Host mw2117 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:25] <icinga-wm>	 PROBLEM - Host mw2115 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:25] <icinga-wm>	 PROBLEM - Host mw2116 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:25] <icinga-wm>	 PROBLEM - Host mw2118 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:25] <icinga-wm>	 PROBLEM - Host mw2119 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:25] <icinga-wm>	 PROBLEM - Host mw2120 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:26] <icinga-wm>	 PROBLEM - Host mw2121 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:26] <icinga-wm>	 PROBLEM - Host mw2122 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:27] <icinga-wm>	 PROBLEM - Host mw2123 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:27] <icinga-wm>	 PROBLEM - Host mw2124 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:28] <icinga-wm>	 PROBLEM - Host mw2125 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:28] <icinga-wm>	 PROBLEM - Host mw2126 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:29] <icinga-wm>	 PROBLEM - Host mw2127 is DOWN: PING CRITICAL - Packet loss = 100%
[04:07:14] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 271.16 seconds
[04:58:29] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4045522 (10Prtksxna)
[05:31:24] <icinga-wm>	 PROBLEM - Long running screen/tmux on labtestneutron2001 is CRITICAL: Return code of 255 is out of bounds
[06:31:20] <_joe_>	 ok now, how do those systems still resurface in puppet, then in icinga
[06:31:23] <_joe_>	 grrr
[06:35:43] <_joe_>	 oh I see
[06:35:49] <_joe_>	 it's herron's fault :P
[06:36:56] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108
[06:41:25] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108
[06:42:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108 (owner: 10Marostegui)
[06:44:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108 (owner: 10Marostegui)
[06:45:58] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1103:3314 after alter table (duration: 01m 19s)
[06:46:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:21] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419109 (https://phabricator.wikimedia.org/T187089)
[06:48:57] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419108 (owner: 10Marostegui)
[06:49:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419109 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[06:50:48] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419109 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[06:54:16] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419109 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[06:54:18] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1081 for alter table (duration: 00m 56s)
[06:54:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:27] <marostegui>	 !log Deploy schema change on db1081 - T187089 T185128 T153182
[06:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:34] <stashbot>	 T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089
[06:56:34] <stashbot>	 T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182
[06:56:34] <stashbot>	 T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128
[06:58:17] <marostegui>	 !log Deploy schema change on dbstore1002 - T187089 T185128 T153182
[06:58:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:26] <wikibugs>	 (03PS1) 10Elukey: profile::geowiki: disable periodic data quality check (cronspam) [puppet] - 10https://gerrit.wikimedia.org/r/419110 (https://phabricator.wikimedia.org/T173486)
[07:02:32] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::geowiki: disable periodic data quality check (cronspam) [puppet] - 10https://gerrit.wikimedia.org/r/419110 (https://phabricator.wikimedia.org/T173486) (owner: 10Elukey)
[07:11:48] <wikibugs>	 (03CR) 10Gergő Tisza: beta: Enable password authn for Beta Cluster logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) (owner: 10BryanDavis)
[07:13:46] <wikibugs>	 (03PS1) 10Elukey: statistics::user: avoid adding 'stats' to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/419111
[07:14:48] <wikibugs>	 (03CR) 10Elukey: [C: 032] "@Ottomata: maybe I am missing something but I'd merge this straight away, I'll rollback if needed :)" [puppet] - 10https://gerrit.wikimedia.org/r/419111 (owner: 10Elukey)
[07:23:40] <icinga-wm>	 PROBLEM - Check systemd state on es2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:36:49] <icinga-wm>	 RECOVERY - Check systemd state on es2014 is OK: OK - running: The system is fully operational
[07:36:50] <wikibugs>	 (03CR) 10WMDE-leszek: [C: 031] Enable Wikidata description override on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419083 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza)
[07:37:38] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4045657 (10Joe) @Papaul I would move the servers you put in row A to row B after you decommission the old servers in B 3, if that works for you.  Else, I'll try to resh...
[07:45:21] <wikibugs>	 (03CR) 10Jcrespo: "Comment" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui)
[07:45:48] <wikibugs>	 (03CR) 10Jcrespo: "Sorry, I meant multi-instance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui)
[07:46:45] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] db-eqiad,db-codfw.php: Proposal for moving hosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui)
[07:46:50] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Proposal for moving hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469)
[08:08:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] base/icinga: add Hiera override to skip systemd monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn)
[08:09:48] <wikibugs>	 10Operations, 10hardware-requests: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4045706 (10elukey)
[08:10:14] <wikibugs>	 10Operations, 10hardware-requests: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4045706 (10elukey)
[08:10:16] <elukey>	 \o/
[08:11:10] <wikibugs>	 (03PS1) 10Muehlenhoff: statistics::packages: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419112
[08:12:26] <wikibugs>	 (03CR) 10Hashar: Cumin masters: upgrade to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans)
[08:15:56] <wikibugs>	 (03PS2) 10Elukey: statistics::packages: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419112 (owner: 10Muehlenhoff)
[08:17:04] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for diamond [puppet] - 10https://gerrit.wikimedia.org/r/418926 (https://phabricator.wikimedia.org/T135991)
[08:17:06] <wikibugs>	 (03CR) 10Elukey: [C: 032] statistics::packages: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419112 (owner: 10Muehlenhoff)
[08:21:12] <wikibugs>	 (03CR) 10Gilles: [C: 031] varnishslowlog: add Backend-Timing D=, in seconds [puppet] - 10https://gerrit.wikimedia.org/r/418603 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema)
[08:22:18] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469)
[08:24:13] <wikibugs>	 (03CR) 10Muehlenhoff: Cumin masters: upgrade to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans)
[08:27:21] <wikibugs>	 (03PS1) 10Elukey: statistics::wmde::graphite: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419115
[08:27:47] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:28:38] <wikibugs>	 (03CR) 10Jcrespo: "Looks ok then to me, although probably will need more (unknown) tuning later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui)
[08:28:40] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for diamond [puppet] - 10https://gerrit.wikimedia.org/r/418926 (https://phabricator.wikimedia.org/T135991)
[08:29:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for diamond [puppet] - 10https://gerrit.wikimedia.org/r/418926 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:29:53] <wikibugs>	 (03CR) 10Elukey: [C: 032] statistics::wmde::graphite: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419115 (owner: 10Elukey)
[08:30:01] <wikibugs>	 (03PS2) 10Elukey: statistics::wmde::graphite: Use non-virtual package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/419115
[08:30:08] <wikibugs>	 (03CR) 10Gilles: [C: 031] varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema)
[08:30:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:31:25] <elukey>	 moritzm: ready to merge?
[08:31:28] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:31:42] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419114 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:31:44] <moritzm>	 I was about to ask you that :-) please go aheads
[08:32:36] <elukey>	 ack!
[08:41:08] <wikibugs>	 (03PS1) 10Elukey: statistics::wmde::graphite: depend on generic php-xml [puppet] - 10https://gerrit.wikimedia.org/r/419118
[08:43:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] statistics::wmde::graphite: depend on generic php-xml [puppet] - 10https://gerrit.wikimedia.org/r/419118 (owner: 10Elukey)
[08:44:09] <wikibugs>	 (03CR) 10Elukey: [C: 032] statistics::wmde::graphite: depend on generic php-xml [puppet] - 10https://gerrit.wikimedia.org/r/419118 (owner: 10Elukey)
[08:46:35] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1063 (duration: 00m 57s)
[08:46:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:33] <wikibugs>	 10Operations, 10Community-Liaisons, 10Security-Reviews, 10Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1554077 (10Bawolff) So to clarify - There is still interest in using lime survey, right (The third party site, not the software package)? And the question that you want answ...
[08:54:16] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469)
[08:55:22] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:55:52] <wikibugs>	 (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991)
[08:56:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:58:05] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:58:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:58:51] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Remove db1051 and db1063 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419119 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[09:00:48] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4045777 (10Marostegui)
[09:02:26] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove db1051 and db1063 (duration: 00m 56s)
[09:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:08] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove db1051 and db1063 (duration: 00m 56s)
[09:05:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:46] <wikibugs>	 (03PS2) 10Muehlenhoff: Depool poolcounter1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418892
[09:13:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Depool poolcounter1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418892 (owner: 10Muehlenhoff)
[09:13:17] <wikibugs>	 (03CR) 10jenkins-bot: Depool poolcounter1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418892 (owner: 10Muehlenhoff)
[09:14:58] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045800 (10Gehel) 05Open>03Resolved Yay! Thanks @faidon for finding the issue!  @RobH / @Papaul : the symptoms of wdqs2006 mgmt interface look vaguely similar (T189318). Any cha...
[09:15:03] <logmsgbot>	 !log jmm@tin Synchronized wmf-config/ProductionServices.php: Depooling poolcounter1001 for kernel security update (duration: 00m 56s)
[09:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:13] <wikibugs>	 (03PS2) 10BBlack: cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315)
[09:24:15] <wikibugs>	 (03PS2) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315)
[09:24:17] <wikibugs>	 (03PS2) 10BBlack: varnish: remove weekly restart cron entries [puppet] - 10https://gerrit.wikimedia.org/r/419091 (https://phabricator.wikimedia.org/T181315)
[09:37:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4045827 (10elukey) I acked on icinga notebook100[34] systemd unit failures to avoid confusion for other people (expected I guess since the task is WIP...
[09:37:55] <moritzm>	 !log rebooting poolcounter1001 for kernel security update
[09:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:45] <wikibugs>	 (03PS3) 10Volans: Cumin masters in prod: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773)
[09:38:46] <wikibugs>	 (03PS1) 10Volans: Cumin masters in WMCS: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T187773)
[09:41:29] <wikibugs>	 (03CR) 10Volans: [C: 04-2] "This is only for prod now and still -2, the WMCS part was moved into I364f7a3a23328deeaddb69d632d6e9c7ded47258" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans)
[09:42:40] <hashar>	 volans: good morning!!  Is the production cumin master using Stretch?  :]
[09:43:23] <hashar>	 volans: on labs I have a Jessie one and custom modules are in /usr/local/lib/python3.4  while Stretch apparently uses "python3" :]
[09:43:27] <volans>	 hashar: nope, jessie for now, pending an upgrade probably this year
[09:43:36] <volans>	 see above
[09:43:37] <hashar>	 ah
[09:44:42] <hashar>	 volans: you are magic :]
[09:45:36] <volans>	 hashar: that *should* work, but I have no way to test it in the compiler and not much time to test it on my puppetmaster atm
[09:47:17] <hashar>	 volans: dont worry. I am probably the only one using cumin on the CI master :]  At least puppet pass now!
[09:47:50] <wikibugs>	 (03PS1) 10Elukey: Update the README file with some notes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/419132
[09:47:58] <volans>	 nice, so if that patch works we could also merge it, I'm not against it
[09:48:06] <volans>	 the prod one has to wait though
[09:48:16] <vgutierrez>	 !log upgrading ulsfo LVSs to pybal 1.15.2
[09:48:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:37] <wikibugs>	 (03CR) 10Elukey: [C: 032] Update the README file with some notes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/419132 (owner: 10Elukey)
[09:49:27] <wikibugs>	 (03CR) 10Hashar: "I cherry picked this and integration-cumin now has /usr/local/lib/python3.4/dist-packages/cumin_file_backend.py  :]  Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans)
[09:54:07] <wikibugs>	 (03PS2) 10Hashar: Cumin masters in WMCS: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans)
[09:55:23] <wikibugs>	 (03CR) 10Hashar: [C: 031] "I have edited the commit message to link to T188112:" [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans)
[09:56:20] <volans>	 hashar: ack, thanks
[09:58:13] <volans>	 hashar: if your test passed, do you think is good to merge?
[10:00:04] <gehel>	 !log shuttind down blazegraph on wdqs2001 for data transfer to wdqs1004 - T189548
[10:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:10] <stashbot>	 T189548: reload data on wdqs1004 - https://phabricator.wikimedia.org/T189548
[10:00:56] <volans>	 hashar: just read your full comment, sorry
[10:01:18] <hashar>	 volans: almost :D
[10:01:20] <wikibugs>	 (03CR) 10Hashar: "Then I get:" [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans)
[10:01:23] <hashar>	 Default backend 'openstack' is not registered
[10:01:33] <hashar>	 I have updated my task and copy pasted to the gerrit change
[10:01:36] <volans>	 yeah, that's because is 'optional'
[10:01:40] <hashar>	 probably the cumin config file need to explicitly list it ?
[10:01:55] <hashar>	 there is no hurry
[10:01:57] <volans>	 no, it needs to install the cumin package with the suggested ones
[10:02:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Depool poolcounter1001 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419135
[10:02:38] <volans>	 let me see if puppet can do it
[10:03:02] <hashar>	 AHHH
[10:03:06] <hashar>	 Suggests: python3-keystoneauth1, python3-keystoneclient, python3-novaclient
[10:03:27] <hashar>	 and indeed they are not installed
[10:03:28] <hashar>	 :D
[10:03:29] <volans>	 yep, it's an optional backend
[10:03:33] <_joe_>	 volans: no-can-do
[10:03:37] <_joe_>	 and for good reasons
[10:03:38] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469)
[10:03:39] <volans>	 you got the py2 version of them, not the py3 one
[10:03:58] <volans>	 _joe_: so I cannot tell puppet to require a specific package with suggested ones?
[10:04:05] <_joe_>	 nope
[10:04:14] <volans>	 :(
[10:04:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:04:21] <_joe_>	 Suggested/Recommended packages are a debian/ubuntu specific thng
[10:04:38] <volans>	 our puppet is ALL debian (and slightly ubuntu) specific
[10:04:52] <_joe_>	 well "package" is part of puppet itself
[10:04:59] <_joe_>	 you're welcome to improve it
[10:05:02] <volans>	 eheheh I knew it
[10:05:26] <_joe_>	 oh
[10:05:31] <_joe_>	 actually lemme check
[10:05:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Revert "Depool poolcounter1001 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419135 (owner: 10Muehlenhoff)
[10:05:43] <volans>	 I could add them in puppet as require_package, but meh
[10:05:56] <wikibugs>	 (03CR) 10Jcrespo: "Ignore the violations, it is the addition of hosts to the new style, which will be compensated when we decom the old ones." [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:06:00] <wikibugs>	 (03CR) 10Marostegui: [C: 031] "Looks good to me and I would override the -1 from jenkins as once we have them on stretch and all that we can work on a proper refactor fo" [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:06:07] <_joe_>	 volans: https://puppet.com/docs/puppet/4.8/type.html#package-attribute-install_options
[10:06:14] <wikibugs>	 (03CR) 10Jcrespo: "s/new/old misc/" [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:06:16] <_joe_>	 but then you cannot use require_package ofc
[10:07:20] <wikibugs>	 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4045892 (10akosiaris) >>! In T180628#4044173, @mmodell wrote: > @akosiaris: I think it's needed on masters, at least to enable deployers to issue gi...
[10:07:22] <logmsgbot>	 !log jmm@tin Synchronized wmf-config/ProductionServices.php: Repooling poolcounter1001 after kernel security update (duration: 00m 57s)
[10:07:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:31] <wikibugs>	 (03PS1) 10Elukey: Assing role::analytics_cluster::hadoop::worker to analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/419138 (https://phabricator.wikimedia.org/T188294)
[10:07:38] <volans>	 _joe_: would be so bad to not use require_package just for cumin in the WMCS profile?
[10:07:54] <_joe_>	 volans: I don't give a damn, tbh
[10:07:58] <volans>	 shouldn't conflict with other requirements
[10:08:07] <volans>	 lol
[10:08:29] <wikibugs>	 (03CR) 10Elukey: [C: 032] Assing role::analytics_cluster::hadoop::worker to analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/419138 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey)
[10:08:41] <jynus>	 !log stopping mysql on db1063 and db1051 to validate the depool before full reimage
[10:08:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:02] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Depool poolcounter1001 for kernel update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419135 (owner: 10Muehlenhoff)
[10:11:11] <wikibugs>	 (03PS3) 10Volans: Cumin masters in WMCS: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112)
[10:11:13] <wikibugs>	 (03PS4) 10Volans: Cumin masters in prod: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773)
[10:11:28] <volans>	 hashar: ^^^
[10:12:22] <volans>	 now I don't know if that fixes the situation if you already have cumin tbh, given that you already had it and was upgraded by unattended upgrades (that I think is wrong)
[10:17:36] <vgutierrez>	 !log upgrading esams LVSs to pybal 1.15.2
[10:17:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:35] <wikibugs>	 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4045911 (10ayounsi) a:03ayounsi Only looking at the asw ports with link up for now, using LibreNMS:  @Papaul If the ports with LLDP neighbors are correct, I can mass add th...
[10:20:54] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469)
[10:21:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:21:37] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add initialize_namespace.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/419139
[10:22:12] <wikibugs>	 (03PS3) 10BBlack: cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315)
[10:22:14] <wikibugs>	 (03PS3) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315)
[10:22:16] <wikibugs>	 (03PS3) 10BBlack: varnish: remove weekly restart cron entries [puppet] - 10https://gerrit.wikimedia.org/r/419091 (https://phabricator.wikimedia.org/T181315)
[10:22:31] <moritzm>	 !log rebooting DNS recursors in ulsfo and eqsin for kernel security update
[10:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:00] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469)
[10:23:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack)
[10:23:19] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move db1063 and db1051 to m1 and m2 respectively [puppet] - 10https://gerrit.wikimedia.org/r/419136 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:25:58] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4045920 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1063.eqiad.wmnet'] ``` The log...
[10:26:39] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4045922 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1051.eqiad.wmnet'] ``` The log...
[10:28:37] <wikibugs>	 (03Abandoned) 10BBlack: varnish: remove weekly restart cron entries [puppet] - 10https://gerrit.wikimedia.org/r/419091 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack)
[10:29:26] <wikibugs>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4045940 (10elukey) This bit prevents the first couple of puppet runs to complete (and also yarn to start etc..):  ``` Error: Could not...
[10:30:27] <wikibugs>	 (03PS4) 10BBlack: cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315)
[10:30:29] <wikibugs>	 (03PS4) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315)
[10:33:41] <_joe_>	 bblack: I like that approach
[10:35:47] <bblack>	 yeah my other approaches all had the transition issue that it was messy to switch from one to the other and deal with gaps the first week
[10:38:08] <bblack>	 you could argue the code is a bit redundantly-verbose now vs a more mathematically-compact form, but it's probably easier to follow when each of the period-cases is explicit and separate
[10:39:04] <_joe_>	 I prefer explicit, easy-to-understand code than mathematical compactness
[10:40:33] <wikibugs>	 (03CR) 10BBlack: [C: 031] "Compiler confirms (through another commit making use of this code) this leaves the existing cron entry at the original time, and adds a ne" [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack)
[10:41:03] <bblack>	 I never said it was easy-to-understand of course :)
[10:41:10] <bblack>	 (but it could be far worse!)
[10:41:58] <_joe_>	 bblack: it's pretty easy to read cron_splay.rb and understand what's going on
[10:42:30] <_joe_>	 and instead of strange math constructs in the cron output, we have two cron lines, both weekly, correctly staggered
[10:42:48] <wikibugs>	 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4045978 (10mmodell) @akosiaris: **tl;dr**  I can't think of any reason that we //must// have `git-lfs` on masters, I've only got vague hand-wavy not...
[10:44:25] <wikibugs>	 (03CR) 10Hashar: "The poor --install-suggests does not seem to install the suggests :(" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans)
[10:44:37] <wikibugs>	 (03PS1) 10Elukey: Assign role::analytics_cluster::hadoop::worker to analytics1073 [puppet] - 10https://gerrit.wikimedia.org/r/419142 (https://phabricator.wikimedia.org/T188294)
[10:45:08] <wikibugs>	 (03CR) 10BBlack: [C: 032] cron_splay: add a semiweekly mode of operation [puppet] - 10https://gerrit.wikimedia.org/r/419089 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack)
[10:46:28] <wikibugs>	 (03PS2) 10Elukey: Assign role::analytics_cluster::hadoop::worker to analytics1073 [puppet] - 10https://gerrit.wikimedia.org/r/419142 (https://phabricator.wikimedia.org/T188294)
[10:46:39] <volans>	 hashar: have you tried to apt-get remove cumin first?
[10:46:55] <bblack>	 why would anyone remove cumin? :P
[10:47:10] <wikibugs>	 (03CR) 10Elukey: [C: 032] Assign role::analytics_cluster::hadoop::worker to analytics1073 [puppet] - 10https://gerrit.wikimedia.org/r/419142 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey)
[10:47:29] <hashar>	 volans: yeah with apt purge
[10:47:41] <hashar>	 apt -q -y purge cumin; /usr/bin/apt-get -y -o DPkg::Options::=--force-confold -o APT::Install-Suggests=1 install cumin && apt-cache policy python3-keystoneauth1 python3-keystoneclient python3-novaclient
[10:47:42] <hashar>	 ;D
[10:48:29] <volans>	 bblack: to make puppet install the suggested ones in labs :D
[10:49:00] <volans>	 hashar: no my suggestion was to apt-get remove cumin and let puppet install it with the suggests
[10:49:32] <volans>	 with the latest patch
[10:49:40] * hashar tries
[10:50:20] <hashar>	 volans: yes that the same deal :(
[10:50:59] <volans>	 it doesn't install them?
[10:52:45] <wikibugs>	 (03PS1) 10Odder: Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448)
[10:52:48] <wikibugs>	 10Operations, 10Puppet, 10User-fgiunchedi: Update jmx_exporter mbeans whitelist for puppetdb 4 - https://phabricator.wikimedia.org/T189516#4046017 (10fgiunchedi)
[10:53:58] <hashar>	 volans: yeah apt-get with --install-suggests does not install any of the Suggests  packages :^/
[10:56:43] <volans>	 hashar: that's not what I see on my test host
[10:56:54] <wikibugs>	 (03CR) 10BBlack: [C: 032] varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack)
[10:56:59] <wikibugs>	 (03PS5) 10BBlack: varnish: restart backends every 3.5 days [puppet] - 10https://gerrit.wikimedia.org/r/419090 (https://phabricator.wikimedia.org/T181315)
[10:57:05] <hashar>	 volans: I guess the labs project has a broken apt config so :D
[10:57:25] <volans>	 dunno, but cannot debug right now, goal to finish
[10:57:32] <hashar>	 no worries
[10:59:57] <jynus>	 install... is... slow... packages... download... slowly
[11:00:49] <jynus>	 WARNING **: no packages matching running kernel 4.9.0-4-amd64 in archive
[11:01:45] <jynus>	 No kernel modules were found. This probably is due to a mismatch between the kernel used by this version of the installer :-(
[11:02:19] <jynus>	 let me guess, the point release happened
[11:02:22] <_joe_>	 jynus: yes
[11:02:36] <_joe_>	 yesterday 9.4 was released
[11:02:37] <jynus>	 I will see if I can fix it
[11:05:37] <wikibugs>	 (03PS6) 10BBlack: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema)
[11:10:10] <moritzm>	 jynus: there's a script, let me find it
[11:12:06] <jynus>	 moritzm: https://gerrit.wikimedia.org/r/#/c/292906/ ?
[11:12:07] <wikibugs>	 (03PS7) 10BBlack: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema)
[11:12:09] <moritzm>	 jynus: /home/faidon/update-netboot-stretch.sh on puppetmaster1001
[11:12:09] <wikibugs>	 (03PS7) 10BBlack: varnishslowlog: add Backend-Timing D=, in seconds [puppet] - 10https://gerrit.wikimedia.org/r/418603 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema)
[11:12:43] <jynus>	 that commit probably needs an update
[11:13:12] <moritzm>	 yeah, should be for stretch. the cleanest way to fix this would be https://phabricator.wikimedia.org/T182699
[11:13:33] <moritzm>	 but needs some testing (T182699)
[11:13:34] <stashbot>	 T182699: Use firmware-enriched Debian installation images - https://phabricator.wikimedia.org/T182699
[11:13:36] <jynus>	 well, that is mostly secondary
[11:13:53] <jynus>	 the important stuff would be to automatize it as much as reasonable
[11:13:53] <wikibugs>	 (03CR) 10BBlack: [C: 032] varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema)
[11:13:59] <wikibugs>	 (03CR) 10BBlack: [C: 032] varnishslowlog: add Backend-Timing D=, in seconds [puppet] - 10https://gerrit.wikimedia.org/r/418603 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema)
[11:16:00] <paravoid>	 s/ize/e/ :P
[11:16:48] <jynus>	 sorry
[11:16:52] <bblack>	 I prefer automagicifycation
[11:21:20] <moritzm>	 !log rebooting DNS recursors in esams for kernel security update
[11:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:07] <jynus>	 !log ran update-netboot-stretch.sh
[11:23:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:49] <jynus>	 I have made a backup, but will delete it if it works
[11:26:51] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1063.eqiad.wmnet'] ``` The log...
[11:27:10] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046130 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1051.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['db1051.eqiad.wmnet'] ```
[11:29:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add golang as a build-dependency [debs/etcd] - 10https://gerrit.wikimedia.org/r/419148
[11:29:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add golang as a build-dependency [debs/etcd] - 10https://gerrit.wikimedia.org/r/419148 (owner: 10Giuseppe Lavagetto)
[11:30:21] <logmsgbot>	 !log kartik@tin Started deploy [cxserver/deploy@30ff3b1]: Update cxserver to bd2ccfc
[11:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:21] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046150 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1051.eqiad.wmnet'] ``` The log...
[11:33:40] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046158 (10jcrespo) Installing...{F15259084}
[11:33:44] <wikibugs>	 10Operations, 10Community-Liaisons, 10Security-Reviews, 10Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#4046159 (10Aklapper) I'd start with asking what would be our use case of using Linesurvey in 2018.  This task lacks a desc of a problem that might get solved by Limesurvey (...
[11:33:51] <logmsgbot>	 !log kartik@tin Finished deploy [cxserver/deploy@30ff3b1]: Update cxserver to bd2ccfc (duration: 03m 30s)
[11:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:24] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=argon.eqiad.wmnet,service=kubemaster
[11:37:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:22] <wikibugs>	 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4046204 (10jhsoby) >>! In T188776#4021634, @Varnent wrote: >>>! In T188776#4021611, @Bawolff wrote: >> That sa...
[11:43:56] <_joe_>	 !log include our own etcd package (3.2.16) on stretch
[11:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:47] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046220 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1063.eqiad.wmnet'] ```  and were **ALL** successful.
[11:45:53] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:46:37] <akosiaris>	 me ^
[11:46:52] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational
[11:52:07] * akosiaris expects WATCHLIST to recover... let's see
[11:56:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: initial commit of 4.4.0-1 (031 comment) [debs/puppetdb] (4.4.0-1) - 10https://gerrit.wikimedia.org/r/415591 (owner: 10Herron)
[11:56:52] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 2031 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[11:56:54] <godog>	 puppetdb not starting when using java 9 ^
[11:56:58] <godog>	 akosiaris: neat
[11:57:09] <wikibugs>	 (03PS1) 10Reedy: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419153 (https://phabricator.wikimedia.org/T188537)
[11:57:11] <moritzm>	 we don't have Java 9 yet?
[11:57:26] <Reedy>	 jouncebot: next
[11:57:26] <jouncebot>	 In 1 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1300)
[11:58:09] <akosiaris>	 we need java 10 anyway :P
[11:58:15] <godog>	 moritzm: no I deluded myself into hoping it'd work with java 9 from stretch-backports
[11:58:36] <akosiaris>	 godog: yeah ok so theory confirmed. Now to just ignore that , now that I know what is going on
[11:58:43] <moritzm>	 Java is always happy to crush your hopes!
[11:59:22] <moritzm>	 !log rebooting DNS recursors in codfw for kernel security update
[11:59:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:55] <godog>	 akosiaris: indeed
[11:59:55] <wikibugs>	 (03CR) 10Reedy: [C: 032] Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419153 (https://phabricator.wikimedia.org/T188537) (owner: 10Reedy)
[12:01:10] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419153 (https://phabricator.wikimedia.org/T188537) (owner: 10Reedy)
[12:01:31] <wikibugs>	 (03CR) 10jenkins-bot: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419153 (https://phabricator.wikimedia.org/T188537) (owner: 10Reedy)
[12:01:37] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4046290 (10faidon) a:05faidon>03ayounsi > We're happy to announce that your RIPE Atlas anchor is functioning properly and is now connected to the RIPE Atlas network. > > You can see...
[12:04:07] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/interwiki.php: T188537 (duration: 00m 57s)
[12:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:12] <stashbot>	 T188537: Please update wmf-config/interwiki.php following on-wiki updates - https://phabricator.wikimedia.org/T188537
[12:07:52] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 751629 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:07:52] <wikibugs>	 (03PS1) 10BBlack: eqsin: add ripe-atlas ping measurement monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419156 (https://phabricator.wikimedia.org/T179042)
[12:09:28] <wikibugs>	 (03CR) 10BBlack: [C: 032] eqsin: add ripe-atlas ping measurement monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419156 (https://phabricator.wikimedia.org/T179042) (owner: 10BBlack)
[12:15:15] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic, 10netops, 10Patch-For-Review: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4046328 (10BBlack)
[12:15:54] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic, 10netops, 10Patch-For-Review: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3711364 (10BBlack) 05Open>03Resolved >>! In T179042#4046290, @faidon wrote: > Only thing left is monitoring, right?  I think so AFAIK, and done above, showing...
[12:21:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: puppetmaster: export all puppetdb mbeans [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516)
[12:22:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159
[12:26:19] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=argon.eqiad.wmnet,service=kubemaster
[12:26:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:57] <wikibugs>	 (03CR) 10Addshore: [C: 031] Enable Wikidata description override on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419083 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza)
[12:27:50] <wikibugs>	 (03CR) 10Addshore: "Awesome, removing the -1 because this patch is now based on the best patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza)
[12:29:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC for some reason doesn't show a diff https://puppet-compiler.wmflabs.org/compiler02/10422/" [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) (owner: 10Filippo Giunchedi)
[12:32:43] <akosiaris>	 !log reboot ganeti VMs on row_A in eqiad for cache=none setting. T181121
[12:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:49] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[12:33:03] <wikibugs>	 (03PS1) 10ArielGlenn: remove snapshot01 from mediawiki scap list on beta for testing [puppet] - 10https://gerrit.wikimedia.org/r/419160
[12:33:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "FWIW when the time comes we can merge this as-is since the only puppet masters using puppetdb too are in production (i.e. not labspuppetma" [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron)
[12:33:55] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] remove snapshot01 from mediawiki scap list on beta for testing [puppet] - 10https://gerrit.wikimedia.org/r/419160 (owner: 10ArielGlenn)
[12:35:26] <akosiaris>	 elukey: bohrium will get rebooted soon ^
[12:35:33] <akosiaris>	 fyi
[12:35:53] <elukey>	 ack!
[12:38:24] <wikibugs>	 (03PS1) 10Odder: Add high-density logos for seven Wikipedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419162 (https://phabricator.wikimedia.org/T150618)
[12:41:55] <icinga-wm>	 RECOVERY - Host labtestneutron2002 is UP: PING OK - Packet loss = 0%, RTA = 36.87 ms
[12:47:10] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::worker: use require instead of include [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294)
[12:48:15] <wikibugs>	 (03PS1) 10Odder: Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578)
[12:51:59] <wikibugs>	 (03CR) 10Elukey: "my 2c: since this jvm is really important and iirc it may populate a ton of mbeans that can potentially be expensive to calculate, so I'd " [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) (owner: 10Filippo Giunchedi)
[12:52:51] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10423/analytics1030.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey)
[12:54:18] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4046439 (10fgiunchedi)
[12:54:22] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade hiera to stretch (version 3) - https://phabricator.wikimedia.org/T188623#4046437 (10fgiunchedi) 05Open>03Resolved This should be resolved as all patches are merged and rhodium is running hiera 3 and compiling fine.
[12:59:11] <icinga-wm>	 PROBLEM - SSH on install1002 is CRITICAL: connect to address 208.80.154.22 and port 22: Connection refused
[12:59:11] <icinga-wm>	 PROBLEM - dhclient process on install1002 is CRITICAL: Return code of 255 is out of bounds
[12:59:12] <icinga-wm>	 PROBLEM - Squid on install1002 is CRITICAL: connect to address 208.80.154.22 and port 8080: Connection refused
[12:59:12] <icinga-wm>	 PROBLEM - Check size of conntrack table on install1002 is CRITICAL: Return code of 255 is out of bounds
[13:00:03] <_joe_>	 uh
[13:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1300).
[13:00:04] <jouncebot>	 marlier: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <icinga-wm>	 RECOVERY - SSH on install1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[13:00:12] <icinga-wm>	 RECOVERY - dhclient process on install1002 is OK: PROCS OK: 0 processes with command name dhclient
[13:00:12] <icinga-wm>	 RECOVERY - Squid on install1002 is OK: TCP OK - 0.003 second response time on 208.80.154.22 port 8080
[13:00:12] <icinga-wm>	 RECOVERY - Check size of conntrack table on install1002 is OK: OK: nf_conntrack is 0 % full
[13:00:14] <zeljkof>	 I can SWAT today
[13:00:26] <zeljkof>	 marlier: around for SWAT?
[13:00:37] <_joe_>	 akosiaris: maybe we should wait swat to be done
[13:00:45] <_joe_>	 mwdebug1001/1002 are on ganeti
[13:00:57] <wikibugs>	 (03PS1) 10Urbanecm: Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171
[13:01:06] <zeljkof>	 _joe_: something is happening?
[13:01:14] <_joe_>	 zeljkof: nothing
[13:01:22] <zeljkof>	 cool :)
[13:01:34] <_joe_>	 zeljkof: I was advising akosiaris to avoid rebooting mwdebug1* servers during SWAT :P
[13:01:47] <_joe_>	 we're not used to swat being this early :)
[13:01:50] <zeljkof>	 _joe_: please dont! :)
[13:02:00] <wikibugs>	 (03PS2) 10Urbanecm: Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171
[13:02:02] <zeljkof>	 it surprises me too
[13:02:19] <marlier>	 zeljkof: would it be possible to deploy in about 10 minutes? 
[13:02:36] <zeljkof>	 marlier: sure, I'm around, let me know when you are ready
[13:02:43] <marlier>	 Had to step away from the computer but only for a moment
[13:02:50] <marlier>	 Thanks! 
[13:02:56] <zeljkof>	 _joe_: looks like swat will be 10 minutes late, if you can reboot in that time, go ahead
[13:03:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 (owner: 10Urbanecm)
[13:07:18] <wikibugs>	 (03PS1) 10Gilles: Upgrade to 1.16 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/419172 (https://phabricator.wikimedia.org/T186528)
[13:08:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: puppet: depool and reinstall puppetmaster2002 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/419173 (https://phabricator.wikimedia.org/T184562)
[13:09:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo
[13:09:12] <icinga-wm>	 g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled
[13:09:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo
[13:09:31] <icinga-wm>	 g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled
[13:10:21] <godog>	 wah wah logstash
[13:10:26] <godog>	 I'll take a look
[13:10:33] <jynus>	 did 1008 crash?
[13:11:05] <akosiaris>	 rebooted 
[13:11:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy
[13:11:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[13:11:33] <jynus>	 ah, I didn't know that was on ganety too
[13:11:45] <akosiaris>	 unfortunately closer time wise to logstash1007 than it should
[13:11:49] <godog>	 ah that explains it 
[13:11:55] <akosiaris>	 I did leave 2 minutes of time between reboots
[13:12:03] <akosiaris>	 it looks like it wasn't enough for logstash
[13:12:33] <jynus>	 what is availability mode for logstash? as long as one is up is ok?
[13:12:48] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::prometheus_jmx_exporter: include rpc metrics for each port [puppet] - 10https://gerrit.wikimedia.org/r/419175
[13:13:13] <moritzm>	 the logstash hosts which are on ganeti (1007-1009) don't hold elastic data
[13:13:14] <godog>	 I don't know offhand the numbers but I'd guess one ingestion host for logstash is enough to cope with the load yeah
[13:13:24] <moritzm>	 the others are on baremetal (1004-1006)
[13:13:40] <jynus>	 moritzm: I literally mean logstash and not elastic
[13:13:41] <moritzm>	 and need to be rebooted one by one until the cluster has recovered, usually takes 10 minutes
[13:13:59] <wikibugs>	 (03PS1) 10Odder: Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586)
[13:14:04] <marlier>	 zeljkof: I'm back, ready whenever works for you.
[13:14:17] <moritzm>	 those can be rebooted without interruption as long as they are depooled one by one
[13:14:30] <moritzm>	 no need for waiting time otherwise
[13:14:40] <zeljkof>	 marlier: I'm ready!
[13:14:45] <zeljkof>	 SWAT starts!
[13:14:52] <jynus>	 moritzm: you became an expert on reboots :-)
[13:14:59] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: Restricting access for a collaboration nearing completion - https://phabricator.wikimedia.org/T189341#4039520 (10Ottomata) I had emailed Dario about this before, and told him it might be hard, but on second thought, I think it isn'...
[13:15:11] <jynus>	 you are awarded with more reboots!
[13:15:24] <jynus>	 :-)
[13:15:29] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier)
[13:15:31] <icinga-wm>	 RECOVERY - Disk space on labtestneutron2001 is OK: DISK OK
[13:15:51] <moritzm>	 yay!
[13:16:01] <icinga-wm>	 RECOVERY - DPKG on labtestneutron2001 is OK: All packages OK
[13:16:01] <icinga-wm>	 RECOVERY - dhclient process on labtestneutron2001 is OK: PROCS OK: 0 processes with command name dhclient
[13:16:01] <icinga-wm>	 RECOVERY - configured eth on labtestneutron2001 is OK: OK - interfaces up
[13:16:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) (owner: 10Filippo Giunchedi)
[13:16:59] <wikibugs>	 (03Merged) 10jenkins-bot: wmf-config: enable Singapore oversample as default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier)
[13:18:25] <zeljkof>	 marlier: the patch is at mwdebug1002, please test and let me know if I can deploy
[13:18:30] <marlier>	 Verified
[13:18:32] <marlier>	 GTG
[13:18:39] <zeljkof>	 ok, deploying
[13:18:55] <wikibugs>	 (03CR) 10jenkins-bot: wmf-config: enable Singapore oversample as default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier)
[13:19:01] <icinga-wm>	 RECOVERY - puppet last run on labtestneutron2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:19:23] <wikibugs>	 10Operations, 10netops: Config discrepencies on network devices - https://phabricator.wikimedia.org/T189588#4046514 (10ayounsi) p:05Triage>03Low
[13:19:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on labtestneutron2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[13:20:07] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:417331|wmf-config: enable Singapore oversample as default on all wikis (T188652)]] (duration: 00m 57s)
[13:20:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:13] <stashbot>	 T188652: Enable oversampling for Singapore - https://phabricator.wikimedia.org/T188652
[13:20:26] <zeljkof>	 marlier: deployed! please test and thanks for deploying with #releng ;)
[13:20:44] <zeljkof>	 no other patches for swat?
[13:20:46] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4046529 (10Ottomata) Ah thanks, yeah, I meant to get back to this the next day but we got other thinnngnggs
[13:20:58] <zeljkof>	 !log EU SWAT finished
[13:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:42] <wikibugs>	 (03CR) 10Ottomata: "Hm, ok!  I hope this doesn't break anything!  Not that it will, but it seems like there was probably a reason it was in this group." [puppet] - 10https://gerrit.wikimedia.org/r/419111 (owner: 10Elukey)
[13:23:51] <wikibugs>	 (03CR) 10Ottomata: [C: 031] profile::hadoop::worker: use require instead of include [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey)
[13:24:10] <marlier>	 zeljkof: confirmed that change is live everywhere.  Appreciate it!
[13:27:08] <zeljkof>	 marlier: /me thumbs up ;)
[13:28:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] profile::hadoop::prometheus_jmx_exporter: include rpc metrics for each port [puppet] - 10https://gerrit.wikimedia.org/r/419175 (owner: 10Elukey)
[13:29:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991)
[13:29:25] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#4046549 (10chasemp) 05Open>03Resolved closing for now
[13:29:27] <jynus>	 !log stop db1001 for maintenance (proxies will temporarely complain about lack of redundancy)
[13:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:30:30] <wikibugs>	 10Operations, 10cloud-services-team, 10Epic: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#4046555 (10chasemp)
[13:30:42] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991)
[13:31:10] * moritzm shakes fist at the commit message CI check
[13:31:22] <icinga-wm>	 RECOVERY - Long running screen/tmux on labtestneutron2001 is OK: OK: No SCREEN or tmux processes detected.
[13:31:27] <wikibugs>	 10Operations, 10cloud-services-team, 10Epic: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939825 (10chasemp)
[13:31:58] <wikibugs>	 10Operations, 10cloud-services-team, 10Epic: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939825 (10chasemp)
[13:32:02] <icinga-wm>	 PROBLEM - Disk space on labtestneutron2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[13:32:43] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1006 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[13:33:02] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[13:33:05] <wikibugs>	 (03PS1) 10Andrew Bogott: labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470)
[13:33:27] <marostegui>	 jynus: ^ is that you?
[13:33:30] <marostegui>	 ah yes
[13:33:33] <marostegui>	 missed the !log :)
[13:33:34] <marostegui>	 sorry
[13:33:54] <jynus>	 yes
[13:34:09] <jynus>	 everthing is ok
[13:34:15] <marostegui>	 :)
[13:34:43] <jynus>	 the other option, downtim'ing the proxies would prevent us from seein a real outage
[13:36:02] <icinga-wm>	 RECOVERY - Disk space on labtestneutron2002 is OK: DISK OK
[13:37:46] <wikibugs>	 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4046577 (10akosiaris) All row_A eqiad VMs have been rebooted with cache=none. We are now again in a waiting period.
[13:39:35] <wikibugs>	 (03PS1) 10Gehel: wdqs: disable kafka poller on new wdqs-internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/419181
[13:39:54] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4046578 (10chasemp) @madhuvishy could you review and potentially close this round of cleanup? :D
[13:40:20] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: disable kafka poller on new wdqs-internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/419181 (owner: 10Gehel)
[13:42:14] <wikibugs>	 (03PS2) 10Ottomata: Point eventlogging varnishkafka at Kafka jumbo-eqiad with TLS [puppet] - 10https://gerrit.wikimedia.org/r/417319 (https://phabricator.wikimedia.org/T183297)
[13:43:20] <wikibugs>	 (03CR) 10MarcoAurelio: Initial configuration for euwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 (owner: 10Urbanecm)
[13:43:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes: Ignore WATCHLIST latencies as well [puppet] - 10https://gerrit.wikimedia.org/r/419182
[13:44:04] <wikibugs>	 (03CR) 10Elukey: [C: 031] Point eventlogging varnishkafka at Kafka jumbo-eqiad with TLS [puppet] - 10https://gerrit.wikimedia.org/r/417319 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata)
[13:44:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Ignore WATCHLIST latencies as well [puppet] - 10https://gerrit.wikimedia.org/r/419182 (owner: 10Alexandros Kosiaris)
[13:49:41] <wikibugs>	 (03Abandoned) 10Elukey: profile::analytics::refinery::job::sqoop_mediawiki: add stdout redirect to crons [puppet] - 10https://gerrit.wikimedia.org/r/415849 (owner: 10Elukey)
[13:51:58] <wikibugs>	 10Operations, 10cloud-services-team: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#4046607 (10chasemp)
[13:52:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Add chicocvenancio to LDAP users (for cn=wmf) [puppet] - 10https://gerrit.wikimedia.org/r/419183
[13:52:34] <wikibugs>	 (03PS2) 10Andrew Bogott: labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470)
[13:52:36] <wikibugs>	 (03CR) 10Vgutierrez: [C: 031] Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159 (owner: 10Muehlenhoff)
[13:52:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add chicocvenancio to LDAP users (for cn=wmf) [puppet] - 10https://gerrit.wikimedia.org/r/419183 (owner: 10Muehlenhoff)
[13:53:06] <wikibugs>	 (03PS2) 10Elukey: profile::hadoop::prometheus_jmx_exporter: include rpc metrics for each port [puppet] - 10https://gerrit.wikimedia.org/r/419175
[13:54:10] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::prometheus_jmx_exporter: include rpc metrics for each port [puppet] - 10https://gerrit.wikimedia.org/r/419175 (owner: 10Elukey)
[13:56:06] <moritzm>	 hashar: CI tests are failing due to disk space depletion: "mv: cannot create regular file ‘/srv/workspace/log/admin-0.log’: No space left on device", see e.g. https://gerrit.wikimedia.org/r/419183 
[13:57:02] <hashar>	 agrer
[13:57:05] <hashar>	 zeljkof: ^^
[13:57:26] <zeljkof>	 hashar: yes, that problem :)
[13:57:33] <hashar>	 stupid jobs
[13:58:12] <wikibugs>	 (03PS3) 10Andrew Bogott: labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470)
[13:59:21] <hashar>	 zeljkof: I have just deleted a bunch of build workspaces
[13:59:25] <hashar>	 solved :)
[13:59:43] <zeljkof>	 hashar: thanks!
[14:01:10] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/419183 (owner: 10Muehlenhoff)
[14:01:25] <wikibugs>	 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Set up external DNS record for wikitech-static - https://phabricator.wikimedia.org/T164290#3228802 (10chasemp) > Simple solution: all opsen and devs who would need wikitech-static should put in a commented line in the...
[14:02:21] <wikibugs>	 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Investigate alternative RAID strategies for labstore1001/2 - https://phabricator.wikimedia.org/T162090#4046631 (10chasemp) 05Open>03Invalid These are going to be decommissioned just as soon as we get labstore1008/1009 online.
[14:02:24] <wikibugs>	 10Operations, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Reimage labstore1001 and labstore1002 for DRBD storage setup - https://phabricator.wikimedia.org/T158196#4046633 (10chasemp)
[14:02:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Add chicocvenancio to LDAP users (for cn=wmf) [puppet] - 10https://gerrit.wikimedia.org/r/419183
[14:04:14] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1001 is OK: OK check_failover servers up 2 down 0
[14:04:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add chicocvenancio to LDAP users (for cn=wmf) [puppet] - 10https://gerrit.wikimedia.org/r/419183 (owner: 10Muehlenhoff)
[14:04:45] <wikibugs>	 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#4046635 (10chasemp)
[14:04:54] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1006 is OK: OK check_failover servers up 2 down 0
[14:11:41] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919)
[14:11:51] <chasemp>	 !log add chico to wmf-nda (verified nda things with moritz and all the goodness)
[14:11:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) (owner: 10Alexandros Kosiaris)
[14:12:32] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4046639 (10Pchelolo) > Add a second LVS IP, to be served from the same cluster, to use for videoscaling. This will guarantee we evenly distrib...
[14:14:18] <wikibugs>	 (03PS2) 10Ottomata: Use roundrobin partition.assignment.strategy for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/418934 (https://phabricator.wikimedia.org/T189464)
[14:18:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159
[14:20:00] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Use roundrobin partition.assignment.strategy for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/418934 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata)
[14:23:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159 (owner: 10Muehlenhoff)
[14:23:23] <wikibugs>	 (03PS3) 10Muehlenhoff: Temporarily remove chromium from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419159
[14:24:46] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919)
[14:25:07] <wikibugs>	 (03PS2) 10Elukey: profile::hadoop::worker: use require instead of include [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294)
[14:27:15] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919)
[14:28:23] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::worker: use require instead of include [puppet] - 10https://gerrit.wikimedia.org/r/419165 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey)
[14:28:25] <wikibugs>	 (03CR) 10MarcoAurelio: "logstash-beta is no longer public; let's just stop collecting IPs for abusefilter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio)
[14:28:46] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 1829 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:28:56] <icinga-wm>	 RECOVERY - Request latencies on acrux is OK: OK - apiserver_request_latencies is 1698 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:32:36] <Hauskatze>	 zeljkof: do you remember why https://gerrit.wikimedia.org/r/#/c/417189/ didn't went through?
[14:37:06] <wikibugs>	 (03PS1) 10Elukey: Assign role::analytics_cluster::hadoop::worker to analytics1074 [puppet] - 10https://gerrit.wikimedia.org/r/419188 (https://phabricator.wikimedia.org/T188294)
[14:37:16] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.088 second response time
[14:37:53] <zeljkof>	 Hauskatze: we had some problems with deployment yesterday, there was time for only one patch
[14:37:56] <wikibugs>	 (03CR) 10Elukey: [C: 032] Assign role::analytics_cluster::hadoop::worker to analytics1074 [puppet] - 10https://gerrit.wikimedia.org/r/419188 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey)
[14:39:11] <_joe_>	 uhm what's up with wikidata?
[14:39:30] <_joe_>	 Amir1: any idea? should I look?
[14:39:49] <Amir1>	 _joe_: what's up
[14:40:02] <_joe_>	  wikidata.org dispatch lag is higher than 300s
[14:40:07] <Amir1>	 let me check
[14:40:58] <_joe_>	 lag according to scripts running on terbium is ~ 400 seconds on some wikis
[14:41:10] <Amir1>	 _joe_: dispatching seems fine: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1
[14:42:02] <moritzm>	 !log rebooting chromium for kernel security update
[14:42:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:01] <_joe_>	 yeah I'm looking at numbers on terbium
[14:43:30] <Amir1>	 _joe_: I keep monitoring it and if it started to get really bad I do something
[14:43:46] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190
[14:43:50] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190
[14:45:36] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui)
[14:47:18] <Lucas_WMDE>	 looks like size change on wikidata-edits went up at the same time as dispatch lag started growing https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1&panelId=3&fullscreen&from=1520909213004&to=1520952413005
[14:47:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Temporarily remove chromium from LVS name servers in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/419191
[14:47:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui)
[14:47:34] <wikibugs>	 (03PS4) 10Andrew Bogott: labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470)
[14:48:49] <marostegui>	 zeljkof hasharAway : https://integration.wikimedia.org/ci/job/operations-mw-config-php55lint/19880/console -> fatal: write error: No space left on device
[14:49:08] <zeljkof>	 marostegui: I think hasharAway fixed it
[14:49:16] <marostegui>	 It just happened
[14:49:20] <marostegui>	 Like a minute ago XD
[14:49:24] <marostegui>	 should I recheck?
[14:49:48] <zeljkof>	 marostegui: hm, probably :)
[14:49:53] <zeljkof>	 it might be broken again
[14:49:56] <wikibugs>	 (03CR) 10Marostegui: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui)
[14:49:59] <marostegui>	 let's see!
[14:50:03] <zeljkof>	 > 14:59 <hashar> zeljkof: I have just deleted a bunch of build workspaces
[14:50:19] <zeljkof>	 it was 50 minutes ago, if it happened again, it might be broken again :(
[14:51:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui)
[14:51:51] <jynus>	 !log stopping db2044 (this will make proxies complain about redundancy)
[14:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:58] <marostegui>	 zeljkof: looks like it worked this time
[14:52:05] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419190 (owner: 10Marostegui)
[14:52:09] <zeljkof>	 marostegui: great!
[14:52:16] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1966 bytes in 0.086 second response time
[14:52:21] <wikibugs>	 (03CR) 10Rush: openstack: rabbit codify nova user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[14:53:10] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081 after alter table (duration: 00m 57s)
[14:53:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:11] <wikibugs>	 (03PS1) 10Volans: Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563)
[14:54:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans)
[14:54:28] <paladox>	 zeljkof marostegui there's a task for this
[14:54:37] <paladox>	 https://phabricator.wikimedia.org/T189587
[14:54:42] <wikibugs>	 (03PS7) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915)
[14:54:43] <marostegui>	 paladox: thanks!
[14:54:49] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089)
[14:54:51] <paladox>	 your welcome :)
[14:54:51] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919)
[14:54:53] <volans>	 hasharAway:  integration-slave-jessie-1001 error: unable to create temporary file: No space left on device
[14:55:05] <paladox>	 volans https://phabricator.wikimedia.org/T189587
[14:55:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) (owner: 10ArielGlenn)
[14:55:14] <vgutierrez>	 !log upgrading codfw LVSs to pybal 1.15.2
[14:55:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:22] <volans>	 paladox: thx
[14:55:22] <marostegui>	 volans: I just suffered that too, so…zeljkof it is not completely fixed indeed
[14:55:29] <zeljkof>	 paladox, volans: uh oh, looks like we will have to wait for hasharAway to get back
[14:55:35] <paladox>	 yep
[14:55:36] <volans>	 I can have a look
[14:56:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[14:56:26] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[14:56:49] <volans>	 12G jenkins-workspace; 7.3G pbuilder
[14:56:56] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[14:59:42] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919)
[15:00:58] <wikibugs>	 (03CR) 10Volans: "The VMs will be created in row C in eqiad and row B in codfw." [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans)
[15:02:00] <wikibugs>	 (03PS1) 10Rush: neutron dummies for rabbit and db [labs/private] - 10https://gerrit.wikimedia.org/r/419196
[15:02:53] <wikibugs>	 (03PS8) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915)
[15:03:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Revert "Temporarily remove chromium from LVS name servers in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/419191 (owner: 10Muehlenhoff)
[15:03:23] <wikibugs>	 (03PS2) 10Volans: Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563)
[15:03:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans)
[15:03:53] <Amir1>	 it's recovering 
[15:15:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:15:31] <wikibugs>	 (03PS1) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266)
[15:16:01] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919)
[15:16:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[15:20:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10432/ says PCC is happy and a quick overview of the change catalog looks fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/416950 (https://phabricator.wikimedia.org/T184919) (owner: 10Alexandros Kosiaris)
[15:21:28] <godog>	 volans: any joy ressurecting that jenkins worker?
[15:21:49] <volans>	 godog: I just took a quick look we have a lot of used space in the jenkins workspace for the android app stuff
[15:21:57] <volans>	 and in pbuilder
[15:22:03] <volans>	 but I'm unsure what is safe to delete
[15:22:54] <paladox>	 volans i think normaly we do rm -rf * in the workspace
[15:23:15] <godog>	 volans: ah :(
[15:23:46] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Force string for file mode in yaml [puppet] - 10https://gerrit.wikimedia.org/r/419201
[15:24:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Force string for file mode in yaml [puppet] - 10https://gerrit.wikimedia.org/r/419201 (owner: 10Alexandros Kosiaris)
[15:24:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans)
[15:24:46] <volans>	 is there any wikitech doc about it?
[15:25:46] <godog>	 not that I could find, e.g. https://wikitech.wikimedia.org/wiki/Jenkins
[15:26:35] <godog>	 though https://phabricator.wikimedia.org/T126176 is similar
[15:27:56] <volans>	 the full partition is /srv, but yeah, same content apparently
[15:30:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add initialize_namespace.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/419139 (owner: 10Alexandros Kosiaris)
[15:32:44] <jynus>	 !log upgrade and restart dbproxy1001
[15:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:46] <vgutierrez>	 !log upgrading eqiad LVSs to pybal 1.15.2
[15:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:03] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add tiller/deploy RBAC clusterroles [deployment-charts] - 10https://gerrit.wikimedia.org/r/419203
[15:37:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add tiller/deploy RBAC clusterroles [deployment-charts] - 10https://gerrit.wikimedia.org/r/419203 (owner: 10Alexandros Kosiaris)
[15:39:19] <jynus>	 !log upgrade and restart dbproxy1007
[15:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add apiVersion attribute to deploy ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/419206
[15:40:27] <wikibugs>	 (03CR) 10Marostegui: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[15:43:12] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[15:44:24] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[15:46:17] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 for alter table (duration: 00m 56s)
[15:46:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:09] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419194 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[15:50:51] <marostegui>	 !log Deploy schema change on db1097:3314 - T187089 T185128 T153182
[15:50:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:58] <stashbot>	 T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089
[15:50:59] <stashbot>	 T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182
[15:50:59] <stashbot>	 T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128
[15:55:51] <wikibugs>	 (03PS3) 10Chico Venancio: Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273)
[15:56:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add apiVersion attribute to deploy ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/419206 (owner: 10Alexandros Kosiaris)
[15:58:09] <wikibugs>	 (03PS1) 10Odder: Correct logo for the Livvi-Karelian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419213 (https://phabricator.wikimedia.org/T146745)
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:01:15] <wikibugs>	 (03PS6) 10Vgutierrez: pybal: Prometheus based icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085)
[16:02:41] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Add UDP monitor for pybal - https://phabricator.wikimedia.org/T178151#3682454 (10Vgutierrez) Pybal 1.15.2 has been successfully deployed in our LVSs, the UDP monitor is now available.
[16:05:29] <wikibugs>	 10Operations, 10Puppet: puppetdb4: systemd config review - https://phabricator.wikimedia.org/T187257#3969415 (10fgiunchedi) We'd still need the oom settings to help debugging oom cases we've seen on nitrogen for example. Passing a directory instead of a file to `-XX:HeapDumpPath` will create dump files with pi...
[16:11:07] <wikibugs>	 (03PS1) 10Jcrespo: dbproxy: switchover m1 and m2 master reference [dns] - 10https://gerrit.wikimedia.org/r/419216 (https://phabricator.wikimedia.org/T183469)
[16:13:51] <wikibugs>	 (03CR) 10Marostegui: [C: 031] dbproxy: switchover m1 and m2 master reference [dns] - 10https://gerrit.wikimedia.org/r/419216 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[16:14:21] <wikibugs>	 (03CR) 10Rush: [V: 032 C: 032] neutron dummies for rabbit and db [labs/private] - 10https://gerrit.wikimedia.org/r/419196 (owner: 10Rush)
[16:15:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] dbproxy: switchover m1 and m2 master reference [dns] - 10https://gerrit.wikimedia.org/r/419216 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[16:16:15] <awight>	 How do I log into https:://logstash-beta.wmflabs.org ?  My wikitech creds don’t work, maybe I’ve forgotten how this works?
[16:18:23] <jynus>	 !log update CNAME for m1-master and m2-master
[16:18:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:44] <wikibugs>	 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4047089 (10Papaul) a:05Papaul>03faidon Done
[16:21:09] <paladox>	 hmm https://gerrit.wikimedia.org/r/ is not loading for me
[16:21:32] <paladox>	 no_justification ^^
[16:21:48] <jynus>	 I may have broken it
[16:21:52] <jynus>	 let me revert
[16:22:19] <marostegui>	 it works for me
[16:22:21] <paladox>	 works now :)
[16:22:25] <jynus>	 mmm
[16:22:28] <jynus>	 should I wait?
[16:22:31] <marostegui>	 yeah
[16:22:33] <marostegui>	 don't revert
[16:22:43] <jynus>	 let me prepare the revert at least
[16:22:46] <marostegui>	 it works for me so far
[16:23:08] <marostegui>	 it seems stalled now
[16:23:15] <jynus>	 if we don't have gerrit, we will not be able to revert
[16:23:23] <marostegui>	 right
[16:23:27] <paladox>	 yeh seems stalled now
[16:23:36] <paladox>	 stuck on "Working ..."
[16:24:01] <jynus>	 let me kick the process
[16:24:19] <marostegui>	 yeah that might be it
[16:24:38] <no_justification>	 hmmm
[16:24:45] <no_justification>	 It's working for me, just hella slow
[16:25:07] <jynus>	 no_justification: which host is gerrit running?
[16:25:15] <paladox>	 cobalt
[16:25:15] <no_justification>	 cobalt, signing in now
[16:25:22] <_joe_>	 gerrit1001, no?
[16:25:35] <_joe_>	 oh no sorry, cobalt in eqiad
[16:26:10] <no_justification>	 Bunch of mysql packet failures
[16:26:11] <jynus>	 !log restarting gerring on cobalt, stalled
[16:26:15] <no_justification>	 Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
[16:26:15] <no_justification>	 The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
[16:26:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:23] <marostegui>	 no_justification: yeah, we were changing some proxy related stuff
[16:26:37] <_joe_>	 clearly gerrit doesn't like that
[16:26:52] <marostegui>	 it was a cname change only
[16:26:56] <jynus>	 it is restarting
[16:27:50] <paladox>	 gerrit wont start if it cannot connect to the db
[16:28:02] <jynus>	 it did
[16:28:27] <jynus>	 it restart, but it didn't come back
[16:28:38] <marostegui>	 jynus: FW rules on the proxies maybe?
[16:28:40] <marostegui>	 let me check
[16:28:48] <paladox>	 jynus that's the systemd service. Gerrit will try to start but if it cannot it will eventually fail
[16:28:53] <paladox>	 like gerrit2001
[16:29:25] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki]
[16:29:35] <jynus>	 yeah, firewall then
[16:29:35] <icinga-wm>	 PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.85 and port 29418: Connection refused
[16:29:48] <jynus>	 let's clear all firewall rules, then
[16:30:00] <_joe_>	 we can add a rule for the specific address if you want to
[16:30:14] <_joe_>	 but yeah clearing them is faster
[16:30:22] <_joe_>	 remember to disable puppet
[16:30:27] <jynus>	 gerrit is not on 10.x
[16:31:02] <_joe_>	 we can also go back to the old dns record if needed
[16:31:05] <icinga-wm>	 PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[16:31:38] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4047131 (10madhuvishy)
[16:31:39] <marostegui>	 I have cleared rules in dbproxy1007
[16:31:46] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui]
[16:32:07] <marostegui>	 jynus: can you restart gerrit again?
[16:32:14] <jynus>	 !log restarting gerring on cobalt, stalled
[16:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:19] <_joe_>	 I can connect from cobalt to dbproxy1007 now
[16:32:25] <marostegui>	 \o/
[16:32:35] <icinga-wm>	 PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_jenkins CI Composer]
[16:32:36] <jynus>	 but gerrit doesn't work 
[16:32:46] <icinga-wm>	 PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Service[eventlogging/init]
[16:32:58] <wikibugs>	 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4047140 (10Papaul) @robh thanks
[16:33:01] <marostegui>	 !log Retroactive: cleared iptables rules on dbproxy1007
[16:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:12] <jynus>	 it does now?
[16:33:24] <marostegui>	 works yes
[16:33:26] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts]
[16:33:32] <marostegui>	 quite slow, but loads for me
[16:33:36] <icinga-wm>	 RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.14.6-7-g55dde9d68b (SSHD-CORE-1.4.0) (protocol 2.0)
[16:33:41] <_joe_>	 it's coming back
[16:34:10] <marostegui>	 fully works for me now
[16:34:17] <_joe_>	 let the jvm heat the glow plugs people
[16:34:25] <jynus>	 but if gerrit wasn't on a 10.x and that was down
[16:34:30] <jynus>	 maybe otrs was down, too
[16:34:35] <jynus>	 or other services
[16:34:46] <icinga-wm>	 PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_jenkins CI Composer]
[16:34:55] <_joe_>	 jynus: while you add the firewall rule, I can check
[16:35:07] <jynus>	 can you check otrs?
[16:35:28] <marostegui>	 otrs login mainpage works for me
[16:35:53] <jynus>	 yeah, but probably it needs a more complexy thinkg to access the db
[16:35:55] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4047147 (10madhuvishy)
[16:35:58] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#4047145 (10madhuvishy) 05Resolved>03Open @Kolossos I see utilization has climbed up again to over 600G. How can we ensure we don't have to keep makin...
[16:36:13] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#4047149 (10madhuvishy) p:05High>03Normal
[16:36:15] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4047148 (10Papaul) @Joe ok . For now I have 5 new servers in A4 and 7 new servers in B3. so moving all the new server in A3 to B3, B3 will have a total of 12 new server...
[16:36:35] <icinga-wm>	 PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[16:36:37] <jynus>	 I think the wise think is to revert the  change
[16:36:48] <jynus>	 and reanalize later
[16:37:06] <_joe_>	 why? we're not in a failure state right now
[16:37:15] <marostegui>	 I wouldn't revert either no
[16:37:18] <jynus>	 but services with a pool of connections
[16:37:22] <bblack>	 Oh the Thinks You Can Think
[16:37:23] <_joe_>	 and you can actually see the rejected packets on dbproxy1007
[16:37:37] <jynus>	 may fail with some delay
[16:37:46] <jynus>	 or
[16:37:57] <bblack>	 ( https://en.wikipedia.org/wiki/Oh,_the_Thinks_You_Can_Think! )
[16:38:14] <jynus>	 we dont revert, but clear the iptables of dbproxy1001
[16:38:25] <icinga-wm>	 PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Service[eventlogging/init]
[16:38:33] <_joe_>	 no errors from otrs
[16:38:52] <_joe_>	 AFAICT
[16:39:15] <wikibugs>	 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#4047174 (10Bstorm)
[16:40:14] <marostegui>	 jynus: do you want me to take care of dbproxy1001?
[16:40:18] <jynus>	 _joe_: the problem is that it happened not just for otrs
[16:40:52] <jynus>	 it was 2 proxies we changed the alias 2, serving all misc dbs except phabricator, eventlogging and cloud
[16:41:11] <jynus>	 and I assumed all those servies were on 10.x networks
[16:41:38] <marostegui>	 etherpad is on dbproxy1001 no?
[16:41:44] <marostegui>	 well, you know what I mean XD
[16:42:08] <jynus>	 yeah, let's clear dbproxy1001 firewall
[16:42:21] <marostegui>	 I will do that
[16:42:23] <jynus>	 and add it slowly instead of the other way round
[16:42:48] <marostegui>	 for the record: etherpad and librenms work now
[16:42:49] <_joe_>	 AFAICT, we should just allow DOMAIN_NETWORKS as a srange
[16:42:58] <marostegui>	 going to clear the rules anyways
[16:42:58] <_joe_>	 instead of INTERNAL_NETWORKS
[16:43:02] <_joe_>	 wait
[16:43:07] <marostegui>	 waiting
[16:43:34] <jynus>	 _joe_: the proxy was a TODO
[16:43:44] <jynus>	 the firewall for the proxy
[16:43:48] <jynus>	 but I forgot about it
[16:44:15] <_joe_>	 so what's the puppet code that creates the proxy rules on dbproxy1001/7?
[16:44:32] <marostegui>	 dbproxy1006 (the old one) has no rules, that is why I wanted to leave dbproxy1001 the same as 1006, just to make sure everything works as it used to before the cname change
[16:45:24] <_joe_>	 marostegui: ok go on
[16:45:28] <marostegui>	 k
[16:45:34] <_joe_>	 profile::mariadb::proxy::firewall in hiera is what we have to change
[16:45:59] <_joe_>	 well, no, the def within there
[16:45:59] <marostegui>	 !log Clean iptables rules on dbproxy1001 to leave it as dbproxy1006
[16:46:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:23] <jynus>	 so there is a hiera rule to apply a ruleset or not
[16:46:28] <marostegui>	 dbproxy1001 is clean now
[16:46:29] <_joe_>	 it's more complex than I thought it would be.
[16:46:37] <_joe_>	 profile::mariadb::ferm defaults to         srange  => '$INTERNAL',                                                                                                                                                               
[16:46:42] <_joe_>	 without overrides
[16:46:53] <jynus>	 yeah, we just have to create a new ruleset
[16:46:56] <_joe_>	 so we need to change that, and make the range configurable
[16:47:00] <jynus>	 that is the one we use for core dbs
[16:47:09] <_joe_>	 or to add a simple ferm::service
[16:48:26] <marostegui>	 guys you mind if I logoff? I need to take care of some stuff and I think stuff is under control now
[16:48:35] <jynus>	 yes, sorry
[16:48:57] <_joe_>	 marostegui: did you disable puppet on dbproxy1001?
[16:49:00] <jynus>	 I will just add a new variable other than cloud and internal
[16:49:04] <marostegui>	 _joe_: nope
[16:49:09] <jynus>	 ferm should only create the rules on start
[16:49:13] <marostegui>	 i can do it now
[16:49:22] <jynus>	 it alerts because of that
[16:49:23] <_joe_>	 marostegui: I'm doing it
[16:49:25] <marostegui>	 but I don't think puppet will add the rules back
[16:49:42] <_joe_>	 no, it should not, but better to control when it gets executed
[16:49:47] <jynus>	 don't worry, _joe_ marostegui I will take care of this
[16:49:49] <marostegui>	 i ran puppet to check, and they were not added
[16:50:07] <jynus>	 I just need to check the right rules for m1 and m2 services
[16:50:24] <_joe_>	 ok, I'm happy to review the changes
[16:51:31] <jynus>	 the quick change is to set the proxy firewal as = disabled until I fine-tune
[16:51:48] <_joe_>	 I would go that way tbh
[16:51:53] <marostegui>	 +1
[16:51:54] <_joe_>	 and have time to do things properly
[16:52:11] <jynus>	 oh, I will just to make sure puppet and current state matches
[16:52:44] <jynus>	 or basically, previous state
[16:52:50] <marostegui>	 yeah, exactly
[16:53:03] <marostegui>	 Have to go guys, ring me if needed!
[16:53:14] <jynus>	 then we can test the changes on the passive host to not create more outages
[16:54:49] <wikibugs>	 (03CR) 10BryanDavis: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) (owner: 10BryanDavis)
[16:57:46] <icinga-wm>	 RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:58:26] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[16:58:35] <wikibugs>	 (03PS1) 10Jcrespo: dbproxy: Disable temporarily firewall on the active proxy for m1 & m2 [puppet] - 10https://gerrit.wikimedia.org/r/419221
[16:58:58] <jynus>	 ^this is the bad fix, I will now work on the proper one
[16:59:20] <wikibugs>	 (03CR) 10Marostegui: [C: 031] dbproxy: Disable temporarily firewall on the active proxy for m1 & m2 [puppet] - 10https://gerrit.wikimedia.org/r/419221 (owner: 10Jcrespo)
[16:59:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] dbproxy: Disable temporarily firewall on the active proxy for m1 & m2 [puppet] - 10https://gerrit.wikimedia.org/r/419221 (owner: 10Jcrespo)
[16:59:25] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1700).
[17:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:00:18] <wikibugs>	 (03PS1) 10Andrew Bogott: shorten ttl for horizon.wm.o and toolsadmin.wm.o [dns] - 10https://gerrit.wikimedia.org/r/419222 (https://phabricator.wikimedia.org/T168470)
[17:01:05] <icinga-wm>	 RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:01:17] <wikibugs>	 (03Abandoned) 10Andrew Bogott: shorten ttl for horizon.wm.o and toolsadmin.wm.o [dns] - 10https://gerrit.wikimedia.org/r/419222 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott)
[17:01:36] <wikibugs>	 (03PS1) 10Awight: Fix new ORES threshold syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419223 (https://phabricator.wikimedia.org/T181159)
[17:01:48] <subbu>	 no parsoid deploy today
[17:01:55] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:02:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4047252 (10faidon) So post-mortem, I think there are 4 different things here: - T189519: Audit switch ports/descriptions/enable (and do this on an ongoing basis) - T189522: Detect I...
[17:02:35] <icinga-wm>	 RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:03:25] <icinga-wm>	 RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:04:46] <icinga-wm>	 RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:05:59] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove misc-web config for 'newwikitech' [puppet] - 10https://gerrit.wikimedia.org/r/419224 (https://phabricator.wikimedia.org/T168470)
[17:06:06] <wikibugs>	 (03PS1) 10Andrew Bogott: Rename newhorizon and newtoolsadmin to horizon and toolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/419225 (https://phabricator.wikimedia.org/T168470)
[17:06:07] <wikibugs>	 (03PS1) 10Andrew Bogott: Move horizon and toolsadmin to labweb backends [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470)
[17:06:35] <icinga-wm>	 RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:06:36] <wikibugs>	 (03PS2) 10Andrew Bogott: Remove misc-web config for 'newwikitech' [puppet] - 10https://gerrit.wikimedia.org/r/419224 (https://phabricator.wikimedia.org/T168470)
[17:06:37] <godog>	 !log cleanup integration-slave-jessie-1001:/srv/pbuilder/build - T189587
[17:06:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:43] <stashbot>	 T189587: integration-slave-jessie-1001 out of disk space - https://phabricator.wikimedia.org/T189587
[17:07:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Remove misc-web config for 'newwikitech' [puppet] - 10https://gerrit.wikimedia.org/r/419224 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott)
[17:07:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans)
[17:08:12] <wikibugs>	 (03PS1) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228
[17:08:57] <wikibugs>	 (03PS1) 10Awight: Enable Extension:JADE on all beta cluster wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419229 (https://phabricator.wikimedia.org/T176333)
[17:10:45] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0
[17:11:05] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0
[17:11:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "Scheduled for tomorrow (Wednesday) morning." [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott)
[17:12:48] <wikibugs>	 (03CR) 10Awight: [C: 032] Fix new ORES threshold syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419223 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight)
[17:12:53] <wikibugs>	 (03CR) 10Awight: [C: 032] Enable Extension:JADE on all beta cluster wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419229 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight)
[17:14:01] <wikibugs>	 (03Merged) 10jenkins-bot: Fix new ORES threshold syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419223 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight)
[17:14:24] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Extension:JADE on all beta cluster wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419229 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight)
[17:14:35] <jynus>	 ok, we are now in a far from idea, but stable state
[17:14:46] <wikibugs>	 10Operations, 10DC-Ops, 10monitoring, 10User-fgiunchedi: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4047303 (10fgiunchedi)
[17:16:57] <wikibugs>	 (03PS2) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228
[17:17:22] <jynus>	 the funny thing is by pure chance, I think we only need a rule for gerrit, it is the only active service with a different range
[17:18:09] <wikibugs>	 (03PS3) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228
[17:18:49] <paladox>	 jynus will that also fix gerrit2001 too?
[17:18:57] <wikibugs>	 (03CR) 10jenkins-bot: Fix new ORES threshold syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419223 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight)
[17:19:11] <wikibugs>	 (03PS4) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228
[17:19:40] <logmsgbot>	 !log awight@tin Started scap: Beta: Fix ORES thresholds and enable JADE, T181159, T176333
[17:19:45] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991)
[17:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:47] <stashbot>	 T181159: Migrate ORES extension threshold config from old to new syntax - https://phabricator.wikimedia.org/T181159
[17:19:47] <stashbot>	 T176333: [Blocked] Deploy JADE prototype in Beta Cluster - https://phabricator.wikimedia.org/T176333
[17:19:56] <jynus>	 paladox: gerrit2001 is waiting hardware provisioning
[17:20:12] <paladox>	 jynus hardware provisioning?
[17:20:16] <jynus>	 so it will not fix it, but it should provent from the same thing happening
[17:20:28] <jynus>	 paladox: we are missing misc db proxies there
[17:20:34] <paladox>	 oh
[17:20:45] <jynus>	 and to be fair, proper database setup
[17:20:53] <wikibugs>	 (03CR) 10BBlack: [C: 032] varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228 (owner: 10BBlack)
[17:20:53] <jynus>	 so we need to setup more servers for that
[17:21:02] <wikibugs>	 (03PS5) 10BBlack: varnish: do not gzip empty/small responses [puppet] - 10https://gerrit.wikimedia.org/r/419228
[17:21:13] <jynus>	 we could have rush it, but it would have been setup poorly
[17:21:30] <paladox>	 ok
[17:21:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[17:21:32] <paladox>	 thanks
[17:25:52] <wikibugs>	 (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991)
[17:26:00] <wikibugs>	 (03PS5) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419178 (https://phabricator.wikimedia.org/T135991)
[17:27:50] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4047356 (10Gehel) a:03Gehel
[17:28:12] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Investigate and improve memory allocation rates of WDQS - https://phabricator.wikimedia.org/T181988#4047358 (10Gehel) a:03Gehel
[17:29:02] <wikibugs>	 (03PS2) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266)
[17:29:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[17:31:22] <wikibugs>	 (03PS3) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266)
[17:34:21] <wikibugs>	 (03CR) 10Rush: "labtestneutron2001.codfw.wmnet,labtestneutron2002.codfw.wmnet,labtestcontrol2003.codfw.wmnet,labtestvirt2003.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[17:37:05] <volans>	 thanks godog for the review and recheck
[17:37:53] <godog>	 volans: np! was easy enough to fix, sadly not automatic yet
[17:38:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4047379 (10Gehel) The decision is to not replace this out of warranty RAM. We'll run with 3% less capacity until this batch of servers is renewed (in ~ 1year).
[17:38:32] <volans>	 indeed
[17:42:41] <wikibugs>	 (03PS1) 10Jcrespo: dblist: Update db1051 and db1063 location [software] - 10https://gerrit.wikimedia.org/r/419235
[17:43:08] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: icinga: update wikitech-static check contacts [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584)
[17:43:53] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] "I am merging now to not delay it unnecesarily, but please review with further commits if you see issues." [software] - 10https://gerrit.wikimedia.org/r/419235 (owner: 10Jcrespo)
[17:49:25] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: shinken: labs: delete wikitech-static check [puppet] - 10https://gerrit.wikimedia.org/r/419237 (https://phabricator.wikimedia.org/T189584)
[17:51:00] <wikibugs>	 (03PS4) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266)
[17:51:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[17:52:36] <wikibugs>	 (03CR) 10Rush: icinga: update wikitech-static check contacts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez)
[17:53:30] <wikibugs>	 (03CR) 10Rush: shinken: labs: delete wikitech-static check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419237 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez)
[17:54:34] <wikibugs>	 (03PS5) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266)
[17:55:07] <moritzm>	 !log installing ncurses updates from stretch point release
[17:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:09] <wikibugs>	 (03PS3) 10Volans: Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563)
[17:57:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for ncurses [puppet] - 10https://gerrit.wikimedia.org/r/419241
[17:58:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Add library hint for ncurses [puppet] - 10https://gerrit.wikimedia.org/r/419241
[17:58:42] <wikibugs>	 (03CR) 10Volans: [C: 032] Add entries for ganeti instances for Puppetboard [dns] - 10https://gerrit.wikimedia.org/r/419193 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans)
[17:59:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add library hint for ncurses [puppet] - 10https://gerrit.wikimedia.org/r/419241 (owner: 10Muehlenhoff)
[17:59:47] <wikibugs>	 (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/10434/" [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1800)
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:04:40] <wikibugs>	 (03PS1) 10Awight: Add JADE to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419243
[18:06:00] <wikibugs>	 (03CR) 10Awight: [C: 032] Add JADE to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419243 (owner: 10Awight)
[18:07:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add JADE to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419243 (owner: 10Awight)
[18:09:05] <wikibugs>	 (03PS1) 10DCausse: Add extra-analysis [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/419244 (https://phabricator.wikimedia.org/T189239)
[18:14:11] <wikibugs>	 (03CR) 10Gehel: [V: 032 C: 032] "LGTM, checked locally according to procedure in README" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/419244 (https://phabricator.wikimedia.org/T189239) (owner: 10DCausse)
[18:14:30] <wikibugs>	 (03PS8) 10Rush: openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266)
[18:17:25] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[18:18:23] <wikibugs>	 (03CR) 10Rush: "Arturo is on clinic so hopefully can roll this out" [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) (owner: 10Chico Venancio)
[18:21:50] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584)
[18:21:52] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: shinken: labs: delete wikitech-static check [puppet] - 10https://gerrit.wikimedia.org/r/419237 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez)
[18:22:00] <wikibugs>	 (03PS1) 10Rush: rabbitmq: handle dynamic resource names for deduping [puppet] - 10https://gerrit.wikimedia.org/r/419245
[18:22:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez)
[18:22:36] <wikibugs>	 (03CR) 10Rush: [C: 032] rabbitmq: handle dynamic resource names for deduping [puppet] - 10https://gerrit.wikimedia.org/r/419245 (owner: 10Rush)
[18:26:09] <wikibugs>	 (03PS1) 10Odder: Update logo for the Maithili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419246 (https://phabricator.wikimedia.org/T149790)
[18:26:20] <wikibugs>	 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4047526 (10Papaul) Server would not power on  - Draining power   It looks like another dead main board. I will contact HP and see what they say.
[18:29:14] <wikibugs>	 (03PS6) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266)
[18:29:19] <wikibugs>	 (03PS3) 10Rush: icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez)
[18:30:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez)
[18:30:08] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584)
[18:30:35] <wikibugs>	 (03CR) 10Rush: "labtestneutron2001.codfw.wmnet,labtestneutron2002.codfw.wmnet,labtestcontrol2003.wikimedia.org,labtestvirt2003.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[18:32:10] <wikibugs>	 (03CR) 10Rush: [C: 031] "cool, fyi the server this lands on for icinga is einsteinium and I usually reach out to make sure the eventual icinga config is valid :)" [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez)
[18:32:25] <moritzm>	 !log installing w3m updates from stretch point release
[18:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:28] <wikibugs>	 (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/10436/" [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[18:34:57] <wikibugs>	 (03PS1) 10Volans: DHCP: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419247 (https://phabricator.wikimedia.org/T184563)
[18:34:59] <wikibugs>	 (03PS1) 10Volans: netboot: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419248 (https://phabricator.wikimedia.org/T184563)
[18:35:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] icinga: refresh wikitech-static monitoring and alerting [puppet] - 10https://gerrit.wikimedia.org/r/419236 (https://phabricator.wikimedia.org/T189584) (owner: 10Arturo Borrero Gonzalez)
[18:37:20] <moritzm>	 !log installing reportbug updates from stretch point release
[18:37:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:45] <wikibugs>	 (03CR) 10Volans: [C: 032] DHCP: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419247 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans)
[18:40:00] <wikibugs>	 (03PS2) 10Volans: DHCP: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419247 (https://phabricator.wikimedia.org/T184563)
[18:41:00] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[18:41:11] <wikibugs>	 (03CR) 10Dzahn: base/icinga: add Hiera override to skip systemd monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn)
[18:41:13] <wikibugs>	 (03PS7) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266)
[18:42:12] <wikibugs>	 (03PS8) 10Rush: openstack: labtestn initial neutron framework [puppet] - 10https://gerrit.wikimedia.org/r/419198 (https://phabricator.wikimedia.org/T188266)
[18:42:17] <gehel>	 !log repool wdqs1004 & wdqs2001 now that data reload is completed T189548
[18:42:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:23] <stashbot>	 T189548: reload data on wdqs1004 - https://phabricator.wikimedia.org/T189548
[18:42:33] <wikibugs>	 (03PS1) 10Ottomata: Update wheels for Debian Stretch [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419251 (https://phabricator.wikimedia.org/T183145)
[18:42:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Temporarily remove hydrogen from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419252
[18:42:47] <volans>	 chasemp: sorry, "rush" hour for merging :-P
[18:42:47] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Update wheels for Debian Stretch [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419251 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata)
[18:43:01] <chasemp>	 volans: rebase wars! we deserver our own history channel show
[18:43:20] <volans>	 :)
[18:43:53] <volans>	 I have another one but can wait few minutes :D
[18:45:22] <chasemp>	 volans: all you now dude, I'm off and running
[18:45:41] <wikibugs>	 (03PS1) 10Ottomata: wheel frozen-requirements should refer to version [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419253
[18:45:51] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] wheel frozen-requirements should refer to version [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419253 (owner: 10Ottomata)
[18:46:03] <volans>	 lol, ack thanks
[18:46:12] <wikibugs>	 (03PS1) 10Chad: group0 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419254
[18:46:46] <wikibugs>	 (03PS1) 10Dzahn: restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050)
[18:46:47] <icinga-wm>	 RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational
[18:46:55] <logmsgbot>	 !log demon@tin scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "awight"; reason is "Beta: Fix ORES thresholds and enable JADE, T181159, T176333" (duration: 00m 00s)
[18:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:01] <stashbot>	 T181159: Migrate ORES extension threshold config from old to new syntax - https://phabricator.wikimedia.org/T181159
[18:47:01] <stashbot>	 T176333: Deploy JADE prototype in Beta Cluster - https://phabricator.wikimedia.org/T176333
[18:47:08] <mutante>	 joins the merge wars
[18:47:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn)
[18:47:38] <logmsgbot>	 !log demon@tin scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "awight"; reason is "Beta: Fix ORES thresholds and enable JADE, T181159, T176333" (duration: 00m 00s)
[18:47:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:06] <no_justification>	 awight|lunch: You have the scap lock...?
[18:50:50] <wikibugs>	 (03PS1) 10RobH: decom db2030 [puppet] - 10https://gerrit.wikimedia.org/r/419256 (https://phabricator.wikimedia.org/T187768)
[18:50:54] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4047594 (10RobH)
[18:51:25] <wikibugs>	 (03PS2) 10Volans: netboot: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419248 (https://phabricator.wikimedia.org/T184563)
[18:51:45] <wikibugs>	 (03CR) 10RobH: [C: 032] decom db2030 [puppet] - 10https://gerrit.wikimedia.org/r/419256 (https://phabricator.wikimedia.org/T187768) (owner: 10RobH)
[18:51:47] <icinga-wm>	 PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:52:23] <volans>	 ottomata: is it you? ^^^
[18:53:06] <wikibugs>	 (03PS3) 10Volans: netboot: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419248 (https://phabricator.wikimedia.org/T184563)
[18:54:04] <wikibugs>	 (03PS2) 10Dzahn: restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050)
[18:54:06] <wikibugs>	 (03CR) 10Volans: [C: 032] netboot: add entries for puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/419248 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans)
[18:54:27] <wikibugs>	 (03PS1) 10RobH: decom db2030 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/419257 (https://phabricator.wikimedia.org/T187768)
[18:54:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn)
[18:54:56] <wikibugs>	 (03CR) 10RobH: [C: 032] decom db2030 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/419257 (https://phabricator.wikimedia.org/T187768) (owner: 10RobH)
[18:55:15] <no_justification>	 awight|lunch: Can you please rm /var/lock/scap.operations_mediawiki-config.lock from tin?
[18:55:21] <no_justification>	 (you seem to have a stuck lock...)
[18:57:58] <wikibugs>	 (03PS1) 10Rush: openstack: labtestmetal partmon raid1 recipe [puppet] - 10https://gerrit.wikimedia.org/r/419258 (https://phabricator.wikimedia.org/T188266)
[18:58:59] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4047627 (10RobH)
[18:59:15] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3984964 (10RobH) a:05RobH>03Papaul @papaul: ready for onsite disk wipe
[19:00:04] <jouncebot>	 no_justification: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T1900).
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:01:10] <no_justification>	 Bleh. Or can I get a root to nuke this file from tin? 
[19:01:17] <no_justification>	  /var/lock/scap.operations_mediawiki-config.lock
[19:01:54] <awight>	 omg sorry
[19:01:55] <awight>	 nuking
[19:01:55] <no_justification>	 There's an awight!
[19:01:56] <no_justification>	 :)
[19:02:21] <awight>	 done
[19:02:39] <logmsgbot>	 !log demon@tin Started scap: bootstrap wmf.25
[19:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:23] <wikibugs>	 10Operations, 10Traffic, 10Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#4047636 (10BBlack) So, recapping this ticket that's been stale for quite a while:  * We've had past applayer bugs with gzipped outputs in e...
[19:05:33] <wikibugs>	 (03PS8) 10Herron: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259)
[19:06:01] <wikibugs>	 (03PS1) 10Ottomata: Update jupyterhub to 0.8.1 to work with newer singleuserauthenticator [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419260 (https://phabricator.wikimedia.org/T183145)
[19:06:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron)
[19:06:24] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Update jupyterhub to 0.8.1 to work with newer singleuserauthenticator [wheels/paws-internal] - 10https://gerrit.wikimedia.org/r/419260 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata)
[19:10:59] <mutante>	 no_justification: ok
[19:11:07] <no_justification>	 No
[19:11:08] <no_justification>	 It's already done
[19:11:13] <no_justification>	 awight handled it :)
[19:11:16] <mutante>	 just saw,'k
[19:11:39] <awight>	 Dig a hole, fill it in again!
[19:15:22] <bblack>	 or cover it with a peice of paper with a drawing of grass on it and hope nobody notices! :)
[19:23:30] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#4047699 (10BBlack) 05Open>03Resolved
[19:26:04] <awight>	 bblack: yes!  Also good for catching neighbors to have for dinner
[19:26:48] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4047703 (10Papaul) @RobH  thanks
[19:52:34] <wikibugs>	 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4047768 (10Papaul) Dear Mr Papaul Tshibamba,  Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.  Your request...
[19:58:50] <icinga-wm>	 PROBLEM - HHVM rendering on mw2187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:59:49] <icinga-wm>	 RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 74345 bytes in 1.289 second response time
[20:01:06] <wikibugs>	 (03PS1) 10Gehel: wdqs: collect prometheus metrics for both wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/419264 (https://phabricator.wikimedia.org/T187766)
[20:03:30] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 66.70, 36.61, 27.67
[20:09:57] <logmsgbot>	 !log demon@tin Finished scap: bootstrap wmf.25 (duration: 67m 17s)
[20:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:44] <wikibugs>	 (03CR) 10Chad: [C: 032] group0 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419254 (owner: 10Chad)
[20:12:32] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419254 (owner: 10Chad)
[20:30:52] <wikibugs>	 (03CR) 10Framawiki: [C: 031] "Note that a -2 is not just an opinion. It's a veto. See MarcoAurelio's comment on the phab ticket. The debate is still present on phab. I " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21)
[20:34:40] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 19.56, 21.89, 23.63
[20:35:15] <wikibugs>	 (03PS2) 10Framawiki: Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277)
[20:37:57] <wikibugs>	 (03PS1) 10Ottomata: Blacklist VirtualPageView schema from EL MySQL [puppet] - 10https://gerrit.wikimedia.org/r/419268 (https://phabricator.wikimedia.org/T186728)
[20:40:21] <logmsgbot>	 !log demon@tin rebuilt and synchronized wikiversions files: group0 to wmf.25
[20:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:15] <wikibugs>	 (03CR) 10BryanDavis: [C: 031] Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) (owner: 10Chico Venancio)
[20:45:50] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Blacklist VirtualPageView schema from EL MySQL [puppet] - 10https://gerrit.wikimedia.org/r/419268 (https://phabricator.wikimedia.org/T186728) (owner: 10Ottomata)
[20:48:27] <awight>	 no_justification: For when you're out of the train...  I realized that my extension has a composer dependency on justinrainbow/json-schema, which is in require-dev for mediawiki-core.  Grasping at straws here.
[20:48:54] <legoktm>	 awight: isn't it already in mediawiki/vendor?
[20:48:58] <Reedy>	 awight: Isn't that in the wmf deployment vendor repo anyway?
[20:49:01] <awight>	 legoktm: +1 just confirmed that
[20:49:14] <legoktm>	 so it should be fine?
[20:49:25] <Reedy>	 Unless you happen to need a different version
[20:49:49] <awight>	 Sorry for the spam.  What I'm looking at is a new extension that should have been deployed to the beta cluster a few hours ago, but so far nothing is loaded.
[20:50:07] <legoktm>	 link to the mediawiki-config patch?
[20:50:34] <Reedy>	 https://github.com/wikimedia/operations-mediawiki-config/commit/30ba98a43665e4e025611dd9283948cce2b97d58
[20:50:37] <awight>	 There are follow-ups, but here's the business, https://gerrit.wikimedia.org/r/#/c/419229/1/wmf-config/CommonSettings.php
[20:50:41] <legoktm>	 oh
[20:50:43] <legoktm>	 I see why
[20:50:47] <legoktm>	 jenkins is stuck
[20:50:49] <awight>	 I used eval.php to show that $wmgUseJADE is true
[20:50:51] <awight>	 wat
[20:50:54] <awight>	 hehe okay thanks
[20:51:17] <awight>	 Good catch, "postmerge" jobs
[20:51:47] <wikibugs>	 (03CR) 10jenkins-bot: Enable Extension:JADE on all beta cluster wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419229 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight)
[20:51:51] <wikibugs>	 (03CR) 10jenkins-bot: Add JADE to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419243 (owner: 10Awight)
[20:51:59] <wikibugs>	 (03CR) 10jenkins-bot: group0 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419254 (owner: 10Chad)
[20:52:51] <legoktm>	 now we wait ~5min?
[20:53:08] <wikibugs>	 (03PS1) 10Jdlrobson: Enable VirtualPageViews on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419271 (https://phabricator.wikimedia.org/T184793)
[20:54:20] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1974 bytes in 0.092 second response time
[20:57:32] <awight>	 legoktm: That was amazing.  Thanks!
[21:00:20] <icinga-wm>	 PROBLEM - HHVM rendering on mw2141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:01:10] <icinga-wm>	 RECOVERY - HHVM rendering on mw2141 is OK: HTTP OK: HTTP/1.1 200 OK - 74277 bytes in 0.302 second response time
[21:20:33] <wikibugs>	 (03PS9) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650)
[21:20:47] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4048057 (10RobH) a:03RobH
[21:23:16] <wikibugs>	 (03PS10) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650)
[21:28:45] <wikibugs>	 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4048062 (10brion) *nod* If there's general agreement not to add more specific hardware yet, we can just work with the reassigned image servers for now and add later if ne...
[21:30:00] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4048063 (10RobH)
[21:33:20] <wikibugs>	 (03PS1) 10RobH: decom restbase-test200[123] [puppet] - 10https://gerrit.wikimedia.org/r/419312 (https://phabricator.wikimedia.org/T187447)
[21:33:26] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "From IRC:"looks like it's a problem with either the confluent-kafka module, or (more likely) with the librdkafka library that it uses unde" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier)
[21:33:59] <wikibugs>	 (03CR) 10RobH: [C: 032] decom restbase-test200[123] [puppet] - 10https://gerrit.wikimedia.org/r/419312 (https://phabricator.wikimedia.org/T187447) (owner: 10RobH)
[21:35:36] <wikibugs>	 (03PS11) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650)
[21:36:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm)
[21:39:18] <wikibugs>	 (03PS12) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650)
[21:39:22] <wikibugs>	 (03PS1) 10Dzahn: rm restbase::monitoring, remnants of module [puppet] - 10https://gerrit.wikimedia.org/r/419314
[21:39:49] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=55%)
[21:40:12] <wikibugs>	 (03PS1) 10RobH: restbase-test200* prod dns removal [dns] - 10https://gerrit.wikimedia.org/r/419315 (https://phabricator.wikimedia.org/T187447)
[21:40:52] <wikibugs>	 (03CR) 10RobH: [C: 032] restbase-test200* prod dns removal [dns] - 10https://gerrit.wikimedia.org/r/419315 (https://phabricator.wikimedia.org/T187447) (owner: 10RobH)
[21:43:20] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4048083 (10RobH)
[21:44:19] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#3975465 (10RobH) a:05RobH>03Papaul These are ready to be wiped by onsite.  Please note as SSDs, these need the specific smartctl utility run to erase them securely....
[21:49:31] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Create the basic structure of a helm chart repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/419316
[21:49:41] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "turns out this class is not used, monitoring check is already moved -> https://gerrit.wikimedia.org/r/#/c/419314/" [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn)
[21:51:09] <urandom>	 mutante: Q: do you think you'd have any time in the coming days to re-image restbase-dev1006 (ala T185494), and optionally, restbase-dev100{4,5}?
[21:51:10] <stashbot>	 T185494: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494
[21:51:50] <urandom>	 mutante: these are not production hosts, so nothing Bad can happen, no special considerations are needed, and if you get them back to the point where I have login, I can take it from there
[21:52:24] <urandom>	 mutante: TL;DR it's kinda blocking a bunch of other things, and everyone in SRE seems pretty slammed :)
[21:52:37] <urandom>	 (hoping you're not-as)
[21:52:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Create the basic structure of a helm chart repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/419316 (owner: 10Alexandros Kosiaris)
[21:52:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Create the basic structure of a helm chart repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/419316 (owner: 10Alexandros Kosiaris)
[21:53:20] <mutante>	 it depends whether we get hardware working for bast/deployment server replacements which would be due by end of quarter in 2 weeks
[21:53:39] <mutante>	 is this related to restbase-test hosts being removed right now above?
[21:53:45] <mutante>	 looking at ticket
[21:55:12] <urandom>	 mutante: not really, no
[21:55:46] <urandom>	 mutante: those machines are really long in the tooth, we're going to stick to the -dev* ones, but that environment needs to be rebuilt
[21:55:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#4048126 (10RobH) a:03RobH
[21:56:26] <urandom>	 mutante: i.e. we're consolidating, but atm we have nothing
[22:12:48] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4048150 (10RobH) I just disabled the following ports for decommission of the systems:  {master:1}[edit] robh@asw-b-eqiad# show | compare  [edit interface...
[22:13:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#4048151 (10RobH)
[22:17:18] <wikibugs>	 10Operations, 10Services: rename role::xenon - https://phabricator.wikimedia.org/T189629#4048155 (10RobH) p:05Triage>03Normal
[22:20:12] <wikibugs>	 (03PS1) 10RobH: decom of xenon, cerium, & praseodymium [puppet] - 10https://gerrit.wikimedia.org/r/419322 (https://phabricator.wikimedia.org/T187446)
[22:21:55] <wikibugs>	 (03CR) 10RobH: [C: 032] decom of xenon, cerium, & praseodymium [puppet] - 10https://gerrit.wikimedia.org/r/419322 (https://phabricator.wikimedia.org/T187446) (owner: 10RobH)
[22:26:44] <wikibugs>	 (03PS1) 10RobH: Decommission xenon, cerium, praseodymium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/419324 (https://phabricator.wikimedia.org/T187446)
[22:27:24] <wikibugs>	 (03CR) 10RobH: [C: 032] Decommission xenon, cerium, praseodymium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/419324 (https://phabricator.wikimedia.org/T187446) (owner: 10RobH)
[22:28:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#4048208 (10RobH)
[22:28:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#3975453 (10RobH) a:05RobH>03Cmjohnson this is now ready for onsite wipe of ssds.  Please note these are ssds, so will need the smartctl utility to clear them ou...
[22:29:22] <icinga-wm>	 PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:29:32] <icinga-wm>	 PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:29:41] <icinga-wm>	 PROBLEM - puppet last run on db1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:29:59] <robh>	 uh oh
[22:30:01] <icinga-wm>	 PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:30:07] <robh>	 i just erged a change but it was decom, shouldnt cause that
[22:30:22] <icinga-wm>	 PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:31:11] <icinga-wm>	 PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:31:51] <robh>	 manually running on lvs1002 to see what exactly is happening
[22:31:51] <icinga-wm>	 PROBLEM - puppet last run on ganeti1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:32:01] <icinga-wm>	 PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:32:05] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4048217 (10mobrovac) >>! In T188947#4046639, @Pchelolo wrote: >> Add a second LVS IP, to be served from the same cluster, to use for videoscal...
[22:32:11] <icinga-wm>	 PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:32:29] <robh>	 ok lvs1002 runs puppet fine when i run it
[22:32:31] <icinga-wm>	 PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:32:41] <icinga-wm>	 PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:33:21] <icinga-wm>	 PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:33:22] <icinga-wm>	 PROBLEM - puppet last run on mw1316 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:33:42] <icinga-wm>	 PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:34:22] <icinga-wm>	 RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[22:34:47] <robh>	 so the one i manually ran of course clears...
[22:36:42] <icinga-wm>	 PROBLEM - puppet last run on mw1311 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:38:05] <mutante>	 confirming one.. mw1311
[22:39:06] <mutante>	 and ACK, it's just the icinga noise now
[22:39:07] <paladox>	 I think it was puppetdb 
[22:39:11] <mutante>	 yes, indeed
[22:41:40] <wikibugs>	 10Operations, 10Services (watching): rename role::xenon - https://phabricator.wikimedia.org/T189629#4048246 (10mobrovac) [`role::xenon`](https://github.com/wikimedia/puppet/blob/c6a8895e6eb1ea858795aca2325d60a877a0276e/modules/role/manifests/xenon.pp) actually refers to the [HHVM extension named Xenon](https:/...
[22:41:42] <icinga-wm>	 RECOVERY - puppet last run on mw1311 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[22:44:22] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1977 bytes in 0.087 second response time
[22:47:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4048260 (10ayounsi) asw2-b-eqiad updated.
[22:52:51] <icinga-wm>	 PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/e00bf838-2710-11e8-9cb3-aa0000fe6bdf/volumes/kubernetes.iosecret/tiller-token-xf04b is not accessible: Permission denied
[22:53:07] <wikibugs>	 (03PS13) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650)
[22:53:32] <icinga-wm>	 RECOVERY - Host restbase-dev1006 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[22:54:51] <icinga-wm>	 PROBLEM - IPMI Sensor Status on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:55:32] <icinga-wm>	 PROBLEM - Restbase root url on restbase-dev1006 is CRITICAL: connect to address 10.64.48.10 and port 7231: Connection refused
[22:55:41] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:55:51] <icinga-wm>	 PROBLEM - configured eth on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:55:52] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:56:01] <icinga-wm>	 PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:56:02] <icinga-wm>	 PROBLEM - DPKG on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:56:02] <icinga-wm>	 PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:56:21] <icinga-wm>	 PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:56:31] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:56:31] <icinga-wm>	 PROBLEM - Check size of conntrack table on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:56:51] <icinga-wm>	 RECOVERY - puppet last run on ganeti1005 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[22:57:41] <icinga-wm>	 RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[22:58:21] <icinga-wm>	 RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:58:31] <icinga-wm>	 RECOVERY - puppet last run on mw1316 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[22:58:41] <icinga-wm>	 PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[22:58:51] <icinga-wm>	 RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:59:32] <icinga-wm>	 RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:59:41] <icinga-wm>	 RECOVERY - puppet last run on db1099 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:00:01] <icinga-wm>	 RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180313T2300).
[23:00:04] <jouncebot>	 subbu, twkozlowski, and Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:22] <icinga-wm>	 RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:00:26] <subbu>	 o/
[23:00:34] <odder>	 o/
[23:01:05] <odder>	 We're at 9 patches this window, I see
[23:01:11] <icinga-wm>	 RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:02:01] <icinga-wm>	 RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[23:02:11] <icinga-wm>	 RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[23:02:31] <icinga-wm>	 RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[23:06:16] <subbu>	 any swatters around?
[23:07:28] <Reedy>	 there's too many patches!
[23:07:58] <Hauskatze>	 just 9, one more :)
[23:08:14] <wikibugs>	 (03PS2) 10Reedy: Enable RemexHTML on wikis with < 25 errors in high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418918 (https://phabricator.wikimedia.org/T188010) (owner: 10Subramanya Sastry)
[23:08:20] <wikibugs>	 (03CR) 10Reedy: [C: 032] Enable RemexHTML on wikis with < 25 errors in high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418918 (https://phabricator.wikimedia.org/T188010) (owner: 10Subramanya Sastry)
[23:10:01] <wikibugs>	 (03Merged) 10jenkins-bot: Enable RemexHTML on wikis with < 25 errors in high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418918 (https://phabricator.wikimedia.org/T188010) (owner: 10Subramanya Sastry)
[23:10:17] <wikibugs>	 (03CR) 10jenkins-bot: Enable RemexHTML on wikis with < 25 errors in high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418918 (https://phabricator.wikimedia.org/T188010) (owner: 10Subramanya Sastry)
[23:10:25] <mutante>	 !log restbase-dev1006 - reinstalling, manually skipping " Volume group name already in use" (T185494)
[23:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:31] <stashbot>	 T185494: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494
[23:11:42] <wikibugs>	 (03PS2) 10Reedy: Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448) (owner: 10Odder)
[23:11:49] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable RemexHTML on 96 wikis T188010 (duration: 01m 16s)
[23:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:55] <stashbot>	 T188010: Enable RemexHTML on additional wikis with < 25  errors in all high priority categories - https://phabricator.wikimedia.org/T188010
[23:11:59] <jdlrobson>	 Reedy: I'm here (but at the end of the queue apparently ;-))
[23:12:05] <jdlrobson>	 lemme know when we are ready to rumble
[23:12:14] <wikibugs>	 (03CR) 10Reedy: [C: 032] Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448) (owner: 10Odder)
[23:12:30] * odder has got a few patches but they should be nice, quick n' easy
[23:13:15] <Reedy>	 jerkins is slow
[23:13:26] <wikibugs>	 (03Merged) 10jenkins-bot: Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448) (owner: 10Odder)
[23:13:42] <icinga-wm>	 PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:13:48] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic, 10netops: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4048335 (10ayounsi) 05Open>03stalled
[23:14:01] <icinga-wm>	 PROBLEM - configured eth on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:14:01] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:14:08] <wikibugs>	 (03PS2) 10Reedy: Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578) (owner: 10Odder)
[23:14:11] <icinga-wm>	 PROBLEM - DPKG on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:14:11] <icinga-wm>	 PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:14:11] <icinga-wm>	 PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:14:15] <wikibugs>	 (03CR) 10Reedy: [C: 032] Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578) (owner: 10Odder)
[23:14:31] <icinga-wm>	 PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:14:31] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:14:32] <icinga-wm>	 PROBLEM - Check size of conntrack table on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:14:37] <mutante>	 that wasn't downtime? why
[23:14:42] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:14:52] <mutante>	 i mean why did it not alert before when it was down
[23:15:33] <wikibugs>	 (03CR) 10jenkins-bot: Add high-density logos for the Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419144 (https://phabricator.wikimedia.org/T181448) (owner: 10Odder)
[23:15:41] <wikibugs>	 (03Merged) 10jenkins-bot: Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578) (owner: 10Odder)
[23:15:51] <wikibugs>	 (03PS2) 10Reedy: Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586) (owner: 10Odder)
[23:15:53] <wikibugs>	 (03CR) 10jenkins-bot: Provide a high-density logo for the Twi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419168 (https://phabricator.wikimedia.org/T189578) (owner: 10Odder)
[23:15:53] <mutante>	 apparently because if the return code is 255 and not 0,1,2 or 3 ...
[23:15:55] <wikibugs>	 (03CR) 10Reedy: [C: 032] Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586) (owner: 10Odder)
[23:16:02] <mutante>	 then scheduled downtimes are ignored
[23:16:09] <subbu>	 Reedy, looks like the remex thing is done already .. based on my testing.
[23:16:17] <mutante>	 and it's 255 because currently that's installing OS
[23:16:20] <Reedy>	 subbu: Yeah, I just pushed it live :P
[23:16:22] <subbu>	 oh yes .. you did sync it up there.
[23:16:24] <subbu>	 ok.
[23:17:05] <wikibugs>	 (03Merged) 10jenkins-bot: Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586) (owner: 10Odder)
[23:17:26] <wikibugs>	 (03PS2) 10Reedy: Correct logo for the Livvi-Karelian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419213 (https://phabricator.wikimedia.org/T146745) (owner: 10Odder)
[23:17:29] <wikibugs>	 (03CR) 10Reedy: [C: 032] Correct logo for the Livvi-Karelian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419213 (https://phabricator.wikimedia.org/T146745) (owner: 10Odder)
[23:18:12] <Hauskatze>	 and no logmsgbot messages?
[23:18:44] <Reedy>	 morebots seems to be dead
[23:18:52] <icinga-wm>	 RECOVERY - Disk space on kubernetes1003 is OK: DISK OK
[23:19:00] <wikibugs>	 (03Merged) 10jenkins-bot: Correct logo for the Livvi-Karelian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419213 (https://phabricator.wikimedia.org/T146745) (owner: 10Odder)
[23:19:16] <wikibugs>	 (03PS2) 10Reedy: Update logo for the Maithili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419246 (https://phabricator.wikimedia.org/T149790) (owner: 10Odder)
[23:19:19] <wikibugs>	 (03CR) 10Reedy: [C: 032] Update logo for the Maithili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419246 (https://phabricator.wikimedia.org/T149790) (owner: 10Odder)
[23:20:31] <wikibugs>	 (03Merged) 10jenkins-bot: Update logo for the Maithili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419246 (https://phabricator.wikimedia.org/T149790) (owner: 10Odder)
[23:21:07] <wikibugs>	 (03PS2) 10Reedy: Add high-density logos for seven Wikipedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419162 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder)
[23:21:11] <wikibugs>	 (03CR) 10Reedy: [C: 032] Add high-density logos for seven Wikipedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419162 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder)
[23:21:23] <wikibugs>	 (03CR) 10jenkins-bot: Add a localised logo for the Kongo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419176 (https://phabricator.wikimedia.org/T189586) (owner: 10Odder)
[23:22:21] <wikibugs>	 (03Merged) 10jenkins-bot: Add high-density logos for seven Wikipedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419162 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder)
[23:22:22] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - kubelet_operational_latencies is 33696 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:23:22] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - kubelet_operational_latencies is 1329 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:24:32] <logmsgbot>	 !log reedy@tin Synchronized static/images/project-logos/: YOU GET A LOGO, YOU GET A LOGO. YOU ALL GET LOGOS (duration: 01m 16s)
[23:24:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:24:54] <odder>	 ;-)
[23:25:23] <Hauskatze>	 lol
[23:25:42] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:26:01] <icinga-wm>	 PROBLEM - configured eth on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:26:01] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:26:11] <icinga-wm>	 PROBLEM - DPKG on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:26:12] <icinga-wm>	 PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:26:12] <icinga-wm>	 PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: Return code of 255 is out of bounds
[23:26:16] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: moar logos (duration: 01m 15s)
[23:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:26] <wikibugs>	 (03PS2) 10Reedy: Enable VirtualPageViews on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419271 (https://phabricator.wikimedia.org/T184793) (owner: 10Jdlrobson)
[23:30:37] <icinga-wm>	 PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:30:38] <wikibugs>	 (03CR) 10Reedy: [C: 032] Enable VirtualPageViews on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419271 (https://phabricator.wikimedia.org/T184793) (owner: 10Jdlrobson)
[23:30:47] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:30:47] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[23:30:57] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:31:08] <icinga-wm>	 PROBLEM - configured eth on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:31:08] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[23:31:08] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:31:18] <icinga-wm>	 PROBLEM - DPKG on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:31:18] <icinga-wm>	 PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:31:18] <icinga-wm>	 PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:31:57] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on restbase-dev1006 is OK: OK ferm input default policy is set
[23:31:58] <icinga-wm>	 RECOVERY - MD RAID on restbase-dev1006 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0
[23:32:01] <wikibugs>	 (03Merged) 10jenkins-bot: Enable VirtualPageViews on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419271 (https://phabricator.wikimedia.org/T184793) (owner: 10Jdlrobson)
[23:32:07] <icinga-wm>	 RECOVERY - configured eth on restbase-dev1006 is OK: OK - interfaces up
[23:32:08] <icinga-wm>	 RECOVERY - Disk space on restbase-dev1006 is OK: DISK OK
[23:32:18] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install ms-be204[1-4] - https://phabricator.wikimedia.org/T189633#4048384 (10Papaul) p:05Triage>03Normal
[23:32:18] <icinga-wm>	 RECOVERY - dhclient process on restbase-dev1006 is OK: PROCS OK: 0 processes with command name dhclient
[23:32:18] <icinga-wm>	 RECOVERY - DPKG on restbase-dev1006 is OK: All packages OK
[23:32:27] <mutante>	 ^ reinstalled
[23:32:34] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install ms-be204[1-3] - https://phabricator.wikimedia.org/T189633#4048400 (10Papaul)
[23:33:18] <icinga-wm>	 RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active
[23:33:47] <icinga-wm>	 RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[23:34:06] <mutante>	 urandom: ^ i think you should be able to SSH again (minus mismatching fingerprint, new one is here https://phabricator.wikimedia.org/P6845)
[23:35:31] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 01m 15s)
[23:35:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:37] <wikibugs>	 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#4048405 (10Dzahn) reinstalled, re-added to puppet, initial puppet run, recovered in Icinga, including:  19:33 < icinga-wm> RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cas...
[23:35:38] <Reedy>	 !log that was Enable VirtualPageViews on Hungarian Wikipedia T184793
[23:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:44] <stashbot>	 T184793: Instrument page interactions - https://phabricator.wikimedia.org/T184793
[23:36:18] <icinga-wm>	 PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[23:37:51] <urandom>	 mutante: \o/
[23:37:56] <urandom>	 mutante: thank you so much!
[23:38:36] <urandom>	 i know that ended up being more than you bargained for and that you have other stuff to do; it is appreciated!
[23:39:33] <Reedy>	 error: insufficient permission for adding an object to repository database .git/objects
[23:39:33] <Reedy>	 fatal: git write-tree failed to write a tree
[23:39:57] <Reedy>	 tgr: You seem to have a mad umask for groups
[23:40:10] <Reedy>	 drwxr-sr-x   2 tgr        wikidev 4096 Mar  9 23:21 e9
[23:40:10] <Reedy>	 drwxrwsr-x   2 reedy      wikidev 4096 Mar  8 22:15 eb
[23:40:34] <mutante>	 urandom: :) so, one question. did you say 1004/1005 because they should all be stretch? that makes sense
[23:40:49] <mutante>	 and did you expect to get stretch on 1006
[23:40:52] <Reedy>	 mutante: Can I get you to chmod -R g+w /srv/mediawiki-staging/.git/objects/*
[23:41:24] <urandom>	 mutante: hrmm, no actually, our standard setup is still jessie
[23:41:45] <urandom>	 mutante: how does that work, what is the timetable for upgrading?
[23:42:07] <mutante>	 Reedy: done on tin
[23:42:07] <urandom>	 i assume at some point, there will be an effort to get everything migrated to stretch
[23:42:08] <tgr>	 Reedy: pretty sure I never changed my umask settings on tin
[23:42:32] <Reedy>	 mutante: ugh, wrong folder, damn it
[23:42:52] <Reedy>	 chmod -R g+w /srv/mediawiki-staging/php-1.31.0-wmf.24/.git/objects/*
[23:43:09] <jdlrobson>	 Reedy: oh we're live? cool. was expecting a mwdebug test round
[23:43:11] <jdlrobson>	 thanks!
[23:43:24] <Reedy>	 effort :P
[23:43:30] <jdlrobson>	 Reedy: the UBN is that live?
[23:43:31] <Reedy>	 jdlrobson: other is blocked on bad git file permissions
[23:43:31] <mutante>	 !log tin: chmod -R g+w /srv/mediawiki-staging/.git/objects/* ;  chmod -R g+w /srv/mediawiki-staging/php-1.31.0-wmf.24/.git/objects/*
[23:43:34] <mutante>	 Reedy: done
[23:43:34] <jdlrobson>	 oh.
[23:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:50] <mutante>	 urandom: so .. a little while i ago i changed the default to be stretch .. in DHCP
[23:44:02] <Reedy>	 mutante: <3
[23:44:03] <mutante>	 urandom: we did that to encourage more stretch
[23:44:14] <urandom>	 wait, this is jessie
[23:44:35] <mutante>	 so basically we wanted to change it to "you need a reason now to still want jessie"
[23:44:44] <urandom>	 mutante: yeah, this is jesssie
[23:44:46] <mutante>	 but also.. i did not touch the individual DHCP config for it today
[23:44:51] <mutante>	 so.. you still got jessie
[23:44:57] <urandom>	 err, with the apropos number of s's
[23:44:58] <icinga-wm>	 RECOVERY - Check size of conntrack table on restbase-dev1006 is OK: OK: nf_conntrack is 0 % full
[23:45:06] <urandom>	 oh, ok
[23:45:36] <mutante>	 urandom: so while the 3 dev hosts are consistent.. which is good... maybe you also want one that is stretch
[23:45:41] <mutante>	 so that you can start testing the difference
[23:45:51] <urandom>	 ¯\_(ツ)_/¯
[23:46:33] <urandom>	 that would probably be OK, but you've already got this up, are you trying to find more to do? :)
[23:46:54] <logmsgbot>	 !log reedy@tin Synchronized php-1.31.0-wmf.24/extensions/MobileFrontend/: T188825 (duration: 01m 18s)
[23:46:59] <mutante>	 lol, no. but why did you say to reinstall 1004/1005 optinally
[23:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:00] <stashbot>	 T188825: Infobox relocation ends up messing up paragraphs placing - https://phabricator.wikimedia.org/T188825
[23:47:42] <urandom>	 well, i said optionally because we've typically reimaged when reseting one of these clusters, but it isn't strictly needed
[23:47:49] <urandom>	 i can clean them up
[23:48:05] <urandom>	 and it is in my power to do so, i can't reimage :)
[23:48:59] <Hauskatze>	 Good night people. Try not to work too hard.
[23:49:51] <mutante>	 urandom: ah! ok.. well then.. if the cleanup is easy enough, ok
[23:50:02] <urandom>	 it's doable :)
[23:50:35] <urandom>	 and this is the dev environment, so if there are missteps, then it's OK
[23:51:35] <mutante>	 btw, i just made a change to disable the monitoring for "restbase root url" if the host is a "dev" host
[23:51:56] <urandom>	 ok
[23:52:04] <mutante>	 and then i found that we had an old restbase::monitoring class that isn't used 
[23:52:14] <mutante>	 replaced by profile::restbase
[23:52:50] <wikibugs>	 (03PS1) 10Odder: Add a localised logo for the Cree Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419332
[23:53:18] <odder>	 Reedy: Would you mind doing ^^ or do you prefer to wait until another window tomorrow?
[23:53:45] <odder>	 I created the two HiDPI logos today, but it looks like we missed the normal logo somehow :/
[23:54:19] <wikibugs>	 (03CR) 10Dzahn: [C: 032] rm restbase::monitoring, remnants of module [puppet] - 10https://gerrit.wikimedia.org/r/419314 (owner: 10Dzahn)
[23:54:25] <wikibugs>	 (03PS2) 10Dzahn: rm restbase::monitoring, remnants of module [puppet] - 10https://gerrit.wikimedia.org/r/419314
[23:54:47] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "profile::restbase already does this" [puppet] - 10https://gerrit.wikimedia.org/r/419314 (owner: 10Dzahn)
[23:54:57] <icinga-wm>	 RECOVERY - IPMI Sensor Status on restbase-dev1006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[23:55:24] <wikibugs>	 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#4048451 (10Dzahn) 05stalled>03Resolved a:05Cmjohnson>03Dzahn
[23:55:35] <wikibugs>	 (03CR) 10Reedy: [C: 032] Add a localised logo for the Cree Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419332 (owner: 10Odder)
[23:56:21] <wikibugs>	 (03Merged) 10jenkins-bot: Add a localised logo for the Cree Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419332 (owner: 10Odder)
[23:57:42] <jdlrobson>	 Reedy: any luck with the git issue?
[23:57:49] <Reedy>	 It's deployed
[23:57:57] <Reedy>	 11 minutes ago
[23:58:14] <tgr>	 Reedy: fwiw my umask is u=rwx,g=rwx,o=rx
[23:58:24] <Reedy>	 hmm
[23:58:30] <Reedy>	 you were definitely listed as the user...
[23:58:37] <mutante>	 robh: i think you removed all the restbase-test hosts but one.  test2002 survived in icinga
[23:59:50] <robh>	 ok, resent
[23:59:53] <robh>	 it should go away