[00:26:32] (03CR) 10Bartosz DziewoƄski: ":(" [puppet] - 10https://gerrit.wikimedia.org/r/258713 (owner: 10Ori.livneh) [00:27:32] (03CR) 10Ori.livneh: "Yeah, a proper fix would be to figure out what is still expecting ASCII in 2015 and get it to grow up, but with Puppet broken on rutherfor" [puppet] - 10https://gerrit.wikimedia.org/r/258713 (owner: 10Ori.livneh) [01:03:43] (03PS5) 10Ori.livneh: import debian directory [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 [01:11:28] (03CR) 10Ori.livneh: "Should work now." [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 (owner: 10Ori.livneh) [02:22:03] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.8) (duration: 09m 08s) [02:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:52] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Dec 13 02:28:52 UTC 2015 (duration 6m 49s) [02:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:18] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4005 is CRITICAL: Connection refused [03:18:17] PROBLEM - cassandra service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [03:31:49] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:09] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:38] PROBLEM - puppet last run on mc2008 is CRITICAL: CRITICAL: Puppet has 1 failures [03:37:09] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 1 failures [03:50:27] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Puppet has 1 failures [04:00:48] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [04:00:48] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [04:01:17] RECOVERY - puppet last run on mc2008 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:06:48] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:06:48] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:06:49] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:06:57] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:27] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:49] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:49] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:18] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:19] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:49] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:09:27] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [04:14:48] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:58] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:18:08] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [04:18:28] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:19:48] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:59] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:28] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:37:27] RECOVERY - Disk space on mw1013 is OK: DISK OK [04:37:57] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [04:43:28] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:43:49] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:53:08] PROBLEM - NTP on mw1013 is CRITICAL: NTP CRITICAL: No response from NTP server [04:56:38] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:26:38] RECOVERY - NTP on mw1013 is OK: NTP OK: Offset 0.00269651413 secs [05:40:34] (03PS13) 10Yuvipanda: Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) (owner: 10BryanDavis) [05:41:08] (03PS1) 10Yuvipanda: k8s: Add and trust docker official repository [puppet] - 10https://gerrit.wikimedia.org/r/258731 [05:41:12] (03CR) 10Yuvipanda: [C: 032 V: 032] Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) (owner: 10BryanDavis) [05:47:13] (03PS2) 10Yuvipanda: k8s: Add and trust docker official repository [puppet] - 10https://gerrit.wikimedia.org/r/258731 [05:47:35] (03PS3) 10Yuvipanda: k8s: Add and trust docker official repository [puppet] - 10https://gerrit.wikimedia.org/r/258731 [05:51:13] (03PS4) 10Yuvipanda: k8s: Add and trust docker official repository [puppet] - 10https://gerrit.wikimedia.org/r/258731 [05:51:15] (03PS1) 10Yuvipanda: base: Allow https debian repositories [puppet] - 10https://gerrit.wikimedia.org/r/258732 [05:51:31] (03PS2) 10Yuvipanda: base: Allow https debian repositories! [puppet] - 10https://gerrit.wikimedia.org/r/258732 [05:51:47] (03PS5) 10Yuvipanda: k8s: Add and trust docker official repository [puppet] - 10https://gerrit.wikimedia.org/r/258731 [05:55:02] (03PS1) 10Yuvipanda: k8s: Use official docker packages instead of jessie's [puppet] - 10https://gerrit.wikimedia.org/r/258733 [05:55:17] (03CR) 10Yuvipanda: [C: 032] base: Allow https debian repositories! [puppet] - 10https://gerrit.wikimedia.org/r/258732 (owner: 10Yuvipanda) [05:55:31] (03CR) 10Yuvipanda: [C: 032] k8s: Add and trust docker official repository [puppet] - 10https://gerrit.wikimedia.org/r/258731 (owner: 10Yuvipanda) [05:59:01] (03CR) 10Ori.livneh: "Already in modules/install_server/manifests/apt-repository.pp ; see Id615732bdb." [puppet] - 10https://gerrit.wikimedia.org/r/258732 (owner: 10Yuvipanda) [06:01:01] ori: ugh, can you fix that? I think my or a nearby building is on fire. [06:01:08] brb [06:03:18] * legoktm hugs YuviPanda [06:06:27] PROBLEM - NTP on mw1013 is CRITICAL: NTP CRITICAL: No response from NTP server [06:06:57] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: puppet fail [06:08:27] RECOVERY - NTP on mw1013 is OK: NTP OK: Offset 0.004920363426 secs [06:08:28] carbon is probably the https thing [06:08:41] i'll take a look in a moment [06:18:57] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: Puppet has 1 failures [06:23:57] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [06:31:09] PROBLEM - Disk space on elastic1012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [06:31:28] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:09] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:19] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:19] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:29] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:48] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:08] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:19] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:29] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:29] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 3 failures [06:44:38] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:07] PROBLEM - NTP on mw1013 is CRITICAL: NTP CRITICAL: No response from NTP server [06:49:48] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:54:28] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:08] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:56:33] yep [06:56:43] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[apt-transport-https] is already declared in file /etc/puppet/modules/install_server/manifests/apt-repository.pp:24; cannot redeclare at /etc/puppet/modules/base/manifests/standard-packages.pp:45 on node carbon.wikimedia.org [06:56:52] I can fix that [06:57:08] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:49] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:58:08] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:09] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:18] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:58:28] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:48] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:48] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:38] (03PS1) 10Ori.livneh: Revert "base: Allow https debian repositories!" [puppet] - 10https://gerrit.wikimedia.org/r/258735 [06:59:50] (03PS2) 10Ori.livneh: Revert "base: Allow https debian repositories!" [puppet] - 10https://gerrit.wikimedia.org/r/258735 [06:59:58] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "base: Allow https debian repositories!" [puppet] - 10https://gerrit.wikimedia.org/r/258735 (owner: 10Ori.livneh) [07:00:24] YuviPanda: ^ [07:02:18] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:02:28] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:02:37] yep. [07:05:35] !log elasticsearch1016: Moving /var/log/elasticsearch/* to /var/lib/elasticsearch/logs to free up space. [07:05:38] dcausse: ^ [07:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:06:56] !log elasticsearch1012 ditto. [07:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:09:37] RECOVERY - Disk space on elastic1016 is OK: DISK OK [07:12:39] RECOVERY - Disk space on elastic1012 is OK: DISK OK [07:14:58] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [07:14:58] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:17:48] RECOVERY - Disk space on mw1013 is OK: DISK OK [07:19:38] PROBLEM - NTP on mw1013 is CRITICAL: NTP CRITICAL: No response from NTP server [07:20:58] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:58] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:39] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:56] ori: thanks / sorry. and my building really did catch fire [07:30:06] heh. are you ok? [07:30:19] ori: yeah, just a bit shaken and can't find the cats... [07:30:29] they will let me back in in 30mins, sitting on the pavement now [07:40:07] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 124916 MB (3% inode=99%) [07:47:57] RECOVERY - Disk space on fluorine is OK: DISK OK [07:58:18] PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: puppet fail [08:04:28] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [08:04:28] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [08:04:28] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:04:49] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [08:04:49] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [08:05:07] RECOVERY - NTP on mw1013 is OK: NTP OK: Offset 0.006927847862 secs [08:05:07] RECOVERY - Disk space on mw1013 is OK: DISK OK [08:05:38] RECOVERY - DPKG on mw1013 is OK: All packages OK [08:05:48] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [08:05:48] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:10:18] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:25:57] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:43] (03PS2) 10Yuvipanda: k8s: Use official docker packages instead of jessie's [puppet] - 10https://gerrit.wikimedia.org/r/258733 [08:58:48] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000000.0] [08:59:48] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [09:09:49] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0] [09:13:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:14:39] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [10:09:54] (03CR) 10Brian Wolff: "Was this tested if files of those size hit OOM errors?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255822 (owner: 10Reedy) [10:24:21] !log elastic in eqiad: disabling TRACE indexing slowlog on jawiki_content and zhwiki_content [10:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:41:18] ori: thanks, I think you moved opened files, will double check and maybe restart the node to actually free space [10:44:21] dcausse: does elasticsearch not reopen its log files periodically? [10:45:00] ori: I think it should but I can see them in lsof [10:45:47] I looked over the logrotate file in /etc/logrotate.d/elasticsearch and concluded it would be safe, but I missed the 'copytruncate' line. [10:47:10] ok then I will restart the node early this afternoon [10:54:10] !log extend fluorine's storage lv (94% util) lvresize --size +300G --resizefs /dev/mapper/vg0-lv0 [10:54:13] ori: ^ [10:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:19] !log elastic in eqiad: restarting elastic1012 to release opened log filedesc [11:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:04] there is some high activity regarding replication now on s7 [15:11:12] (03PS7) 10BBlack: varnish: move file to module [puppet] - 10https://gerrit.wikimedia.org/r/253457 (owner: 10Dzahn) [15:14:56] (03CR) 10BBlack: [C: 031] "I pulled out the linting part, as it was whitespace-only and seemed like it was moving in the wrong direction." [puppet] - 10https://gerrit.wikimedia.org/r/253457 (owner: 10Dzahn) [15:51:39] (03PS1) 10BBlack: Remove expired sni.foo certs [puppet] - 10https://gerrit.wikimedia.org/r/258752 [15:52:09] (03CR) 10BBlack: [C: 032 V: 032] Remove expired sni.foo certs [puppet] - 10https://gerrit.wikimedia.org/r/258752 (owner: 10BBlack) [17:02:00] (03PS1) 10BBlack: ssl_ciphersuite: prefer ECDHE to DHE within strong list [puppet] - 10https://gerrit.wikimedia.org/r/258762 [17:02:02] (03PS1) 10BBlack: ssl_ciphersuite: prefer SHA-2 HMACs more-strongly [puppet] - 10https://gerrit.wikimedia.org/r/258763 [17:41:00] hi, it seems that externallinks table in hewiki shows links which doesn't exist [17:46:52] 6operations: externallinks in hewiki DB have wrong links - https://phabricator.wikimedia.org/T121353#1876430 (10eranroz) 3NEW [17:59:13] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1876438 (10BBlack) We still haven't had time to deal with HPKP properly yet, so this task has been kinda stalled-out for quite a while. While the actual HPKP pinsets (and asso... [18:03:57] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: puppet fail [18:31:37] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:38:28] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: puppet fail [20:05:59] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [21:04:27] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [21:05:09] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [23:08:57] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 [23:09:39] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [23:11:09] PROBLEM - puppet last run on mw2029 is CRITICAL: CRITICAL: puppet fail [23:40:38] RECOVERY - puppet last run on mw2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures