[00:03:59] (03PS1) 10Dzahn: installserver: let bast4001 use carbon [puppet] - 10https://gerrit.wikimedia.org/r/284992 [00:04:34] (03CR) 10Dzahn: "nothing works, install12001 -> timeout, install1001 -> no tftp serving" [puppet] - 10https://gerrit.wikimedia.org/r/284992 (owner: 10Dzahn) [00:04:39] (03PS7) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [00:04:48] (03PS2) 10Dzahn: installserver: let bast4001 use carbon [puppet] - 10https://gerrit.wikimedia.org/r/284992 [00:07:15] !log deactivate phabricator account for ktr101 per global ban [00:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:10] (03CR) 10Dzahn: [C: 032] installserver: let bast4001 use carbon [puppet] - 10https://gerrit.wikimedia.org/r/284992 (owner: 10Dzahn) [00:29:22] since when have you been able to do that Jamesofur? [00:30:00] (03PS8) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [00:30:03] Krenair: couple months? [00:30:22] * Jamesofur was given the keys a bit ago for similar purposes by the gods of phabricator [00:32:41] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 74.75 ms [00:32:53] (03PS9) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [00:33:14] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2231806 (10BBlack) If you have time and want to do it (next week!), by all means go for it, I have lots else to keep me busy indefinitely :) My basic plan was try it for an hour on a... [00:37:31] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [00:40:00] (03PS2) 10Dzahn: install_server: move tftp role to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/284793 (https://phabricator.wikimedia.org/T132757) [00:40:31] (03CR) 10Dzahn: [C: 032] "no-op on carbon, install2001, bast4001 ..." [puppet] - 10https://gerrit.wikimedia.org/r/284793 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:43:41] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [00:55:32] (03PS10) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [00:56:38] (03CR) 10jenkins-bot: [V: 04-1] Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [00:59:21] (03PS11) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [01:08:00] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [01:26:00] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [01:32:26] (03CR) 10Catrope: [C: 031] Use notify-type-availability due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) (owner: 10Mattflaschen) [01:37:33] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [01:43:33] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [02:14:57] (03PS3) 10Catrope: Use notify-type-availability due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) (owner: 10Mattflaschen) [02:17:23] (03CR) 10Catrope: "PS3 adds back compat code that we will need because we will have some wikis running the new code and the rest running the old code for a f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) (owner: 10Mattflaschen) [02:22:49] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 10m 27s) [02:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:36] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [02:43:37] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:07:57] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [03:14:06] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [04:37:29] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [04:43:29] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:07:45] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [06:13:53] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:30:23] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail [06:31:04] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:15] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:37] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:37] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:08] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:48] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:47] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [06:39:38] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 3 failures [06:43:47] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:55:47] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:07] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:09] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:57:17] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:17] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:28] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:52] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [07:12:51] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [07:40:47] PROBLEM - puppet last run on mw2059 is CRITICAL: CRITICAL: puppet fail [07:46:10] 06Operations: Investigate Ubuntu fork of ttf-indic-fonts and bring it in Jessie - https://phabricator.wikimedia.org/T103328#1387212 (10KartikMistry) fonts-indic should be fine (I know I'm late to comment but, still). [07:52:48] PROBLEM - RAID on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:53:08] PROBLEM - configured eth on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:53:27] PROBLEM - DPKG on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:53:28] PROBLEM - dhclient process on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:53:38] PROBLEM - salt-minion processes on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:53:47] PROBLEM - puppet last run on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:09] PROBLEM - Disk space on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:58] RECOVERY - configured eth on hassaleh is OK: OK - interfaces up [07:55:09] RECOVERY - DPKG on hassaleh is OK: All packages OK [07:55:17] RECOVERY - dhclient process on hassaleh is OK: PROCS OK: 0 processes with command name dhclient [07:55:28] RECOVERY - salt-minion processes on hassaleh is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:55:37] RECOVERY - puppet last run on hassaleh is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [07:55:58] RECOVERY - Disk space on hassaleh is OK: DISK OK [07:56:38] RECOVERY - RAID on hassaleh is OK: OK: no RAID installed [08:08:58] RECOVERY - puppet last run on mw2059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:27:08] ACKNOWLEDGEMENT - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Filippo Giunchedi bootstrapping new hardware [08:37:07] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [08:43:17] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [08:59:08] PROBLEM - configured eth on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:17] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:17] PROBLEM - Disk space on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:31] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:37] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:48] PROBLEM - puppet last run on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:17] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:39] PROBLEM - RAID on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:58] PROBLEM - SSH on meitnerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:24] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2232007 (10hashar) Seems `10-self.conf` varies the hostname fqdn :( ```... [09:49:35] 06Operations, 10OfflineContentGenerator, 13Patch-For-Review, 05codfw-rollout: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2232021 (10Aklapper) [09:57:24] 06Operations, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2232043 (10Southparkfan) [10:05:55] PROBLEM - NTP on meitnerium is CRITICAL: NTP CRITICAL: No response from NTP server [10:06:44] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [10:27:14] PROBLEM - SSH on meitnerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:14] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:06:33] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [11:08:34] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:26:54] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [11:33:25] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: puppet fail [11:37:23] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:43:34] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [11:45:34] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:51:34] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [11:57:45] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:59:55] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:06:50] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [12:14:50] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [12:17:10] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Puppet has 1 failures [12:20:42] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [12:22:41] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [12:41:54] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:42:54] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [12:44:54] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [12:50:46] (03PS1) 10Urbanecm: Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) [13:01:04] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [13:05:14] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:11:14] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 657 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5183040 keys - replication_delay is 657 [13:11:14] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [13:19:15] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5137373 keys - replication_delay is 0 [13:27:23] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:32:53] (03PS1) 10Urbanecm: Enable DynamicPageList extension on tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T133032) [13:57:51] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [14:12:11] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [14:18:21] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [14:24:22] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [14:30:31] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [14:34:41] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [14:40:43] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [14:42:51] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [14:55:01] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [15:11:23] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:23] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [15:13:03] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [15:13:14] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 83.77 ms [15:14:44] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 82.83 ms [15:16:54] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:44] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:54] PROBLEM - HHVM processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:04] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [15:19:23] PROBLEM - nutcracker port on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:24] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:44] PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:45] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:20:03] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:20:03] PROBLEM - nutcracker process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:20:04] PROBLEM - puppet last run on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:20:25] PROBLEM - SSH on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:54] RECOVERY - nutcracker process on mw1142 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:23:04] PROBLEM - Disk space on cp1008 is CRITICAL: DISK CRITICAL - free space: / 342 MB (3% inode=84%) [15:24:55] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [15:28:05] PROBLEM - nutcracker process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:15] PROBLEM - DPKG on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:33] PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:35] PROBLEM - Disk space on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:31:14] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [15:33:34] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: puppet fail [15:38:54] 06Operations, 10MobileFrontend, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2232357 (10BBlack) p:05Triage>03High This doesn't seem to be a (varnish) cache effect. I get the same behavior on your test URLs when forcing cache misses. If thi... [15:41:02] anyone who's around, there's something fishy going on with mobile-vs-desktop rendering, noted on foundation wiki only so far: ^ [15:43:33] RECOVERY - HHVM processes on mw1142 is OK: PROCS OK: 6 processes with command name hhvm [15:43:44] RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:45] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [15:44:04] RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 0 % full [15:44:13] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [15:44:24] RECOVERY - nutcracker process on mw1142 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:44:24] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [15:44:33] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 58 minutes ago with 0 failures [15:44:37] bblack: it's completely random. I'm trying to reproduce this on other wikis now, but haven't found anything yet [15:44:43] RECOVERY - DPKG on mw1142 is OK: All packages OK [15:44:53] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:44:54] RECOVERY - SSH on mw1142 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [15:44:55] RECOVERY - Disk space on mw1142 is OK: DISK OK [15:45:33] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [15:46:33] SPF|Cloud: it helps to evade the varnish caches by appending random query URLs, e.g. /wiki/Foo?a3gt89h45h=oeairglj [15:47:04] Using the Firefox extension for appending X-Wikimedia-Debug should work too [15:47:12] (I guess?) [15:47:15] yeah :) [15:50:43] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: puppet fail [15:51:22] (03PS1) 10Ladsgroup: ores: fix staging configs [puppet] - 10https://gerrit.wikimedia.org/r/285010 [15:53:43] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [16:00:04] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:54] 06Operations, 10MobileFrontend, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2232375 (10Southparkfan) Yeah, this doesn't seem to be a Varnish problem: I tried the following to force the mobile version of the site on a non mobile-site (nl.wikipe... [16:10:42] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [16:22:42] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [16:24:03] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [100000000.0] [16:39:03] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [16:46:32] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [16:49:04] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [16:55:04] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [17:02:41] (03PS1) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [17:03:44] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [17:07:23] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [17:21:42] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [17:27:43] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [17:29:26] 06Operations, 10Ops-Access-Requests: Access Request - https://phabricator.wikimedia.org/T133464#2232536 (10Zppix) [17:39:14] 06Operations, 10Ops-Access-Requests: Access Request - https://phabricator.wikimedia.org/T133464#2232536 (10JanZerebecki) Which permission is required for that? [17:40:02] 06Operations, 10Ops-Access-Requests: Access Request - https://phabricator.wikimedia.org/T133464#2232554 (10JanZerebecki) Context: T131132 [18:15:03] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [18:21:04] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [18:23:12] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [18:27:14] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:29:24] PROBLEM - SSH on meitnerium is CRITICAL: Server answer [18:30:43] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:32:29] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:33:00] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:33:10] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:35:00] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:41:06] 06Operations, 10Ops-Access-Requests: Access Request - https://phabricator.wikimedia.org/T133464#2232625 (10Krenair) 05Open>03Invalid This is not a server access request, there just needs to be code patches submitted to gerrit, and approved by someone with +2 in the appropriate repository. (+2 rights are re... [18:58:50] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [19:07:39] PROBLEM - SSH on meitnerium is CRITICAL: Connection timed out [19:08:18] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:08:28] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [19:08:39] RECOVERY - DPKG on meitnerium is OK: All packages OK [19:08:59] RECOVERY - Disk space on meitnerium is OK: DISK OK [19:09:19] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [19:09:24] <_joe_> !log rebooted meitnerium [19:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:38] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [19:09:48] RECOVERY - RAID on meitnerium is OK: OK: no RAID installed [19:09:49] RECOVERY - configured eth on meitnerium is OK: OK - interfaces up [19:11:58] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:50] RECOVERY - NTP on meitnerium is OK: NTP OK: Offset -0.001602768898 secs [19:46:47] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [19:48:46] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5133073 keys - replication_delay is 0 [20:13:16] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5135044 keys - replication_delay is 623 [20:21:18] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5134385 keys - replication_delay is 0 [20:33:46] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [20:35:48] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5134402 keys - replication_delay is 0 [20:45:47] RECOVERY - Disk space on cp1008 is OK: DISK OK [20:49:36] PROBLEM - Varnish HTCP daemon on cp1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (vhtcpd), args vhtcpd [20:51:17] PROBLEM - Varnishkafka log producer on cp1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:54:50] 06Operations, 10Wikimedia-Mailing-lists: Reset administrator password for nlcheckuser-l mailing list - https://phabricator.wikimedia.org/T133449#2232733 (10MarcoAurelio) p:05Triage>03Normal [21:02:25] re-downtimed cp1008, its long-term downtime had expired heh [22:22:29] PROBLEM - MariaDB Slave Lag: s2 on db2064 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.88 seconds [22:22:38] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.17 seconds [22:23:19] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.13 seconds [22:24:38] RECOVERY - MariaDB Slave Lag: s2 on db2064 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [22:24:39] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [22:25:28] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.22 seconds [22:47:11] (03PS1) 10Nuria: Read values inbound in X-Analytics header (pageview and preview) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) [22:48:20] (03PS2) 10Nuria: Read values inbound in X-Analytics header (pageview and preview) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) [23:11:38] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [100000000.0] [23:16:51] (03PS2) 10BryanDavis: logstash: Make truncated MediaWiki json easier to find [puppet] - 10https://gerrit.wikimedia.org/r/278315 [23:18:04] (03PS8) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [23:18:06] (03PS8) 10BBlack: create letsencrypt module, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [23:19:34] (03CR) 10jenkins-bot: [V: 04-1] letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [23:26:55] (03PS9) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [23:34:55] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [23:35:28] (03PS10) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [23:44:24] RECOVERY - Varnish HTCP daemon on cp1008 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [23:44:25] RECOVERY - Varnishkafka log producer on cp1008 is OK: PROCS OK: 1 process with command name varnishkafka [23:51:03] (03PS11) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812)