[00:07:16] (03PS2) 10Tim Starling: Move idiosyncratic gdbinit to /home/ori [puppet] - 10https://gerrit.wikimedia.org/r/176307 [00:08:37] (03CR) 10Tim Starling: "Updated commit message since the previous commit message was elliptical." [puppet] - 10https://gerrit.wikimedia.org/r/176307 (owner: 10Tim Starling) [00:25:23] (03CR) 10Ori.livneh: [C: 031] "You'll need to either amend the change to ensure => absent the global file, or you'll need to clean it up manually. You can clean it up ma" [puppet] - 10https://gerrit.wikimedia.org/r/176307 (owner: 10Tim Starling) [00:49:52] (03PS2) 10Tim Starling: xhprof production profiling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174372 [00:50:24] (03CR) 10Tim Starling: [C: 032] xhprof production profiling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174372 (owner: 10Tim Starling) [00:50:35] (03Merged) 10jenkins-bot: xhprof production profiling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174372 (owner: 10Tim Starling) [00:51:54] !log tstarling Synchronized wmf-config/StartProfiler.php: (no message) (duration: 00m 05s) [00:52:00] Logged the message, Master [00:54:37] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: /srv 18587 MB (3% inode=97%): [00:55:56] PROBLEM - HTTPS on antimony is CRITICAL: SSL_CERT CRITICAL svn.wikimedia.org: certificate will expire on Jan 31 10:53:05 2015 GMT [00:57:03] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 578 seconds [00:58:52] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 677 seconds [01:03:18] PROBLEM - HHVM busy threads on mw1232 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:04:37] PROBLEM - HHVM busy threads on mw1222 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [01:05:37] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:06:59] PROBLEM - HHVM busy threads on mw1235 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [01:07:06] PROBLEM - HHVM busy threads on mw1233 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [01:07:06] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:08:06] PROBLEM - HHVM busy threads on mw1227 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [01:08:56] RECOVERY - HHVM busy threads on mw1232 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:05] RECOVERY - HHVM busy threads on mw1235 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:05] RECOVERY - HHVM busy threads on mw1233 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:16] RECOVERY - HHVM busy threads on mw1222 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:17] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [01:11:24] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 1.00% above the threshold [76.8] [01:11:26] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [01:12:45] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 1.00% above the threshold [76.8] [01:13:57] RECOVERY - HHVM busy threads on mw1227 is OK: OK: Less than 1.00% above the threshold [76.8] [01:32:18] (03PS1) 10Ori.livneh: Provision HHVM source tree in /usr/src instead of /usr/local/src [puppet] - 10https://gerrit.wikimedia.org/r/176624 [01:33:48] (03CR) 10Ori.livneh: "Follow-up change: Id1d40d5cb" [puppet] - 10https://gerrit.wikimedia.org/r/176307 (owner: 10Tim Starling) [01:49:18] (03CR) 10Aude: [C: 031] Extra language names configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 (owner: 10Dereckson) [02:10:30] !log l10nupdate Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 00m 01s) [02:10:34] !log LocalisationUpdate completed (1.25wmf9) at 2014-12-01 02:10:33+00:00 [02:10:40] Logged the message, Master [02:10:42] Logged the message, Master [02:17:53] !log l10nupdate Synchronized php-1.25wmf10/cache/l10n: (no message) (duration: 00m 03s) [02:17:56] Logged the message, Master [02:17:56] !log LocalisationUpdate completed (1.25wmf10) at 2014-12-01 02:17:56+00:00 [02:17:59] Logged the message, Master [03:34:57] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Dec 1 03:34:57 UTC 2014 (duration 34m 56s) [03:35:04] Logged the message, Master [03:52:14] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [03:54:54] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 1.00% above the threshold [76.8] [04:06:04] (03CR) 10KartikMistry: [C: 031] Extra language names configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 (owner: 10Dereckson) [04:24:57] (03PS6) 10KartikMistry: Add ContentTranslation in wikishared DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175979 [04:26:52] (03Abandoned) 10Ori.livneh: Add site-list.json for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174885 (owner: 10Ori.livneh) [04:36:14] _joe_: can we make sure all hhvm servers are running the same package today? i see that 38 are running 3.3.0+dfsg1-1+wm4 and another 21 are running 3.3.0+dfsg1-1+wm3.1 [04:44:23] !log tstarling Synchronized php-1.25wmf9/includes/parser/MWTidy.php: change previously pulled but scap was apparently not run (duration: 00m 06s) [04:44:25] Logged the message, Master [04:46:26] !log tstarling Synchronized php-1.25wmf10/includes/parser/MWTidy.php: change previously pulled but scap was apparently not run (duration: 00m 05s) [04:46:28] Logged the message, Master [05:28:33] (03Abandoned) 10Tim Landscheidt: Set up redirects for toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [05:31:36] (03Abandoned) 10Tim Landscheidt: Script to test Toolserver redirects [software] - 10https://gerrit.wikimedia.org/r/108467 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [05:43:23] <_joe_> ori: yep [05:43:41] <_joe_> the older ones are running wm3.1 probably, [06:34:10] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail [06:34:39] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:40] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:59] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:02] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [06:35:38] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:43:41] PROBLEM - puppet last run on db1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:01] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:34] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:48] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:02] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:05] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:14] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:49:03] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:35] RECOVERY - puppet last run on db1006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw1037 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:57] (03PS1) 10Legoktm: extdist: Add composer location to config [puppet] - 10https://gerrit.wikimedia.org/r/176631 [07:17:54] (03CR) 10Legoktm: "I26b1a2710d25a90a71d9bebd86e3c447793d8567 is the labs/tools/extdist change to use this." [puppet] - 10https://gerrit.wikimedia.org/r/176631 (owner: 10Legoktm) [09:14:57] (03PS3) 10Giuseppe Lavagetto: hiera: role-based backend, role keyword [puppet] - 10https://gerrit.wikimedia.org/r/176334 [09:22:44] (03CR) 10Hashar: [C: 031] Remove -hhvm suffix from beta multiversion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173486 (owner: 10Reedy) [09:30:56] <_joe_> !log upgrading hhvm to the latest version across the cluster [09:31:05] Logged the message, Master [09:55:47] <_joe_> !log reimaging mw1033-mw1040 to HHVM, depooling from the main pool now [09:55:51] Logged the message, Master [10:07:58] PROBLEM - Host mw1233 is DOWN: PING CRITICAL - Packet loss = 100% [10:08:29] PROBLEM - Host mw1234 is DOWN: PING CRITICAL - Packet loss = 100% [10:08:32] RECOVERY - Host mw1233 is UP: PING WARNING - Packet loss = 50%, RTA = 1.04 ms [10:10:13] RECOVERY - Host mw1234 is UP: PING OK - Packet loss = 0%, RTA = 2.01 ms [10:10:19] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 1 failures [10:11:59] PROBLEM - nutcracker process on mw1233 is CRITICAL: Connection refused by host [10:12:09] <_joe_> and I scheduled downtime... [10:12:24] PROBLEM - check configured eth on mw1233 is CRITICAL: Connection refused by host [10:12:39] PROBLEM - HHVM processes on mw1233 is CRITICAL: Connection refused by host [10:12:39] PROBLEM - nutcracker port on mw1233 is CRITICAL: Connection refused by host [10:14:29] PROBLEM - check if salt-minion is running on mw1233 is CRITICAL: Connection refused by host [10:14:32] PROBLEM - DPKG on mw1233 is CRITICAL: Connection refused by host [10:15:22] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 1 failures [10:16:10] PROBLEM - Disk space on mw1233 is CRITICAL: Connection refused by host [10:16:19] PROBLEM - RAID on mw1233 is CRITICAL: Connection refused by host [10:16:20] PROBLEM - check if dhclient is running on mw1233 is CRITICAL: Connection refused by host [10:16:41] PROBLEM - check if salt-minion is running on mw1234 is CRITICAL: Connection refused by host [10:16:42] PROBLEM - Apache HTTP on mw1233 is CRITICAL: Connection refused [10:16:45] PROBLEM - Disk space on mw1234 is CRITICAL: Connection refused by host [10:16:46] PROBLEM - HHVM processes on mw1235 is CRITICAL: Connection refused by host [10:16:52] PROBLEM - Host mw1236 is DOWN: PING CRITICAL - Packet loss = 100% [10:16:53] PROBLEM - puppet last run on mw1233 is CRITICAL: Connection refused by host [10:17:00] PROBLEM - HHVM rendering on mw1233 is CRITICAL: Connection refused [10:17:12] PROBLEM - HHVM rendering on mw1234 is CRITICAL: Connection refused [10:17:12] RECOVERY - Host mw1236 is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [10:17:12] PROBLEM - SSH on mw1233 is CRITICAL: Connection refused [10:17:20] PROBLEM - Apache HTTP on mw1234 is CRITICAL: Connection refused [10:17:52] PROBLEM - check configured eth on mw1234 is CRITICAL: Connection refused by host [10:17:53] PROBLEM - nutcracker process on mw1234 is CRITICAL: Connection refused by host [10:17:53] PROBLEM - DPKG on mw1234 is CRITICAL: Connection refused by host [10:18:18] PROBLEM - SSH on mw1234 is CRITICAL: Connection refused [10:18:25] PROBLEM - RAID on mw1234 is CRITICAL: Connection refused by host [10:18:25] PROBLEM - nutcracker port on mw1234 is CRITICAL: Connection refused by host [10:18:25] PROBLEM - HHVM processes on mw1234 is CRITICAL: Connection refused by host [10:18:46] PROBLEM - puppet last run on mw1234 is CRITICAL: Connection refused by host [10:19:17] PROBLEM - check if dhclient is running on mw1234 is CRITICAL: Connection refused by host [10:19:55] PROBLEM - HHVM busy threads on mw1229 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [10:20:08] PROBLEM - HHVM rendering on mw1235 is CRITICAL: Connection refused [10:20:08] RECOVERY - SSH on mw1233 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [10:20:18] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 1 failures [10:20:35] PROBLEM - HHVM processes on mw1236 is CRITICAL: Connection refused by host [10:20:36] PROBLEM - RAID on mw1235 is CRITICAL: Connection refused by host [10:20:49] PROBLEM - nutcracker port on mw1235 is CRITICAL: Connection refused by host [10:20:49] PROBLEM - check configured eth on mw1235 is CRITICAL: Connection refused by host [10:21:06] PROBLEM - check if salt-minion is running on mw1236 is CRITICAL: Connection refused by host [10:21:11] PROBLEM - SSH on mw1235 is CRITICAL: Connection refused [10:21:11] PROBLEM - puppet last run on mw1235 is CRITICAL: Connection refused by host [10:21:19] PROBLEM - HHVM busy threads on mw1227 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [10:22:22] PROBLEM - HHVM busy threads on mw1224 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [10:23:00] PROBLEM - HHVM queue size on mw1224 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [80.0] [10:23:00] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [10:23:12] PROBLEM - HHVM busy threads on mw1222 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [10:23:53] PROBLEM - HHVM queue size on mw1222 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [80.0] [10:24:04] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 1.00% above the threshold [76.8] [10:25:19] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 1 failures [10:25:44] RECOVERY - HHVM queue size on mw1224 is OK: OK: Less than 1.00% above the threshold [10.0] [10:25:45] RECOVERY - HHVM busy threads on mw1227 is OK: OK: Less than 1.00% above the threshold [76.8] [10:26:25] RECOVERY - HHVM queue size on mw1222 is OK: OK: Less than 1.00% above the threshold [10.0] [10:28:35] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 1.00% above the threshold [76.8] [10:28:54] RECOVERY - HHVM busy threads on mw1222 is OK: OK: Less than 1.00% above the threshold [76.8] [10:30:21] RECOVERY - HHVM busy threads on mw1224 is OK: OK: Less than 1.00% above the threshold [76.8] [10:30:25] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 121 seconds ago with 0 failures [10:46:07] PROBLEM - DPKG on mw1233 is CRITICAL: Connection refused by host [10:46:07] PROBLEM - puppet last run on mw1233 is CRITICAL: Connection refused by host [10:46:46] PROBLEM - Disk space on mw1233 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:49:33] RECOVERY - Disk space on mw1233 is OK: DISK OK [10:54:25] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 112 failures [10:56:36] PROBLEM - Host mw1233 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:49] _joe_: next time schedule a downtime :D [10:57:20] RECOVERY - DPKG on mw1233 is OK: All packages OK [10:57:23] <_joe_> matanya: I did... [10:57:26] RECOVERY - Host mw1233 is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [10:57:46] mw1033 != mw1233 [10:57:54] <_joe_> for both [10:57:57] * matanya is stupid [11:04:31] PROBLEM - Apache HTTP on mw1233 is CRITICAL: Connection refused [11:05:38] PROBLEM - HHVM rendering on mw1233 is CRITICAL: Connection refused [11:06:19] PROBLEM - Apache HTTP on mw1236 is CRITICAL: Connection refused [11:07:19] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Puppet has 112 failures [11:07:49] PROBLEM - HHVM rendering on mw1234 is CRITICAL: Connection refused [11:08:20] PROBLEM - HHVM rendering on mw1235 is CRITICAL: Connection refused [11:08:54] PROBLEM - HHVM rendering on mw1236 is CRITICAL: Connection refused [11:10:08] PROBLEM - Apache HTTP on mw1234 is CRITICAL: Connection refused [11:10:33] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Puppet has 8 failures [11:10:38] PROBLEM - Apache HTTP on mw1235 is CRITICAL: Connection refused [11:10:52] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 112 failures [11:11:39] RECOVERY - HHVM rendering on mw1236 is OK: HTTP OK: HTTP/1.1 200 OK - 72551 bytes in 4.503 second response time [11:12:09] RECOVERY - Apache HTTP on mw1236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [11:12:59] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.701 second response time [11:13:10] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /srv 18603 MB (3% inode=97%): [11:13:10] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:13:10] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.634 second response time [11:13:22] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:13:42] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.637 second response time [11:13:43] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 72551 bytes in 1.873 second response time [11:13:48] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:13:58] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 72551 bytes in 1.838 second response time [11:14:18] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 72551 bytes in 1.870 second response time [11:14:29] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:26:20] PROBLEM - check configured eth on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:01] PROBLEM - check if dhclient is running on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:20] PROBLEM - check if salt-minion is running on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:59] PROBLEM - nutcracker port on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:10] PROBLEM - nutcracker process on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:33] PROBLEM - DPKG on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:33] PROBLEM - puppet last run on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:37] PROBLEM - Disk space on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:29:20] PROBLEM - HHVM processes on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:30:10] PROBLEM - RAID on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:31:34] RECOVERY - Disk space on mw1034 is OK: DISK OK [11:31:55] PROBLEM - puppet last run on mw1035 is CRITICAL: CRITICAL: Puppet has 112 failures [11:31:56] RECOVERY - check configured eth on mw1034 is OK: NRPE: Unable to read output [11:32:05] RECOVERY - HHVM processes on mw1034 is OK: PROCS OK: 1 process with command name hhvm [11:32:35] RECOVERY - check if dhclient is running on mw1034 is OK: PROCS OK: 0 processes with command name dhclient [11:32:55] RECOVERY - check if salt-minion is running on mw1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:33:08] RECOVERY - RAID on mw1034 is OK: OK: no RAID installed [11:33:37] RECOVERY - nutcracker port on mw1034 is OK: TCP OK - 0.000 second response time on port 11212 [11:33:38] RECOVERY - nutcracker process on mw1034 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:33:57] RECOVERY - DPKG on mw1034 is OK: All packages OK [11:39:37] PROBLEM - puppet last run on mw1034 is CRITICAL: CRITICAL: Puppet has 111 failures [11:45:28] PROBLEM - HHVM rendering on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:16] RECOVERY - puppet last run on mw1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:48:06] RECOVERY - HHVM rendering on mw1034 is OK: HTTP OK: HTTP/1.1 200 OK - 72514 bytes in 0.312 second response time [11:50:46] PROBLEM - puppet last run on mw1033 is CRITICAL: CRITICAL: Puppet has 112 failures [11:51:55] PROBLEM - HHVM rendering on mw1035 is CRITICAL: Connection refused [11:57:36] RECOVERY - HHVM rendering on mw1035 is OK: HTTP OK: HTTP/1.1 200 OK - 72515 bytes in 2.988 second response time [11:59:19] RECOVERY - puppet last run on mw1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:00:27] RECOVERY - puppet last run on mw1035 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:17:25] (03CR) 10Yuvipanda: [C: 032] extdist: Add composer location to config [puppet] - 10https://gerrit.wikimedia.org/r/176631 (owner: 10Legoktm) [12:18:05] legoktm: ^ merged [12:18:43] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [12:19:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [12:30:04] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:39:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [12:56:00] PROBLEM - nutcracker process on mw1039 is CRITICAL: Connection refused by host [12:57:11] (03PS1) 10Giuseppe Lavagetto: monitoring: refine alarms on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/176653 [12:57:29] <_joe_> what part of "scheduled downtime" you don't understand, icinga? [12:58:03] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: refine alarms on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/176653 (owner: 10Giuseppe Lavagetto) [12:59:13] PROBLEM - puppet last run on mw1039 is CRITICAL: Connection refused by host [12:59:25] PROBLEM - check if dhclient is running on mw1039 is CRITICAL: Connection refused by host [13:00:07] PROBLEM - Apache HTTP on mw1039 is CRITICAL: Connection refused [13:00:25] PROBLEM - DPKG on mw1039 is CRITICAL: Connection refused by host [13:00:34] PROBLEM - SSH on mw1039 is CRITICAL: Connection refused [13:00:35] PROBLEM - Disk space on mw1039 is CRITICAL: Connection refused by host [13:01:17] PROBLEM - RAID on mw1039 is CRITICAL: Connection refused by host [13:01:27] PROBLEM - check if salt-minion is running on mw1039 is CRITICAL: Connection refused by host [13:01:29] PROBLEM - check configured eth on mw1039 is CRITICAL: Connection refused by host [13:01:49] PROBLEM - nutcracker port on mw1039 is CRITICAL: Connection refused by host [13:09:09] RECOVERY - SSH on mw1039 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [13:17:33] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.014 second response time [13:20:56] RECOVERY - check if dhclient is running on mw1039 is OK: PROCS OK: 0 processes with command name dhclient [13:21:15] RECOVERY - nutcracker port on mw1039 is OK: TCP OK - 0.000 second response time on port 11212 [13:21:15] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 1 failures [13:21:34] RECOVERY - nutcracker process on mw1039 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:22:41] RECOVERY - Disk space on mw1039 is OK: DISK OK [13:24:04] RECOVERY - RAID on mw1039 is OK: OK: no RAID installed [13:24:35] RECOVERY - check configured eth on mw1039 is OK: NRPE: Unable to read output [13:24:52] RECOVERY - check if salt-minion is running on mw1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:25:06] RECOVERY - DPKG on mw1039 is OK: All packages OK [13:30:07] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 112 failures [13:32:30] (03PS1) 10Faidon Liambotis: Unbreak misc::statistics on <= precise systems [puppet] - 10https://gerrit.wikimedia.org/r/176656 [13:33:51] (03CR) 10Faidon Liambotis: [C: 032] Unbreak misc::statistics on <= precise systems [puppet] - 10https://gerrit.wikimedia.org/r/176656 (owner: 10Faidon Liambotis) [13:36:56] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:37:37] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:37:42] _joe_: since you're in a fixing HHVM alerts mood :), a bunch of "PROCS WARNING: 2 processes with command name 'hhvm'" alerts [13:37:58] <_joe_> paravoid: yes it's next on the list [13:38:07] also ocg is apparently full [13:38:08] again [13:38:10] <_joe_> when I wrote that I didn't expect we shelled out so often [13:38:26] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:38:28] <_joe_> paravoid: it was at 98% this morning and I forgot to check [13:38:36] <_joe_> only this time it's the /srv dir [13:38:39] 97%, 99% and 100% [13:38:49] also mw1190: eth0 has different negotiated speed than requested [13:39:03] and mw1039 with a bunch of errors [13:39:20] <_joe_> paravoid: that's me reimaging and scheduled downtime disappearing magically [13:39:33] 1039 you mean [13:39:35] <_joe_> yes [13:39:38] 1190 is a broken cable probably [13:40:32] <_joe_> yes [13:40:46] <_joe_> ok I'll take a look at ocg [13:40:49] <_joe_> gee [13:40:59] <_joe_> they probably changed the cache retention policy [13:41:41] godog: ms-be2014 runs puppet and isn't very happy about bcache [13:41:49] puppet isn't that is [13:44:00] <_joe_> !log removing cache files from ocg1001, when they're older than 3 days [13:44:06] Logged the message, Master [13:44:09] RECOVERY - check if salt-minion is running on stat1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:46:56] <_joe_> !log removing the same files from ocg1002,3 as well [13:46:58] Logged the message, Master [13:48:44] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:52:42] RECOVERY - Disk space on ocg1001 is OK: DISK OK [13:53:01] (03PS1) 10Faidon Liambotis: hhvm: remove check_procs check [puppet] - 10https://gerrit.wikimedia.org/r/176659 [13:53:04] RECOVERY - Disk space on ocg1003 is OK: DISK OK [13:53:11] _joe_: do you mind if I merge this? [13:53:26] <_joe_> not at all :) [13:54:03] <_joe_> It's quite useless now in fact [13:54:05] RECOVERY - Disk space on ocg1002 is OK: DISK OK [13:54:21] I can make it 1: [13:54:39] I guess that's better [13:54:50] I'll remove the -w 1:1 part [13:55:52] (03PS2) 10Faidon Liambotis: hhvm: remove check_procs' WARNING state [puppet] - 10https://gerrit.wikimedia.org/r/176659 [13:56:54] (03CR) 10Faidon Liambotis: [C: 032] hhvm: remove check_procs' WARNING state [puppet] - 10https://gerrit.wikimedia.org/r/176659 (owner: 10Faidon Liambotis) [14:11:24] (03PS1) 10Faidon Liambotis: ldap: fix LDAP's monitoring::service CN matching [puppet] - 10https://gerrit.wikimedia.org/r/176662 [14:11:48] alerting for 67 days, acknowledged with comment "foo bar baz" [14:12:58] (03CR) 10Faidon Liambotis: [C: 032] ldap: fix LDAP's monitoring::service CN matching [puppet] - 10https://gerrit.wikimedia.org/r/176662 (owner: 10Faidon Liambotis) [14:13:59] <_joe_> the comment was to acknowledge it was being ignored [14:16:20] <_joe_> !log repooling mw1036-mw1040 [14:16:25] Logged the message, Master [14:22:01] RECOVERY - Certificate expiration on labcontrol2001 is OK: SSL_CERT OK - X.509 certificate for ldap-codfw.wikimedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Sep 20 19:36:03 2015 GMT (expires in 293 days) [14:22:26] there we go :) [14:26:07] <_joe_> !log depooling mw1041-1046 [14:26:12] Logged the message, Master [14:27:50] SSL_CERT CRITICAL ldap-eqiad.wikimedia.org: invalid CN ('ldap-eqiad.wikimedia.org' does not match 'ldap-codfw.wikimedia.org') [14:27:57] interesting [14:32:45] (03PS1) 10Faidon Liambotis: ldap: neptunium is eqiad, virt1000 is no more. [puppet] - 10https://gerrit.wikimedia.org/r/176665 [14:33:31] paravoid: indeed it isn't amused, the obvious choice is having an additional parameter to have caching on/off, anyways yes it is on my radar [14:34:01] (03PS2) 10Faidon Liambotis: ldap: neptunium is eqiad, virt1000 is no more [puppet] - 10https://gerrit.wikimedia.org/r/176665 [14:37:59] (03CR) 10Faidon Liambotis: [C: 032] ldap: neptunium is eqiad, virt1000 is no more [puppet] - 10https://gerrit.wikimedia.org/r/176665 (owner: 10Faidon Liambotis) [14:48:51] RECOVERY - Certificate expiration on neptunium is OK: SSL_CERT OK - X.509 certificate for ldap-eqiad.wikimedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Sep 20 19:41:02 2015 GMT (expires in 293 days) [14:49:15] aaand there we go [14:49:44] hi manybubbles [14:50:23] paravoid: hi! [14:50:28] how are things? [14:50:49] good :) [14:50:56] had fun? [14:52:01] on my holiday? yes! lots of time with family [14:52:12] yeah :) [14:52:39] things looked exciting yesterday but otherwise pretty good while we were all out [14:53:28] <^d> g'morning paravoid, manybubbles [14:53:35] ^d: morning! [14:58:07] linked invitations are viral. you get one and you ignore it for a few days and then you are like "I should accept that" and when you do linkedin is like "do you know these 100 people?" and you are like, "yeah, mostly" and then you click "I know this person" until you get bored. its like a chain letter. [15:05:25] PROBLEM - Disk space on mw1041 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:25] PROBLEM - HHVM processes on mw1040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:56] <_joe_> grrr [15:06:41] PROBLEM - HHVM processes on mw1041 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:01] PROBLEM - puppet last run on mw1043 is CRITICAL: CRITICAL: Puppet has 112 failures [15:07:26] PROBLEM - RAID on mw1040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:42] PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: Puppet has 112 failures [15:07:42] PROBLEM - RAID on mw1041 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:42] PROBLEM - check configured eth on mw1040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:08:09] PROBLEM - check if dhclient is running on mw1040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:08:12] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: Puppet has 112 failures [15:08:24] RECOVERY - Disk space on mw1041 is OK: DISK OK [15:08:24] RECOVERY - HHVM processes on mw1040 is OK: PROCS OK: 1 process with command name hhvm [15:08:44] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 8 failures [15:08:54] PROBLEM - DPKG on mw1046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:08:56] _joe_: ? [15:09:00] what are all these? [15:09:11] <_joe_> servers being reimaged, with scheduled downtime [15:09:16] oh [15:09:18] <_joe_> but reimaging cleans them from nagios [15:09:20] heh [15:09:23] RECOVERY - HHVM processes on mw1041 is OK: PROCS OK: 1 process with command name hhvm [15:09:26] <_joe_> so, sometimes the downtime is lost [15:09:27] <_joe_> :/ [15:09:43] <_joe_> so either we find a clever way not to make naggen add those [15:09:49] <_joe_> and I think we might have [15:10:03] <_joe_> or it's going to be like this for every server we reimage [15:10:03] PROBLEM - DPKG on mw1041 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:10:13] RECOVERY - RAID on mw1040 is OK: OK: no RAID installed [15:10:26] naggen is adding these only after the services on the host are provisioned, no? [15:10:34] RECOVERY - RAID on mw1041 is OK: OK: no RAID installed [15:10:37] RECOVERY - check configured eth on mw1040 is OK: NRPE: Unable to read output [15:10:49] RECOVERY - check if dhclient is running on mw1040 is OK: PROCS OK: 0 processes with command name dhclient [15:11:29] <_joe_> paravoid: well, that's true but not for everything it works [15:11:33] RECOVERY - DPKG on mw1046 is OK: All packages OK [15:11:34] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 112 failures [15:12:04] <_joe_> for example, the apache alarm depend on apache, but apache doesn't get started after the first run, as scap runs on the second one [15:12:27] <_joe_> (the first failing because we need to accept the salt password, so installing scap usually fails) [15:12:53] RECOVERY - DPKG on mw1041 is OK: All packages OK [15:13:30] <_joe_> (also, the first 2 puppet runs on an appserver take ~ 45 minutes, it's usually much less on other servers) [15:17:54] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 7 failures [15:18:36] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:37] PROBLEM - HHVM rendering on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:47] RECOVERY - HHVM rendering on mw1041 is OK: HTTP OK: HTTP/1.1 200 OK - 73135 bytes in 0.303 second response time [15:23:45] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:24:15] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:30] <_joe_> !log repooling mw1041-mw1046 [15:24:35] Logged the message, Master [15:24:55] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:25:27] (03PS1) 10Andrew Bogott: -- DRAFT -- [puppet] - 10https://gerrit.wikimedia.org/r/176670 [15:25:29] (03PS1) 10Andrew Bogott: -- DRAFT -- [puppet] - 10https://gerrit.wikimedia.org/r/176671 [15:25:46] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:25:46] <_joe_> andrewbogott: gerrit review -D :P [15:26:22] _joe_: I never remember that until it's too late :) [15:26:29] <_joe_> !log depooling mw1047-mw1052 [15:26:32] Logged the message, Master [15:28:39] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:11] RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:33:10] (03PS5) 10BBlack: Remove old protoproxy / ssl[13]00x config / star certs [puppet] - 10https://gerrit.wikimedia.org/r/175466 [15:35:25] RECOVERY - puppet last run on mw1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:31] !log restarted logstash on logstash1001; log2udp events were not being processed [15:49:35] Logged the message, Master [15:51:02] manybubbles, ^d, marktraceur: Who wants to SWAT this morning? [15:51:17] anomie: i'd like to pass this morning if that is ok [15:51:25] * anomie would like to pass too [15:53:11] Not it. [15:53:15] I have to deal with car drama. [15:53:19] * cscott is never it [15:57:03] <^d> I guess me. Was hoping by keeping my mouth shut... [15:58:00] <_joe_> cscott: hey [15:58:18] <_joe_> I had to manually purge files older than 3 days from ocg100* [15:58:28] <_joe_> they had their /srv partition full [16:00:04] manybubbles, anomie, ^d, marktraceur, Glaisher: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141201T1600). [16:01:17] <^d> Glaisher: you about? [16:01:23] _joe_: orly? [16:01:36] ^d: yes, I am [16:01:44] <^d> Okay, let's do this :) [16:01:44] _joe_: let me take a look at icinga, that shouldn't happen [16:01:50] (03CR) 10Chad: [C: 032] Add 'move-subpages' right to "closer" and "filemover" groups at ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176272 (owner: 10Glaisher) [16:01:52] (03CR) 10Chad: [C: 032] Restore default configuration for ruwikisource bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176287 (owner: 10Glaisher) [16:01:54] (03CR) 10Chad: [C: 032] Modify abusefilter configuration for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176504 (owner: 10Glaisher) [16:01:57] wait what.. jouncebot pings me too now.. ^^ [16:02:08] (03Merged) 10jenkins-bot: Add 'move-subpages' right to "closer" and "filemover" groups at ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176272 (owner: 10Glaisher) [16:02:09] <^d> Maybe? [16:02:13] <^d> Hmm [16:02:17] (03Merged) 10jenkins-bot: Restore default configuration for ruwikisource bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176287 (owner: 10Glaisher) [16:02:25] (03Merged) 10jenkins-bot: Modify abusefilter configuration for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176504 (owner: 10Glaisher) [16:04:14] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 05s) [16:04:17] Logged the message, Master [16:04:32] !log demon Synchronized wmf-config/abusefilter.php: (no message) (duration: 00m 05s) [16:04:34] Logged the message, Master [16:05:10] <^d> Glaisher: All done [16:05:12] _joe_: what hapened on sunday with OCG? looks like the redis job queue was completely wiped out [16:05:25] looking at https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=PDF+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=ocg_job_status_queue&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [16:05:44] jouncebot pings every {{ircnick}} template in the deploy window (or at least is meant to) [16:05:47] ^d: All live.. Thanks :) [16:06:11] PROBLEM - DPKG on mw1052 is CRITICAL: Timeout while attempting connection [16:06:11] <^d> Glaisher: yw [16:06:46] PROBLEM - Disk space on mw1052 is CRITICAL: Timeout while attempting connection [16:07:18] _joe_: and https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=PDF+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=ocg_data_filesystem_utilization&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name seems to indicate that something broke in the GC three weeks ago, i'll have to look at that [16:07:41] PROBLEM - RAID on mw1052 is CRITICAL: Connection refused by host [16:08:13] PROBLEM - check configured eth on mw1052 is CRITICAL: Connection refused by host [16:08:20] PROBLEM - check if dhclient is running on mw1052 is CRITICAL: Connection refused by host [16:08:40] PROBLEM - check if salt-minion is running on mw1052 is CRITICAL: Connection refused by host [16:09:13] PROBLEM - nutcracker port on mw1052 is CRITICAL: Connection refused by host [16:09:21] PROBLEM - nutcracker process on mw1052 is CRITICAL: Connection refused by host [16:09:41] PROBLEM - puppet last run on mw1052 is CRITICAL: Connection refused by host [16:09:41] PROBLEM - puppet last run on mw1048 is CRITICAL: CRITICAL: Puppet has 112 failures [16:09:58] <_joe_> looks like the downtime finished [16:10:15] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: Puppet has 112 failures [16:10:22] PROBLEM - Apache HTTP on mw1052 is CRITICAL: Connection refused [16:13:21] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.006 second response time [16:14:43] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet has 112 failures [16:16:13] PROBLEM - HHVM rendering on mw1047 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 5.249 second response time [16:17:08] (03PS1) 10RobH: setting mgmt ip for server heze [dns] - 10https://gerrit.wikimedia.org/r/176674 [16:17:43] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:17:46] (03CR) 10RobH: [C: 032] setting mgmt ip for server heze [dns] - 10https://gerrit.wikimedia.org/r/176674 (owner: 10RobH) [16:19:12] RECOVERY - RAID on mw1052 is OK: OK: no RAID installed [16:19:39] RECOVERY - check configured eth on mw1052 is OK: NRPE: Unable to read output [16:19:42] RECOVERY - check if dhclient is running on mw1052 is OK: PROCS OK: 0 processes with command name dhclient [16:20:02] RECOVERY - check if salt-minion is running on mw1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:20:59] RECOVERY - nutcracker port on mw1052 is OK: TCP OK - 0.000 second response time on port 11212 [16:21:58] RECOVERY - DPKG on mw1052 is OK: All packages OK [16:21:58] RECOVERY - nutcracker process on mw1052 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:22:46] ^d: Still SWATting? [16:22:51] <^d> No, all done [16:22:59] RECOVERY - Disk space on mw1052 is OK: DISK OK [16:23:07] ok, I'm going to backport https://gerrit.wikimedia.org/r/#/c/176673/ [16:24:30] <_joe_> ^d: ouch I was still reimaging servers, did you have any failures? [16:24:37] RECOVERY - puppet last run on mw1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:24:41] <_joe_> I'm not repooling them until tomorrow then [16:25:01] <^d> Oh hmm, 1050-52? [16:25:37] <^d> 48? [16:26:21] <_joe_> 47-52 [16:26:46] <_joe_> and 50-52 are still being reimaged [16:26:48] <_joe_> brb [16:27:29] RECOVERY - HHVM rendering on mw1047 is OK: HTTP OK: HTTP/1.1 200 OK - 73196 bytes in 5.151 second response time [16:27:37] PROBLEM - puppet last run on mw1050 is CRITICAL: CRITICAL: Puppet has 112 failures [16:27:54] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 112 failures [16:28:11] PROBLEM - puppet last run on mw1051 is CRITICAL: CRITICAL: Puppet has 112 failures [16:29:20] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:32:57] (03PS1) 10Chad: Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 [16:33:44] (03PS2) 10Chad: Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 [16:33:46] Who has live hacks on 1.25wmf10 on tin? Grr. [16:34:28] (03CR) 10BryanDavis: [C: 031] Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 (owner: 10Chad) [16:35:32] !log anomie Synchronized php-1.25wmf10/extensions/SyntaxHighlight_GeSHi/geshi/geshi.php: SWAT: Fix highly recursive number highlighting regex in GeSHi (duration: 00m 10s) [16:35:33] anomie: ^ test please [16:35:36] Logged the message, Master [16:35:40] anomie: Looks good [16:35:51] <^d> anomie: mwtidy? my guess would be tim [16:37:01] (03CR) 10Chad: [C: 032] Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 (owner: 10Chad) [16:37:14] (03Merged) 10jenkins-bot: Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 (owner: 10Chad) [16:37:15] !log anomie Synchronized php-1.25wmf9/extensions/SyntaxHighlight_GeSHi/geshi/geshi.php: SWAT: Fix highly recursive number highlighting regex in GeSHi (duration: 00m 07s) [16:37:15] anomie: ^ Test please [16:37:17] Logged the message, Master [16:37:21] <^d> looks good! [16:37:24] anomie: Good [16:38:16] !log demon Synchronized wmf-config/CommonSettings.php: opensearchxml conditional include (duration: 00m 06s) [16:38:17] Logged the message, Master [16:38:26] <^d> bd808: fix is in prod, should be in beta shortly. [16:38:29] <^d> or already. [16:38:30] <^d> dunno [16:38:36] thx ^d [16:38:39] <^d> yw [16:39:24] (03PS4) 10BryanDavis: Use hiera to configure udp2log endpoint for ::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/176191 [16:40:12] (03CR) 10BBlack: [C: 032] Remove old protoproxy / ssl[13]00x config / star certs [puppet] - 10https://gerrit.wikimedia.org/r/175466 (owner: 10BBlack) [16:40:56] bblack: \o/ \o/ \o/ [16:41:07] how are we on CPU post-SSL by the way? [16:41:47] we seem fine [16:42:15] the only thing I hate a little bit about the new situation is that we're not source-hashing for all the obvious reasons, so our rate of renegotiations for clients has surely gone up. [16:42:31] oh? [16:42:42] I hadn't realized [16:43:01] it was disabled for ulsfo even before I started working on it (well disabled in the sense that it's not set to "sh" in the LVS config) [16:43:16] I assumed that was intentional because of the issues with ipvs sh and downed/removed nodes, etc [16:44:32] arguably either way is an acceptable state to be in, they just have different suboptimal tradeoffs. if the renegotiations aren't killing us or making the client experience awful though (and that seems to be the case), I'd rather have it this way till sh is fixed though. [16:46:55] the most pronounced way to see the CPU effects is in the monthly graph on esams varnishes for cpu/load: e.g. http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Text+caches+esams&m=cpu_report&s=by+name&mc=2&g=load_report [16:47:40] aha [16:47:53] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:48:13] yeah we'll need more servers if we ever want to go ssl for everyone :) [16:48:48] well that and fixing sh will help, and maybe switching up ciphersuite to pick more efficient choices, etc [16:49:36] but in the overall, I think things fit together nicely this way. varnish is mostly about memory and i/o, and SSL is mostly about CPU. [16:49:46] <_joe_> bblack: the chiphersuite is pretty well chosen tbh [16:49:53] <_joe_> I worked on that quite a bit [16:50:16] <_joe_> and well, it surely needs to be refreshed everytime something happens in browserland [16:50:17] RECOVERY - puppet last run on mw1050 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:50:26] <_joe_> and I can have missed a few things [16:50:34] <_joe_> but in general, it should be good [16:50:50] RECOVERY - puppet last run on mw1051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:50:51] for good clients, yes. I think for some of the ancient clients, we're currently making some security tradeoffs on perf? [16:51:02] I donno I'd have to go stare at it again to refresh my brain [16:51:32] <_joe_> bblack: surely, yes [16:51:45] <_joe_> the focus was "lightest chipher with PFS" [16:51:57] <_joe_> there are a couple of emails I sent at the time [16:52:43] <_joe_> and jzerebecki was the original patch author and can have some more insights for sure :) [16:53:56] yeah, he commented on my abandoned patch the other day as well: https://gerrit.wikimedia.org/r/#/c/170879/ [16:54:04] (re: PFS for some ancient clients) [16:56:33] PROBLEM - check if dhclient is running on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:59:27] RECOVERY - check if dhclient is running on rhenium is OK: PROCS OK: 0 processes with command name dhclient [17:02:33] !log created empty jessie-wikimedia repo on Carbon [17:02:37] Logged the message, Master [17:21:07] (03PS1) 10coren: Add labstore* to codfw DNS [dns] - 10https://gerrit.wikimedia.org/r/176685 [17:21:59] YuviPanda: ty :D [17:22:40] (03PS1) 10Vogone: Add new import sources to dewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176686 [17:23:33] (03PS1) 10coren: Add codfw labs support and labstores to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/176687 [17:26:25] bblack: yes not using non-ec dhe is a trade in favor of low latency and less cpu usage and against security (pfs). but we would have to drop java6 support if we want to generate 2k rsa keys with dhe. we would have to drop IE 8 / XP to disable non-pfs ciphers and/or to disable rc4. so the only ones from ssllabs list that would use non-EC DHE would be OpenSSL 0.9.8y and Android 2.3.7 (and BingPreview Jun 2014). [17:26:33] akosiaris: Can you do a quick check of both patches? ^^ I changed your comment in the zone file because it seemed obvious you wanted /24s though. [17:27:16] I just want to make sure I didn't misunderstand you. :-) [17:27:41] Coren: sure, gimme a sec [17:28:43] jzerebecki: even though I proposed changing that, I'm kinda thinking at this point, people who care about PFS should just upgrade their browser. And maybe we should start putting in some banner type stuff via mediawiki: if($using_SSL && $browser_sucks) display_banner_about_insecure_browser [17:28:57] PROBLEM - HHVM busy threads on mw1232 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [115.2] [17:31:21] basically what i said above would be equivalent to https://wiki.mozilla.org/Security/Server_Side_TLS#Intermediate_compatibility_.28default.29 [17:31:46] RECOVERY - HHVM busy threads on mw1232 is OK: OK: Less than 30.00% above the threshold [76.8] [17:33:02] another random related thought: it would be interesting (but I'm sure not currently implemented, right?) if nginx at the SSL termination point could do something like "if(chosen_cipher =~ ...) set-header: X-SSL-Ugly-Cipher", for mediawiki to later consume for that banner, instead of trying to parse user-agent to detect the condition. [17:33:06] PROBLEM - HHVM busy threads on mw1229 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:33:10] we could disable non-pfs, rc4 and not enable non-EC DHE then it is a trade off against old clients :) [17:33:59] (03CR) 10Alexandros Kosiaris: [C: 032] Add labstore* to codfw DNS [dns] - 10https://gerrit.wikimedia.org/r/176685 (owner: 10coren) [17:36:01] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 30.00% above the threshold [76.8] [17:36:06] akosiaris: Oh, don't +2 the dhcp one if you haven't noticed - I just saw I am dumb. :-) [17:36:50] (03PS2) 10coren: Add codfw labs support and labstores to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/176687 [17:37:34] * Coren is too used to typing 'eqiad' everywhere. :-) [17:38:19] (03CR) 10Alexandros Kosiaris: [C: 032] Add codfw labs support and labstores to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/176687 (owner: 10coren) [17:46:04] (03PS1) 10BBlack: Switch SSL loadbalancing to sh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/176692 [17:51:49] (03CR) 10Faidon Liambotis: [C: 031] Switch SSL loadbalancing to sh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/176692 (owner: 10BBlack) [17:56:57] (03PS1) 10BryanDavis: logstash: Forward syslog events for apache2 + hhvm [puppet] - 10https://gerrit.wikimedia.org/r/176693 [17:59:57] (03CR) 10BBlack: [C: 032] Switch SSL loadbalancing to sh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/176692 (owner: 10BBlack) [18:04:48] (03CR) 10Faidon Liambotis: "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/176693 (owner: 10BryanDavis) [18:10:04] !log stopping pybal on primary eqiad LVSes to test 'sh' change for SSL (already restarted for change on backup LVSes) [18:10:06] Logged the message, Master [18:10:13] (03PS2) 10BryanDavis: logstash: Forward syslog events for apache2 + hhvm [puppet] - 10https://gerrit.wikimedia.org/r/176693 [18:10:33] paravoid: My rsyslog fu only extends to cut-n-paste, but pointers to a better syntax are welcome. [18:15:20] !log ditto on pybal 'sh' stuff for esams [18:15:25] Logged the message, Master [18:19:56] (03CR) 10EBernhardson: [C: 031] "afaik relevant blockers have all been closed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [18:21:41] !log eqiad+esams LVS back to normal, with new config for 'sh' for SSL [18:21:56] \o/ [18:22:05] Logged the message, Master [18:25:23] YuviPanda: hi! [18:25:50] !log ulsfo LVS updated for 'sh' for SSL as well [18:25:53] Logged the message, Master [18:30:43] (03PS3) 10BryanDavis: logstash: Forward syslog events for apache2 + hhvm [puppet] - 10https://gerrit.wikimedia.org/r/176693 [18:32:53] (03CR) 10BryanDavis: logstash: Forward syslog events for apache2 + hhvm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/176693 (owner: 10BryanDavis) [18:33:40] bd808: basically the ':programname, isequal, "hhvm" @foo' syntax can be rewitten as 'if $programname == "hhvm" then @foo' [18:33:48] google for rsyslog "new style" or something [18:33:55] the current config uses both, that's why I said it's confusing [18:34:13] the old style is deprecated but it's okay to use it [18:34:24] mixing both in the same file is not great though, but that's hardly your fault :) [18:34:33] paravoid: Yup. I think I found the reference cods [18:34:38] s/cods/docs/ [18:35:31] I don't consider this a blocker for you change [18:35:42] just saying, if you have a test rig all set up in labs... ;)) [18:36:24] bd808: btw, newer versions of rsyslog (including the one we run with HAT) have structured logging and some other cool stuff [18:36:29] an elasticsearch backend as well iir [18:36:31] iirc [18:36:39] not sure if it's logstash compatible at all though [18:37:06] Yeah I looked at that a bit. For now I think having logstash sit in between will actually be helpful. [18:37:26] We can use it to rewrite some events and ignore others [18:37:38] I'm not actually suggesting it, just fyi :) [18:37:46] I haven't researched the problem at all, whatever you think is best [18:38:07] *nod* I'm glad to have the input [18:39:23] (03PS4) 10BryanDavis: logstash: Forward syslog events for apache2 + hhvm [puppet] - 10https://gerrit.wikimedia.org/r/176693 [18:43:44] (03PS1) 10Dzahn: add bmansurov to researchers admin group [puppet] - 10https://gerrit.wikimedia.org/r/176701 [18:55:56] (03PS1) 10Aaron Schulz: Added some redis queue comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176706 [18:57:53] greg-g, any objects if i go now instead of 14:00 ? [18:58:10] yurikR: not now, max is doing something [18:58:18] MaxSem, ? [18:59:13] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures [18:59:41] yurikR, I have a security fix [19:00:22] MaxSem, could you ping me when done? [19:00:43] sure [19:12:31] joakino: hey! [19:14:44] yurikR, having problems with gerritcrap, go ahead [19:16:08] (03PS1) 10BryanDavis: Remove OpenSeachXml from extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176714 [19:16:30] ^d: ^ [19:18:13] MaxSem, something tells me that if i start +2ing branches, they might not merge either [19:24:32] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:25:55] Someone pls remind me which repo contains our Varnish config? [19:26:19] the repo of "bblack knows all" [19:26:57] awight: it's in ops/puppet in various places [19:28:02] bd808: thanks! [19:29:12] yurikR, I've resolved my problems and deploying shortly [19:29:21] MaxSem, ok [19:41:18] !log Stashed Tim's uncommitted tidy-related changes on tin [19:41:23] Logged the message, Master [19:43:27] !log maxsem Synchronized php-1.25wmf9/extensions/Popups/: https://gerrit.wikimedia.org/r/#/c/176715/ (duration: 00m 05s) [19:43:29] Logged the message, Master [19:43:41] !log maxsem Synchronized php-1.25wmf10/extensions/Popups/: https://gerrit.wikimedia.org/r/#/c/176715/ (duration: 00m 06s) [19:43:43] Logged the message, Master [19:44:50] MaxSem: thanks for logging that [19:45:19] There's a joke in there somewhere [19:45:58] chasemp: it looks like I am not a member of https://phabricator.wikimedia.org/tag/operations/ -- I believe my name on phab is 'andrew' [19:46:19] cajoel_: found ticket, moving and taking it [19:46:27] robh: thanks [19:46:28] 8943 [19:46:32] yep [19:46:36] andrewbogott: https://phabricator.wikimedia.org/p/andrew/ ? [19:46:49] that's me! [19:46:53] done [19:47:04] thx [19:53:40] yurikR, I'm done [19:53:50] thx [19:55:00] (03CR) 10Yurik: [C: 032] Vary mdot webroot on Accept-Language, X-Subdomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175797 (owner: 10Dr0ptp4kt) [19:55:13] (03Merged) 10jenkins-bot: Vary mdot webroot on Accept-Language, X-Subdomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175797 (owner: 10Dr0ptp4kt) [19:57:48] (03CR) 10Ottomata: [C: 031] add bmansurov to researchers admin group [puppet] - 10https://gerrit.wikimedia.org/r/176701 (owner: 10Dzahn) [19:58:37] (03CR) 10Dzahn: [C: 032] "has manager approval on ticket as well and old enough" [puppet] - 10https://gerrit.wikimedia.org/r/176701 (owner: 10Dzahn) [20:00:38] !log yurik Synchronized mobilelanding.php: https://gerrit.wikimedia.org/r/#/c/175797/ (duration: 00m 06s) [20:00:40] Logged the message, Master [20:04:49] !log yurik Synchronized php-1.25wmf9/extensions/ZeroBanner/: updatidng ZeroBanner to master (duration: 00m 08s) [20:04:52] Logged the message, Master [20:05:20] !log yurik Synchronized php-1.25wmf9/extensions/ZeroPortal/: updatidng ZeroPortal to master (duration: 00m 05s) [20:05:22] Logged the message, Master [20:06:33] !log yurik Synchronized php-1.25wmf10/extensions/ZeroBanner/: updatidng ZeroBanner to master (duration: 00m 06s) [20:06:36] Logged the message, Master [20:08:47] !log yurik Synchronized php-1.25wmf10/extensions/ZeroPortal/: updatidng ZeroPortal to master (duration: 00m 05s) [20:08:50] Logged the message, Master [20:23:15] hey greg-g / roblaAWAY : who is (on a technical production level lets say) responsible for tmh hosts? Wanna take a stab at an RT ticket about group access on these hosts but need a basic idea of the privs, hosts and members needed in it :) [20:23:24] cscott ^^ [20:23:55] show me the ticket and I'll take a gander [20:24:05] https://rt.wikimedia.org/Ticket/Display.html?id=8480 [20:26:02] * greg-g looks, but is also in an oh so fun Open Enrollment (medical benefits) meeting [20:29:33] JohnFLewis: ah, tmh. [20:29:43] yeah :p [20:30:17] JohnFLewis, greg-g: i'm only embroiled in this because I tried to do a mediawiki deploy as a newish account, and found that the tmh hosts only granted access to old farts. [20:30:22] but i think that's been fixed now. [20:30:58] Reedy: hey, can I bug you about https://phabricator.wikimedia.org/T76061 ? [20:31:26] cscott: for you or is there an actual solution in place for all? [20:31:29] I'm about to workaround with a manual deployment, but wondering if there's an ETA on the cache working correctly? [20:31:38] I know there was an AR for you to get tmh access [20:33:40] cscott: oh I see; inclusion of the deployment group. [20:33:55] JohnFLewis: I think what is needed is the same basic level of access that is needed for all MW deploy target hosts. That ticket came up from a scap failure. [20:35:30] bd808: that was my original thought. [20:37:39] Is anyone available to work on LocalisationUpdate fail? https://phabricator.wikimedia.org/T76061 [20:39:03] (03CR) 10Yuvipanda: [C: 031] Use hiera to configure udp2log endpoint for ::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/176191 (owner: 10BryanDavis) [20:40:39] bd808: zcat'ing l10nupdate logs is like a replay of scap [20:40:41] it's amusing [20:41:10] Reedy: thanks :) [20:41:47] Reedy: heh. The control chars end up in there I guess [20:42:24] but no l10nupdate due ot Permission denied (publickey). [20:43:58] bd808: Did you say you'd fixed something like that already? [20:44:23] (03CR) 10Reedy: [C: 031] apachesync - delete sync-apache script [puppet] - 10https://gerrit.wikimedia.org/r/175884 (owner: 10Dzahn) [20:44:41] Reedy: I fixed the paths for the commands on the target hosts. Similar but different [20:45:17] I guess that presumably means it's not loading the key into the agent [20:45:41] Or that those hosts are missing the authorized_keys part? [20:45:49] It worked for a while didn't it? [20:46:14] it's 100% fail [20:46:22] JohnFLewis, bd808: yes, it was a scap failure, and I believe the RT ticket to "grant me access" was resolved by granting the appropriate puppet group access (not me specifically) although I don't remember the exact details right now. [20:47:21] cscott: looking at site.pp; deployment has access so I assume that is what the AR ticket resulted in (right mutante?) [20:48:05] bd808: Not sure. though, SAL lies as it says it succeeded [20:48:13] I guess it does, as it doesn't count for sync-dir [20:48:29] JohnFLewis: https://gerrit.wikimedia.org/r/166109 is what the RT ticket resulted in. [20:49:05] Reedy: The authorized_keys file on mw1201 for l10nupdate looks right so I'd start poking from tin [20:49:37] Reedy: It may be the new shared agent stuff on tin getting in the way somehow [20:49:40] run l10nupdate manually with a verbose flag? [20:50:07] well, really, we need the verbose at the dsh call I guess [20:50:14] cscott: so now it is the matter of whether we want a tmh-admin group for these hosts or not [20:51:16] JohnFLewis: well, from my perspective I don't care about tmh at all. i was just following the steps in https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Deployment_requirements [20:52:08] Reedy: Bah. It's scap [20:52:19] :( [20:52:49] JohnFLewis: actually, i guess I was following https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Deployment_Requirements [20:53:06] Reedy: This is clobbering SSH_AUTH_SOCK -- https://github.com/wikimedia/mediawiki-tools-scap/blob/master/scap/cli.py#L171-L175 [20:53:49] Reedy: So that needs a patch to only replace the auth sock if the shared one is present and readable I think [20:56:56] bd808: os.path.isfile? [20:57:12] Reedy: The other way to fix it would be to change the permissions on the shared auth socket so that l10nupdate can read from it. In the long term that would be even better. [20:58:53] Reedy: os.path.exists. It could be a symlink in theory [21:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141201T2100). Please do the needful. [21:00:28] Reedy: Or more pythonically, open the file for reading and only change the env is that succeeds [21:00:33] if auth_sock is not None and os.path.exists(auth_sock): [21:00:40] heh [21:05:12] bd808: https://docs.python.org/release/2.6.6/library/os.html#os.access ? [21:06:55] Reedy: I think maybe just `with open(auth_sock, 'r'):` instead of the current test. [21:07:09] That might raise an exception ... [21:09:30] yeah, `with open()` will still raise the exception, so you need to wrap that in a try/except still [21:10:15] so really just a try: open(); setenv; except: ignore sorto of thing is needed [21:10:36] s/sorto of/sort of/ [21:12:56] jouncebot, greg-g, cmjohnson: parsoid's skipping deploy today, but i'm doing an ocg deploy during this window. [21:23:19] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Puppet has 1 failures [21:28:11] (03PS5) 10BryanDavis: logstash: Forward syslog events for apache2 + hhvm [puppet] - 10https://gerrit.wikimedia.org/r/176693 [21:31:22] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: Puppet has 1 failures [21:37:41] RECOVERY - puppet last run on amssq37 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:47:05] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:48:50] (03PS1) 10Ottomata: Include new class misc::statistics::packages::utilities on stat1002 and stat1003 [puppet] - 10https://gerrit.wikimedia.org/r/176800 [21:51:01] (03CR) 10BBlack: [C: 031] Change ru.wikinews.org to HTTPS only. [puppet] - 10https://gerrit.wikimedia.org/r/173078 (owner: 10JanZerebecki) [21:52:25] (03CR) 10Ottomata: [C: 032] Include new class misc::statistics::packages::utilities on stat1002 and stat1003 [puppet] - 10https://gerrit.wikimedia.org/r/176800 (owner: 10Ottomata) [21:58:01] (03CR) 10BryanDavis: "Testing in beta." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/176693 (owner: 10BryanDavis) [22:00:05] awight: Respected human, time to deploy WMF Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141201T2200). Please do the needful. [22:08:33] (03PS1) 10Vogone: Add new namespaces to dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176808 [22:15:32] !log awight Synchronized php-1.25wmf9/extensions/DonationInterface: push DonationInterface translations (duration: 00m 07s) [22:15:34] Logged the message, Master [22:15:47] !log awight Synchronized php-1.25wmf10/extensions/DonationInterface: push DonationInterface translations (duration: 00m 09s) [22:15:49] Logged the message, Master [22:15:57] Reedy_: was https://rt.wikimedia.org/Ticket/Display.html?id=8334 done ? [22:16:13] !log awight Synchronized php-1.25wmf9/extensions/LandingCheck: push LandingCheck GeoIP fix (duration: 00m 06s) [22:16:16] Logged the message, Master [22:16:23] !log awight Synchronized php-1.25wmf10/extensions/LandingCheck: push LandingCheck GeoIP fix (duration: 00m 05s) [22:16:24] Logged the message, Master [22:16:44] !log awight Synchronized php-1.25wmf9/extensions/FundraiserLandingPage: push FundraiserLandingPage GeoIP fix (duration: 00m 06s) [22:16:46] Logged the message, Master [22:16:54] !log awight Synchronized php-1.25wmf10/extensions/FundraiserLandingPage: push FundraiserLandingPage GeoIP fix (duration: 00m 06s) [22:16:56] Logged the message, Master [22:17:03] Coren: can haz 1.5T on labs storage ? [22:17:46] mutante: Do you need it separate from /data/project? Otherwise, 1.5T isn't so much we'd hurt. [22:17:58] there is /media/wikimania2014 on terbium [22:18:02] /data/project works [22:18:04] it is 1.5T and has mp4 videos [22:18:09] matanya asks if we can copy it to labs [22:18:14] because then people can convert them there [22:18:28] "people" :) [22:18:42] heh, yea. to more useful format [22:19:07] unless Reedy_ already did it, and then just dup work [22:20:00] i also see .mov [22:20:25] @terbium:/media/wikimania2014/Barbican Hall# ls [22:20:25] ls: reading directory .: Input/output error [22:20:39] mutante: Should be straightforward enough - but avoid NFS for the copy. You should place it directly in /srv/project/tools/project/ [22:20:53] lol, again [22:21:06] Coren: ok, thanks! [22:21:43] Education II - Reform.mp4: ISO Media, MPEG v4 system, version 1 [22:21:46] Education I - Medicine.mov: ISO Media, Apple QuickTime movie [22:22:32] we should just have !xddc :) [22:22:50] xdcc [22:23:28] matanya: are we making a new project for videos? [22:23:38] already have one [22:23:51] what's the name [22:23:54] * matanya doesn't remind Coren again of his needs [22:24:02] mutante: video [22:26:54] matanya: What needs? [22:27:23] (03PS1) 10Milimetric: Add cron job that generates flow statistics [puppet] - 10https://gerrit.wikimedia.org/r/176810 [22:27:24] Coren: moar cpu for encoding [22:27:50] you said you are getting more hardware [22:29:07] maybe we should just setup Apache on terbium to let you download these files via http [22:29:33] download via http to labs ? [22:29:51] (03CR) 10Ottomata: [C: 032] Add cron job that generates flow statistics [puppet] - 10https://gerrit.wikimedia.org/r/176810 (owner: 10Milimetric) [22:29:53] yea [22:30:01] terbium also has people.wm and noc.wm [22:30:03] works for me [22:30:06] which are public webservers [22:30:10] matanya: We gots new CPUs but it'll take a while before it's available. :-) [22:30:19] just this path where the videos are is not a docroot [22:30:28] Coren: i'll wait :) [22:30:42] i dont think i should be able to scp straight from terbium into labs [22:32:53] !log awight Synchronized php-1.25wmf9/extensions/DonationInterface: push DonationInterface translations (duration: 00m 07s) [22:32:59] !log awight Synchronized php-1.25wmf10/extensions/DonationInterface: push DonationInterface translations (duration: 00m 06s) [22:33:01] Logged the message, Master [22:33:05] !log awight Synchronized php-1.25wmf9/extensions/LandingCheck: push LandingCheck GeoIP fix (duration: 00m 06s) [22:33:12] !log awight Synchronized php-1.25wmf10/extensions/LandingCheck: push LandingCheck GeoIP fix (duration: 00m 06s) [22:33:15] Logged the message, Master [22:33:18] !log awight Synchronized php-1.25wmf9/extensions/FundraiserLandingPage: push FundraiserLandingPage GeoIP fix (duration: 00m 06s) [22:33:25] !log awight Synchronized php-1.25wmf10/extensions/FundraiserLandingPage: push FundraiserLandingPage GeoIP fix (duration: 00m 06s) [22:34:41] !log updated OCG to version a06e7c186796a6ee5d5af81e93688520abdf2596 [22:34:43] Logged the message, Master [22:36:53] <^demon|pumpkinpi> I'm not getting css/js from bits. [22:37:47] <_joe_> ^demon|pumpkinpi: can you elaborate on that? [22:38:18] !log rescaled ipvs weights for text/mobile/upload/bits to 1 (there was no differential weighting), for better sh scheduler [22:38:22] Logged the message, Master [22:38:41] <^demon|pumpkinpi> _joe_: Getting unstyled pages on enwiki. [22:38:42] and I hope that's not causing css/js from bits issue above! [22:38:48] <^demon|pumpkinpi> Very slow, generally. [22:39:17] very slowly? [22:39:41] <_joe_> ^demon|pumpkinpi: are they timing out, giving 500, 404, what HTTP code? [22:40:16] <^demon|pumpkinpi> Hmm, back now. [22:40:35] <_joe_> maybe some local network issue? [22:40:41] <^demon|pumpkinpi> Could be. [22:40:47] Reedy_: Nemo_bis: is it possible that the l10n cache / LocalisationUpdates is actually obscuring newer translations provided in extension message files? [22:40:48] <_joe_> we don't have recorded any 503s AFAICS [22:42:15] <^demon|pumpkinpi> _joe_: RESOLVED DUPLICATE -> {T1234: Comcast Sucks} [22:42:22] <_joe_> eheh [22:42:54] <_joe_> I was so distracted by chasing issues that I forgot the hangout opened [22:42:59] <_joe_> :) [22:43:14] * _joe_ off [22:44:32] I can sure test https://gerrit.wikimedia.org/r/#/c/176750/, at least by observing effects such as l10update starting to work or not :) -- anyone up to deploy it? [22:45:29] (03CR) 10Aaron Schulz: [C: 032] Added some redis queue comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176706 (owner: 10Aaron Schulz) [22:45:44] (03Merged) 10jenkins-bot: Added some redis queue comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176706 (owner: 10Aaron Schulz) [22:48:31] from enwiki: Database error A database query error has occurred. This may indicate a bug in the software. Function: AbuseFilterViewTestBatch::doTest Error: 2013 Lost connection to MySQL server during query (10.64.48.20) [22:48:35] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Puppet has 1 failures [22:49:40] jackmcbarn, seems to be an one-off [22:49:58] MaxSem: that's the second time i tried it. first time was a 504 [22:50:03] i'll try it again [22:51:07] !log aaron Synchronized wmf-config/jobqueue-eqiad.php: b13eaa3f6e287e7268951a2f7e3798f994a20b28; comment tweaks (duration: 00m 05s) [22:51:12] Logged the message, Master [22:52:14] _joe_: do you think you could look at https://gerrit.wikimedia.org/r/176202 for me? (apparmor fixes for OCG) [22:52:40] <_joe_> cscott-split: it's midnight here :) [22:54:01] _joe_: excuses, excuses ;) [22:57:32] (03PS1) 10Jgreen: make icinga contactgroup 'fundraising' send to fr-tech@ [puppet] - 10https://gerrit.wikimedia.org/r/176823 [23:00:43] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:12:24] (03CR) 10Jgreen: [C: 032 V: 031] make icinga contactgroup 'fundraising' send to fr-tech@ [puppet] - 10https://gerrit.wikimedia.org/r/176823 (owner: 10Jgreen) [23:12:39] !log terbium - running rsync in screen to copy wikimania videos to labstore1001 [23:12:42] Logged the message, Master [23:27:57] paravoid: ah! [23:28:06] i finally figured out the webrequest warning in icinga. [23:28:20] nsca-client 2.9.1 in trusty is broekn [23:28:27] at least when working with older nsca server [23:28:48] only one of the worker nodes hasn't been upgraded yet [23:28:58] which is why the check sometimes succeeds [23:29:04] if the job happens to run on the old worker node [23:29:17] it will successfully send the passive check to nsca on neon [23:29:18] otherwise: [23:29:28] nsca[24056]: Dropping packet with invalid CRC32 [23:29:36] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=670373 [23:29:56] Jeff_Green: ^ take note [23:30:08] since I'm pretty sure you use passive checks and nsca for frack stuff [23:30:36] ncsa is not a great idea in general [23:30:50] weren't many choices, really, we thought hard about that one. [23:31:11] the job is regularly scheudled, but there's no guaruntee about when it will run [23:31:20] we needed a passive check, and we needed it to be remote [23:32:03] pssh, maybe we shouldn't use icinga for these after all :/ qchris doesn't really look at these [23:32:14] i think anyway [23:32:19] pffffff [23:33:32] paravoid: hm, what should I do? try to upgrade neon? that seems like it might cause more headaches than I am looking for [23:34:02] is it swat time yet? [23:34:21] yeah, no i think that would just break any 2.7.2 nsca-clients out there [23:34:50] ottomata: interesting [23:35:01] paravoid: why not nsca? [23:36:04] it's a bit abandoned afaik [23:36:04] anyway [23:36:09] I'm gonna go crash :) [23:36:33] k, laters [23:36:33] have a good evening! [23:36:45] 1:30am *cough* [23:36:46] "evening" [23:36:50] :) [23:37:02] ha [23:37:07] * YuviPanda has switched to a SFish timezone already, I think [23:37:10] I wasn't thinking in clock terms [23:37:12] ottomata: graphite + check_graphite? :) [23:37:26] paravoid: join the club [23:37:30] doesn't afternoon start at like 8PM there? [23:37:31] its a boolean check, really YuviPanda [23:37:36] how would I do that in graphite? [23:37:42] ottomata: 0, 1! :) [23:37:45] ha [23:37:46] well [23:37:53] although yes, it's somewhat difficult to do properly. [23:37:56] without abusing graphite [23:37:56] it needs to be passive [23:37:59] that's the real problem [23:38:15] hmm, right. [23:38:17] i want to be alerted IF there hasn't been a good partition added to hive in 1.5 hours [23:38:39] usually that would mean either nsca or snmptrap [23:38:40] so for puppet staleness I track 'time since last run' [23:38:58] there is a hadoop job that checks this, and then uses send_nsca to alert icinga that all is well [23:39:18] YuviPanda: is that via nrpe? [23:39:18] right, so one thing might be to track 'time since last good partition' as a metric in graphite. [23:39:23] just have the hadoop check send to graphite [23:39:37] hm [23:39:39] and then check_graphite runs on icinga, queries graphite for metric, and alerts. [23:40:45] no NRPE [23:40:51] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:41:22] that's not me, is it? [23:41:39] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:41:47] who knows at what point virt1000 is ready to be included in scap [23:41:58] can i just add it to dsh groups already? [23:42:02] hm, aye, YuviPanda, spoose that could work [23:42:17] mutante: andrewbogott would know, IIRC it was an explicit decision. [23:42:24] ottomata: that's how all checks in labs happen atm. [23:42:53] mutante: I had some security concerns about giving lots of people login on virt1000 [23:43:05] But I think that was overruled. So it's probably fine to turn it on. [23:43:08] I'd expect it to just work [23:43:23] my specific change is just this https://gerrit.wikimedia.org/r/#/c/175889/ [23:43:30] but that will add it to scap hosts [23:43:37] so next deploy..it would be in [23:43:59] https://phabricator.wikimedia.org/T70751 [23:44:45] YuviPanda: does shinkin maybe have a better way of doing this? [23:44:47] I think we can add that patch during SWAT and then watch it run? [23:44:56] eh, wow, i see this quote "Wikitech is now running from a MediaWiki version that is synced from tin and using production configuration system." [23:45:01] from Sep 4 ? [23:45:09] how did it without being in dsh ? [23:45:15] OO i gotta go! [23:45:39] ottomata: sadly nope. [23:46:11] laters all! [23:47:02] mutante: I run local syncs on virt1000 [23:47:04] periodically [23:47:14] andrewbogott: ok, thanks, so i see greg-g already put it on a "workboard". that sounds good [23:47:23] andrewbogott: gotcha! [23:47:50] mutante: check out the email thread 'wikitech vs. deployment' for context [23:47:50] jouncebot: next [23:47:51] In 0 hour(s) and 12 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141202T0000) [23:51:39] Jeff_Green: you forgot to merge on master [23:51:49] fr contact group