[00:33:52] (03PS2) 10Wpmirrordev: Extend maximum allowed mediawiki version to 1.24 (Bug: 66663) [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/139413 [01:10:52] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:42] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:02] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:02] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:02] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:12] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:12] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:32] RECOVERY - DPKG on tungsten is OK: All packages OK [01:12:42] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [01:12:52] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [01:12:52] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [01:12:52] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [01:13:02] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1403 seconds ago with 0 failures [01:13:02] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [01:22:07] PROBLEM - Puppet freshness on rhenium is CRITICAL: Last successful Puppet run was Sat 05 Jul 2014 05:09:48 UTC [01:29:48] (03PS1) 10TTO: Set up autopatrolled group for eswikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144419 (https://bugzilla.wikimedia.org/67557) [02:13:48] !log LocalisationUpdate completed (1.24wmf11) at 2014-07-07 02:12:44+00:00 [02:13:56] Logged the message, Master [02:15:33] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:15:53] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:15:53] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:15:53] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:03] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:03] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:23] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:33] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [02:16:43] RECOVERY - DPKG on tungsten is OK: All packages OK [02:16:43] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 387 seconds ago with 0 failures [02:16:43] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [02:16:53] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [02:16:53] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [02:17:13] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [02:24:51] !log LocalisationUpdate completed (1.24wmf12) at 2014-07-07 02:23:48+00:00 [02:24:57] Logged the message, Master [02:53:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 7 02:52:10 UTC 2014 (duration 52m 9s) [02:53:20] Logged the message, Master [03:09:15] (03PS1) 10Springle: Some schema sanity-check scripts rescued from unversioned $HOME labyrinth on terbium. [operations/software] - 10https://gerrit.wikimedia.org/r/144421 [03:10:22] (03CR) 10Springle: [C: 032] Some schema sanity-check scripts rescued from unversioned $HOME labyrinth on terbium. [operations/software] - 10https://gerrit.wikimedia.org/r/144421 (owner: 10Springle) [03:22:34] PROBLEM - Puppet freshness on rhenium is CRITICAL: Last successful Puppet run was Sat 05 Jul 2014 05:09:48 UTC [04:00:00] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:00:40] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:00:50] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [04:01:30] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [05:11:59] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:59] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:11:59] PROBLEM - check configured eth on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:00] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:09] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:09] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:19] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:12:19] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:29] PROBLEM - Disk space on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:29] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:39] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:49] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 2.285 second response time [05:12:49] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [05:12:49] RECOVERY - check configured eth on tungsten is OK: NRPE: Unable to read output [05:12:50] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [05:12:59] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [05:12:59] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1366 seconds ago with 0 failures [05:13:09] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [05:13:09] RECOVERY - DPKG on tungsten is OK: All packages OK [05:13:19] RECOVERY - Disk space on tungsten is OK: DISK OK [05:13:29] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [05:13:29] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [05:22:49] PROBLEM - Puppet freshness on rhenium is CRITICAL: Last successful Puppet run was Sat 05 Jul 2014 05:09:48 UTC [05:53:48] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Mon 07 Jul 2014 03:52:43 UTC [06:27:53] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:12] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:22] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 4 failures [06:28:42] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:52] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 3 failures [06:28:52] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 4 failures [06:28:52] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:52] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:02] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:02] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 4 failures [06:29:02] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:02] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 4 failures [06:29:03] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:03] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:03] PROBLEM - check configured eth on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:04] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:04] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:05] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:12] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 4 failures [06:29:12] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:12] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:22] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:23] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:52] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [06:29:52] RECOVERY - check configured eth on tungsten is OK: NRPE: Unable to read output [06:29:53] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [06:29:53] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [06:29:53] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [06:30:02] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1213 seconds ago with 0 failures [06:33:32] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Mon Jul 7 06:33:30 UTC 2014 [06:39:12] PROBLEM - puppet last run on es1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:09] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:09] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:29] PROBLEM - puppet last run on db1017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:49] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:09] RECOVERY - puppet last run on es1004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:03:29] RECOVERY - puppet last run on db1017 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:17:53] (03CR) 10Hashar: [C: 031] "All good :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144363 (owner: 10Matanya) [07:21:39] PSA: I have a swift upgrade at 8utc but now in transit, will start at ~9utc [07:23:17] PROBLEM - Puppet freshness on rhenium is CRITICAL: Last successful Puppet run was Sat 05 Jul 2014 05:09:48 UTC [07:41:48] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [07:44:48] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [07:52:26] good morning [07:56:06] hi hashar [07:57:48] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:59:45] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [08:01:58] (03PS1) 10TTO: Add Foreign Word of the Day featured feed for Wiktionaries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144425 (https://bugzilla.wikimedia.org/67563) [08:06:55] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: Puppet has 2 failures [08:06:55] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Puppet has 2 failures [08:06:55] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 2 failures [08:14:45] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [08:20:57] (03PS1) 10Matanya: bugzilla: update cipher_suite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144427 [08:25:55] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:25:55] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:26:55] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:29:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:01:01] !og Jenkins manually compressing log files on gallium (done manually so I can monitor it!) [09:02:45] (03Abandoned) 10Hashar: Simple docker module [operations/puppet] - 10https://gerrit.wikimedia.org/r/139388 (owner: 10Hashar) [09:05:19] <_joe_> cajoel: rhenium is alarming since 3 days for like everything. Can you please acknowledge the problem if you're working on it in any way? the server seems unresponsive and console is stuck in 'stopping atop', so I'm going to power-cycle that server if I find nothing in the SAL [09:12:12] hashar: how easy is to import some articles on ca/es betawikis? [09:13:14] kart_: -qa :D [09:13:24] gah. [09:15:37] !log upgrade ms-be1002/1008 (zone1) to swift icehouse [09:15:41] Logged the message, Master [09:16:10] <_joe_> !log restarting rhenium, pings but no ssh since 2 days, serial console is blank and unresponsive [09:16:15] Logged the message, Master [09:18:07] PROBLEM - Host rhenium is DOWN: CRITICAL - Host Unreachable (208.80.154.52) [09:18:41] <_joe_> justeh [09:19:26] RECOVERY - check configured eth on rhenium is OK: NRPE: Unable to read output [09:19:26] RECOVERY - RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [09:19:36] RECOVERY - SSH on rhenium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:19:36] RECOVERY - check if dhclient is running on rhenium is OK: PROCS OK: 0 processes with command name dhclient [09:19:36] RECOVERY - Host rhenium is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [09:19:36] RECOVERY - DPKG on rhenium is OK: All packages OK [09:19:46] RECOVERY - Disk space on rhenium is OK: DISK OK [09:20:06] RECOVERY - puppet disabled on rhenium is OK: OK [09:24:06] PROBLEM - Puppet freshness on rhenium is CRITICAL: Last successful Puppet run was Sat 05 Jul 2014 05:09:48 UTC [09:29:46] RECOVERY - Puppet freshness on rhenium is OK: puppet ran at Mon Jul 7 09:29:42 UTC 2014 [09:30:27] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:30:59] (03PS12) 10Hashar: sanity test for refreshWikiversionsCDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 [09:35:26] PROBLEM - RAID on db1019 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [09:36:27] hola _joe_, I think i found the problem with the threshold alarms i set up on graphite: user error (user being me) [09:40:48] !log upgrade ms-be1003/1004/1012 (zone2) to swift icehouse [09:40:53] Logged the message, Master [09:46:40] <_joe_> \o/ [10:03:04] they are working on swift 2.0.0 :-D [10:12:34] ACKNOWLEDGEMENT - RAID on db1019 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Sean Pringle RT #7808 [10:14:05] indeedly! I think they are at rc1 now [10:25:05] (03CR) 10Alexandros Kosiaris: [C: 032] puppet: variable is in scope, no need for scope lookup [operations/puppet] - 10https://gerrit.wikimedia.org/r/144361 (owner: 10Matanya) [10:31:19] (03CR) 10Alexandros Kosiaris: [C: 032] firewall: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/139836 (owner: 10Matanya) [10:33:30] (03PS2) 10Filippo Giunchedi: beta: minor lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/144363 (owner: 10Matanya) [10:33:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] beta: minor lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/144363 (owner: 10Matanya) [10:35:50] (03CR) 10Alexandros Kosiaris: [C: 032] mha: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140565 (owner: 10Matanya) [10:40:16] and it is lunch time with wife! (which is awesome) [10:40:38] objections if I raise ops@ max_num_recipients from 10 to 20? rationale: more people in the team that can get addressed in code reviews [10:44:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] swift: lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140654 (owner: 10Matanya) [10:58:03] <_joe_> godog: no objections from me! [10:59:37] (03CR) 10Matanya: "Putting this change on hold per Filippo's request. I'll fix it once he finishes with swift upgrades, and rebase the rest." [operations/puppet] - 10https://gerrit.wikimedia.org/r/140654 (owner: 10Matanya) [11:01:02] yup _joe_, done! [11:06:06] (03CR) 10Filippo Giunchedi: [C: 031] "general plan sounds good to me!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [11:17:38] (03PS1) 10Matanya: bugzilla: vars in scope, no need for lookup [operations/puppet] - 10https://gerrit.wikimedia.org/r/144442 [11:27:53] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:53] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:53] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:43] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1071 seconds ago with 0 failures [11:28:43] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [11:28:43] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [12:09:26] !log Updated our Jenkins job builder fork 0972985..e1ddd23 [12:09:30] Logged the message, Master [12:11:30] !log Jenkins job builder e1ddd23 fails for us :/ Moving back to parent commit [12:11:34] Logged the message, Master [13:26:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [13:30:19] flow related apparently [13:31:45] bit heavy? [13:32:12] * Reedy beats closedmouth [13:33:19] :) [13:36:55] logged flow issues as https://bugzilla.wikimedia.org/show_bug.cgi?id=67592 and https://bugzilla.wikimedia.org/show_bug.cgi?id=67593 [13:39:59] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:49:38] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Puppet has 1 failures [13:57:20] (03CR) 10Chad: [C: 031] "All ready to go in ~3h :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140752 (owner: 10Chad) [14:01:36] <^d> ottomata: Heya. Can get get elasticsearch 1.2.1 .deb forward ported to the trusty repo too? Right now it's only in precise I think. [14:07:14] (03PS2) 10Ottomata: Eventlogging monitoring, pass true w/o quotes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144178 (owner: 10Nuria) [14:07:20] (03CR) 10Ottomata: [C: 032 V: 032] Eventlogging monitoring, pass true w/o quotes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144178 (owner: 10Nuria) [14:07:37] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:12:13] (03PS4) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [14:12:15] (03PS1) 10Giuseppe Lavagetto: appservers: mediawiki config in puppet, debianized, 2.4-compatible [WIP] [operations/puppet] - 10https://gerrit.wikimedia.org/r/144453 [14:13:54] (03CR) 10jenkins-bot: [V: 04-1] appservers: mediawiki config in puppet, debianized, 2.4-compatible [WIP] [operations/puppet] - 10https://gerrit.wikimedia.org/r/144453 (owner: 10Giuseppe Lavagetto) [14:14:01] <_joe_> I'm sure ori would be proud of how easy was to use his apache module [14:14:07] :) [14:14:07] <_joe_> but of course there are still errors [14:14:19] surprisingly, I found a small error in the diamond module as well. [14:14:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The comment was (partially) correct. The file was removed from the package indeed in version 1.00 but that version never made it for hardy" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144059 (owner: 10Matanya) [14:15:19] <_joe_> YuviPanda: I just converted the mediawiki production apache config into something that fits the distro we're using, so it's a huge amount of work, errors are to be expected upon my first commit [14:15:25] right :) [14:15:39] _joe_: I found a small omission in diamond that was causing problems in labs, https://gerrit.wikimedia.org/r/#/c/144360/ [14:15:49] * YuviPanda feels that his quota for bugging _joe_ has been exhausted last week [14:16:06] (03PS3) 10Giuseppe Lavagetto: nutcracker migration: allow wikidev to (re-)start nutcracker [operations/puppet] - 10https://gerrit.wikimedia.org/r/144343 (owner: 10Ori.livneh) [14:17:44] <_joe_> YuviPanda: eheh right [14:17:53] <_joe_> classical example of omission [14:18:13] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker migration: allow wikidev to (re-)start nutcracker [operations/puppet] - 10https://gerrit.wikimedia.org/r/144343 (owner: 10Ori.livneh) [14:18:20] _joe_: yeah, and I had a few things that were set to ensure absent but still were being sent. I went around looking if puppet hadn't run... [14:18:35] <_joe_> ok :) [14:18:51] _joe_: I shall try to bug you lesser this week :) [14:19:05] (03CR) 10Giuseppe Lavagetto: [C: 032] diamond: Make ensure => absent work [operations/puppet] - 10https://gerrit.wikimedia.org/r/144360 (owner: 10Yuvipanda) [14:19:12] _joe_: ty! [14:19:54] (03PS3) 10Matanya: vimrc: remove from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/144059 [14:21:15] what happened to grrrit-wm ? needs a hug? [14:21:25] godog: should be up, why? [14:21:35] godog: I see messages... [14:21:44] YuviPanda: nothing, was wondering grrrit-wm1 vs grrrit-wm [14:21:52] godog: oh, must've been a netsplit [14:22:02] I can restart it if it bothers you :) [14:22:28] haha not particularly, but I've put the bot traffic as NOTICE, that's why I noticed [14:22:36] ah :) [14:22:39] I can add grrrit-wm1 to list of nicks [14:22:53] :) [14:26:58] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet has 2 failures [14:42:13] _joe_ added the disk back to ms-be1007 but getting errors on puppet run [14:43:35] can you take a look...it may be rebuilding the disk but the controller doesn't state it's rebuilding just online-spun up [14:43:57] <_joe_> cmjohnson1: I think godog could be a better inspector there :) [14:44:15] k [14:45:02] <_joe_> but if he's not available I can step in [14:45:12] (03PS12) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with new packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [14:46:48] yup, taking a look cmjohnson1 [14:46:50] thanks _joe_ [14:48:44] (03PS1) 10ArielGlenn: dumps deployment setup: replace with salt-master based system [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/144457 [14:49:37] !log (Cirrus) Applying cache warmer configuration that went out last Thursday to all wikipedias. [14:49:41] Logged the message, Master [14:50:00] <^d> manybubbles: Speaking of config, we sent out the destructive_requires_name change Thurs. [14:50:06] <^d> Turns out it didn't need a cluster restart. [14:50:12] ^d: I saw! thanks [14:51:37] (03CR) 10Giuseppe Lavagetto: "Thanks ori. This approach makes more sense and is safer." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [14:54:42] <^d> ottomata: About? [14:55:29] PROBLEM - puppet last run on mw1093 is CRITICAL: CRITICAL: Puppet has 1 failures [14:56:25] yup hey ^d [14:56:28] saw your request, can do... [14:56:34] <^d> Cool beans, thx! [14:56:59] akosiaris: around? [14:57:29] PROBLEM - MySQL Recent Restart Port 3308 on labsdb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:00:26] RECOVERY - MySQL Recent Restart Port 3308 on labsdb1002 is OK: OK seconds since restart [15:01:30] ^d, why do you want it forward ported now [15:01:30] ? [15:01:31] for labs? [15:02:45] <^d> For vagrant, mostly. [15:03:20] <^d> Hadn't thought of labs. That's a nice bonus too, sure. [15:04:04] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: move config in puppet, work with new packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [15:04:05] hmm, i have never done this before... [15:04:12] no idea what's involved [15:04:25] cmjohnson1: can't reach ms-be1007 ATM, down on purpose? [15:04:25] do I have to rebuild the package? or can I just edit the dist somehow? [15:04:25] hm [15:04:48] godog no it's not [15:04:56] ^d: are you talking about getting elasticsearch into vagrant again? [15:04:59] <^d> Yep. [15:05:10] cmjohnson1: ok! taking a look [15:05:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [15:05:44] <_joe_> mmmh right when I was about to merge the new nutcracker patch [15:06:21] <_joe_> ok, a small transient [15:06:39] ^d: sweet [15:06:42] ^d, you can just dpkg -i ? :) [15:06:54] <_joe_> if we see nutcracker processes alerts, btw, that is expected while we wait for the second puppet run on neon [15:06:55] <^d> I can work around it a couple of ways, sure. [15:07:14] <^d> Still would be nice to fix ;-) [15:08:01] akosiaris: do you know, can I just change deb-override and set Distribution for elasticsearch packages to trusty? [15:08:51] !log powercycled ms-be1007, unresponsive on console and remnants of a stack trace [15:08:56] Logged the message, Master [15:12:26] PROBLEM - MySQL Recent Restart Port 3308 on labsdb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:13:16] RECOVERY - MySQL Recent Restart Port 3308 on labsdb1002 is OK: OK seconds since restart [15:14:26] RECOVERY - puppet last run on mw1093 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:16:26] PROBLEM - MySQL Recent Restart Port 3308 on labsdb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:39] !log reseating PEM2 cr1-eqiad [15:16:44] Logged the message, Master [15:17:20] jouncebot: No ping this morning? I agree there was no need for one (nothing in this morning's SWAT window), but I didn't know you realized that. [15:17:56] PROBLEM - MySQL Slave Delay Port 3308 on labsdb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:06] PROBLEM - check configured eth on labsdb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:06] PROBLEM - mysqld processes on labsdb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:06] PROBLEM - MySQL Idle Transactions Port 3306 on labsdb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:06] PROBLEM - MySQL Idle Transactions Port 3308 on labsdb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:11] (03CR) 10Rush: "removing myself here as it seems the cleanup is happening elsewhere :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144092 (owner: 10Dzahn) [15:18:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:18:26] PROBLEM - MySQL Idle Transactions Port 3307 on labsdb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:46] RECOVERY - MySQL Slave Delay Port 3308 on labsdb1002 is OK: OK replication delay 0 seconds [15:18:56] RECOVERY - mysqld processes on labsdb1002 is OK: PROCS OK: 3 processes with command name mysqld [15:18:56] RECOVERY - check configured eth on labsdb1002 is OK: NRPE: Unable to read output [15:18:56] RECOVERY - MySQL Idle Transactions Port 3306 on labsdb1002 is OK: OK longest blocking idle transaction sleeps for 0 seconds [15:18:57] RECOVERY - MySQL Idle Transactions Port 3308 on labsdb1002 is OK: OK longest blocking idle transaction sleeps for seconds [15:19:16] RECOVERY - MySQL Idle Transactions Port 3307 on labsdb1002 is OK: OK longest blocking idle transaction sleeps for 0 seconds [15:19:16] RECOVERY - MySQL Recent Restart Port 3308 on labsdb1002 is OK: OK seconds since restart [15:20:36] ottomata: what's the problem ? [15:21:44] PROBLEM - nutcracker port on mw1073 is CRITICAL: Connection refused [15:21:45] PROBLEM - nutcracker port on mw1032 is CRITICAL: Connection refused [15:21:45] PROBLEM - nutcracker port on mw1097 is CRITICAL: Connection refused [15:21:45] PROBLEM - nutcracker port on mw1148 is CRITICAL: Connection refused [15:21:45] PROBLEM - nutcracker port on mw1180 is CRITICAL: Connection refused [15:21:45] PROBLEM - nutcracker port on mw1108 is CRITICAL: Connection refused [15:21:45] PROBLEM - nutcracker port on mw1168 is CRITICAL: Connection refused [15:21:46] PROBLEM - nutcracker port on mw1198 is CRITICAL: Connection refused [15:21:46] PROBLEM - nutcracker port on mw1183 is CRITICAL: Connection refused [15:21:50] akosiaris: apertium packages comes with Trusty has weird bug and we need to recompile it and installed on beta labs. [15:21:54] PROBLEM - nutcracker port on mw1087 is CRITICAL: Connection refused [15:21:54] PROBLEM - nutcracker port on mw1113 is CRITICAL: Connection refused [15:21:54] PROBLEM - nutcracker port on mw1137 is CRITICAL: Connection refused [15:21:54] PROBLEM - nutcracker port on mw1056 is CRITICAL: Connection refused [15:21:54] PROBLEM - nutcracker port on mw1158 is CRITICAL: Connection refused [15:21:55] PROBLEM - nutcracker port on mw1103 is CRITICAL: Connection refused [15:21:55] PROBLEM - nutcracker port on tmh1002 is CRITICAL: Connection refused [15:21:56] PROBLEM - nutcracker port on mw1203 is CRITICAL: Connection refused [15:21:56] PROBLEM - nutcracker port on mw1022 is CRITICAL: Connection refused [15:21:57] PROBLEM - nutcracker port on mw1037 is CRITICAL: Connection refused [15:21:57] PROBLEM - nutcracker port on mw1027 is CRITICAL: Connection refused [15:21:58] PROBLEM - nutcracker port on mw1015 is CRITICAL: Connection refused [15:21:58] PROBLEM - nutcracker port on mw1093 is CRITICAL: Connection refused [15:21:59] PROBLEM - nutcracker port on mw1079 is CRITICAL: Connection refused [15:21:59] PROBLEM - nutcracker port on terbium is CRITICAL: Connection refused [15:22:00] PROBLEM - nutcracker port on mw1193 is CRITICAL: Connection refused [15:22:00] PROBLEM - nutcracker port on mw1154 is CRITICAL: Connection refused [15:22:01] PROBLEM - nutcracker port on mw1163 is CRITICAL: Connection refused [15:22:01] PROBLEM - nutcracker port on mw1184 is CRITICAL: Connection refused [15:22:02] PROBLEM - nutcracker port on mw1188 is CRITICAL: Connection refused [15:22:02] PROBLEM - nutcracker port on mw1219 is CRITICAL: Connection refused [15:22:04] PROBLEM - nutcracker port on mw1033 is CRITICAL: Connection refused [15:22:04] PROBLEM - nutcracker port on mw1051 is CRITICAL: Connection refused [15:22:04] PROBLEM - nutcracker port on mw1070 is CRITICAL: Connection refused [15:22:04] PROBLEM - nutcracker port on mw1104 is CRITICAL: Connection refused [15:22:05] PROBLEM - nutcracker port on mw1098 is CRITICAL: Connection refused [15:22:05] PROBLEM - nutcracker port on mw1121 is CRITICAL: Connection refused [15:22:06] PROBLEM - nutcracker port on mw1143 is CRITICAL: Connection refused [15:22:06] PROBLEM - nutcracker port on mw1209 is CRITICAL: Connection refused [15:22:07] PROBLEM - nutcracker port on mw1214 is CRITICAL: Connection refused [15:22:07] PROBLEM - nutcracker port on mw1199 is CRITICAL: Connection refused [15:22:08] PROBLEM - nutcracker port on mw1019 is CRITICAL: Connection refused [15:22:08] PROBLEM - nutcracker port on mw1023 is CRITICAL: Connection refused [15:22:09] PROBLEM - nutcracker port on mw1029 is CRITICAL: Connection refused [15:22:09] akosiaris: where should I start looking at guide to add them? [15:22:09] PROBLEM - nutcracker port on mw1090 is CRITICAL: Connection refused [15:22:10] PROBLEM - nutcracker port on mw1165 is CRITICAL: Connection refused [15:22:10] PROBLEM - nutcracker port on mw1074 is CRITICAL: Connection refused [15:22:11] PROBLEM - nutcracker port on mw1181 is CRITICAL: Connection refused [15:22:11] PROBLEM - nutcracker port on mw1047 is CRITICAL: Connection refused [15:22:12] PROBLEM - nutcracker port on mw1133 is CRITICAL: Connection refused [15:22:12] PROBLEM - nutcracker port on mw1194 is CRITICAL: Connection refused [15:22:13] PROBLEM - nutcracker port on mw1127 is CRITICAL: Connection refused [15:22:14] PROBLEM - nutcracker port on mw1001 is CRITICAL: Connection refused [15:22:14] PROBLEM - nutcracker port on mw1010 is CRITICAL: Connection refused [15:22:14] PROBLEM - nutcracker port on mw1057 is CRITICAL: Connection refused [15:22:15] PROBLEM - nutcracker port on mw1110 is CRITICAL: Connection refused [15:22:15] PROBLEM - nutcracker port on mw1016 is CRITICAL: Connection refused [15:22:16] PROBLEM - nutcracker port on mw1155 is CRITICAL: Connection refused [15:22:16] PROBLEM - nutcracker port on mw1105 is CRITICAL: Connection refused [15:22:17] PROBLEM - nutcracker port on mw1169 is CRITICAL: Connection refused [15:22:17] PROBLEM - nutcracker port on mw1101 is CRITICAL: Connection refused [15:22:18] PROBLEM - nutcracker port on mw1215 is CRITICAL: Connection refused [15:22:18] PROBLEM - nutcracker port on mw1139 is CRITICAL: Connection refused [15:22:19] PROBLEM - nutcracker port on mw1210 is CRITICAL: Connection refused [15:22:19] PROBLEM - nutcracker port on mw1204 is CRITICAL: Connection refused [15:22:20] PROBLEM - nutcracker port on mw1071 is CRITICAL: Connection refused [15:22:20] PROBLEM - nutcracker port on mw1034 is CRITICAL: Connection refused [15:22:21] PROBLEM - nutcracker port on mw1116 is CRITICAL: Connection refused [15:22:21] PROBLEM - nutcracker port on mw1085 is CRITICAL: Connection refused [15:22:22] PROBLEM - nutcracker port on mw1159 is CRITICAL: Connection refused [15:22:22] PROBLEM - nutcracker port on mw1185 is CRITICAL: Connection refused [15:22:23] PROBLEM - nutcracker port on mw1095 is CRITICAL: Connection refused [15:22:23] PROBLEM - nutcracker port on mw1220 is CRITICAL: Connection refused [15:22:24] PROBLEM - nutcracker port on mw1020 is CRITICAL: Connection refused [15:22:24] PROBLEM - nutcracker port on mw1030 is CRITICAL: Connection refused [15:22:25] PROBLEM - nutcracker port on mw1064 is CRITICAL: Connection refused [15:22:25] PROBLEM - nutcracker port on mw1075 is CRITICAL: Connection refused [15:22:26] PROBLEM - nutcracker port on mw1081 is CRITICAL: Connection refused [15:22:26] PROBLEM - nutcracker port on mw1111 is CRITICAL: Connection refused [15:22:27] PROBLEM - nutcracker port on mw1128 is CRITICAL: Connection refused [15:22:27] PROBLEM - nutcracker port on fenari is CRITICAL: Connection refused [15:22:28] PROBLEM - nutcracker port on mw1122 is CRITICAL: Connection refused [15:22:28] PROBLEM - nutcracker port on mw1151 is CRITICAL: Connection refused [15:22:29] PROBLEM - nutcracker port on mw1167 is CRITICAL: Connection refused [15:22:29] PROBLEM - nutcracker port on mw1190 is CRITICAL: Connection refused [15:22:30] PROBLEM - nutcracker port on mw1201 is CRITICAL: Connection refused [15:22:30] PROBLEM - nutcracker port on mw1049 is CRITICAL: Connection refused [15:22:31] PROBLEM - nutcracker port on mw1017 is CRITICAL: Connection refused [15:22:31] PROBLEM - nutcracker port on mw1091 is CRITICAL: Connection refused [15:22:32] PROBLEM - nutcracker port on mw1096 is CRITICAL: Connection refused [15:22:32] PROBLEM - nutcracker port on mw1135 is CRITICAL: Connection refused [15:22:33] PROBLEM - nutcracker port on mw1146 is CRITICAL: Connection refused [15:22:33] PROBLEM - nutcracker port on mw1186 is CRITICAL: Connection refused [15:22:34] PROBLEM - nutcracker port on mw1156 is CRITICAL: Connection refused [15:22:34] PROBLEM - nutcracker port on mw1004 is CRITICAL: Connection refused [15:22:35] PROBLEM - nutcracker port on mw1024 is CRITICAL: Connection refused [15:22:35] PROBLEM - nutcracker port on mw1043 is CRITICAL: Connection refused [15:22:36] PROBLEM - nutcracker port on mw1077 is CRITICAL: Connection refused [15:22:36] PROBLEM - nutcracker port on mw1053 is CRITICAL: Connection refused [15:22:37] PROBLEM - nutcracker port on mw1107 is CRITICAL: Connection refused [15:22:37] PROBLEM - nutcracker port on mw1058 is CRITICAL: Connection refused [15:22:38] PROBLEM - nutcracker port on mw1152 is CRITICAL: Connection refused [15:22:38] PROBLEM - nutcracker port on mw1179 is CRITICAL: Connection refused [15:22:39] PROBLEM - nutcracker port on mw1102 is CRITICAL: Connection refused [15:22:39] PROBLEM - nutcracker port on mw1202 is CRITICAL: Connection refused [15:22:40] PROBLEM - nutcracker port on mw1086 is CRITICAL: Connection refused [15:22:40] PROBLEM - nutcracker port on mw1066 is CRITICAL: Connection refused [15:22:41] PROBLEM - nutcracker port on mw1125 is CRITICAL: Connection refused [15:22:41] PROBLEM - nutcracker port on mw1021 is CRITICAL: Connection refused [15:22:42] PROBLEM - nutcracker port on mw1050 is CRITICAL: Connection refused [15:22:42] PROBLEM - nutcracker port on mw1112 is CRITICAL: Connection refused [15:22:43] PROBLEM - nutcracker port on mw1191 is CRITICAL: Connection refused [15:22:43] PROBLEM - nutcracker port on mw1182 is CRITICAL: Connection refused [15:22:44] PROBLEM - nutcracker port on mw1171 is CRITICAL: Connection refused [15:22:44] PROBLEM - nutcracker port on mw1131 is CRITICAL: Connection refused [15:22:45] PROBLEM - nutcracker port on mw1018 is CRITICAL: Connection refused [15:22:45] PROBLEM - nutcracker port on mw1014 is CRITICAL: Connection refused [15:22:46] PROBLEM - nutcracker port on mw1136 is CRITICAL: Connection refused [15:22:46] PROBLEM - nutcracker port on mw1078 is CRITICAL: Connection refused [15:22:47] PROBLEM - nutcracker port on mw1142 is CRITICAL: Connection refused [15:22:47] PROBLEM - nutcracker port on mw1207 is CRITICAL: Connection refused [15:22:48] PROBLEM - nutcracker port on tmh1001 is CRITICAL: Connection refused [15:22:48] PROBLEM - nutcracker port on mw1083 is CRITICAL: Connection refused [15:22:49] PROBLEM - nutcracker port on mw1157 is CRITICAL: Connection refused [15:22:49] PROBLEM - nutcracker port on mw1212 is CRITICAL: Connection refused [15:22:51] <^d> ohaithere icinga-wm. [15:22:54] o [15:22:55] kart_: what kind of bugs ? [15:23:14] weird ones [15:23:24] RECOVERY - nutcracker port on mw1017 is OK: TCP OK - 0.000 second response time on port 11212 [15:23:31] no such thing as non-weird bugs :-) [15:23:32] jouncebot: next [15:23:32] In 0 hour(s) and 36 minute(s): Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140707T1600) [15:24:00] (03PS1) 10Giuseppe Lavagetto: nutcracker: correct whitespace in config, stats port [operations/puppet] - 10https://gerrit.wikimedia.org/r/144461 [15:24:03] hi greg-g [15:24:05] though the ones looking at you with that crazed look are the creepiest [15:24:17] oh wait, that's me, not the bugs :P [15:24:22] welcome back :) [15:24:28] (03CR) 10coren: [C: 032] "That's okay, but wouldn't it be fundamentally better to not have diamond log so much stuff in the first place instead?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144331 (owner: 10Yuvipanda) [15:24:29] thanks aude :) [15:24:36] got my mail? [15:25:03] (03CR) 10Alexandros Kosiaris: [C: 032] vimrc: remove from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/144059 (owner: 10Matanya) [15:25:17] <_joe_> hey I told you don't worry, that's me changing things [15:25:25] Coren: re: your suggestion, https://gerrit.wikimedia.org/r/#/c/144330/ goes halfway there :) [15:25:48] Coren: but that's not labs only, so waiting for chasemp [15:25:51] (03PS2) 10Giuseppe Lavagetto: nutcracker: correct whitespace in config, stats port [operations/puppet] - 10https://gerrit.wikimedia.org/r/144461 [15:26:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] nutcracker: correct whitespace in config, stats port [operations/puppet] - 10https://gerrit.wikimedia.org/r/144461 (owner: 10Giuseppe Lavagetto) [15:26:14] I'm about, looking at those now actually [15:26:50] aude: hmm, which one? /me is still digging out [15:27:34] greg-g: that weird things happen w/ git and renaming of directories when working on a mac [15:27:44] aude: yeah, sure [15:27:45] for that reason, we didn't have our branch ready thursday [15:27:50] * greg-g nods [15:27:50] so want to update it today [15:27:56] yep [15:28:00] cool [15:28:24] RECOVERY - nutcracker port on mw1190 is OK: TCP OK - 0.000 second response time on port 11212 [15:28:34] RECOVERY - nutcracker port on mw1125 is OK: TCP OK - 0.000 second response time on port 11212 [15:28:44] RECOVERY - nutcracker port on mw1014 is OK: TCP OK - 0.000 second response time on port 11212 [15:28:48] prepare for moar spam [15:28:54] RECOVERY - nutcracker port on mw1087 is OK: TCP OK - 0.000 second response time on port 11212 [15:28:54] RECOVERY - nutcracker port on mw1079 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:04] RECOVERY - nutcracker port on mw1098 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:04] RECOVERY - nutcracker port on mw1133 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:04] RECOVERY - nutcracker port on mw1165 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:14] RECOVERY - nutcracker port on mw1034 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:19] akosiaris: apertium has wrongly linked to different version of pcre lib [15:29:24] RECOVERY - nutcracker port on mw1111 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:24] RECOVERY - nutcracker port on mw1151 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:24] RECOVERY - nutcracker port on mw1049 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:29] <_joe_> :) [15:29:34] RECOVERY - nutcracker port on mw1202 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:35] akosiaris: so, it doesn't work unless we recompile [15:29:41] <_joe_> more recoveries coming, sorry guys [15:29:43] Reedy: correct :) [15:29:44] RECOVERY - nutcracker port on mw1183 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:44] RECOVERY - nutcracker port on mw1180 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:44] RECOVERY - nutcracker port on mw1168 is OK: TCP OK - 0.000 second response time on port 11212 [15:29:58] <_joe_> that was a "-" in an erb template to cancel trailing whitespace [15:30:01] <_joe_> :( [15:30:04] RECOVERY - nutcracker port on mw1051 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:04] (03PS1) 10Jgreen: re-replace OTRS favicon with wmf (metawiki) one, Bugzilla 17271 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144462 [15:30:04] RECOVERY - nutcracker port on mw1181 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:14] RECOVERY - nutcracker port on mw1057 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:14] RECOVERY - nutcracker port on mw1159 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:15] (03PS2) 10Rush: diamond: Remove archive.log handler [operations/puppet] - 10https://gerrit.wikimedia.org/r/144330 (owner: 10Yuvipanda) [15:30:24] RECOVERY - nutcracker port on mw1030 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:24] RECOVERY - nutcracker port on mw1081 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:24] RECOVERY - nutcracker port on mw1156 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:24] RECOVERY - nutcracker port on mw1146 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:34] (03CR) 10Rush: [C: 032] diamond: Remove archive.log handler [operations/puppet] - 10https://gerrit.wikimedia.org/r/144330 (owner: 10Yuvipanda) [15:30:34] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:34] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:34] RECOVERY - nutcracker port on mw1050 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:40] (03CR) 10Rush: [V: 032] diamond: Remove archive.log handler [operations/puppet] - 10https://gerrit.wikimedia.org/r/144330 (owner: 10Yuvipanda) [15:30:43] greg-g: I often wish each day in the deployments calendar were a section I could edit. Is such tomfoolery even conceivable!? [15:30:44] RECOVERY - nutcracker port on mw1212 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:44] RECOVERY - nutcracker port on mw1198 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:46] chasemp: ty! [15:30:48] cmjohnson1: ok I guess we'll just let it fail again then (#7711) [15:30:54] RECOVERY - nutcracker port on mw1056 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:54] RECOVERY - nutcracker port on mw1188 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:54] RECOVERY - nutcracker port on mw1163 is OK: TCP OK - 0.000 second response time on port 11212 [15:30:57] akosiaris: just recomiple should be okay. [15:31:04] YuviPanda: on the make enable do it's thing [15:31:04] RECOVERY - nutcracker port on mw1074 is OK: TCP OK - 0.000 second response time on port 11212 [15:31:04] RECOVERY - nutcracker port on mw1023 is OK: TCP OK - 0.000 second response time on port 11212 [15:31:05] RECOVERY - nutcracker port on mw1029 is OK: TCP OK - 0.000 second response time on port 11212 [15:31:12] godog: okay..odd [15:31:14] RECOVERY - nutcracker port on mw1210 is OK: TCP OK - 0.000 second response time on port 11212 [15:31:14] RECOVERY - nutcracker port on mw1116 is OK: TCP OK - 0.000 second response time on port 11212 [15:31:14] I had one comment which was just to put a note in the header doc [15:31:23] kart_: how so ? [15:31:30] that indicates bascially this isn't affecting the in config enabled settings, but is removing the collector config [15:31:32] if that makes sense [15:31:34] RECOVERY - nutcracker port on mw1171 is OK: TCP OK - 0.000 second response time on port 11212 [15:31:44] RECOVERY - nutcracker port on mw1097 is OK: TCP OK - 0.000 second response time on port 11212 [15:31:44] RECOVERY - nutcracker port on mw1148 is OK: TCP OK - 0.000 second response time on port 11212 [15:31:45] chasemp: ah, yeah, let me do that [15:31:48] having ensure in puppet and enabled in the template and that enabled is always true, etc starts to get confusin [15:31:49] g [15:31:54] yeah [15:31:54] It looks like it's all in Lua...hrm [15:32:03] but otherwise cool stuff [15:32:07] Accursed mwalker|away and his helpfulness [15:32:24] RECOVERY - nutcracker port on mw1122 is OK: TCP OK - 0.000 second response time on port 11212 [15:32:24] RECOVERY - nutcracker port on mw1201 is OK: TCP OK - 0.000 second response time on port 11212 [15:32:34] RECOVERY - nutcracker port on mw1077 is OK: TCP OK - 0.000 second response time on port 11212 [15:32:34] RECOVERY - nutcracker port on mw1043 is OK: TCP OK - 0.000 second response time on port 11212 [15:32:36] marktraceur: it could be, theoretically I suppose, we should fooler the tom with mwalker|away [15:32:37] chasemp: btw, the puppet state sudo patch got reverted by Coren, apparently in some places (betalabs?) the code was updated but the sudo rule wasn't, which feels... very weird, since they're part of the same define [15:32:45] RECOVERY - nutcracker port on mw1032 is OK: TCP OK - 0.000 second response time on port 11212 [15:32:47] Right [15:32:54] RECOVERY - nutcracker port on mw1022 is OK: TCP OK - 0.000 second response time on port 11212 [15:32:57] !log Restarted logstash on logstash1001 because log volume looked lower than I though it should be. [15:33:01] (03CR) 10Jgreen: [C: 032 V: 031] re-replace OTRS favicon with wmf (metawiki) one, Bugzilla 17271 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144462 (owner: 10Jgreen) [15:33:02] Logged the message, Master [15:33:04] RECOVERY - nutcracker port on mw1033 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:04] RECOVERY - nutcracker port on mw1121 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:11] YuviPanda: I can help you look this week if you want, but labs as a whole has some islands puppet master wise to be sure [15:33:14] RECOVERY - nutcracker port on mw1001 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:14] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:14] RECOVERY - nutcracker port on mw1105 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:14] RECOVERY - nutcracker port on mw1185 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:17] so IDK what the best thing to do there is [15:33:18] yet [15:33:24] RECOVERY - nutcracker port on mw1167 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:24] RECOVERY - nutcracker port on mw1186 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:28] chasemp: yeah, but even then - they should either get applied together, or not at all [15:33:33] same commit, same define [15:33:33] akosiaris: http://wiki.apertium.org/wiki/Installation_troubleshooting#Runtime_errors [15:33:34] RECOVERY - nutcracker port on mw1024 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:34] RECOVERY - nutcracker port on mw1152 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:44] RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:44] RECOVERY - nutcracker port on mw1108 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:49] YuviPanda: It may simply be a case that it /did/ get applied, but that because of the rest of the local sudoers file it could not work. [15:33:50] YuviPanda: ah now I understand [15:33:54] RECOVERY - nutcracker port on mw1093 is OK: TCP OK - 0.000 second response time on port 11212 [15:33:55] how old is ubuntu there? [15:33:58] chasemp: 12.04 [15:33:59] any chance is really old [15:33:59] Mostly upstream issue, it seems. Can be fix via recombole. [15:34:08] Coren: but the revert never made it to betalabs, I think [15:34:09] YuviPanda: The problem in beta was that the cherry-picked version was out of date and caused a merge conflict on puppet git update. [15:34:14] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [15:34:21] bd808: aaaha! [15:34:23] that makes sense. [15:34:24] It should be fixed now. [15:34:24] RECOVERY - nutcracker port on mw1091 is OK: TCP OK - 0.000 second response time on port 11212 [15:34:34] Coren: ^ bd808 says it should be fixed now :) [15:34:39] greg-g: https://wikitech.wikimedia.org/wiki/User:MarkTraceur/Sandbox it's ugly but it works [15:34:39] YuviPanda: so w/ beta updated should be gtg? [15:34:46] akosiaris: in future, we also plan to use 'latest' apertium from upstream - so want to start with 'homegrown' packages later. [15:34:50] bd808: I'll remember to kill any cherry picks that I do when they get merged [15:34:51] chasemp: I think so [15:34:54] RECOVERY - nutcracker port on mw1027 is OK: TCP OK - 0.000 second response time on port 11212 [15:34:54] RECOVERY - nutcracker port on mw1193 is OK: TCP OK - 0.000 second response time on port 11212 [15:34:54] RECOVERY - nutcracker port on mw1219 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:01] YuviPanda: I don't think it did, because the puppetmaster there got confused and I really needed to turn of the spam before gmail again started throttling email. [15:35:04] RECOVERY - nutcracker port on mw1209 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:04] RECOVERY - nutcracker port on mw1143 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:04] RECOVERY - nutcracker port on mw1090 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:11] * marktraceur makes bold changes etc. [15:35:14] RECOVERY - nutcracker port on mw1204 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:14] RECOVERY - nutcracker port on mw1139 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:14] RECOVERY - nutcracker port on mw1220 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:19] Coren: yeah, that's fine. I didn't remember to uncherrypicki [15:35:19] <_joe_> he. This will go on for a few minutes [15:35:24] RECOVERY - nutcracker port on mw1064 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:34] RECOVERY - nutcracker port on mw1107 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:34] RECOVERY - nutcracker port on mw1086 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:34] RECOVERY - nutcracker port on mw1066 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:34] RECOVERY - nutcracker port on mw1112 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:45] YuviPanda: Either way, you can try again some time this week though today would be a bad day for me. [15:35:49] (03PS2) 10Ottomata: Production now uses CDH (CDH5) module, also refactor roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/144242 [15:35:53] YuviPanda: They clean up by themselves as long as the cherry-picked version is up to date but often cause rebase conflicts if the patch had been amended in gerrit since it was picked. [15:35:54] RECOVERY - nutcracker port on mw1113 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:54] RECOVERY - nutcracker port on mw1158 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:54] RECOVERY - nutcracker port on mw1203 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:54] RECOVERY - nutcracker port on mw1154 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:54] RECOVERY - nutcracker port on mw1037 is OK: TCP OK - 0.000 second response time on port 11212 [15:35:55] RECOVERY - nutcracker port on terbium is OK: TCP OK - 0.000 second response time on port 11212 [15:36:04] RECOVERY - nutcracker port on mw1104 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:14] RECOVERY - nutcracker port on mw1110 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:14] RECOVERY - nutcracker port on mw1215 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:14] RECOVERY - nutcracker port on mw1155 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:14] RECOVERY - nutcracker port on mw1071 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:24] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.001 second response time on port 11212 [15:36:32] * bd808 notes that there are 5 hosts in beta that have failing puppet runs at the moment. [15:36:34] RECOVERY - nutcracker port on mw1021 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:34] RECOVERY - nutcracker port on mw1131 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:44] RECOVERY - nutcracker port on tmh1001 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:44] RECOVERY - nutcracker port on mw1018 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:44] RECOVERY - nutcracker port on mw1207 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:44] RECOVERY - nutcracker port on mw1073 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:47] bd808: Anything exciting? [15:36:50] (03PS1) 10Yuvipanda: Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 [15:36:53] chasemp: bd808 ^ [15:36:54] RECOVERY - nutcracker port on mw1137 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:54] RECOVERY - nutcracker port on mw1103 is OK: TCP OK - 0.000 second response time on port 11212 [15:37:02] YuviPanda: if I recall it used to be that if a collector threw in exception it would be disabled [15:37:04] RECOVERY - nutcracker port on mw1199 is OK: TCP OK - 0.000 second response time on port 11212 [15:37:04] RECOVERY - nutcracker port on mw1047 is OK: TCP OK - 0.000 second response time on port 11212 [15:37:04] RECOVERY - nutcracker port on mw1194 is OK: TCP OK - 0.000 second response time on port 11212 [15:37:09] <_joe_> bd808: btw, beta could be thw right place to test the new mediawiki apache config [15:37:10] so you could make it so a bad return on your sudo call threw an exception [15:37:14] RECOVERY - nutcracker port on mw1101 is OK: TCP OK - 0.000 second response time on port 11212 [15:37:21] and then it at least wouldn't continue to spam in a weird sudo case [15:37:24] RECOVERY - nutcracker port on mw1128 is OK: TCP OK - 0.000 second response time on port 11212 [15:37:24] RECOVERY - nutcracker port on fenari is OK: TCP OK - 0.000 second response time on port 11212 [15:37:26] chasemp: the spam was from the attempt to sudo itself. but yes, makes sense. [15:37:34] RECOVERY - nutcracker port on mw1179 is OK: TCP OK - 0.000 second response time on port 11212 [15:37:34] RECOVERY - nutcracker port on mw1058 is OK: TCP OK - 0.000 second response time on port 11212 [15:37:55] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [15:37:56] Reedy: I haven't dug into them today. Probably some leftovers from module refactorings clashing with beta only roles/classes. [15:37:57] chasemp: btw, I don't think it gets disabled [15:38:04] RECOVERY - nutcracker port on mw1070 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:04] kart_: the fact about pcre versions is correct, then everyone should have a problem with apertium and ubuntu. We could ask the maintainer rebuild the package and have everyone benefit and not just us. As far as the "latest" apertium goes, let's try avoid that please, for all the myriads of reasons that we avoid that with any "latest" software [15:38:04] RECOVERY - nutcracker port on mw1019 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:07] well that's silly [15:38:14] RECOVERY - nutcracker port on mw1169 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:14] RECOVERY - nutcracker port on mw1085 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:14] RECOVERY - nutcracker port on mw1095 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:19] chasemp: if you grep diamond logs from before the patch, the puppet collector tries to run every time, has a permission error, and logs it, and then is tried again next time [15:38:20] been a long time since I tried to use it a feature tho :) [15:38:24] Reedy: I know there were some caused by beta scap hacks. [15:38:24] RECOVERY - nutcracker port on mw1020 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:24] RECOVERY - nutcracker port on mw1075 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:34] RECOVERY - nutcracker port on mw1102 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:34] chasemp: so it doesn't do any good :) [15:38:34] RECOVERY - nutcracker port on mw1182 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:38] kart_: s/the fact/if the fact/ [15:38:44] RECOVERY - nutcracker port on mw1078 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:44] RECOVERY - nutcracker port on mw1157 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:54] RECOVERY - nutcracker port on tmh1002 is OK: TCP OK - 0.000 second response time on port 11212 [15:38:54] RECOVERY - nutcracker port on mw1184 is OK: TCP OK - 0.000 second response time on port 11212 [15:39:04] RECOVERY - nutcracker port on mw1214 is OK: TCP OK - 0.000 second response time on port 11212 [15:39:05] RECOVERY - nutcracker port on mw1127 is OK: TCP OK - 0.000 second response time on port 11212 [15:39:24] RECOVERY - nutcracker port on mw1096 is OK: TCP OK - 0.000 second response time on port 11212 [15:39:27] cmjohnson1: at some point the controller thought it was failed, there's "adpevents" in root's home look for seqNum: 0x00001502 [15:39:34] RECOVERY - nutcracker port on mw1191 is OK: TCP OK - 0.000 second response time on port 11212 [15:39:49] RECOVERY - nutcracker port on mw1083 is OK: TCP OK - 0.000 second response time on port 11212 [15:39:49] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [15:40:10] _joe_: Yeah. There is weirdness with beta and the apache config however because we run a forked branch of the apache repo. I'm not sure if that could be eliminated or not. [15:40:35] We at least have very different vhost names [15:40:36] <_joe_> mmmmh [15:40:55] <_joe_> bd808: we can arrange those differences in our new puppet-based config I guess [15:41:14] <_joe_> where do I finid that modified apache config? [15:41:18] akosiaris: filing bug.. [15:41:21] Oh, right, the section headers won't add an edit link *anyway* because they're in the module output. [15:41:29] Bloody MediaWiki and its reasonableness. [15:41:34] <_joe_> this could be a chance to unify beta to prod even more [15:41:35] kart_: cool, thanks [15:41:46] _joe_: https://git.wikimedia.org/log/operations%2Fapache-config/refs%2Fheads%2Fbetacluster [15:41:49] PROBLEM - nutcracker port on mw1059 is CRITICAL: Connection refused [15:41:49] PROBLEM - nutcracker port on mw1065 is CRITICAL: Connection refused [15:41:50] PROBLEM - nutcracker port on mw1129 is CRITICAL: Connection refused [15:41:50] PROBLEM - nutcracker port on mw1046 is CRITICAL: Connection refused [15:41:50] PROBLEM - nutcracker port on mw1141 is CRITICAL: Connection refused [15:41:50] PROBLEM - nutcracker port on mw1170 is CRITICAL: Connection refused [15:41:59] PROBLEM - nutcracker port on mw1117 is CRITICAL: Connection refused [15:41:59] PROBLEM - nutcracker port on mw1123 is CRITICAL: Connection refused [15:41:59] PROBLEM - nutcracker port on mw1164 is CRITICAL: Connection refused [15:41:59] PROBLEM - nutcracker port on mw1153 is CRITICAL: Connection refused [15:42:00] PROBLEM - nutcracker port on mw1060 is CRITICAL: Connection refused [15:42:08] <_joe_> shit [15:42:09] marktraceur: :) [15:42:09] PROBLEM - nutcracker port on mw1042 is CRITICAL: Connection refused [15:42:09] PROBLEM - nutcracker port on mw1054 is CRITICAL: Connection refused [15:42:09] PROBLEM - nutcracker port on mw1011 is CRITICAL: Connection refused [15:42:09] PROBLEM - nutcracker port on mw1175 is CRITICAL: Connection refused [15:42:09] PROBLEM - nutcracker port on mw1118 is CRITICAL: Connection refused [15:42:10] PROBLEM - nutcracker port on mw1195 is CRITICAL: Connection refused [15:42:10] PROBLEM - nutcracker port on mw1217 is CRITICAL: Connection refused [15:42:11] PROBLEM - nutcracker port on mw1025 is CRITICAL: Connection refused [15:42:11] PROBLEM - nutcracker port on mw1006 is CRITICAL: Connection refused [15:42:12] PROBLEM - nutcracker port on mw1200 is CRITICAL: Connection refused [15:42:12] PROBLEM - nutcracker port on mw1160 is CRITICAL: Connection refused [15:42:13] PROBLEM - nutcracker port on mw1189 is CRITICAL: Connection refused [15:42:13] PROBLEM - nutcracker port on mw1208 is CRITICAL: Connection refused [15:42:19] PROBLEM - nutcracker port on mw1119 is CRITICAL: Connection refused [15:42:19] PROBLEM - nutcracker port on mw1176 is CRITICAL: Connection refused [15:42:19] PROBLEM - nutcracker port on mw1166 is CRITICAL: Connection refused [15:42:19] PROBLEM - nutcracker port on mw1213 is CRITICAL: Connection refused [15:42:19] PROBLEM - nutcracker port on mw1002 is CRITICAL: Connection refused [15:42:20] PROBLEM - nutcracker port on mw1012 is CRITICAL: Connection refused [15:42:20] PROBLEM - nutcracker port on mw1039 is CRITICAL: Connection refused [15:42:21] PROBLEM - nutcracker port on mw1061 is CRITICAL: Connection refused [15:42:21] PROBLEM - nutcracker port on mw1149 is CRITICAL: Connection refused [15:42:22] PROBLEM - nutcracker port on mw1106 is CRITICAL: Connection refused [15:42:22] PROBLEM - nutcracker port on mw1172 is CRITICAL: Connection refused [15:42:29] PROBLEM - nutcracker port on mw1007 is CRITICAL: Connection refused [15:42:29] PROBLEM - nutcracker port on mw1044 is CRITICAL: Connection refused [15:42:29] PROBLEM - nutcracker port on mw1092 is CRITICAL: Connection refused [15:42:29] PROBLEM - nutcracker port on mw1114 is CRITICAL: Connection refused [15:42:29] PROBLEM - nutcracker port on mw1126 is CRITICAL: Connection refused [15:42:30] PROBLEM - nutcracker port on mw1144 is CRITICAL: Connection refused [15:42:30] PROBLEM - nutcracker port on mw1026 is CRITICAL: Connection refused [15:42:31] PROBLEM - nutcracker port on mw1068 is CRITICAL: Connection refused [15:42:31] PROBLEM - nutcracker port on mw1055 is CRITICAL: Connection refused [15:42:32] PROBLEM - nutcracker port on mw1099 is CRITICAL: Connection refused [15:42:32] PROBLEM - nutcracker port on mw1140 is CRITICAL: Connection refused [15:42:33] PROBLEM - nutcracker port on mw1173 is CRITICAL: Connection refused [15:42:33] PROBLEM - nutcracker port on mw1205 is CRITICAL: Connection refused [15:42:35] <_joe_> what the hell is this? [15:42:36] that doesn't seem good. [15:42:39] PROBLEM - nutcracker port on mw1008 is CRITICAL: Connection refused [15:42:39] PROBLEM - nutcracker port on mw1003 is CRITICAL: Connection refused [15:42:39] PROBLEM - nutcracker port on mw1088 is CRITICAL: Connection refused [15:42:39] PROBLEM - nutcracker port on mw1045 is CRITICAL: Connection refused [15:42:39] PROBLEM - nutcracker port on mw1120 is CRITICAL: Connection refused [15:42:40] PROBLEM - nutcracker port on mw1162 is CRITICAL: Connection refused [15:42:40] PROBLEM - nutcracker port on mw1177 is CRITICAL: Connection refused [15:42:41] PROBLEM - nutcracker port on mw1082 is CRITICAL: Connection refused [15:42:41] PROBLEM - nutcracker port on mw1069 is CRITICAL: Connection refused [15:42:42] PROBLEM - nutcracker port on mw1100 is CRITICAL: Connection refused [15:42:42] PROBLEM - nutcracker port on mw1150 is CRITICAL: Connection refused [15:42:43] PROBLEM - nutcracker port on mw1211 is CRITICAL: Connection refused [15:42:43] PROBLEM - nutcracker port on mw1145 is CRITICAL: Connection refused [15:42:44] <_joe_> cscott: it's also bogus [15:42:49] PROBLEM - nutcracker port on mw1009 is CRITICAL: Connection refused [15:42:49] PROBLEM - nutcracker port on mw1041 is CRITICAL: Connection refused [15:42:49] PROBLEM - nutcracker port on mw1052 is CRITICAL: Connection refused [15:42:49] PROBLEM - nutcracker port on mw1174 is CRITICAL: Connection refused [15:42:49] PROBLEM - nutcracker port on mw1206 is CRITICAL: Connection refused [15:42:50] PROBLEM - nutcracker port on mw1187 is CRITICAL: Connection refused [15:42:50] spam [15:42:54] 16:42 Ignoring ALL from icinga-wm [15:43:00] akosiaris: there are few reasons for new version. I know we don't encourage, but want to know if we maintain packages at WMF repo with newer version? [15:43:09] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:09] RECOVERY - nutcracker port on mw1200 is OK: TCP OK - 0.002 second response time on port 11212 [15:43:19] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:19] RECOVERY - nutcracker port on mw1106 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:29] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:29] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:39] RECOVERY - nutcracker port on mw1045 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:39] <_joe_> sorry I can't understand why this is happening, let me check [15:43:39] RECOVERY - nutcracker port on mw1082 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:39] RECOVERY - nutcracker port on mw1145 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:47] <_joe_> nothing is a real problem anyway [15:43:49] RECOVERY - nutcracker port on mw1174 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:49] RECOVERY - nutcracker port on mw1041 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:49] RECOVERY - nutcracker port on mw1187 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:49] RECOVERY - nutcracker port on mw1059 is OK: TCP OK - 0.000 second response time on port 11212 [15:43:50] RECOVERY - nutcracker port on mw1141 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:09] RECOVERY - nutcracker port on mw1160 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:19] RECOVERY - nutcracker port on mw1176 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:29] RECOVERY - nutcracker port on mw1026 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:29] RECOVERY - nutcracker port on mw1068 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:29] RECOVERY - nutcracker port on mw1099 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:29] RECOVERY - nutcracker port on mw1173 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:29] RECOVERY - nutcracker port on mw1205 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:39] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:39] RECOVERY - nutcracker port on mw1003 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:39] RECOVERY - nutcracker port on mw1120 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:39] RECOVERY - nutcracker port on mw1100 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:49] RECOVERY - nutcracker port on mw1046 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:59] RECOVERY - nutcracker port on mw1164 is OK: TCP OK - 0.000 second response time on port 11212 [15:44:59] RECOVERY - nutcracker port on mw1153 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:00] RECOVERY - nutcracker port on mw1060 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:09] RECOVERY - nutcracker port on mw1217 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:09] RECOVERY - nutcracker port on mw1189 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:29] RECOVERY - nutcracker port on mw1092 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:39] RECOVERY - nutcracker port on mw1088 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:39] RECOVERY - nutcracker port on mw1069 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:39] RECOVERY - nutcracker port on mw1150 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:49] RECOVERY - nutcracker port on mw1009 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:49] RECOVERY - nutcracker port on mw1065 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:59] RECOVERY - nutcracker port on mw1117 is OK: TCP OK - 0.000 second response time on port 11212 [15:45:59] RECOVERY - nutcracker port on mw1123 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:09] RECOVERY - nutcracker port on mw1042 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:09] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:09] RECOVERY - nutcracker port on mw1175 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:09] RECOVERY - nutcracker port on mw1118 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:09] RECOVERY - nutcracker port on mw1025 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:13] (03PS2) 10Yuvipanda: Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 [15:46:19] RECOVERY - nutcracker port on mw1119 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:19] RECOVERY - nutcracker port on mw1213 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:19] RECOVERY - nutcracker port on mw1166 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:19] RECOVERY - nutcracker port on mw1039 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:19] RECOVERY - nutcracker port on mw1061 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:20] RECOVERY - nutcracker port on mw1172 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:28] | grep -v 11212 [15:46:29] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:29] RECOVERY - nutcracker port on mw1144 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:49] RECOVERY - nutcracker port on mw1052 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:49] RECOVERY - nutcracker port on mw1129 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:49] RECOVERY - nutcracker port on mw1170 is OK: TCP OK - 0.000 second response time on port 11212 [15:46:59] (03PS3) 10Ottomata: Production now uses CDH (CDH5) module, also refactor roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/144242 [15:46:59] kart_: newer version is different than latest. We can have newer (or older for that matter) package versions indeed. But doing so should be obviously justified and not cause big maintenance costs. "latest" is pretty much out of the question [15:47:00] if something important happens in here, let me know, k? [15:47:09] RECOVERY - nutcracker port on mw1054 is OK: TCP OK - 0.000 second response time on port 11212 [15:47:19] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [15:47:29] RECOVERY - nutcracker port on mw1044 is OK: TCP OK - 0.000 second response time on port 11212 [15:47:29] RECOVERY - nutcracker port on mw1126 is OK: TCP OK - 0.000 second response time on port 11212 [15:47:29] RECOVERY - nutcracker port on mw1055 is OK: TCP OK - 0.000 second response time on port 11212 [15:47:39] RECOVERY - nutcracker port on mw1177 is OK: TCP OK - 0.000 second response time on port 11212 [15:47:39] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.000 second response time on port 11212 [15:47:39] RECOVERY - nutcracker port on mw1211 is OK: TCP OK - 0.000 second response time on port 11212 [15:48:09] RECOVERY - nutcracker port on mw1195 is OK: TCP OK - 0.000 second response time on port 11212 [15:48:09] RECOVERY - nutcracker port on mw1208 is OK: TCP OK - 0.000 second response time on port 11212 [15:48:09] (03PS3) 10Yuvipanda: Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 [15:48:13] <_joe_> grrrit-wm1: yes :) [15:48:18] <_joe_> ehrrr [15:48:19] RECOVERY - nutcracker port on mw1149 is OK: TCP OK - 0.000 second response time on port 11212 [15:48:49] RECOVERY - nutcracker port on mw1206 is OK: TCP OK - 0.000 second response time on port 11212 [15:49:08] (03PS1) 10Giuseppe Lavagetto: Nutcracker: hotfix to not make nutcracker restart with every puppet run [operations/puppet] - 10https://gerrit.wikimedia.org/r/144465 [15:49:46] !log Logstash event volume looks better after restart. Probably related to bug 63490. [15:49:49] <_joe_> the icinga spam will soon be over [15:49:51] Logged the message, Master [15:50:16] <_joe_> did I already say that I f*ckin hate erb templates? [15:50:27] akosiaris: right [15:50:59] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:51:08] (03CR) 10Giuseppe Lavagetto: [C: 032] Nutcracker: hotfix to not make nutcracker restart with every puppet run [operations/puppet] - 10https://gerrit.wikimedia.org/r/144465 (owner: 10Giuseppe Lavagetto) [15:51:27] speaking of which _joe_ I didn't get any list of templates ... [15:51:50] <_joe_> matanya: eh, I know [15:51:57] <_joe_> it's on my todo list [15:52:00] <_joe_> I swear [15:52:01] :) [15:52:09] <_joe_> I'm letting a lot of people down sorry [15:52:19] no worries, i come to help you, not disturb [15:52:42] <_joe_> btw, puppet tagged runs are awesome [15:52:49] <_joe_> matanya: you don't disturb me [15:52:57] <_joe_> I'm letting you wait [15:53:12] I guess i can find them on my own, if i knew what you are looking to fix [15:53:31] heya godog, yt? [15:53:36] ottomata: yyp [15:53:48] so, ^d has asked me to forward-port elasticsearch packages to trusty [15:53:56] and I have no idea how to do that properly ! :) googling is not helping much [15:54:10] i see that we have a 'trusy-wikimedia' distribution defined in our reprepro distributions file [15:54:31] we track upstreams for elasticsearch (and cloudera for that matter..i'm interested in forward porting that too) [15:54:45] precise-wikimedia says Update: ceph jenkins cloudera hwraid cassandra elasticsearch logstash hhvm [15:54:54] trusty-wikimedia just says Update: hwraid [15:55:01] can I add elasticsearch (and cloudera there)? [15:55:14] cmjohnson1: I've made sde1 unmountable on ms-be1007, the fs clearly isn't in a good state anyway and puppet will try to remount it [15:55:18] ottomata: yes, but make sure cloudera and elasticsearch have repos for trusty [15:55:23] they don't afaik [15:55:41] do they need something in the deb-overrides file then? [15:55:43] godog: ok..i can try and replace the disk if that doesn't work [15:55:47] or do I need to copy them between dists? [15:55:52] <_joe_> godog: you can unignore icinga-wm [15:55:53] <_joe_> :) [15:55:59] cloudera: [15:55:59] http://archive-primary.cloudera.com/cdh5/ubuntu/ [15:55:59] haha thanks _joe_ [15:56:13] then you can not use the updates mechanism [15:56:30] hmmm [15:56:36] let me check that [15:56:39] elastic is harder for me to tell, but i'm pretty sure it doesn't [15:56:44] cmjohnson1: if you could that'd be great, I probably did something wrong when trying to get the machine get after POST at boot and now the controller thinks it is good [15:57:07] ah..okay..i will replace it then [15:57:36] cmjohnson1: thanks! [15:59:42] (03CR) 10Rush: "I requested that Yuvi use sudo -l to check if the command is available and then throw a diamond error if not, this stops the sudo spam for" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 (owner: 10Yuvipanda) [16:00:04] aude: The time is nigh to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140707T1600) [16:03:12] * aude here [16:04:51] (03PS4) 10Yuvipanda: Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 [16:05:21] (03PS5) 10Yuvipanda: Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 [16:05:25] ottomata: elasticsearch never packaged for ubuntu anyway. We used the debian packages. It will probably work if you just add it in the update line in distributions. Cloudera is a different story though [16:06:20] cloudera seems to be targeting precise and lucid specifically. It could work or it could not work and it could be messy [16:06:35] yep it doesn't seem particularly tied to a distribution, it ships all the jars anyway (ES pacakges) [16:06:43] ottomata: maybe we can wait for them to build for trusty ? [16:06:55] godog: thanks for confirming that :-) [16:07:24] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 (owner: 10Yuvipanda) [16:07:26] akosiaris: i don't need cloudera trusty now, just would be nice since vagrant is going trusty [16:07:39] since i'm reinstalling production now, i was thinking about going trusty [16:07:44] but sounds risky, so i thikn i'll wait [16:07:47] but for elastic, ok! [16:08:00] so, i just add it to the updates line in distributions file? [16:08:09] ottomata: yup [16:08:15] ok, let's try it! [16:08:19] (03PS6) 10Yuvipanda: Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 [16:08:20] akosiaris: np, really seems just a container and nothing else [16:08:57] cloudera packages are not so simple though. In fact they are unbuildable unless you know exactly what you are doing [16:08:59] (03PS1) 10Ottomata: Add elasticsearch to trusty updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/144468 [16:09:06] akosiaris: buildable? [16:09:08] we aren't building, right? [16:09:14] we're just importing? [16:09:18] so I am not surprised that they have not released trusty support yet [16:09:23] ah, ha [16:09:24] heheh [16:09:29] ottomata: no, I was just making a point :-) [16:09:30] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 (owner: 10Yuvipanda) [16:09:33] aye [16:09:43] (03PS2) 10Ottomata: Add elasticsearch to trusty updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/144468 [16:09:45] (03CR) 10Rush: Revert "Revert "diamond: Let diamond read the puppet state file"" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 (owner: 10Yuvipanda) [16:09:57] (03PS7) 10Rush: Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 (owner: 10Yuvipanda) [16:09:59] (03CR) 10Ottomata: [C: 032 V: 032] Add elasticsearch to trusty updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/144468 (owner: 10Ottomata) [16:10:03] chasemp: don't merge yet [16:10:05] chasemp: am tetsing [16:10:31] YuviPanda: I think we need to return if summary is None now [16:10:44] in collect method [16:10:48] chasemp: ah, right. [16:11:27] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 (owner: 10Yuvipanda) [16:11:48] !log aude Started scap: Update Wikidata to mw1.24-wmf12 branch for group0 wikis [16:11:53] Logged the message, Master [16:12:08] (03PS8) 10Yuvipanda: Revert "Revert "diamond: Let diamond read the puppet state file"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 [16:13:00] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet has 1 failures [16:15:45] ^d, i think it sworking, i'm booting up a new trusty vagrant box to check [16:15:50] but if you have one you can apt-get update and see [16:16:10] <^d> Trying [16:17:02] marktraceur, currently the deployment calendar assumes the list is unordered and it sorts it itself [16:17:16] if we broke out each day into an independent table we could do section editing I suspect [16:17:26] Yeah that's probably what it would take [16:17:29] We order it manually anyway [16:17:46] <^d> ottomata: Worked \o/ [16:18:03] <^d> ty ty ty! [16:18:55] (03PS9) 10Yuvipanda: diamond: Let diamond read the puppet state file (take 2) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 [16:18:56] chasemp: ^ done! [16:19:42] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:20:32] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 531 seconds ago with 0 failures [16:20:42] ^d: oh! I can start working again? [16:20:59] (03CR) 10Rush: [C: 032 V: 032] diamond: Let diamond read the puppet state file (take 2) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144463 (owner: 10Yuvipanda) [16:22:39] !log (Cirrus) load tested commons and eswiki over the last hour - both look fine. [16:22:42] ^d: ^^ [16:22:45] Logged the message, Master [16:25:23] awesome, glad it worked ^d! [16:25:40] (03PS3) 10Gage: Feed logs from ssl terminators again into webstatscollector's filter [operations/puppet] - 10https://gerrit.wikimedia.org/r/143775 (https://bugzilla.wikimedia.org/67456) (owner: 10QChris) [16:27:20] (03CR) 10Gage: [C: 032] "Discussed on IRC" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143775 (https://bugzilla.wikimedia.org/67456) (owner: 10QChris) [16:27:42] jgage: when you apply that, make sure to keep an aye on oxygen's packet loss and cpu load stats [16:27:47] Thanks jgage. [16:27:48] those machines are notoriously annoying [16:28:31] s/machines/udp2log/ [16:29:23] jgage: when you get a sec, would much appreciate a review of this too: https://gerrit.wikimedia.org/r/#/c/144242/ [16:29:24] not urgent though [16:29:34] need to get machines all back up before we apply anyway! [16:30:14] (03PS3) 10Andrew Bogott: Modify nova role to better support labs uses. [operations/puppet] - 10https://gerrit.wikimedia.org/r/141836 [16:30:22] <^d> manybubbles: Awesome, and yes you can :) [16:30:41] bleh, oxygen has been up for >1yr and has 11 defunct processes [16:31:17] sure ottomata, as soon as i finish this merge for qchris [16:33:01] yeah, no hurries on that at all jgage [16:33:12] i'm heading to DC after ops meeting today [16:33:17] and then will be at DC with chris tomorow [16:33:23] so won't get to it til wed or thurs anyway [16:33:59] ok. I'm moving ULSFO on weds so i'll be a bit preoccupied with that beforehand anyway [16:34:21] !log aude Finished scap: Update Wikidata to mw1.24-wmf12 branch for group0 wikis (duration: 22m 33s) [16:34:27] Logged the message, Master [16:34:51] yay [16:36:54] * aude done, all looks good [16:41:35] (03PS8) 10BBlack: Make GeoIP lookup code safer [operations/puppet] - 10https://gerrit.wikimedia.org/r/136655 (https://bugzilla.wikimedia.org/64582) (owner: 10Ori.livneh) [16:42:00] (03CR) 10BBlack: [C: 032 V: 032] Make GeoIP lookup code safer [operations/puppet] - 10https://gerrit.wikimedia.org/r/136655 (https://bugzilla.wikimedia.org/64582) (owner: 10Ori.livneh) [16:46:02] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 2 failures [16:46:02] PROBLEM - puppet last run on cp1039 is CRITICAL: CRITICAL: Puppet has 1 failures [16:46:22] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Puppet has 1 failures [16:46:52] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: Puppet has 1 failures [16:47:12] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 2 failures [16:47:20] yeah that's me :P [16:47:32] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [16:47:37] (03PS1) 10BBlack: Varnish GeoIP: trivial post-merge fixup for 5daa89ee [operations/puppet] - 10https://gerrit.wikimedia.org/r/144474 [16:47:49] (03CR) 10BBlack: [C: 032 V: 032] Varnish GeoIP: trivial post-merge fixup for 5daa89ee [operations/puppet] - 10https://gerrit.wikimedia.org/r/144474 (owner: 10BBlack) [16:48:22] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [16:49:12] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [16:49:12] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [16:49:12] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [16:49:27] (03PS1) 10Jforrester: Enable VisualEditor by default on Portuguese Wikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144476 (https://bugzilla.wikimedia.org/67582) [16:49:42] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [16:50:12] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [16:50:12] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 1 failures [16:50:22] PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: Puppet has 1 failures [16:50:32] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [16:51:12] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [16:51:12] PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: Puppet has 1 failures [16:51:12] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:51:29] akosiaris: fixed issue by NMU :) [16:51:42] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures [16:52:02] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 2 failures [16:52:12] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: Puppet has 1 failures [16:52:12] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 2 failures [16:52:52] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [16:53:12] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [16:53:12] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 1 failures [16:53:12] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Puppet has 1 failures [16:54:56] grabbing some lunch, back on for ops meeting.. [16:56:02] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:57:02] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:57:12] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [16:58:12] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:00:04] manybubbles, ^d: The time is nigh to deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140707T1700) [17:00:22] <^d> manybubbles: I'm ready. You? [17:00:33] ^d: sure - you wanna sync today? [17:00:40] <^d> Yep, already in the right places. [17:00:52] (03CR) 10Chad: [C: 032] Move commons over to Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140752 (owner: 10Chad) [17:00:59] (03Merged) 10jenkins-bot: Move commons over to Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140752 (owner: 10Chad) [17:01:40] (03PS3) 10Milimetric: Add settings to throttle scheduled runs [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/144154 [17:02:55] !log demon Synchronized wmf-config/InitialiseSettings.php: Cirrus on commons as primary (duration: 00m 04s) [17:02:59] Logged the message, Master [17:04:45] RECOVERY - puppet last run on amssq53 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:05:05] RECOVERY - puppet last run on cp1039 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:05:25] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:06:25] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:06:26] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:06:45] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:07:05] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:07:15] RECOVERY - puppet last run on amssq46 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:07:25] ori: ping [17:07:26] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:07:35] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:07:38] bblack: hey [17:07:45] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:07:55] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:07:55] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:08:08] ori: do you remember this? https://gerrit.wikimedia.org/r/#/c/130256/ [17:08:25] ^d: looking good. I say we call that one a non-even unless someone freaks out [17:08:32] bblack: yep [17:08:46] <^d> manybubbles: Yeah, all logs look sane, ganglia's happy. [17:08:48] did i break something? [17:08:51] <^d> I left notice on the commons VP. [17:08:58] ^d: thanks! [17:08:58] mobile varnishes aren't actually doing geoip lookup, in spite of that change (because enable_geoiplookup is checked inside role-specific templates, and the mobile role doesn't have it, and even if you copy/pasted from the text-role template, the cookie stuff would differ a little I think?) [17:09:39] bblack: hm. it's been a while. let me investigate; just a moment. [17:09:45] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:09:45] RECOVERY - puppet last run on amssq34 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:09:45] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:09:55] templates/varnish/text-frontend.inc.vcl.erb has a call to geoip_cookie, but mobile-frontend.inv.vcl.erb doesn't, in other words [17:10:15] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:10:43] (03PS2) 10BBlack: Handle ZERO's new carrier ip subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/144131 (owner: 10Yurik) [17:10:45] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:11:05] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:11:25] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:11:42] bblack: yes, you're right [17:11:49] bblack: would you like me to submit a patch? [17:12:14] sure. maybe for now just copy from text-frontend, although maybe the orig-cookie bit isn't the same? [17:12:36] it's making me think we should work on refactoring and unifying our vcl templates more [17:12:41] but that's a whole other matter [17:13:03] a vmod is probably in order [17:13:09] * bblack stabs ori [17:13:17] ;) [17:16:44] bblack: why would the orig-cookie not be the same? [17:18:14] the req.http.orig-cookie code is specific to text-frontend/text-common, and mobile-frontend has something with req.http.X-Orig-Cookie [17:18:24] ori: ^ [17:19:32] wow, am i reading this correctly? do we not cache cookied reqs on mobile? [17:19:59] mobile-backend.inc.vcl.erb L26-29 [17:20:25] (03CR) 10BBlack: [C: 04-1] "I don't think this behaves as intended. The regsub for creating X-ZEROSFX is always going to return a non-empty string for a non-empty X-" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144131 (owner: 10Yurik) [17:21:40] oh no, i see how this works [17:21:42] ok, got it [17:22:20] bblack, not sure what you mean [17:23:18] yurikR: if X-CS is just "500-05", X_ZEROSFX will be "500-05", not the empty string. regsub's regex won't match anything, so it just doesn't replace anything in the string. [17:23:31] bblack, are you sure? [17:23:36] bblack, http://www.regexr.com/ [17:24:01] it's not about the regex itself (which I haven't checked in depth anyways), it's about the regsub function. [17:24:34] regsub does a substitution on the input and then returns the modified input. strings that don't match your regsub regex will simply be returned unmodified. [17:24:51] bblack, but the regex will always match [17:26:26] yurikR: ok, I see what you mean now. not very intuitive, but it does work :) [17:27:01] bblack, you know what they say - if you try to solve a problem with a regex, you have two problems :) [17:27:30] I got 99 problems and regexes are at least 93 of them. [17:28:27] (03CR) 10BBlack: [C: 032] "Nevermind that, I misunderstood what the regex was doing." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144131 (owner: 10Yurik) [17:29:16] (03PS1) 10Ori.livneh: Set GeoIP cookie in mobile-frontend [operations/puppet] - 10https://gerrit.wikimedia.org/r/144485 [17:29:57] ^ bblack [17:32:56] (03PS2) 10BBlack: Set GeoIP cookie in mobile-frontend [operations/puppet] - 10https://gerrit.wikimedia.org/r/144485 (owner: 10Ori.livneh) [17:34:38] (03CR) 10BBlack: [C: 032] Set GeoIP cookie in mobile-frontend [operations/puppet] - 10https://gerrit.wikimedia.org/r/144485 (owner: 10Ori.livneh) [17:36:28] _joe_: oh wow i see you already performed the nutcracker migration? [17:36:38] or half of it, at least [17:37:55] (03CR) 10Krinkle: [C: 031] Add OBSOLETE [operations/debs/testswarm] - 10https://gerrit.wikimedia.org/r/143635 (owner: 10Krinkle) [17:39:24] (03CR) 10Krinkle: "I don't know if Ops wants to preserve the repo, but this is obsolete. This was a collection of scripts to assist in TestSwarm (not TestSwa" [operations/debs/testswarm] - 10https://gerrit.wikimedia.org/r/143635 (owner: 10Krinkle) [17:39:42] ^d: If you're up for nuking another repo, this one's ripe. [17:40:03] <^d> testswarm? [17:40:06] Yep [17:41:47] <^d> No route to host? [17:41:48] <^d> wtf. [17:42:08] <^d> Oh, port typo. [17:42:08] <^d> I hate that port. [17:42:36] <^d> Krinkle: Done [17:42:49] thx [17:43:01] <^d> yw [17:43:40] greg-g: if no one is doing anything right now, can I take care of https://bugzilla.wikimedia.org/show_bug.cgi?id=67548 ? should take about 5 min [17:45:44] legoktm: yep [17:51:29] (03CR) 10BBlack: [C: 031] "For the record, I also think we should pull defaulting, barring anyone coming up with a security argument against it (and if there is one," [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [17:52:28] (03CR) 10BBlack: "s/IETF/ASF/ in the comment above!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [17:52:50] !log deleted rows in centralauth's localnames and localuser tables for bug 67548 [17:52:55] Logged the message, Master [17:53:42] greg-g: finished [17:54:35] legoktm: there's more [17:54:45] wat [17:54:45] Tim eg. added foundationwiki to his account :P [17:54:53] I deleted those too [17:54:55] ah [17:55:48] legoktm: ty sir [18:08:24] kart_: yippi :-) [18:28:53] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Complete puppet failure [18:32:26] akosiaris: you're right, it's flapping :) [18:38:53] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [18:39:06] andrewbogott: so the flurry of alerts seems to be puppetmaster being overwhelmed. I think I am gonna revert the 20 mins puppet change for now to lessen the load, schedule upgrading them to trusty since ruby 1.9/2.X is expected to provide some quite cool performance improvements. I tried reproducing the race and have failed up to now [18:39:41] akosiaris: sounds good, I'll keep an eye out for alerts. [18:41:35] (03CR) 10Matanya: "Toolserver is now gone. Can this be merged please?" [operations/dns] - 10https://gerrit.wikimedia.org/r/118480 (owner: 10Matanya) [18:44:23] hashar: gallium is Precise ? [18:46:00] matanya: yes [18:46:05] thanks akosiaris [18:48:29] matanya: yes sir [18:48:45] matanya: it used to be Lucid and got upgraded via some magic command [18:48:52] hashar: i would like to break ci a bit [18:48:59] is this ok with you :) [18:49:02] ? [18:52:03] hashar: i guess you want details ? [18:54:36] matanya: yes :D [18:54:54] matanya: I am on vacations next week and still have a lot to do so I am unlikely to be very helpful :D [18:55:16] so, basically, you have firewalls configured on role level and not on node level as it should be [18:55:37] i would to move it over, but don't really want to break every thing [18:55:46] and hey, enjoy your vacation! [18:56:11] matanya: I have some firewall rules set on labs instances iirc [18:56:25] right [18:56:45] i.e. class role::ci::slave::labs::common which is applied on labs and calls contint::firewall::labs [18:56:56] yes, i care less for that [18:56:57] something similar happens for beta iirc [18:57:14] i'm looking more at modules/contint/manifests/firewall.pp [18:57:20] ah [18:58:28] well it is applied on gallium in site.pp [18:58:41] and in roles being used on labs [18:58:41] moreover, can't you allow access to labs on node level? all you need to do is: ferm rule site.pp for gallium [19:00:52] ideally I would move the rules in the modules next to the software that need them [19:01:06] modules/contint/manifests/firewall.pp has a bunch of rules which are for Zuul [19:01:22] hashar: i'll refactor that a bit and you can -/+ it as you see fits, fair? [19:01:24] ideally we would have them in the modules/zuul maybe [19:01:35] sounds good [19:01:41] but don't invest too much time in it :-] [19:01:42] that is very rough [19:01:57] no, the rules goes on roles, and modules should be re-useable [19:02:03] ah [19:02:24] well think about contint module has my subset of roles :-D But yeah I agree [19:02:37] if you had time, i could explain reasoning [19:02:39] having the ferm rules on role will be nice. [19:03:49] matanya: for history purposes, the rules used to be iptables in misc::contint::test [19:04:03] and I moved everything form misc::contint under the contint module [19:04:12] then later on we started the roles (iirc) [19:04:21] so yeah more cleanup needed. Your change will be welcome [19:04:27] :) [19:08:45] hashar: one more clarification : web access you allow in firewall.pp is for zuul or jenkins or something else ? [19:11:45] matanya: gallium has apache which serves http://integration.wikimedia.org/ which behind the misc varnish [19:11:58] matanya: so some role that get the contint website installed needs port 80 [19:12:38] so why did you open Jenkins on port 8080 on 127.0.0.1 ? [19:12:50] matanya: apache on gallium act as a proxy for the jenkins daemon which listens on port 8080 and for the Zuul daemon which is an embedded web service on port 8001 [19:12:57] that seems redundant [19:13:35] the chain is roughly: https://integration.wikimedia.org/zuul/status.json -> misc varnish ---- 8001 ---> gallium ---> Zuul daemon [19:13:38] you can just put it all behind misc-varnish, no ? [19:13:58] and for jenkins: http://integration.wikimedia.org/ci/ --> misc varnish ---- 8080 --> gallium --> Jenkins [19:14:00] oh no [19:14:04] there is apache in between grr [19:14:21] the chain is roughly: https://integration.wikimedia.org/zuul/status.json -> misc varnish ---- 80 ---> gallium ---> apache proxy --- 8001 ---> Zuul daemon [19:14:59] ok, now i get it [19:15:04] i'll amend [19:15:08] sorry that is a bit of a mess :/ [19:15:59] the rules git-daemon_internal and ytterbium_ssh are for the Zuul part that handles merges of Gerrit patches. Will have to move it to another role eventually [19:16:14] I have some patches pending for Zuul roles [19:17:15] ok, i'll create a first round of patches, and we will move forward if it looks sane [19:54:58] (03PS1) 10Matanya: ci firewall: move ferm rules to role level and firewall to node level [operations/puppet] - 10https://gerrit.wikimedia.org/r/144503 [19:55:04] hashar: ^ :) [19:59:00] (03PS2) 10Matanya: ci firewall: move ferm rules to role level and firewall to node level [operations/puppet] - 10https://gerrit.wikimedia.org/r/144503 [20:00:04] gwicke, subbu, cscott: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140707T2000) [20:03:44] !log deployed Parsoid 8ef7b6fe [20:03:48] Logged the message, Master [20:06:48] PROBLEM - check configured eth on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:06:49] PROBLEM - MySQL disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:06:49] PROBLEM - MySQL Recent Restart on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:06:49] PROBLEM - puppet last run on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:06:49] PROBLEM - RAID on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:06:58] PROBLEM - mysqld processes on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:07:48] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [20:07:48] RECOVERY - MySQL Recent Restart on es1001 is OK: OK 17821357 seconds since restart [20:07:48] RECOVERY - check configured eth on es1001 is OK: NRPE: Unable to read output [20:07:48] RECOVERY - puppet last run on es1001 is OK: OK: Puppet is currently enabled, last run 264 seconds ago with 0 failures [20:07:48] RECOVERY - RAID on es1001 is OK: OK: optimal, 1 logical, 2 physical [20:07:49] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [20:10:48] PROBLEM - check configured eth on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:48] PROBLEM - MySQL Recent Restart on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:48] PROBLEM - MySQL disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:48] PROBLEM - check if dhclient is running on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:48] PROBLEM - puppet last run on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:49] PROBLEM - RAID on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:11:38] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [20:11:38] RECOVERY - MySQL Recent Restart on es1001 is OK: OK 17821590 seconds since restart [20:11:38] RECOVERY - check if dhclient is running on es1001 is OK: PROCS OK: 0 processes with command name dhclient [20:11:39] RECOVERY - check configured eth on es1001 is OK: NRPE: Unable to read output [20:11:39] RECOVERY - puppet last run on es1001 is OK: OK: Puppet is currently enabled, last run 499 seconds ago with 0 failures [20:11:39] RECOVERY - RAID on es1001 is OK: OK: optimal, 1 logical, 2 physical [20:12:17] (03PS1) 10Rush: legalpad fix outbound email [operations/puppet] - 10https://gerrit.wikimedia.org/r/144508 [20:14:14] (03CR) 10Rush: [C: 032 V: 032] "make puppet match reality" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144508 (owner: 10Rush) [20:17:22] chasemp: hmm, diamond module needs a little more work, I think. changing a collector doesn't seem to restart diamond, for some reason. [20:17:34] (I wasn't getting metrics from most machines until I restared diamond there) [20:17:48] PROBLEM - check configured eth on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:17:48] PROBLEM - MySQL Recent Restart on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:17:49] PROBLEM - MySQL disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:17:49] PROBLEM - puppet last run on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:17:49] PROBLEM - RAID on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:17:49] YuviPanda: you could do that or there is a default "read the collector config again" timer [20:17:56] as restarting diamond is invasive to collectino [20:17:57] chasemp: oh? there is? [20:18:02] chasemp: yeah, it is. [20:18:22] I'm always torn on which way to go [20:18:33] I don't like the non-deterministic nature of "it'll catch up eventually" [20:18:36] but yeah [20:18:38] RECOVERY - MySQL Recent Restart on es1001 is OK: OK 17822010 seconds since restart [20:18:38] RECOVERY - check configured eth on es1001 is OK: NRPE: Unable to read output [20:18:38] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [20:18:39] RECOVERY - puppet last run on es1001 is OK: OK: Puppet is currently enabled, last run 919 seconds ago with 0 failures [20:18:39] RECOVERY - RAID on es1001 is OK: OK: optimal, 1 logical, 2 physical [20:18:52] chasemp: ah, hmm. it's set to 3600 [20:18:52] collectors_reload_interval = 3600 [20:19:02] assuming that's seconds, that's an hour [20:19:06] seems excessive :) [20:19:20] heh [20:19:29] adding new collectors tho is kind of a novelty don't want it working too hard [20:19:32] idk, 15 minutes? [20:19:40] otherwise just restart if you need it sooner [20:19:46] 15mins sounds good to me :) [20:19:48] shall I submit a patch? [20:19:52] sure [20:23:39] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [20:28:09] (03PS1) 10Ori.livneh: Make mw1041 connect to nutcracker on port 11212 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144574 [20:28:11] (03PS1) 10Ori.livneh: Make app servers connect to nutcracker on port 11212 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144575 [20:28:20] _joe_: ^^ [20:29:01] <_joe_> ori: if you're here to monitor the situation, we may go live with the first one [20:29:17] <_joe_> and I'll see the results tomorrow morning and go on with the second one [20:29:36] (03CR) 10Ori.livneh: [C: 032] "<_joe_> ori: if you're here to monitor the situation, we may go live with the first one" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144574 (owner: 10Ori.livneh) [20:29:43] (03CR) 10Awight: [C: 04-2] "Waiting on extension deployment." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141607 (owner: 10Awight) [20:29:45] (03Merged) 10jenkins-bot: Make mw1041 connect to nutcracker on port 11212 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144574 (owner: 10Ori.livneh) [20:29:49] (03PS1) 10Yuvipanda: diamond: Make diamond reload collectors every 15min [operations/puppet] - 10https://gerrit.wikimedia.org/r/144576 [20:29:49] <_joe_> oh it's mediawiki-config :) [20:30:08] <_joe_> so I can't deploy it whenever I want right? [20:30:13] chasemp: https://gerrit.wikimedia.org/r/#/c/144576/ patched :) [20:31:11] !log ori Synchronized wmf-config/mc.php: Iea24b092b: Make mw1041 connect to nutcracker on port 11212 (duration: 00m 09s) [20:31:16] Logged the message, Master [20:31:47] (03CR) 10Rush: [C: 032] diamond: Make diamond reload collectors every 15min [operations/puppet] - 10https://gerrit.wikimedia.org/r/144576 (owner: 10Yuvipanda) [20:31:59] chasemp: ty [20:32:39] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [20:33:11] chasemp: btw, if you remember our discussion last week about http://cabotapp.com/ - I looked at their code, and it seems a big, ugh. Very, uh, web developer style. Config is stored as django models, and even that's not properly normalized. That + the lack of the ability to store config in git easily means it's mostly out :( I'll look out for other solutions [20:33:29] yeah sounds like a no go then [20:33:41] did you ever look at Kale? [20:33:45] chasemp: no [20:33:59] http://codeascraft.com/2013/06/11/introducing-kale/ [20:34:12] yeah it's a few tools then https://github.com/etsy/oculus [20:34:18] https://github.com/etsy/skyline [20:34:19] oh, you pointed it to me the previous time. [20:34:24] all part of the etsy suite [20:34:28] but no idea if it's any good [20:35:27] chasemp: yeah, I'll take a look. [20:35:48] chasemp: backup solution is icinga + check_graphite, but somehow that feels a bit overly complex. I haven't fully checked that yet. [20:37:45] chasemp: looking at skyline now [21:08:10] chasemp: checked ou Tessera that you linked to earlier. while quite nice, it also tries to let users create dashboards from the UI, which presents the twin problems of auth + no version control [21:09:12] bleh, could we just neuter that ability and use standard auth? [21:10:01] PROBLEM - puppet disabled on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:10:01] PROBLEM - mysqld processes on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:10:02] PROBLEM - Disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:10:30] chasemp: so way to do it would be to have apache or something put auth on just the edit URLs, but from what I can see that's not easily isolatable either. AJAX and all. [21:10:50] yeah it's mostly client side [21:10:51] RECOVERY - puppet disabled on es1001 is OK: OK [21:10:52] RECOVERY - Disk space on es1001 is OK: DISK OK [21:10:52] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [21:10:59] alright well, I love the ui anyway [21:12:11] chasemp: :D I guess not many have our needs of publicly accessible but privately editable [21:12:37] it has two things going for it tho, it's really nicely done and it's not ruby [21:12:40] ;) [21:12:46] chasemp: :D indeed! [21:13:03] 'installation is super simple, this is just a heroku app!' [21:21:12] hey, ops folks: isn't the role::mediawiki-install::labs puppet class supposed to keep my instance running a relatively-recent version of mediawiki? [21:21:33] cscott: #wikimedia-labs would probably know [21:22:42] cscott: you'll have to run git pull every so often [21:22:59] labs-vagrant is also a thing you can use instead. makes extensions, etc easy as well [21:23:35] (03PS1) 10Rush: auth login message [operations/puppet] - 10https://gerrit.wikimedia.org/r/144581 [21:24:56] (03PS2) 10Rush: phabricator login screen message [operations/puppet] - 10https://gerrit.wikimedia.org/r/144581 [21:25:07] (03CR) 10Rush: [C: 032 V: 032] phabricator login screen message [operations/puppet] - 10https://gerrit.wikimedia.org/r/144581 (owner: 10Rush) [21:25:54] PROBLEM - check configured eth on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:04] PROBLEM - puppet disabled on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:04] PROBLEM - mysqld processes on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:04] PROBLEM - Disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:39] (03PS4) 10Krinkle: Disable $wgLegacyJavaScriptGlobals for test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [21:26:44] RECOVERY - check configured eth on es1001 is OK: NRPE: Unable to read output [21:26:54] RECOVERY - puppet disabled on es1001 is OK: OK [21:26:54] RECOVERY - Disk space on es1001 is OK: DISK OK [21:26:54] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [21:26:59] (03CR) 10Krinkle: [C: 031] "Ready to be scheduled and deployed as far as I'm concerned. ::ship-it::" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [21:29:04] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [21:40:48] (03CR) 10Mwalker: [C: 031] filter changes to support messages from Hadoop [operations/puppet] - 10https://gerrit.wikimedia.org/r/140623 (owner: 10Gage) [21:43:02] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:55:51] YuviPanda|zz: does your NovaProxy screw with cookies? i can't seem to create a new wiki account on togetherjs.wmflabs.org [21:56:00] cscott: yeah, pretty sure it does [21:56:05] it gives me "The user account was not created, as we could not confirm its source. Ensure you have cookies enabled, reload this page and try again." [21:56:18] I can login easily to http://sugarfrosties.wmflabs.org/ [21:56:33] i'm trying to *create* an account, though. [21:56:42] cscott: I just did that at sugarfrosties, works fine [21:57:01] why do wmf servers hate me? [21:57:13] too awesome for them, I suppose [21:57:33] cscott: some other server might be stripping cookies... [21:57:34] YuviPanda|zz: you don't seem to have QuestyCaptcha installed on sugarfrosties [21:57:48] cscott: oh? I don't think I do [21:58:19] the puppet mediawiki install role adds "require_once( "$IP/extensions/Nuke/Nuke.php" ); [21:58:19] require_once( "$IP/extensions/SpamBlacklist/SpamBlacklist.php" ); [21:58:19] require_once( "$IP/extensions/ConfirmEdit/QuestyCaptcha.php" ); [21:58:19] " in the part of the LocalSettings.php which I can't modify [21:58:47] hmm, maybe that is a problem? [21:59:08] cscott: I've never used that role. labs_vagrant instead is what I've had [22:01:00] * cscott grumbles [22:01:16] (03PS1) 10Nemo bis: Amend last commonsuploads additions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144590 [22:01:45] hm. i turned off the captcha, but i'm still getting the "could not confirm its source" error [22:01:55] i need to give up for today. :( [22:02:12] Reedy: https://gerrit.wikimedia.org/r/#/c/144590/ [22:04:21] PROBLEM - MySQL Recent Restart on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:05:11] RECOVERY - MySQL Recent Restart on es1001 is OK: OK 17828407 seconds since restart [22:30:40] chasemp: around? [22:31:49] (03PS1) 10BryanDavis: Manage /usr/local/apache from ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 [22:32:34] (03CR) 10Ori.livneh: [C: 032] Make app servers connect to nutcracker on port 11212 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144575 (owner: 10Ori.livneh) [22:32:36] (03CR) 10CSteipp: "Thanks for working on this, it sounds like doc.wikimedia.org should get fixed before we make this change (unless there's some rush to do i" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [22:34:25] (03CR) 10Ori.livneh: [V: 032] Make app servers connect to nutcracker on port 11212 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144575 (owner: 10Ori.livneh) [22:37:16] (03CR) 10Jforrester: "Scheduled for this afternoon's SWAT (in 22 minutes' time)." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [22:41:14] !log ori Synchronized wmf-config/mc.php: I8b66e9339: Make app servers connect to nutcracker on port 11212 (duration: 00m 03s) [22:41:19] Logged the message, Master [22:42:14] _joe_: ^ [22:42:17] looks good [22:42:19] <_joe_> ori: :) [22:42:23] <_joe_> now I can go to bed [22:48:07] hey is there any way I can get some new task as a new contributor? i've already done some puppet lint fixes, happy to do more but some variety would be nice [22:48:20] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:49:10] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [22:49:22] Coren ping on this: https://gerrit.wikimedia.org/r/#/c/142542/ -- let me know if I need to update anything [23:00:05] mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140707T2300) [23:00:10] * MaxSem volunteers [23:00:21] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 1 failures [23:00:25] MaxSem: thank you [23:01:05] YuviPanda: looking for me [23:01:27] chasemp: heya! so... looks like the biggest labs box we got can't really handle all of labs' metrics, sadly. [23:01:40] _joe_: are you still there by any chance? [23:01:57] chasemp: at least, that's what I *think*. it kept dropping lots of data points, etc. iowait was about 45%. [23:02:04] <_joe_> ori: yes, but mostly off [23:02:09] hmmm yeah. not shocking I suppose. [23:02:15] <_joe_> ori: problems? [23:02:24] well it could be a couple things, rate of new metrics, etc [23:02:30] is it stable or big influx currently? [23:02:34] (03CR) 10MaxSem: [C: 032] Enable VisualEditor by default on Portuguese Wikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144476 (https://bugzilla.wikimedia.org/67582) (owner: 10Jforrester) [23:02:38] _joe_: the nutcracker puppetization is causing log churn because the order of keys in a hash is unspecified (and thus the generated file is different from one run to another) [23:02:42] (03Merged) 10jenkins-bot: Enable VisualEditor by default on Portuguese Wikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144476 (https://bugzilla.wikimedia.org/67582) (owner: 10Jforrester) [23:02:54] _joe_: we have a puppet function, ordered_json, for such cases [23:02:54] (03CR) 10MaxSem: [C: 032] Disable $wgLegacyJavaScriptGlobals for test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [23:02:57] chasemp: indeed. pretty stable. a few metrics have been removed, actually. [23:02:58] <_joe_> ori: yes I know [23:03:01] (03Merged) 10jenkins-bot: Disable $wgLegacyJavaScriptGlobals for test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [23:03:07] <_joe_> ori: it's a small change [23:03:13] _joe_: i can take care of it [23:03:19] YuviPanda: thoughts? what would you like to do [23:03:22] _joe_: just wanted to make sure you're aware and are cool with me making that change [23:03:23] chasemp: of course, *right now* it is a huge influx, since I disabled puppet on that machine, reduced resolution for data older than 7 weeks, and deleted all the old whisper files. to see how this copes now. [23:03:33] <_joe_> ori: it's putting a .sorted.each in the erb [23:03:33] ah [23:03:47] <_joe_> I had a very very busy day and I did not get to do that [23:03:56] _joe_: i'll fix. sorry for the ping. rest well! [23:03:57] chasemp: I don't know. easiest option is to just have graphite pick up data from betalabs and toollabs, and then add more people if they want. [23:04:04] I don't know how it works, I mean if this is a service provided to labs....can it not be within the labs universe, if that makes sense. it's own physical box. this is a coren or andrewbogott question [23:04:16] I can tell you we have a similar issue in prod and godog is looking at backing it with cassandra [23:04:25] since the whisper scheme is fraught with these issues [23:04:30] <_joe_> ori: np :) [23:04:35] !log maxsem Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#q,144476,n,z & https://gerrit.wikimedia.org/r/#q,139569,n,z (duration: 00m 03s) [23:04:39] chasemp: it *can* be a physical box in the labsnet, yeah. but I guess that'll go through a lot of things (procurement, etc) [23:04:41] Logged the message, Master [23:05:19] chasemp: cassandra instead of whisper you mean? [23:05:44] https://github.com/pyr/cyanite [23:05:48] yeah, am there atm [23:05:53] hmm, clojure [23:06:00] yeah... [23:06:15] James_F, please verify ^^^ [23:06:27] there is also native graphite clustering, which is a kind of sudo sharding of metrics across multiple backends with a canonical api host [23:06:42] (03PS1) 10Ori.livneh: nutcracker: ensure config keys are sorted to avoid log churn [operations/puppet] - 10https://gerrit.wikimedia.org/r/144605 [23:07:02] chasemp: right. run relay on one host, caches on multiple hosts... [23:07:52] (03PS2) 10BryanDavis: Manage /usr/local/apache from ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 [23:08:03] chasemp: problem with that is that right now the graphite role is the same for labs and prod, and I'm wary of making such a drastic change to it [23:08:35] (03CR) 10Ori.livneh: [C: 032 V: 032] "synced up with giuseppe about this" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144605 (owner: 10Ori.livneh) [23:08:47] understood, I...can't make any guarantee's there, but I can tell you we are pretty much where your at in prod [23:08:51] and asking the same questions [23:09:08] chasemp: i can move mwprof to a different machine if it'll help [23:09:20] so if the cassandra thing worked out great, or some other scheme, but aside from that scaling back is the only thing [23:09:57] ori: it's not a terrible idea but if we are disk bound (probably?) won't do much I don't think [23:10:15] are we in fact disk-bound? i have looked at tungsten recently [23:10:20] *haven't [23:10:36] pure conjecture on my part, I haven't dug in either [23:10:43] but we should probably embargo new metrics until we do [23:10:47] or lots them anyway [23:10:55] yeah, +1 [23:11:01] jackmcbarn, what's the issue in production your changes nominated for SWAT are fixing? [23:11:02] my poking around on diamond-collector tells me it is disk bound, but I'm new at this sortof thing [23:11:15] but it's a virtual disk anyway [23:11:21] MaxSem: server logs filling up with warnings [23:12:10] don't see anything like that in logs... [23:12:21] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [23:12:31] dammit [23:12:44] !log maxsem Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#q,144476,n,z & https://gerrit.wikimedia.org/r/#q,139569,n,z (duration: 00m 05s) [23:12:50] YuviPanda: ask godog when he's about how serious he is / when he was going to explore this [23:12:53] MaxSem: the change that caused the problem was added in wmf11, and when that hit enwiki, it became really noticeable and was backed out of that branch [23:12:56] maybe he could POC it in your setup [23:13:01] but when wmf12 hits enwiki, if it's not fixed by then, it will act up again [23:13:07] chasemp: alright. I'll be happy to help with a cluster setup if needed. [23:13:11] ah, ok [23:13:21] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [23:13:30] YuviPanda: sorry that's not much help, it's an unanswered question [23:13:35] chasemp: another alternative is to write Minimal* variants of all collectors, and send only metrics we're interested in. [23:13:40] clearer commit message would've helped:) [23:13:43] chasemp: 'tis ok, am happy it's not something sutpid I did :) [23:13:44] *stupid [23:13:48] James_F, now deployed for realz [23:13:54] chasemp: do you know what's godog's TZ? [23:14:06] YuviPanda: honestly, that is what I did before, but on a physical box it was a high bar in the vm probably much lower [23:14:08] ireland [23:14:19] UTC or +1 or +2 ? [23:14:23] ah, good enough. [23:14:55] chasemp: the 'Minimal'? yeah. Network, Disk, CPU, etc all have things that we can probably reduce a fair bit. [23:14:57] jackmcbarn, so it's not actually needed in wmf11? [23:15:06] MaxSem: right [23:15:14] that's why I wrote the cpu percentage stuff instead of per item for cpu, etc [23:15:15] (i didn't submit it for wmf11, did i?) [23:15:22] some of it got upstreamed, some of it not at the time [23:15:26] MaxSem: Thanks! [23:15:47] chasemp: hmm, so if I write a set of 'minimal'* ones that capture just the metrics, would you be ok merging those? [23:16:15] thinking on it, real PITA road to travel there [23:16:25] how minimal I guess? [23:16:34] you could almost do one collector if it was a very small subset [23:16:38] labsCollector [23:16:39] idk [23:17:12] hmm, brr. [23:17:41] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [23:17:46] !log maxsem Synchronized php-1.24wmf12/extensions/GWToolset: (no message) (duration: 00m 04s) [23:17:51] Logged the message, Master [23:18:02] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: Fetching origin [23:18:10] MaxSem: Did you sync the dblists for VisualEditor? They seem unaffected. [23:18:21] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:18:39] !log maxsem Synchronized php-1.24wmf11/extensions/GWToolset: (no message) (duration: 00m 03s) [23:18:41] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [23:18:45] Logged the message, Master [23:19:01] RECOVERY - Unmerged changes on repository puppet on palladium is OK: Fetching origin [23:19:08] chasemp: alright, I'll talk to godog tomorrow. Thanks for the help! [23:19:13] !log maxsem Synchronized php-1.24wmf12/extensions/GWToolset: (no message) (duration: 00m 03s) [23:19:17] chasemp: also, will yo be at Wikimania? [23:19:18] Logged the message, Master [23:19:29] bawolff, please verify ^^^ :) [23:19:39] YuviPanda: sounds good on the first, let me know and yes on the second [23:19:49] chasemp: cool! :) [23:20:37] MaxSem: just a second [23:21:57] MaxSem: Could you sync visualeditor-default.dblist please? [23:22:35] !log maxsem Synchronized visualeditor-default.dblist: (no message) (duration: 00m 03s) [23:22:40] Logged the message, Master [23:22:43] Ta. [23:22:44] James_F, ^ [23:22:56] And, indeed, working. [23:23:03] MaxSem: Both patches now working fine. Thanks! [23:26:25] YuviPanda: maybe this is a good wikimania hackathon thing :) [23:26:33] chasemp: :D yeah [23:26:38] chasemp: was just looking at InfluxDB [23:26:53] chasemp: HEY WE CAN USE MONGODB FOR THIS! :) [23:27:11] (03PS1) 10Mwalker: New Variables for OCG Service [operations/puppet] - 10https://gerrit.wikimedia.org/r/144610 [23:27:49] chasemp: but atm, the most attractive solution to me is to just get metrics from tools and betalabs. That'll cut down metric count by about 60% or something [23:28:01] makes sense [23:28:41] this is a new one on me https://github.com/graphite-ng/graphite-ng [23:28:45] chasemp: of course, then the problem becomes how do we get rid of diamond from all the other nodes? :) ensure absent, I suppose [23:29:32] springle: is there anything like ishmael was in terms of getting what queries are taking the most time? [23:29:37] yes you can also discard things not from where you expect so no one restarts mass diamond and floods you [23:29:38] etc, etc [23:29:41] heh, report.py and graphite wonky too [23:29:49] * AaronSchulz feels blind ;) [23:29:53] chasemp: right, but that'll still generate useless network data. [23:30:08] it's just a failsafe for disk usage really [23:30:25] hmm, true [23:33:07] (03PS2) 10Mwalker: New Variables for OCG Service [operations/puppet] - 10https://gerrit.wikimedia.org/r/144610 [23:34:53] (03PS1) 10Ori.livneh: Deploy jobrunner to MW job runners via git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 [23:35:33] MaxSem: confirmed, it works [23:36:47] (03CR) 10Aaron Schulz: [C: 031] Deploy jobrunner to MW job runners via git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 (owner: 10Ori.livneh) [23:37:32] (03PS1) 10Yuvipanda: diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 [23:37:52] (03PS2) 10Yuvipanda: diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 [23:38:03] !log maxsem Synchronized php-1.24wmf12/includes/StubObject.php: https://gerrit.wikimedia.org/r/#/c/144509/ (duration: 00m 03s) [23:38:09] Logged the message, Master [23:38:34] why are still so many local accounts without sul created? [23:38:46] !log maxsem Synchronized php-1.24wmf12/extensions/ParserFunctions/: https://gerrit.wikimedia.org/r/#q,144510,n,z (duration: 00m 03s) [23:38:51] Logged the message, Master [23:38:59] jackmcbarn, deployed ^^^ [23:39:09] MaxSem: thanks [23:40:02] (03CR) 10jenkins-bot: [V: 04-1] diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 (owner: 10Yuvipanda) [23:40:14] jackmcbarn, any ways to test it? [23:40:38] MaxSem: the only one we know to repro uses eval.php, which probably isn't a good idea on production [23:40:38] <^d> I'm gonna be doing some gerrit maintenance in about 20 minutes. Plz don't panic. [23:40:43] <^d> Or panic, just don't panic @ me [23:41:18] (03PS3) 10Yuvipanda: diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 [23:41:25] ^d, just don't make us panic so we don't panic and not panic at you causing more panic [23:42:21] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.013 second response time [23:42:30] Merlissimo: Don't think there are many of those actually [23:43:02] 895 on dewiki last 30 days [23:44:03] at some point we will disallow that [23:44:18] sadly we're not there yet [23:45:18] (03PS4) 10Yuvipanda: diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 [23:45:30] chasemp: ^ should do it [23:45:34] 2726 account not attached to sul where active on last 30 days. stll having new ones doen't solve the problem [23:45:39] (03CR) 10Helder.wiki: "FYI: I asked the users to test their personal scripts on test2wiki now that bug" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [23:46:07] haven't seen this done before $enabled = member($labs_enabled_projects, $::instanceproject) [23:46:09] clever [23:46:25] chasemp: :) let me add some docs [23:46:28] I think your enabled tho should not be quoted [23:46:30] in default [23:47:21] chasemp: right. [23:49:06] bd808, are you around? [23:49:26] mwalker: Yup. what's up? [23:49:37] I'm wondering if you know anything about an unstaged change in beta puppet that disables HHVM [23:49:54] mwalker: Yeah I just made it seconds ago [23:50:09] ah; so I should wait a bit until I start testing my puppet patch [23:50:21] can you let me know when you're done playing? [23:50:39] I'll clean it up... [23:50:54] no rush [23:50:57] (03PS5) 10Yuvipanda: diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 [23:50:57] I have other things I can be doing [23:51:01] chasemp: had to add a couple of notifys to get diamond to restart. testing again now [23:51:33] mwalker: I'm done there. You can cherry-pick away. :) [23:53:54] bd808, ah; I love it when puppet changes just apply cleanly [23:53:59] I'm done [23:54:02] all yours again [23:55:16] mwalker: Thanks. One more patch + cherry-pick and I hope to be done in there for the night [23:57:56] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt and applied on deployment-apache0[12], deployment-bastion and deployment-rsync01." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 (owner: 10BryanDavis) [23:59:01] <^d> How to force a puppet run these days? [23:59:08] puppet agent -tv [23:59:13] <^d> agent -tv, thx [23:59:44] chasemp: uh oh, bad news. setting enabled = false doesn't actually disable anything [23:59:52] nor does setting enable = true enable anything