[00:09:46] jamesofur, did you say you had a lock that didn't get logged? [00:10:16] Krenair: I did, yes, I ended up undoing it and then redoing it to get it in the system (though you can still tell it's screwy since you see the 'unlock' but not the original lock [00:10:23] sigh [00:13:33] 7Puppet, 6operations, 10Continuous-Integration, 6Labs: Error "Duplicate declaration: File[/etc/ssh/userkeys] is already declared in file /private/modules/passwords/manifests/init.pp:36; cannot redeclare at /modules/ssh/manifests/server.pp:31" - https://phabricator.wikimedia.org/T92752#1119403 (10Krinkle) ... [00:16:22] 7Puppet, 6operations, 10Continuous-Integration, 6Labs: Error "Duplicate declaration: File[/etc/ssh/userkeys] is already declared in file /private/modules/passwords/manifests/init.pp:36; cannot redeclare at /modules/ssh/manifests/server.pp:31" - https://phabricator.wikimedia.org/T92752#1119411 (10Krinkle) ... [00:18:07] RECOVERY - uWSGI web apps on graphite2001 is OK: OK: All defined uWSGI apps are runnning. [00:19:54] (03CR) 10Krinkle: "Those could merely be hits preventing AbuseFilter/SpamBlacklist from being triggered if wgSpamRegex is run first. Worth investigating." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196782 (https://phabricator.wikimedia.org/T50491) (owner: 10Glaisher) [00:21:46] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CRITICAL: Not all configured uWSGI apps are running. [00:28:27] (03CR) 10Krinkle: [C: 031] "Was caused by T92752. It works now." [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [00:44:04] (03PS1) 10BBlack: depool cp107[12] backends [puppet] - 10https://gerrit.wikimedia.org/r/196848 [00:44:06] (03PS1) 10BBlack: wmf-reimage: disable agent daemon before enabling puppet [puppet] - 10https://gerrit.wikimedia.org/r/196849 [00:44:26] (03CR) 10BBlack: [C: 032 V: 032] depool cp107[12] backends [puppet] - 10https://gerrit.wikimedia.org/r/196848 (owner: 10BBlack) [00:53:49] (03PS2) 10BBlack: wmf-reimage: disable agent daemon before enabling puppet [puppet] - 10https://gerrit.wikimedia.org/r/196849 [00:53:51] (03PS1) 10BBlack: depool cp107[34] backends [puppet] - 10https://gerrit.wikimedia.org/r/196850 [00:54:16] (03CR) 10BBlack: [C: 032 V: 032] depool cp107[34] backends [puppet] - 10https://gerrit.wikimedia.org/r/196850 (owner: 10BBlack) [02:09:14] !log l10nupdate Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 00m 04s) [02:09:22] Logged the message, Master [02:10:21] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-15 02:09:18+00:00 [02:10:25] Logged the message, Master [02:10:48] !log l10nupdate Synchronized php-1.25wmf21/cache/l10n: (no message) (duration: 00m 03s) [02:10:51] Logged the message, Master [02:11:55] !log LocalisationUpdate completed (1.25wmf21) at 2015-03-15 02:10:51+00:00 [02:11:59] Logged the message, Master [02:23:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Mar 15 02:22:41 UTC 2015 (duration 22m 40s) [02:23:50] Logged the message, Master [02:37:38] (03PS2) 10Krinkle: Don't set up the job queue for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190406 (owner: 10Ori.livneh) [02:38:06] 7Puppet, 6Labs: puppetmaster::gitsync should update labs/private repository as well - https://phabricator.wikimedia.org/T92756#1119472 (10scfc) 3NEW [03:34:06] PROBLEM - puppet last run on mw1087 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:07] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:47] PROBLEM - puppet last run on iodine is CRITICAL: CRITICAL: Puppet has 1 failures [03:51:27] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:52:36] RECOVERY - puppet last run on mw1087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:54:27] RECOVERY - puppet last run on iodine is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:28:36] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:56] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:07] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:27] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [06:29:27] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:27] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:27] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 3 failures [06:46:08] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:46] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:16:07] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Puppet has 1 failures [07:33:37] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:33:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [08:33:16] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [08:46:07] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:46:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:25:20] 7Puppet, 6operations, 5Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908#1119689 (10Matanya) So can we declare this consensus ? If so, i'll add this to the style guide, and fix accordingly. [12:37:57] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [12:41:36] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [13:48:07] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:16] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 2.279 second response time [15:12:44] (03PS1) 10Steinsplitter: cleanup: upload has been disabled on outrechwiki, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196885 [15:40:06] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1119761 (10Multichill) I'm happy I only put asset tags on the servers and no labels yet ;-) [15:50:27] PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: Puppet has 1 failures [16:01:30] 6operations, 3HTTPS-by-default: Upgrade all HTTP frontends to Debian jessie - https://phabricator.wikimedia.org/T86648#1119771 (10GWicke) @bblack, delaying the second Parsoid cache by a week should be fine. If things go well we should have VE use RESTBase instead for all wikipedias by the end of next week (pos... [16:08:06] RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:18:06] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [16:18:54] 6operations: Digitally sign cortado video player java applet - https://phabricator.wikimedia.org/T83995#1119790 (10Aklapper) [16:18:56] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: sign cortado applet so that it works for people with outdated java - https://phabricator.wikimedia.org/T62287#1119791 (10Aklapper) [16:36:46] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:41:35] (03CR) 10GWicke: "@ottomata, I agree with you that a submodule would be nicer & that changing the puppet compiler to somehow support testing submodule check" [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [16:57:24] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1119846 (10GWicke) p:5Triage>3Normal [17:21:15] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security: securing the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T92680#1119890 (10GWicke) Some notes: - We should lock down access to the JMX port, which sadly can't be bound to localhost only. I think a reasonable option would be to on... [17:28:09] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1119900 (10BBlack) I don't think I'm alone in abhorring all the git submodules in our puppet repo. In practice they're a real pain for day-to-day operation... [17:28:30] (03CR) 10BBlack: [C: 031] "I'm all for killing any submodules we can in ops/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [17:31:44] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security: securing the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T92680#1119904 (10BBlack) Re: notes: * If JMX is only needed from localhost, we could limit that with local iptables rules on the servers fairly easily. * We're readying ips... [17:34:11] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1119914 (10GWicke) I'm a bit worried that with a password the usage of nodetool will become quite complicated, even if only accessing localhost. Setting... [17:34:52] (03CR) 10GWicke: "I'm no longer that convinced that a password is better than locking this down at the network level. See https://phabricator.wikimedia.org/" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (https://phabricator.wikimedia.org/T92471) (owner: 10Eevans) [17:39:07] 7Puppet, 6Labs: puppet-run is confused by stale lock files - https://phabricator.wikimedia.org/T92766#1119929 (10scfc) 3NEW [17:39:55] (03CR) 10Tim Landscheidt: "This check is confused by stale lock files; cf. T92766." [puppet] - 10https://gerrit.wikimedia.org/r/196162 (owner: 10BBlack) [18:06:23] (03CR) 10Ori.livneh: "It really is that painful, yes." [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [18:08:05] (03CR) 10Ori.livneh: [C: 031] move cassandra submodule into puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [18:17:57] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [18:21:37] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [18:26:09] 6operations, 5Patch-For-Review: reclaim lsearchd hosts - https://phabricator.wikimedia.org/T86149#1120002 (10faidon) @RobH, what's left to be done here? [19:43:33] (03PS1) 10Yuvipanda: domainproxy: Keep the legend living [puppet] - 10https://gerrit.wikimedia.org/r/196893 [19:44:35] (03PS2) 10Yuvipanda: domainproxy: Keep the legend living [puppet] - 10https://gerrit.wikimedia.org/r/196893 [19:44:58] (03PS3) 10Yuvipanda: domainproxy: Keep the legend living [puppet] - 10https://gerrit.wikimedia.org/r/196893 [19:46:10] (03CR) 10Yuvipanda: [C: 032] domainproxy: Keep the legend living [puppet] - 10https://gerrit.wikimedia.org/r/196893 (owner: 10Yuvipanda) [19:46:32] (03CR) 10Quiddity: [C: 031] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/196893 (owner: 10Yuvipanda) [19:57:56] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [20:01:27] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [20:40:21] hmm, is wikitech login broken? It loads and loads and loads and give me an error, that the password is incorrect (yes, password and username is correct :))? [20:49:28] FlorianSW, confirmed. I logged-out, and now cannot log-in again. >.< [20:49:49] FlorianSW, seems broken indeed [20:50:15] Krenair, quiddity damn :D thanks for testing! Should i open a task in phab for it? [20:50:54] let's see what I can do first [20:51:50] * quiddity goes to drink tea on the porch, in the wind. [20:52:19] :D ok :P [20:59:15] okay [20:59:38] so I had to modify the suggested troubleshooting config so that only I could trigger the logging >_> [21:01:26] 2015-03-15 20:58:59 silver labswiki: 2.1.0 OpenStackNovaController::restCall fullurl: http://virt1000.wikimedia.org:35357/v2.0/tokens [21:01:26] 2015-03-15 21:00:44 silver labswiki: 2.1.0 OpenStackNovaController::authenticate return code: 0 [21:01:30] that takes a long time [21:03:12] https://github.com/wikimedia/mediawiki-extensions-OpenStackManager/blob/master/nova/OpenStackNovaController.php#L710 [21:04:35] and that would be... keystone's url [21:05:32] and I can't go there [21:05:40] so... YuviPanda, around? [21:13:22] akosiaris, jgage? [21:15:58] (03PS3) 10BBlack: wmf-reimage: disable agent daemon before enabling puppet [puppet] - 10https://gerrit.wikimedia.org/r/196849 [21:16:11] (03CR) 10BBlack: [C: 032 V: 032] wmf-reimage: disable agent daemon before enabling puppet [puppet] - 10https://gerrit.wikimedia.org/r/196849 (owner: 10BBlack) [21:17:57] bblack perhaps? [21:18:07] hmmm? [21:18:17] know anything about unbreaking keystone auth for wikitech? [21:19:03] not really, but I do see that once I logged out I can't log back in there [21:19:40] I followed https://wikitech.wikimedia.org/wiki/Wikitech#Troubleshooting and found that the HTTP call to keystone (on virt1000) is not working as expected [21:19:40] any idea how long it's been broken? [21:20:02] usually when things break, someone broke them :) [21:20:17] nope, I found out because FlorianSW said he couldn't log in here [21:21:16] bblack, Krenair sorry, no, i just had to login today again (normally i save all sessions on this pc), so i don't know, when it doesn't worked anymore :( [21:26:10] bblack, maybe virt1000.eqiad.wmnet:/var/log/keystone/keystone.log will tell you? [21:28:03] not really [21:28:09] it's full of spammy warnings, but apparently that's normal :/ [21:28:31] (keystone.token.controllers): 2015-03-14 06:16:25,530 WARNING User novaadmin is unauthorized for tenant puppet3-diffs [21:28:34] (keystone.common.wsgi): 2015-03-14 06:16:25,530 WARNING Authorization failed. User novaadmin is unauthorized for tenant puppet3-diffs from 208.80.154.18 [21:28:37] (keystone.common.controller): 2015-03-14 06:16:31,498 WARNING RBAC: Bypassing authorization [21:28:40] (keystone.common.wsgi): 2015-03-14 06:18:18,256 WARNING Authorization failed. Could not find project, jitsi. from 208.80.154.18 [21:28:48] mostly those, over and over all the time [21:29:30] back on the 12th, there is an LDAP connect error in there [21:29:47] surely someone would've noticed if it's been broken since then, though [21:29:53] (03PS1) 10Hoo man: Add base::firewall on silver [puppet] - 10https://gerrit.wikimedia.org/r/196961 [21:30:07] I don't think ldap is broken [21:31:03] also possibly related: [Fri Mar 13 12:42:00 2015] init: keystone main process (14804) killed by TERM signal [21:31:13] but it seems to have been restarted by init when that happened [21:35:04] keystone is python, and python-requests is using some (I'm guessing manually-applied) odd version, which leads apt-get upgrade to say it would downgrade it if I said yes :P [21:35:14] various other python packages are due for upgrade in general [21:35:24] doubtful it's related, just saying [21:35:56] restarting keystone seems to have fixed it [21:36:26] !log restarted keystone service on virt1001 to fix wikitech login, still no idea why that was necessary or what was broken [21:36:32] Logged the message, Master [21:38:14] thanks bblack! [21:38:19] I hate software like that, where you have no solid indication of what's really wrong, but just kicking it in the head fixes things. [21:38:24] feels like rebooting old Windows machines [21:38:28] np [21:38:44] FlorianSW, quiddity ^ [21:39:07] (03PS2) 10Hoo man: Add base::firewall on silver [puppet] - 10https://gerrit.wikimedia.org/r/196961 [21:40:29] thx bblack & Krenair [21:47:06] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [21:47:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [21:49:12] 7Puppet, 6Labs: puppet-run is confused by stale lock files - https://phabricator.wikimedia.org/T92766#1120168 (10BBlack) It's to protect against a scenario I've run into repeatedly on fresh machine installs: while the admin is executing the initial puppet run to configure the host (which can take several minut... [22:24:35] (03PS1) 10BBlack: puppet-run lock: check running pid + cmdline match T92766 [puppet] - 10https://gerrit.wikimedia.org/r/196963 [22:28:19] (03CR) 10BryanDavis: [C: 031] "Seems to work in my test environment." [tools/scap] - 10https://gerrit.wikimedia.org/r/196306 (https://phabricator.wikimedia.org/T92534) (owner: 10Legoktm) [22:28:52] (03CR) 10BBlack: [C: 032] puppet-run lock: check running pid + cmdline match T92766 [puppet] - 10https://gerrit.wikimedia.org/r/196963 (owner: 10BBlack) [22:32:17] 7Puppet, 6Labs, 5Patch-For-Review: puppet-run is confused by stale lock files - https://phabricator.wikimedia.org/T92766#1120213 (10BBlack) Note the above only implements (a) and (b); it lacks the mtime check, but that may not prove necessary anyways. [22:34:34] (03PS1) 10Hoo man: Let hooft search all domains like the other bastions [puppet] - 10https://gerrit.wikimedia.org/r/196964 [22:41:13] (03PS2) 10Hoo man: Clean up bastionhost domain_search [puppet] - 10https://gerrit.wikimedia.org/r/196964 [23:21:56] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [23:21:56] RECOVERY - HTTP error ratio anomaly detection on graphite2001 is OK: OK: No anomaly detected [23:27:45] ParsoidCacheUpdateJobOnDependencyChange: 148016 queued; 0 claimed (0 active, 0 abandoned); 0 delayed [23:27:47] on wikidata [23:31:55] (03CR) 10Hoo man: "I'm using a private server as ssh proxy now to connect to bast1001, but that's not terribly nice." [puppet] - 10https://gerrit.wikimedia.org/r/196964 (owner: 10Hoo man) [23:36:03] gwicke, ^ [23:44:52] hoo: here? [23:45:10] sure [23:45:22] can I has traceroutes to hooft + bast1001? [23:47:53] hoo, hey. any idea how jobrunners are set up? [23:48:26] paravoid: PMed [23:48:52] I know of many having such problems from the EU when accessing gerrit/git.wm.o/labs/... [23:49:55] what kind of problems? [23:50:41] Krenair: vaguely [23:50:58] paravoid: Extrem slowness, ssh freezing for a bit [23:51:05] https://www.mediawiki.org/wiki/Gerrit/Advanced_usage#ssh_proxy_to_gerrit Even documented [23:51:42] Very annoying in interactive things like mysql, or even when tailing large logs [23:52:11] I've experienced ssh freezing, but not extreme slowness [23:52:30] krenair@fluorine:/a/mw-log$ tail runJobs.log -n 1000 | grep ParsoidCacheUpdateJobOnDependencyChange | grep wikidatawiki -c [23:52:30] 10 [23:53:06] so it's definitely processing them [23:53:33] I *guess* it's triggering them for all item changes, which is pointless as those aren't even wikitext [23:54:08] It's definitely running them on item pages. [23:54:14] maybe we should add a check to Extension:Parsoid [23:54:32] Probably a regression [23:54:37] hoo: is it a recent issue? [23:55:33] It has been there for quite some time... but earlier today it became almost unbearable (had to wait up to several seconds for the remote to react) [23:55:46] but not now? [23:56:03] if that happens again, can you run traceroute or even better mtr and send them to me? [23:58:43] yeah, sure [23:59:08] Did that in the past, but weren't interesting