[00:04:30] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [00:51:39] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 315 seconds [00:52:18] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 354 seconds [00:52:58] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:18] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [02:15:14] !log LocalisationUpdate completed (1.25wmf7) at 2014-11-17 02:15:14+00:00 [02:15:23] Logged the message, Master [02:27:25] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-17 02:27:25+00:00 [02:27:31] Logged the message, Master [03:43:27] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [04:00:49] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.018 second response time [04:09:48] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: puppet fail [04:09:59] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.020 second response time [04:15:03] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 17 04:15:03 UTC 2014 (duration 15m 2s) [04:15:07] Logged the message, Master [04:29:19] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [04:37:39] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [05:12:11] (03PS1) 10Tim Starling: Remove coloured gdb prompt [puppet] - 10https://gerrit.wikimedia.org/r/173752 [05:30:31] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 1 failures [05:33:33] (03CR) 10Ori.livneh: [C: 031] "This shouldn't happen if you use \001 and \002 to enclose non-printing characters (which I did). I can't reproduce the issue. But this is " [puppet] - 10https://gerrit.wikimedia.org/r/173752 (owner: 10Tim Starling) [05:43:03] ori: also, https://bugzilla.wikimedia.org/show_bug.cgi?id=73479 (Eventlogging puppet failure on labs because intertwined with ganglia) [05:43:38] * ori looks [05:45:22] YuviPanda: did labs use to have ganglia? [05:45:35] it had gmond, but didn't really have an aggregator. [05:45:53] and that was eating up memory in instances, so we killed it. Not the most thorough job in hindsight. [05:46:04] since these have been failing ever since (about a month now) [05:46:28] why not add an aggregator? [05:46:45] we used to have an aggregator on a labs instance, and that just died very, very quickly. [05:46:57] so would need to be on a physical host. [05:47:17] and I don't know if I like that idea too much [05:48:13] ori: we added a hiera variable called 'has_ganglia' and been using that [05:48:22] not the most elegant of solutions. [05:50:50] ori: hmm, toollabs has a workaround now, deployment-prep doesn't (for core dumps), and I can't figure out a nice way to disable them forever. [05:50:56] * YuviPanda goes to apply workaround to deployment-prep too [05:51:09] one core dump fills up /var [05:54:34] YuviPanda: sysctl::parameters { 'disable core dumps': values => { 'kern.coredump' => 0, }, } [05:54:35] ori: also hhvm coredumps on deployment-prep, moved to /home/yuvipanda (if you want to have a look) [05:54:39] ... [05:54:39] wat [05:54:45] why didn't I actually find that? [05:55:03] * YuviPanda feels super dumb now [05:55:30] find what? it's not in the repo, i'm just showing you how to do it [05:55:57] yeah [05:55:58] I mean [05:56:02] find that parameter [05:56:15] I was looking into limits.conf and got lost there [05:56:27] you could do it there as well [05:57:04] oh damn, I just found out about limits.d [05:57:08] doubly feeling dumb now [05:57:21] ...or there, yeah :) [05:57:22] since I was spending time trying to figure out how to manage the limits.conf file without conflicts [05:57:33] file_line [05:57:44] in stdlib [05:57:49] triply dumb now. [05:57:54] ok, that was only 3 hours, so not so bad [05:58:02] plus feeling dumb is good, since that means you're learning. [05:58:04] * YuviPanda consoles self. [05:58:08] ori: thanks! [05:58:16] I'll make a patch now [05:58:44] ori: thanks! [05:59:00] no problem, thanks for fixing [05:59:05] :) [05:59:19] just fixed deployment-stream [05:59:27] * YuviPanda wonders if he can get all of deployment-prep green today [06:04:44] YuviPanda: having a dummy gmond package / service might be easier, in that it would not require changing individual modules like the has_ganglia var does [06:05:30] service { 'ganglia-monitor': start => '/bin/true', stop => '/bin/true', } [06:05:42] hmm, right. [06:05:46] i think you also need provider => base [06:09:57] having the separation might also be nice, tho. [06:11:43] why... do we have a lucid instance? [06:13:00] oh man, we've a host called udplog? [06:13:18] goddamit, gmond again [06:13:58] why... do we have a lucid instance? [06:14:06] i guess we'll be porting it to debian squeeze soon [06:14:12] hehe :) [06:14:24] trolololo [06:14:58] * YuviPanda emails qa list [06:28:13] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:34] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:34] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:13] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:45:53] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:55] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:04] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [07:07:13] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [07:39:13] (03PS5) 10Giuseppe Lavagetto: mediawiki: simplify apache config [puppet] - 10https://gerrit.wikimedia.org/r/170300 [07:39:22] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: simplify apache config [puppet] - 10https://gerrit.wikimedia.org/r/170300 (owner: 10Giuseppe Lavagetto) [08:04:38] (03PS1) 10Giuseppe Lavagetto: Fix whitespace in RewriteCond [puppet] - 10https://gerrit.wikimedia.org/r/173756 [08:05:20] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix whitespace in RewriteCond [puppet] - 10https://gerrit.wikimedia.org/r/173756 (owner: 10Giuseppe Lavagetto) [08:13:08] going to upgrade Jenkins in a few minutes, it will be unavailable for a few. [08:15:04] <_joe_> hashar: ok [08:21:16] !log Upgrading Jenkins [08:21:23] Logged the message, Master [08:28:40] PROBLEM - Apache HTTP on mw1129 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50412 bytes in 0.035 second response time [08:30:04] Respected human, time to deploy Infra upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141117T0830). Please do the needful. [08:30:27] <_joe_> mmmh [08:30:55] !log reimaging cerium [08:31:00] Logged the message, Master [08:31:36] <_joe_> PHP Fatal error: Base lambda function for closure not found in /srv/mediawiki/php-1.25wmf7/extensions/Wikidata/extensions/Wikibase/lib/config/WikibaseLib.default.php on line 18 [08:31:40] <_joe_> sigh [08:31:40] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.068 second response time [08:31:46] USE ALL THE FEATURES? [08:31:49] akosiaris: you got a shoutout in [08:32:09] <_joe_> ori: hi [08:32:21] ori: yeah I noticed :-) [08:32:40] PROBLEM - Host cerium is DOWN: PING CRITICAL - Packet loss = 100% [08:32:50] base lambda function for closure does not implement abstract factory interface stream protocol [08:32:55] _joe_: hey, morning [08:33:27] <_joe_> every time we reload apache a to of those errors spawn [08:33:35] <_joe_> it's an apc issue I guess [08:34:18] <_joe_> ori: I'd move to 25% of anons on HHVM at 5PM UTC [08:34:34] * YuviPanda adds _joe_ to https://phabricator.wikimedia.org/T1291 for after firefighting, for opinions. I can / shall implement [08:34:36] <_joe_> that should be your 9 AM? [08:35:16] <_joe_> YuviPanda: I'll take a look today [08:35:17] something very peculiar about our blog btw is that one may click the RSS button expecting to get the RSS feed for the blog but quite the contrary, it lists the "soon to be added :P" RSS feeds we want to ? [08:35:23] _joe_: thanks! [08:35:26] [08:35:29] hiera file for labs vide stuff [08:35:37] can we do 6pm? that gives the caffeine a bit of time to get absorbed [08:35:52] <_joe_> 6PM UTC? ok [08:36:01] <_joe_> but you'll have to do most monitoring [08:36:02] 6PM PST! [08:36:06] <_joe_> afterwards [08:36:24] <_joe_> as it's 7 PM here, and I have ops and mwcore in the evening [08:36:36] <_joe_> but it's super-cool for me [08:36:55] no ops today :) [08:37:15] etheerrrrpaaaad! [08:37:25] also I wonder if we should switch ops meetings to mumble. lower bandwidth! [08:37:26] <_joe_> paravoid: oh right [08:37:50] RECOVERY - Host cerium is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [08:39:17] !log Jenkins upgraded [08:39:20] Logged the message, Master [08:40:10] PROBLEM - RAID on cerium is CRITICAL: Connection refused by host [08:40:20] PROBLEM - check if dhclient is running on cerium is CRITICAL: Connection refused by host [08:40:29] PROBLEM - check configured eth on cerium is CRITICAL: Connection refused by host [08:40:39] PROBLEM - SSH on cerium is CRITICAL: Connection refused [08:40:39] PROBLEM - check if salt-minion is running on cerium is CRITICAL: Connection refused by host [08:40:39] PROBLEM - DPKG on cerium is CRITICAL: Connection refused by host [08:41:09] PROBLEM - Disk space on cerium is CRITICAL: Connection refused by host [08:41:10] PROBLEM - puppet last run on cerium is CRITICAL: Connection refused by host [08:41:12] that reimaging might have been way too fast... [08:41:23] neon did not have enough time to get the puppet changes... [08:41:38] I may have overautomated stuff [08:41:48] hah [08:41:57] or you underautomated it ;) [08:42:20] hehe [08:42:29] true... it is always based on the POV [08:42:45] so let's switch POV... how to automate neon getting the changes.... [08:43:18] <_joe_> akosiaris: did you automate more upon wmf-reimage? [08:43:23] (03PS1) 10Ori.livneh: eventlogging: couple less tightly to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/173758 [08:43:33] _joe_: yeah, pushing commit now [08:43:41] _joe_: could also do it now :) [08:44:05] i'm more alert now than i will be then [08:44:11] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.347 second response time [08:44:15] <_joe_> ori: ahah, ok cool [08:44:32] YuviPanda: that one's for you ^^ [08:44:39] yeah, thanks! :) [08:44:48] now question is realm branching vs feature branching [08:45:09] let me test it anywa [08:45:09] y [08:45:36] testing shmesting, the code looks pretty [08:45:49] it must be right [08:46:02] :D [08:46:40] ori: aha, so this patch also hit the same thing my similar patch hit [08:46:40] Error: Failed to apply catalog: Could not find dependent Service[eventlogging/init] for File[/etc/eventlogging.d/consumers/mysql-m2-master] at /etc/puppet/modules/eventlogging/manifests/service/consumer.pp:50 [08:46:51] <_joe_> lol [08:46:53] which is a bit weird, and I gave up on friday night [08:47:03] hmmmmmm [08:47:15] my patch being https://gerrit.wikimedia.org/r/#/c/173634/, and the last patchset there is stupid, at which point I realized I should sleep. [08:48:16] (03PS1) 10Giuseppe Lavagetto: 25% of anonymous traffic to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173759 [08:49:28] (03PS2) 10Ori.livneh: eventlogging: couple less tightly to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/173758 [08:49:58] !log if there is any oddity with Jenkins/Zuul please poke me. I am on IRC all day today [08:50:03] Logged the message, Master [08:50:09] (03CR) 10Ori.livneh: [C: 031] "1 in 4 anonymous users agree." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173759 (owner: 10Giuseppe Lavagetto) [08:51:13] (03CR) 10Giuseppe Lavagetto: [C: 032] 25% of anonymous traffic to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173759 (owner: 10Giuseppe Lavagetto) [08:51:21] (03Merged) 10jenkins-bot: 25% of anonymous traffic to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173759 (owner: 10Giuseppe Lavagetto) [08:51:23] does one know when more hardware for labs coming in ? [08:52:08] labs is virtual, so it depends on the Cloud [08:52:20] peace be upon it [08:52:58] (sorry. no, i don't.) [08:53:00] PROBLEM - NTP on cerium is CRITICAL: NTP CRITICAL: No response from NTP server [08:53:22] mutante: http://www.weather.com/weather/today/37.540726,-77.436050?par=googleonebox,looks like more clouds are coming to eqiad [08:53:27] !log oblivian Synchronized wmf-config/CommonSettings.php: Open HHVM to 25% of anons (duration: 00m 06s) [08:53:29] Logged the message, Master [08:54:30] matanya: RobH / coren / andrewbogott would know about new hardware order [08:54:47] why do you ask? [08:55:11] I need more horse power, and andrew told me it should arrive, but didn;t tell me the date [08:55:27] so i'm holding my cod changes for now [08:55:30] code [08:55:54] how much more horse power do you need ? [08:56:11] about 4 more cores, and 16 GB RAM [08:56:14] what for? is this the video testing system? [08:56:19] we should already have that [08:56:20] yes, it is YuviPanda [08:56:27] video encoding [08:56:29] what YuviPanda said [08:56:39] not conf system [08:56:41] holding my cod: http://bostonfishingcharters.com/images/mikem45cod.jpg [08:56:54] ahaha [08:56:56] yes, that cod ori ! :D [08:56:57] ori: PS2 ran into the error my patch ran into after I fixed that first error, which is... [08:56:57] Error: Failed to apply catalog: Could not find dependency File[/usr/lib/nagios/plugins/check_eventlogging_jobs] for Nrpe::Monitor_service[eventlogging] at /etc/puppet/manifests/role/eventlogging.pp:171 [08:57:10] I... should've said earlier. [08:58:07] akosiaris paravoid : https://commons.wikimedia.org/wiki/File:The_Hobo_1917_OLIVER_BABE_HARDY_BILLY_WEST_Arvid_E_Gillstrom.webm <- took about a day to re-encode [08:59:01] mutante: labs cores are wimpier than real cores [08:59:26] YuviPanda: mutante != matanya :) [08:59:45] gah [08:59:51] I've been making that for about a year now [09:00:19] https://commons.wikimedia.org/wiki/File:Out_West_1918_FATTY_ARBUCKLE_BUSTER_KEATON.webm <-- also about a day [09:00:40] what are you encoding them to? [09:00:42] H264? [09:00:47] (03PS3) 10Ori.livneh: eventlogging: couple less tightly to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/173758 [09:00:59] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [09:01:05] webm [09:01:10] from mp4 [09:01:18] aaaaha [09:01:21] YuviPanda: try PS3? [09:01:26] ori: yup, just cherry picked [09:01:32] thanks much [09:01:36] Stream #0:0 -> #0:0 (h264 -> libvpx) [09:01:36] Stream #0:1 -> #0:1 (aac -> libvorbis) [09:01:47] mutante: ah, youtube -> commons? [09:01:51] yes [09:01:56] and internet archive [09:02:01] ori: yup, that runs cleanly :) [09:02:04] and vimeo [09:02:11] and million other sources [09:02:20] YuviPanda: (shameless plug) do you know about pcc? [09:02:20] matanya: anyway, if 16G + 4Cores is what you need, we can increase your project's limit now [09:02:22] and it should be ok [09:02:34] ori: the C compiler? [09:02:36] andrewbogott request not to do it [09:02:43] matanya: aaaaho, I see. [09:02:50] ori: wait, puppet compiler? [09:02:52] i'm already beyond limits [09:03:00] YuviPanda: no :) there's a cli interface for _joe_'s catalog compiler in operations/puppet repo root [09:03:02] using 12 cores and 16 GB RAM [09:03:07] ori: aha! [09:03:21] nice [09:03:22] YuviPanda: https://asciinema.org/a/11986 [09:04:21] woah nice! [09:04:24] <_joe_> "asciinema" is pure genius as a domain [09:04:25] asciinema is also nice :) [09:04:27] <_joe_> YuviPanda: it is [09:04:27] yeah [09:04:40] * YuviPanda presumes ori built asciinema too [09:04:47] i wish [09:05:06] slacker [09:05:11] :) [09:05:24] ori: mind if I convert the realm branching to feature flag branching? [09:06:11] yeah, i'm not totally sold on that approach yet [09:06:33] hmm, are you worried about flag proliferation? [09:06:34] it's an interesting idea, though, and it may be right [09:07:30] well, the thing i don't love about it is that it makes things less explicit [09:08:07] <_joe_> ori: how? [09:08:08] if you have a role class from which you toggle bits on and off, there is an authoritative place to go to to read about the setup [09:08:31] <_joe_> I think _that_ should be done in the module classes, usually [09:08:38] <_joe_> or, we could use environments [09:09:11] <_joe_> so we have role::something in the base path [09:09:44] if i have module Foo and I activate module Bar, how should I know to expect module Foo to enable some previously-excluded functionality because it now feature-detects module Bar? [09:09:49] <_joe_> then we have that plus the prod specific stuff in the production enfironment [09:10:00] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [09:10:10] that is the worst host name [09:10:13] <_joe_> uh? parse error [09:10:13] !log praseodymium reimaging [09:10:18] Logged the message, Master [09:12:03] _joe_: feature-detection means modules which are already provisioned on a host and which one may consider stable can react to the introduction of a new, seemingly unrelated module [09:13:18] <_joe_> ori: nope, once you use hiera [09:13:26] <_joe_> and properly namespace variables [09:13:42] +1 hiera, has_ganglia isn't set in ganglia but in hiera [09:13:47] <_joe_> or maybe I didn't get your point, which is really probable [09:15:10] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [09:17:29] PROBLEM - check configured eth on praseodymium is CRITICAL: Connection refused by host [09:17:30] PROBLEM - RAID on praseodymium is CRITICAL: Connection refused by host [09:17:40] PROBLEM - check if dhclient is running on praseodymium is CRITICAL: Connection refused by host [09:17:50] PROBLEM - puppet last run on praseodymium is CRITICAL: Connection refused by host [09:17:50] PROBLEM - SSH on praseodymium is CRITICAL: Connection refused [09:18:10] PROBLEM - DPKG on praseodymium is CRITICAL: Connection refused by host [09:18:11] PROBLEM - check if salt-minion is running on praseodymium is CRITICAL: Connection refused by host [09:29:40] PROBLEM - NTP on praseodymium is CRITICAL: NTP CRITICAL: No response from NTP server [09:35:11] (03Abandoned) 10Yuvipanda: eventlogging: Make ganglia usage optional [puppet] - 10https://gerrit.wikimedia.org/r/173634 (https://bugzilla.wikimedia.org/73479) (owner: 10Yuvipanda) [09:37:34] (03CR) 10Nikerabbit: Add read only configuration for ElasticSearchTTMServer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172534 (owner: 10Nikerabbit) [09:39:27] jzerebecki: ping? [09:40:36] (03PS1) 10KartikMistry: Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 [09:41:55] (03PS2) 10Nikerabbit: Add read only configuration for ElasticSearchTTMServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172534 [09:42:40] (03PS2) 10KartikMistry: Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 [09:44:44] (03PS3) 10Nikerabbit: Add read only configuration for ElasticSearchTTMServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172534 [09:46:43] (03PS1) 10Nikerabbit: Group translate-proofr was removed from Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173766 [09:48:01] (03CR) 10Nemo bis: [C: 031] Group translate-proofr was removed from Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173766 (owner: 10Nikerabbit) [09:50:59] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.061 second response time [09:51:17] <_joe_> !log restarting mw1187, all apache children stuck in apc_pthreadmutex_lock() [09:51:20] Logged the message, Master [10:05:52] RECOVERY - Disk space on db1017 is OK: DISK OK [10:07:02] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [10:07:48] hm [10:16:00] (03PS2) 10Ori.livneh: allow multiple sudo::user grants for same user [puppet] - 10https://gerrit.wikimedia.org/r/173629 [10:16:07] (03CR) 10Ori.livneh: [C: 032 V: 032] allow multiple sudo::user grants for same user [puppet] - 10https://gerrit.wikimedia.org/r/173629 (owner: 10Ori.livneh) [10:17:35] alright, that's all of betacluster accounted for, I think [10:17:47] now to see why tools-webproxy thinks its tools [10:18:17] (03PS2) 10Ori.livneh: keyholder: add icinga check [puppet] - 10https://gerrit.wikimedia.org/r/173633 [10:20:18] (03CR) 10Ori.livneh: [C: 032 V: 032] keyholder: add icinga check [puppet] - 10https://gerrit.wikimedia.org/r/173633 (owner: 10Ori.livneh) [10:20:41] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [10:25:28] i'm going to verify alerts for keyholder (shared deployment ssh agent daemon) work by clearing keys from the agent, if it works we'll get an alert for tin [10:29:31] (03CR) 10Nikerabbit: Add support for woff2 files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/173763 (owner: 10KartikMistry) [10:37:31] (03PS3) 10KartikMistry: Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 [10:39:08] (03CR) 10Giuseppe Lavagetto: [C: 031] Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 (owner: 10KartikMistry) [10:51:32] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [10:54:27] <_joe_> kart_: I can +2 that change if you don't need someone else's review [10:56:41] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [11:01:23] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (100.00%) [11:03:22] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [11:04:12] RECOVERY - SSH on praseodymium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [11:04:22] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [11:14:53] RECOVERY - check configured eth on praseodymium is OK: NRPE: Unable to read output [11:15:11] RECOVERY - Disk space on praseodymium is OK: DISK OK [11:15:11] RECOVERY - RAID on praseodymium is OK: OK: no disks configured for RAID [11:15:31] RECOVERY - check if dhclient is running on praseodymium is OK: PROCS OK: 0 processes with command name dhclient [11:15:41] RECOVERY - DPKG on praseodymium is OK: All packages OK [11:15:42] RECOVERY - check if salt-minion is running on praseodymium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:16:31] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:17:41] RECOVERY - SSH on cerium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [11:21:01] (03PS1) 10Ori.livneh: Route Bug40009 logs to fluorine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173784 [11:21:09] (03PS2) 10Ori.livneh: Route Bug40009 logs to fluorine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173784 [11:21:26] (03CR) 10Ori.livneh: [C: 032] Route Bug40009 logs to fluorine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173784 (owner: 10Ori.livneh) [11:21:36] (03Merged) 10jenkins-bot: Route Bug40009 logs to fluorine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173784 (owner: 10Ori.livneh) [11:22:32] !log ori Synchronized wmf-config/InitialiseSettings.php: Ied0a7ab4b: Route Bug40009 logs to fluorine (duration: 00m 07s) [11:22:38] Logged the message, Master [11:22:41] RECOVERY - Disk space on cerium is OK: DISK OK [11:22:51] RECOVERY - check if dhclient is running on cerium is OK: PROCS OK: 0 processes with command name dhclient [11:22:52] RECOVERY - RAID on cerium is OK: OK: no disks configured for RAID [11:23:02] RECOVERY - DPKG on cerium is OK: All packages OK [11:23:02] PROBLEM - Host xenon is DOWN: PING CRITICAL - Packet loss = 100% [11:23:02] RECOVERY - check if salt-minion is running on cerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:23:11] RECOVERY - check configured eth on cerium is OK: NRPE: Unable to read output [11:23:43] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Puppet has 1 failures [11:25:04] hashar: mediawiki-vendor-integration tests run for nearly six minutes -- is it not possible to skip them on changes that aren't related? [11:25:20] Warning: Duplicate definition found for service 'etherpad.wikimedia.org' on host 'zirconium' :-( [11:25:34] I wonder for how long icinga has not been picking up changes... [11:25:52] <_joe_> akosiaris: a few days tops [11:26:09] <_joe_> I've been doing changes on icinga until thursday and I keep an eye on that [11:26:13] too much already ... [11:26:52] that does it... I 'll write an icinga check to check icinga today... [11:27:18] <_joe_> akosiaris: that's because people think your work ends with puppet-merge [11:27:28] <_joe_> but yes, a check may help us [11:27:29] !log ori Synchronized php-1.25wmf8/includes/Import.php: Icc19961fd: 'Debugging statements to try to diagnose bug 40009' (duration: 00m 08s) [11:27:32] Logged the message, Master [11:27:57] <_joe_> akosiaris: also, it could be some form of puppet failure here [11:28:06] <_joe_> and I could avoid that with naggen [11:28:11] RECOVERY - Host xenon is UP: PING OK - Packet loss = 0%, RTA = 3.22 ms [11:30:21] PROBLEM - check configured eth on xenon is CRITICAL: Connection refused by host [11:30:21] PROBLEM - check if salt-minion is running on xenon is CRITICAL: Connection refused by host [11:30:32] PROBLEM - puppet last run on xenon is CRITICAL: Connection refused by host [11:30:52] PROBLEM - DPKG on xenon is CRITICAL: Timeout while attempting connection [11:31:01] PROBLEM - check if dhclient is running on xenon is CRITICAL: Timeout while attempting connection [11:31:14] PROBLEM - Disk space on xenon is CRITICAL: Timeout while attempting connection [11:31:14] PROBLEM - RAID on xenon is CRITICAL: Timeout while attempting connection [11:31:22] PROBLEM - SSH on xenon is CRITICAL: Connection timed out [11:33:21] RECOVERY - SSH on xenon is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [11:33:31] RECOVERY - NTP on praseodymium is OK: NTP OK: Offset -0.00564622879 secs [11:41:01] RECOVERY - NTP on cerium is OK: NTP OK: Offset -0.04714632034 secs [11:42:11] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:43:03] PROBLEM - NTP on xenon is CRITICAL: NTP CRITICAL: No response from NTP server [11:49:37] (03PS1) 10Filippo Giunchedi: rename db1017 into graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/173789 [11:51:33] RECOVERY - Disk space on xenon is OK: DISK OK [11:51:33] RECOVERY - RAID on xenon is OK: OK: no disks configured for RAID [11:51:52] RECOVERY - check if salt-minion is running on xenon is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:51:52] RECOVERY - check configured eth on xenon is OK: NRPE: Unable to read output [11:52:23] RECOVERY - DPKG on xenon is OK: All packages OK [11:52:31] RECOVERY - check if dhclient is running on xenon is OK: PROCS OK: 0 processes with command name dhclient [12:04:12] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:05:32] (03PS2) 10Filippo Giunchedi: rename db1017 into graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/173789 [12:05:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] rename db1017 into graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/173789 (owner: 10Filippo Giunchedi) [12:08:33] (03PS1) 10Filippo Giunchedi: eqiad: rename db1017 to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/173790 [12:08:56] (03PS2) 10Filippo Giunchedi: eqiad: rename db1017 to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/173790 [12:09:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] eqiad: rename db1017 to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/173790 (owner: 10Filippo Giunchedi) [12:09:41] RECOVERY - NTP on xenon is OK: NTP OK: Offset -0.01121199131 secs [12:12:42] <_joe_> akosiaris: did you unbreak icinga? [12:13:20] _joe_: not yet... fighting with it... something really weird is going on... [12:13:31] <_joe_> akosiaris: can I help maybe? [12:13:42] it started complaining the main config file name looks suspicious [12:14:05] you might... I 've narrowed it down to something in puppet_services.cfg [12:14:09] not sure what yet... [12:14:26] <_joe_> icinga -v /etc/icinga/icinga.cfg seems to say it's all right [12:14:38] now... wait a bit [12:14:43] (03PS1) 10Alexandros Kosiaris: Fixes for raid0-lvm partman config [puppet] - 10https://gerrit.wikimedia.org/r/173792 [12:14:50] I am repopulating puppet_services.cfg [12:14:58] <_joe_> how? [12:15:01] well running puppet and puppet does that but anyway... [12:15:22] I emptied the file on purpose... [12:15:42] <_joe_> ok [12:15:48] <_joe_> if we have duplicate defs [12:16:01] * YuviPanda has enjoyed the shinken approach, which ends up with far less total amount of config files [12:16:05] <_joe_> I think the only place to look for them is the puppet database [12:16:53] I think I found something weird as well in the puppet tree [12:16:59] but it should have bitten us long ago [12:17:09] <_joe_> akosiaris: maybe it's my doing [12:17:10] manifests/role/etherpad.pp [12:17:16] <_joe_> oh no ok [12:17:18] the two monitor_service resources [12:17:26] got the same description [12:17:32] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [12:17:49] !log final reboot for xenon, cerium, praseodymium after a dist-upgrade -y [12:17:51] PROBLEM - Host cerium is DOWN: PING CRITICAL - Packet loss = 100% [12:17:57] Logged the message, Master [12:18:01] PROBLEM - Host xenon is DOWN: PING CRITICAL - Packet loss = 100% [12:18:09] <_joe_> akosiaris: mmmh [12:18:32] RECOVERY - Host cerium is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [12:18:32] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [12:18:32] RECOVERY - Host xenon is UP: PING OK - Packet loss = 0%, RTA = 2.27 ms [12:18:45] ok, the first error is back now [12:18:53] Warning: Duplicate definition found for service 'etherpad.wikimedia.org' on host 'zirconium' (config file '/etc/icinga/puppet_services.cfg' [12:18:56] fixing that [12:20:10] (03CR) 10Alexandros Kosiaris: [C: 032] Fixes for raid0-lvm partman config [puppet] - 10https://gerrit.wikimedia.org/r/173792 (owner: 10Alexandros Kosiaris) [12:22:53] (03PS1) 10Alexandros Kosiaris: Deduplicate etherpad's monitor_service checks [puppet] - 10https://gerrit.wikimedia.org/r/173793 [12:23:50] <_joe_> akosiaris: ok, if having the same description is a problem, our define is broken [12:24:04] (03CR) 10Giuseppe Lavagetto: [C: 031] Deduplicate etherpad's monitor_service checks [puppet] - 10https://gerrit.wikimedia.org/r/173793 (owner: 10Alexandros Kosiaris) [12:25:46] (03CR) 10Alexandros Kosiaris: [C: 032] Deduplicate etherpad's monitor_service checks [puppet] - 10https://gerrit.wikimedia.org/r/173793 (owner: 10Alexandros Kosiaris) [12:26:27] hmmm [12:26:29] Alexandros Kosiaris^O: Deduplicate etherpad's monitor_service checks (40236f3) [12:26:34] notice the ^O [12:26:43] and then... [12:26:45] From https://gerrit.wikimedia.org/r/p/operations/puppet [12:26:45] c7832b8..5a0c2d1 production -> origin/production [12:26:45] *** Please tell me who you are. [12:26:49] interesting... [12:27:02] gerrit ? dafuq are you doing ? [12:34:27] _joe_: yes you might be of help after all [12:34:37] so after merging the above fix [12:34:42] ***> The name of the main configuration file looks suspicious... [12:34:53] <_joe_> oh that [12:35:01] <_joe_> I've already seen that [12:35:14] <_joe_> 1 sec [12:35:19] and mv puppet_services.cfg keep.cfg ; grep -v servicegroups keep.cfg > puppet_services.cfg [12:35:22] fixed it [12:35:36] so the problem is the empty servicegroups directive on every single entry [12:36:15] <_joe_> mmmh [12:36:28] <_joe_> not sure I understood that [12:36:52] look at the keep.cfg file at neon:/etc/icinga/keep.cfg [12:37:05] every single stanza has an empty servicegroups directive... [12:37:10] something is wrong there [12:37:17] now icinga's reporting is awful [12:37:20] <_joe_> who creates that file? [12:37:25] nobody [12:37:32] I did just now.. [12:37:50] we can remove it, all I wanted is a copy of puppet_services.cfg [12:38:10] it is not even parsed or something. Next time I will call it lala.cfg :P [12:38:11] <_joe_> ok so, why are servicegroups empty? [12:38:18] <_joe_> ahah ok [12:38:26] that is what I am searching now ... [12:38:32] <_joe_> this is most likely a bug in naggen [12:38:39] <_joe_> lemme check that [12:38:42] maybe some hiera change ? [12:38:52] <_joe_> akosiaris: no, I just rechecked that [12:38:58] <_joe_> the old logic was [12:39:14] <_joe_> https://github.com/wikimedia/operations-puppet/blob/8918dd86536e3e9f4fe17bd2934820b0693e0290/manifests/nagios.pp [12:39:44] <_joe_> so only monitor_service instances with a) nagios_group set globally or b) $group set explicitly [12:39:48] <_joe_> would get one [12:39:55] <_joe_> I suggest a small change [12:40:08] $group = hiera('nagios_group', undef), [12:40:10] right ? [12:40:11] <_joe_> but lemme commit another change first [12:40:31] <_joe_> akosiaris: yes that's the translation of the old code you find at the link I posted [12:41:32] (03PS1) 10ArielGlenn: stat1002, 1003 access for bmansurov (rt #8852) [puppet] - 10https://gerrit.wikimedia.org/r/173794 [12:41:38] <_joe_> the oldest version, before any of my puppet3 changes, does the same [12:41:43] <_joe_> https://github.com/wikimedia/operations-puppet/blob/165356c0f16edca88754683830c51d6489df47eb/manifests/nagios.pp#L86 [12:42:15] <_joe_> so, I have an easy fix [12:42:25] <_joe_> but let me do something else first [12:42:44] sure, going to lunch in the meantine :-) [12:43:35] (03PS2) 10Giuseppe Lavagetto: nagios: convert monitor_service to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/173322 [12:44:41] (03CR) 10ArielGlenn: [C: 032] stat1002, 1003 access for bmansurov (rt #8852) [puppet] - 10https://gerrit.wikimedia.org/r/173794 (owner: 10ArielGlenn) [12:45:36] <_joe_> icinga.cfg:cfg_file=/etc/nagios/puppet_servicegroups.cfg [12:45:40] * _joe_ facepalms [12:45:48] <_joe_> notice the directory [12:46:13] (03PS3) 10Giuseppe Lavagetto: nagios: convert monitor_service to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/173322 [12:48:05] (03PS1) 10ArielGlenn: Revert "stat1002, 1003 access for bmansurov (rt #8852)" [puppet] - 10https://gerrit.wikimedia.org/r/173798 [12:48:44] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [12:48:55] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: puppet fail [12:50:50] ignore that, I'm not reverting it (I don't think) [12:51:10]