[00:04:30] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [00:51:39] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 315 seconds [00:52:18] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 354 seconds [00:52:58] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:18] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [02:15:14] !log LocalisationUpdate completed (1.25wmf7) at 2014-11-17 02:15:14+00:00 [02:15:23] Logged the message, Master [02:27:25] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-17 02:27:25+00:00 [02:27:31] Logged the message, Master [03:43:27] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [04:00:49] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.018 second response time [04:09:48] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: puppet fail [04:09:59] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.020 second response time [04:15:03] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 17 04:15:03 UTC 2014 (duration 15m 2s) [04:15:07] Logged the message, Master [04:29:19] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [04:37:39] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [05:12:11] (03PS1) 10Tim Starling: Remove coloured gdb prompt [puppet] - 10https://gerrit.wikimedia.org/r/173752 [05:30:31] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 1 failures [05:33:33] (03CR) 10Ori.livneh: [C: 031] "This shouldn't happen if you use \001 and \002 to enclose non-printing characters (which I did). I can't reproduce the issue. But this is " [puppet] - 10https://gerrit.wikimedia.org/r/173752 (owner: 10Tim Starling) [05:43:03] ori: also, https://bugzilla.wikimedia.org/show_bug.cgi?id=73479 (Eventlogging puppet failure on labs because intertwined with ganglia) [05:43:38] * ori looks [05:45:22] YuviPanda: did labs use to have ganglia? [05:45:35] it had gmond, but didn't really have an aggregator. [05:45:53] and that was eating up memory in instances, so we killed it. Not the most thorough job in hindsight. [05:46:04] since these have been failing ever since (about a month now) [05:46:28] why not add an aggregator? [05:46:45] we used to have an aggregator on a labs instance, and that just died very, very quickly. [05:46:57] so would need to be on a physical host. [05:47:17] and I don't know if I like that idea too much [05:48:13] ori: we added a hiera variable called 'has_ganglia' and been using that [05:48:22] not the most elegant of solutions. [05:50:50] ori: hmm, toollabs has a workaround now, deployment-prep doesn't (for core dumps), and I can't figure out a nice way to disable them forever. [05:50:56] * YuviPanda goes to apply workaround to deployment-prep too [05:51:09] one core dump fills up /var [05:54:34] YuviPanda: sysctl::parameters { 'disable core dumps': values => { 'kern.coredump' => 0, }, } [05:54:35] ori: also hhvm coredumps on deployment-prep, moved to /home/yuvipanda (if you want to have a look) [05:54:39] ... [05:54:39] wat [05:54:45] why didn't I actually find that? [05:55:03] * YuviPanda feels super dumb now [05:55:30] find what? it's not in the repo, i'm just showing you how to do it [05:55:57] yeah [05:55:58] I mean [05:56:02] find that parameter [05:56:15] I was looking into limits.conf and got lost there [05:56:27] you could do it there as well [05:57:04] oh damn, I just found out about limits.d [05:57:08] doubly feeling dumb now [05:57:21] ...or there, yeah :) [05:57:22] since I was spending time trying to figure out how to manage the limits.conf file without conflicts [05:57:33] file_line [05:57:44] in stdlib [05:57:49] triply dumb now. [05:57:54] ok, that was only 3 hours, so not so bad [05:58:02] plus feeling dumb is good, since that means you're learning. [05:58:04] * YuviPanda consoles self. [05:58:08] ori: thanks! [05:58:16] I'll make a patch now [05:58:44] ori: thanks! [05:59:00] no problem, thanks for fixing [05:59:05] :) [05:59:19] just fixed deployment-stream [05:59:27] * YuviPanda wonders if he can get all of deployment-prep green today [06:04:44] YuviPanda: having a dummy gmond package / service might be easier, in that it would not require changing individual modules like the has_ganglia var does [06:05:30] service { 'ganglia-monitor': start => '/bin/true', stop => '/bin/true', } [06:05:42] hmm, right. [06:05:46] i think you also need provider => base [06:09:57] having the separation might also be nice, tho. [06:11:43] why... do we have a lucid instance? [06:13:00] oh man, we've a host called udplog? [06:13:18] goddamit, gmond again [06:13:58] why... do we have a lucid instance? [06:14:06] i guess we'll be porting it to debian squeeze soon [06:14:12] hehe :) [06:14:24] trolololo [06:14:58] * YuviPanda emails qa list [06:28:13] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:34] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:34] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:13] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:45:53] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:55] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:04] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [07:07:13] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [07:39:13] (03PS5) 10Giuseppe Lavagetto: mediawiki: simplify apache config [puppet] - 10https://gerrit.wikimedia.org/r/170300 [07:39:22] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: simplify apache config [puppet] - 10https://gerrit.wikimedia.org/r/170300 (owner: 10Giuseppe Lavagetto) [08:04:38] (03PS1) 10Giuseppe Lavagetto: Fix whitespace in RewriteCond [puppet] - 10https://gerrit.wikimedia.org/r/173756 [08:05:20] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix whitespace in RewriteCond [puppet] - 10https://gerrit.wikimedia.org/r/173756 (owner: 10Giuseppe Lavagetto) [08:13:08] going to upgrade Jenkins in a few minutes, it will be unavailable for a few. [08:15:04] <_joe_> hashar: ok [08:21:16] !log Upgrading Jenkins [08:21:23] Logged the message, Master [08:28:40] PROBLEM - Apache HTTP on mw1129 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50412 bytes in 0.035 second response time [08:30:04] Respected human, time to deploy Infra upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141117T0830). Please do the needful. [08:30:27] <_joe_> mmmh [08:30:55] !log reimaging cerium [08:31:00] Logged the message, Master [08:31:36] <_joe_> PHP Fatal error: Base lambda function for closure not found in /srv/mediawiki/php-1.25wmf7/extensions/Wikidata/extensions/Wikibase/lib/config/WikibaseLib.default.php on line 18 [08:31:40] <_joe_> sigh [08:31:40] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.068 second response time [08:31:46] USE ALL THE FEATURES? [08:31:49] akosiaris: you got a shoutout in [08:32:09] <_joe_> ori: hi [08:32:21] ori: yeah I noticed :-) [08:32:40] PROBLEM - Host cerium is DOWN: PING CRITICAL - Packet loss = 100% [08:32:50] base lambda function for closure does not implement abstract factory interface stream protocol [08:32:55] _joe_: hey, morning [08:33:27] <_joe_> every time we reload apache a to of those errors spawn [08:33:35] <_joe_> it's an apc issue I guess [08:34:18] <_joe_> ori: I'd move to 25% of anons on HHVM at 5PM UTC [08:34:34] * YuviPanda adds _joe_ to https://phabricator.wikimedia.org/T1291 for after firefighting, for opinions. I can / shall implement [08:34:36] <_joe_> that should be your 9 AM? [08:35:16] <_joe_> YuviPanda: I'll take a look today [08:35:17] something very peculiar about our blog btw is that one may click the RSS button expecting to get the RSS feed for the blog but quite the contrary, it lists the "soon to be added :P" RSS feeds we want to ? [08:35:23] _joe_: thanks! [08:35:26] [08:35:29] hiera file for labs vide stuff [08:35:37] can we do 6pm? that gives the caffeine a bit of time to get absorbed [08:35:52] <_joe_> 6PM UTC? ok [08:36:01] <_joe_> but you'll have to do most monitoring [08:36:02] 6PM PST! [08:36:06] <_joe_> afterwards [08:36:24] <_joe_> as it's 7 PM here, and I have ops and mwcore in the evening [08:36:36] <_joe_> but it's super-cool for me [08:36:55] no ops today :) [08:37:15] etheerrrrpaaaad! [08:37:25] also I wonder if we should switch ops meetings to mumble. lower bandwidth! [08:37:26] <_joe_> paravoid: oh right [08:37:50] RECOVERY - Host cerium is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [08:39:17] !log Jenkins upgraded [08:39:20] Logged the message, Master [08:40:10] PROBLEM - RAID on cerium is CRITICAL: Connection refused by host [08:40:20] PROBLEM - check if dhclient is running on cerium is CRITICAL: Connection refused by host [08:40:29] PROBLEM - check configured eth on cerium is CRITICAL: Connection refused by host [08:40:39] PROBLEM - SSH on cerium is CRITICAL: Connection refused [08:40:39] PROBLEM - check if salt-minion is running on cerium is CRITICAL: Connection refused by host [08:40:39] PROBLEM - DPKG on cerium is CRITICAL: Connection refused by host [08:41:09] PROBLEM - Disk space on cerium is CRITICAL: Connection refused by host [08:41:10] PROBLEM - puppet last run on cerium is CRITICAL: Connection refused by host [08:41:12] that reimaging might have been way too fast... [08:41:23] neon did not have enough time to get the puppet changes... [08:41:38] I may have overautomated stuff [08:41:48] hah [08:41:57] or you underautomated it ;) [08:42:20] hehe [08:42:29] true... it is always based on the POV [08:42:45] so let's switch POV... how to automate neon getting the changes.... [08:43:18] <_joe_> akosiaris: did you automate more upon wmf-reimage? [08:43:23] (03PS1) 10Ori.livneh: eventlogging: couple less tightly to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/173758 [08:43:33] _joe_: yeah, pushing commit now [08:43:41] _joe_: could also do it now :) [08:44:05] i'm more alert now than i will be then [08:44:11] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.347 second response time [08:44:15] <_joe_> ori: ahah, ok cool [08:44:32] YuviPanda: that one's for you ^^ [08:44:39] yeah, thanks! :) [08:44:48] now question is realm branching vs feature branching [08:45:09] let me test it anywa [08:45:09] y [08:45:36] testing shmesting, the code looks pretty [08:45:49] it must be right [08:46:02] :D [08:46:40] ori: aha, so this patch also hit the same thing my similar patch hit [08:46:40] Error: Failed to apply catalog: Could not find dependent Service[eventlogging/init] for File[/etc/eventlogging.d/consumers/mysql-m2-master] at /etc/puppet/modules/eventlogging/manifests/service/consumer.pp:50 [08:46:51] <_joe_> lol [08:46:53] which is a bit weird, and I gave up on friday night [08:47:03] hmmmmmm [08:47:15] my patch being https://gerrit.wikimedia.org/r/#/c/173634/, and the last patchset there is stupid, at which point I realized I should sleep. [08:48:16] (03PS1) 10Giuseppe Lavagetto: 25% of anonymous traffic to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173759 [08:49:28] (03PS2) 10Ori.livneh: eventlogging: couple less tightly to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/173758 [08:49:58] !log if there is any oddity with Jenkins/Zuul please poke me. I am on IRC all day today [08:50:03] Logged the message, Master [08:50:09] (03CR) 10Ori.livneh: [C: 031] "1 in 4 anonymous users agree." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173759 (owner: 10Giuseppe Lavagetto) [08:51:13] (03CR) 10Giuseppe Lavagetto: [C: 032] 25% of anonymous traffic to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173759 (owner: 10Giuseppe Lavagetto) [08:51:21] (03Merged) 10jenkins-bot: 25% of anonymous traffic to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173759 (owner: 10Giuseppe Lavagetto) [08:51:23] does one know when more hardware for labs coming in ? [08:52:08] labs is virtual, so it depends on the Cloud [08:52:20] peace be upon it [08:52:58] (sorry. no, i don't.) [08:53:00] PROBLEM - NTP on cerium is CRITICAL: NTP CRITICAL: No response from NTP server [08:53:22] mutante: http://www.weather.com/weather/today/37.540726,-77.436050?par=googleonebox,looks like more clouds are coming to eqiad [08:53:27] !log oblivian Synchronized wmf-config/CommonSettings.php: Open HHVM to 25% of anons (duration: 00m 06s) [08:53:29] Logged the message, Master [08:54:30] matanya: RobH / coren / andrewbogott would know about new hardware order [08:54:47] why do you ask? [08:55:11] I need more horse power, and andrew told me it should arrive, but didn;t tell me the date [08:55:27] so i'm holding my cod changes for now [08:55:30] code [08:55:54] how much more horse power do you need ? [08:56:11] about 4 more cores, and 16 GB RAM [08:56:14] what for? is this the video testing system? [08:56:19] we should already have that [08:56:20] yes, it is YuviPanda [08:56:27] video encoding [08:56:29] what YuviPanda said [08:56:39] not conf system [08:56:41] holding my cod: http://bostonfishingcharters.com/images/mikem45cod.jpg [08:56:54] ahaha [08:56:56] yes, that cod ori ! :D [08:56:57] ori: PS2 ran into the error my patch ran into after I fixed that first error, which is... [08:56:57] Error: Failed to apply catalog: Could not find dependency File[/usr/lib/nagios/plugins/check_eventlogging_jobs] for Nrpe::Monitor_service[eventlogging] at /etc/puppet/manifests/role/eventlogging.pp:171 [08:57:10] I... should've said earlier. [08:58:07] akosiaris paravoid : https://commons.wikimedia.org/wiki/File:The_Hobo_1917_OLIVER_BABE_HARDY_BILLY_WEST_Arvid_E_Gillstrom.webm <- took about a day to re-encode [08:59:01] mutante: labs cores are wimpier than real cores [08:59:26] YuviPanda: mutante != matanya :) [08:59:45] gah [08:59:51] I've been making that for about a year now [09:00:19] https://commons.wikimedia.org/wiki/File:Out_West_1918_FATTY_ARBUCKLE_BUSTER_KEATON.webm <-- also about a day [09:00:40] what are you encoding them to? [09:00:42] H264? [09:00:47] (03PS3) 10Ori.livneh: eventlogging: couple less tightly to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/173758 [09:00:59] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [09:01:05] webm [09:01:10] from mp4 [09:01:18] aaaaha [09:01:21] YuviPanda: try PS3? [09:01:26] ori: yup, just cherry picked [09:01:32] thanks much [09:01:36] Stream #0:0 -> #0:0 (h264 -> libvpx) [09:01:36] Stream #0:1 -> #0:1 (aac -> libvorbis) [09:01:47] mutante: ah, youtube -> commons? [09:01:51] yes [09:01:56] and internet archive [09:02:01] ori: yup, that runs cleanly :) [09:02:04] and vimeo [09:02:11] and million other sources [09:02:20] YuviPanda: (shameless plug) do you know about pcc? [09:02:20] matanya: anyway, if 16G + 4Cores is what you need, we can increase your project's limit now [09:02:22] and it should be ok [09:02:34] ori: the C compiler? [09:02:36] andrewbogott request not to do it [09:02:43] matanya: aaaaho, I see. [09:02:50] ori: wait, puppet compiler? [09:02:52] i'm already beyond limits [09:03:00] YuviPanda: no :) there's a cli interface for _joe_'s catalog compiler in operations/puppet repo root [09:03:02] using 12 cores and 16 GB RAM [09:03:07] ori: aha! [09:03:21] nice [09:03:22] YuviPanda: https://asciinema.org/a/11986 [09:04:21] woah nice! [09:04:24] <_joe_> "asciinema" is pure genius as a domain [09:04:25] asciinema is also nice :) [09:04:27] <_joe_> YuviPanda: it is [09:04:27] yeah [09:04:40] * YuviPanda presumes ori built asciinema too [09:04:47] i wish [09:05:06] slacker [09:05:11] :) [09:05:24] ori: mind if I convert the realm branching to feature flag branching? [09:06:11] yeah, i'm not totally sold on that approach yet [09:06:33] hmm, are you worried about flag proliferation? [09:06:34] it's an interesting idea, though, and it may be right [09:07:30] well, the thing i don't love about it is that it makes things less explicit [09:08:07] <_joe_> ori: how? [09:08:08] if you have a role class from which you toggle bits on and off, there is an authoritative place to go to to read about the setup [09:08:31] <_joe_> I think _that_ should be done in the module classes, usually [09:08:38] <_joe_> or, we could use environments [09:09:11] <_joe_> so we have role::something in the base path [09:09:44] if i have module Foo and I activate module Bar, how should I know to expect module Foo to enable some previously-excluded functionality because it now feature-detects module Bar? [09:09:49] <_joe_> then we have that plus the prod specific stuff in the production enfironment [09:10:00] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [09:10:10] that is the worst host name [09:10:13] <_joe_> uh? parse error [09:10:13] !log praseodymium reimaging [09:10:18] Logged the message, Master [09:12:03] _joe_: feature-detection means modules which are already provisioned on a host and which one may consider stable can react to the introduction of a new, seemingly unrelated module [09:13:18] <_joe_> ori: nope, once you use hiera [09:13:26] <_joe_> and properly namespace variables [09:13:42] +1 hiera, has_ganglia isn't set in ganglia but in hiera [09:13:47] <_joe_> or maybe I didn't get your point, which is really probable [09:15:10] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [09:17:29] PROBLEM - check configured eth on praseodymium is CRITICAL: Connection refused by host [09:17:30] PROBLEM - RAID on praseodymium is CRITICAL: Connection refused by host [09:17:40] PROBLEM - check if dhclient is running on praseodymium is CRITICAL: Connection refused by host [09:17:50] PROBLEM - puppet last run on praseodymium is CRITICAL: Connection refused by host [09:17:50] PROBLEM - SSH on praseodymium is CRITICAL: Connection refused [09:18:10] PROBLEM - DPKG on praseodymium is CRITICAL: Connection refused by host [09:18:11] PROBLEM - check if salt-minion is running on praseodymium is CRITICAL: Connection refused by host [09:29:40] PROBLEM - NTP on praseodymium is CRITICAL: NTP CRITICAL: No response from NTP server [09:35:11] (03Abandoned) 10Yuvipanda: eventlogging: Make ganglia usage optional [puppet] - 10https://gerrit.wikimedia.org/r/173634 (https://bugzilla.wikimedia.org/73479) (owner: 10Yuvipanda) [09:37:34] (03CR) 10Nikerabbit: Add read only configuration for ElasticSearchTTMServer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172534 (owner: 10Nikerabbit) [09:39:27] jzerebecki: ping? [09:40:36] (03PS1) 10KartikMistry: Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 [09:41:55] (03PS2) 10Nikerabbit: Add read only configuration for ElasticSearchTTMServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172534 [09:42:40] (03PS2) 10KartikMistry: Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 [09:44:44] (03PS3) 10Nikerabbit: Add read only configuration for ElasticSearchTTMServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172534 [09:46:43] (03PS1) 10Nikerabbit: Group translate-proofr was removed from Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173766 [09:48:01] (03CR) 10Nemo bis: [C: 031] Group translate-proofr was removed from Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173766 (owner: 10Nikerabbit) [09:50:59] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.061 second response time [09:51:17] <_joe_> !log restarting mw1187, all apache children stuck in apc_pthreadmutex_lock() [09:51:20] Logged the message, Master [10:05:52] RECOVERY - Disk space on db1017 is OK: DISK OK [10:07:02] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [10:07:48] hm [10:16:00] (03PS2) 10Ori.livneh: allow multiple sudo::user grants for same user [puppet] - 10https://gerrit.wikimedia.org/r/173629 [10:16:07] (03CR) 10Ori.livneh: [C: 032 V: 032] allow multiple sudo::user grants for same user [puppet] - 10https://gerrit.wikimedia.org/r/173629 (owner: 10Ori.livneh) [10:17:35] alright, that's all of betacluster accounted for, I think [10:17:47] now to see why tools-webproxy thinks its tools [10:18:17] (03PS2) 10Ori.livneh: keyholder: add icinga check [puppet] - 10https://gerrit.wikimedia.org/r/173633 [10:20:18] (03CR) 10Ori.livneh: [C: 032 V: 032] keyholder: add icinga check [puppet] - 10https://gerrit.wikimedia.org/r/173633 (owner: 10Ori.livneh) [10:20:41] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [10:25:28] i'm going to verify alerts for keyholder (shared deployment ssh agent daemon) work by clearing keys from the agent, if it works we'll get an alert for tin [10:29:31] (03CR) 10Nikerabbit: Add support for woff2 files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/173763 (owner: 10KartikMistry) [10:37:31] (03PS3) 10KartikMistry: Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 [10:39:08] (03CR) 10Giuseppe Lavagetto: [C: 031] Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 (owner: 10KartikMistry) [10:51:32] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [10:54:27] <_joe_> kart_: I can +2 that change if you don't need someone else's review [10:56:41] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [11:01:23] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (100.00%) [11:03:22] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [11:04:12] RECOVERY - SSH on praseodymium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [11:04:22] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [11:14:53] RECOVERY - check configured eth on praseodymium is OK: NRPE: Unable to read output [11:15:11] RECOVERY - Disk space on praseodymium is OK: DISK OK [11:15:11] RECOVERY - RAID on praseodymium is OK: OK: no disks configured for RAID [11:15:31] RECOVERY - check if dhclient is running on praseodymium is OK: PROCS OK: 0 processes with command name dhclient [11:15:41] RECOVERY - DPKG on praseodymium is OK: All packages OK [11:15:42] RECOVERY - check if salt-minion is running on praseodymium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:16:31] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:17:41] RECOVERY - SSH on cerium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [11:21:01] (03PS1) 10Ori.livneh: Route Bug40009 logs to fluorine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173784 [11:21:09] (03PS2) 10Ori.livneh: Route Bug40009 logs to fluorine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173784 [11:21:26] (03CR) 10Ori.livneh: [C: 032] Route Bug40009 logs to fluorine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173784 (owner: 10Ori.livneh) [11:21:36] (03Merged) 10jenkins-bot: Route Bug40009 logs to fluorine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173784 (owner: 10Ori.livneh) [11:22:32] !log ori Synchronized wmf-config/InitialiseSettings.php: Ied0a7ab4b: Route Bug40009 logs to fluorine (duration: 00m 07s) [11:22:38] Logged the message, Master [11:22:41] RECOVERY - Disk space on cerium is OK: DISK OK [11:22:51] RECOVERY - check if dhclient is running on cerium is OK: PROCS OK: 0 processes with command name dhclient [11:22:52] RECOVERY - RAID on cerium is OK: OK: no disks configured for RAID [11:23:02] RECOVERY - DPKG on cerium is OK: All packages OK [11:23:02] PROBLEM - Host xenon is DOWN: PING CRITICAL - Packet loss = 100% [11:23:02] RECOVERY - check if salt-minion is running on cerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:23:11] RECOVERY - check configured eth on cerium is OK: NRPE: Unable to read output [11:23:43] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Puppet has 1 failures [11:25:04] hashar: mediawiki-vendor-integration tests run for nearly six minutes -- is it not possible to skip them on changes that aren't related? [11:25:20] Warning: Duplicate definition found for service 'etherpad.wikimedia.org' on host 'zirconium' :-( [11:25:34] I wonder for how long icinga has not been picking up changes... [11:25:52] <_joe_> akosiaris: a few days tops [11:26:09] <_joe_> I've been doing changes on icinga until thursday and I keep an eye on that [11:26:13] too much already ... [11:26:52] that does it... I 'll write an icinga check to check icinga today... [11:27:18] <_joe_> akosiaris: that's because people think your work ends with puppet-merge [11:27:28] <_joe_> but yes, a check may help us [11:27:29] !log ori Synchronized php-1.25wmf8/includes/Import.php: Icc19961fd: 'Debugging statements to try to diagnose bug 40009' (duration: 00m 08s) [11:27:32] Logged the message, Master [11:27:57] <_joe_> akosiaris: also, it could be some form of puppet failure here [11:28:06] <_joe_> and I could avoid that with naggen [11:28:11] RECOVERY - Host xenon is UP: PING OK - Packet loss = 0%, RTA = 3.22 ms [11:30:21] PROBLEM - check configured eth on xenon is CRITICAL: Connection refused by host [11:30:21] PROBLEM - check if salt-minion is running on xenon is CRITICAL: Connection refused by host [11:30:32] PROBLEM - puppet last run on xenon is CRITICAL: Connection refused by host [11:30:52] PROBLEM - DPKG on xenon is CRITICAL: Timeout while attempting connection [11:31:01] PROBLEM - check if dhclient is running on xenon is CRITICAL: Timeout while attempting connection [11:31:14] PROBLEM - Disk space on xenon is CRITICAL: Timeout while attempting connection [11:31:14] PROBLEM - RAID on xenon is CRITICAL: Timeout while attempting connection [11:31:22] PROBLEM - SSH on xenon is CRITICAL: Connection timed out [11:33:21] RECOVERY - SSH on xenon is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [11:33:31] RECOVERY - NTP on praseodymium is OK: NTP OK: Offset -0.00564622879 secs [11:41:01] RECOVERY - NTP on cerium is OK: NTP OK: Offset -0.04714632034 secs [11:42:11] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:43:03] PROBLEM - NTP on xenon is CRITICAL: NTP CRITICAL: No response from NTP server [11:49:37] (03PS1) 10Filippo Giunchedi: rename db1017 into graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/173789 [11:51:33] RECOVERY - Disk space on xenon is OK: DISK OK [11:51:33] RECOVERY - RAID on xenon is OK: OK: no disks configured for RAID [11:51:52] RECOVERY - check if salt-minion is running on xenon is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:51:52] RECOVERY - check configured eth on xenon is OK: NRPE: Unable to read output [11:52:23] RECOVERY - DPKG on xenon is OK: All packages OK [11:52:31] RECOVERY - check if dhclient is running on xenon is OK: PROCS OK: 0 processes with command name dhclient [12:04:12] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:05:32] (03PS2) 10Filippo Giunchedi: rename db1017 into graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/173789 [12:05:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] rename db1017 into graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/173789 (owner: 10Filippo Giunchedi) [12:08:33] (03PS1) 10Filippo Giunchedi: eqiad: rename db1017 to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/173790 [12:08:56] (03PS2) 10Filippo Giunchedi: eqiad: rename db1017 to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/173790 [12:09:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] eqiad: rename db1017 to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/173790 (owner: 10Filippo Giunchedi) [12:09:41] RECOVERY - NTP on xenon is OK: NTP OK: Offset -0.01121199131 secs [12:12:42] <_joe_> akosiaris: did you unbreak icinga? [12:13:20] _joe_: not yet... fighting with it... something really weird is going on... [12:13:31] <_joe_> akosiaris: can I help maybe? [12:13:42] it started complaining the main config file name looks suspicious [12:14:05] you might... I 've narrowed it down to something in puppet_services.cfg [12:14:09] not sure what yet... [12:14:26] <_joe_> icinga -v /etc/icinga/icinga.cfg seems to say it's all right [12:14:38] now... wait a bit [12:14:43] (03PS1) 10Alexandros Kosiaris: Fixes for raid0-lvm partman config [puppet] - 10https://gerrit.wikimedia.org/r/173792 [12:14:50] I am repopulating puppet_services.cfg [12:14:58] <_joe_> how? [12:15:01] well running puppet and puppet does that but anyway... [12:15:22] I emptied the file on purpose... [12:15:42] <_joe_> ok [12:15:48] <_joe_> if we have duplicate defs [12:16:01] * YuviPanda has enjoyed the shinken approach, which ends up with far less total amount of config files [12:16:05] <_joe_> I think the only place to look for them is the puppet database [12:16:53] I think I found something weird as well in the puppet tree [12:16:59] but it should have bitten us long ago [12:17:09] <_joe_> akosiaris: maybe it's my doing [12:17:10] manifests/role/etherpad.pp [12:17:16] <_joe_> oh no ok [12:17:18] the two monitor_service resources [12:17:26] got the same description [12:17:32] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [12:17:49] !log final reboot for xenon, cerium, praseodymium after a dist-upgrade -y [12:17:51] PROBLEM - Host cerium is DOWN: PING CRITICAL - Packet loss = 100% [12:17:57] Logged the message, Master [12:18:01] PROBLEM - Host xenon is DOWN: PING CRITICAL - Packet loss = 100% [12:18:09] <_joe_> akosiaris: mmmh [12:18:32] RECOVERY - Host cerium is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [12:18:32] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [12:18:32] RECOVERY - Host xenon is UP: PING OK - Packet loss = 0%, RTA = 2.27 ms [12:18:45] ok, the first error is back now [12:18:53] Warning: Duplicate definition found for service 'etherpad.wikimedia.org' on host 'zirconium' (config file '/etc/icinga/puppet_services.cfg' [12:18:56] fixing that [12:20:10] (03CR) 10Alexandros Kosiaris: [C: 032] Fixes for raid0-lvm partman config [puppet] - 10https://gerrit.wikimedia.org/r/173792 (owner: 10Alexandros Kosiaris) [12:22:53] (03PS1) 10Alexandros Kosiaris: Deduplicate etherpad's monitor_service checks [puppet] - 10https://gerrit.wikimedia.org/r/173793 [12:23:50] <_joe_> akosiaris: ok, if having the same description is a problem, our define is broken [12:24:04] (03CR) 10Giuseppe Lavagetto: [C: 031] Deduplicate etherpad's monitor_service checks [puppet] - 10https://gerrit.wikimedia.org/r/173793 (owner: 10Alexandros Kosiaris) [12:25:46] (03CR) 10Alexandros Kosiaris: [C: 032] Deduplicate etherpad's monitor_service checks [puppet] - 10https://gerrit.wikimedia.org/r/173793 (owner: 10Alexandros Kosiaris) [12:26:27] hmmm [12:26:29] Alexandros Kosiaris^O: Deduplicate etherpad's monitor_service checks (40236f3) [12:26:34] notice the ^O [12:26:43] and then... [12:26:45] From https://gerrit.wikimedia.org/r/p/operations/puppet [12:26:45] c7832b8..5a0c2d1 production -> origin/production [12:26:45] *** Please tell me who you are. [12:26:49] interesting... [12:27:02] gerrit ? dafuq are you doing ? [12:34:27] _joe_: yes you might be of help after all [12:34:37] so after merging the above fix [12:34:42] ***> The name of the main configuration file looks suspicious... [12:34:53] <_joe_> oh that [12:35:01] <_joe_> I've already seen that [12:35:14] <_joe_> 1 sec [12:35:19] and mv puppet_services.cfg keep.cfg ; grep -v servicegroups keep.cfg > puppet_services.cfg [12:35:22] fixed it [12:35:36] so the problem is the empty servicegroups directive on every single entry [12:36:15] <_joe_> mmmh [12:36:28] <_joe_> not sure I understood that [12:36:52] look at the keep.cfg file at neon:/etc/icinga/keep.cfg [12:37:05] every single stanza has an empty servicegroups directive... [12:37:10] something is wrong there [12:37:17] now icinga's reporting is awful [12:37:20] <_joe_> who creates that file? [12:37:25] nobody [12:37:32] I did just now.. [12:37:50] we can remove it, all I wanted is a copy of puppet_services.cfg [12:38:10] it is not even parsed or something. Next time I will call it lala.cfg :P [12:38:11] <_joe_> ok so, why are servicegroups empty? [12:38:18] <_joe_> ahah ok [12:38:26] that is what I am searching now ... [12:38:32] <_joe_> this is most likely a bug in naggen [12:38:39] <_joe_> lemme check that [12:38:42] maybe some hiera change ? [12:38:52] <_joe_> akosiaris: no, I just rechecked that [12:38:58] <_joe_> the old logic was [12:39:14] <_joe_> https://github.com/wikimedia/operations-puppet/blob/8918dd86536e3e9f4fe17bd2934820b0693e0290/manifests/nagios.pp [12:39:44] <_joe_> so only monitor_service instances with a) nagios_group set globally or b) $group set explicitly [12:39:48] <_joe_> would get one [12:39:55] <_joe_> I suggest a small change [12:40:08] $group = hiera('nagios_group', undef), [12:40:10] right ? [12:40:11] <_joe_> but lemme commit another change first [12:40:31] <_joe_> akosiaris: yes that's the translation of the old code you find at the link I posted [12:41:32] (03PS1) 10ArielGlenn: stat1002, 1003 access for bmansurov (rt #8852) [puppet] - 10https://gerrit.wikimedia.org/r/173794 [12:41:38] <_joe_> the oldest version, before any of my puppet3 changes, does the same [12:41:43] <_joe_> https://github.com/wikimedia/operations-puppet/blob/165356c0f16edca88754683830c51d6489df47eb/manifests/nagios.pp#L86 [12:42:15] <_joe_> so, I have an easy fix [12:42:25] <_joe_> but let me do something else first [12:42:44] sure, going to lunch in the meantine :-) [12:43:35] (03PS2) 10Giuseppe Lavagetto: nagios: convert monitor_service to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/173322 [12:44:41] (03CR) 10ArielGlenn: [C: 032] stat1002, 1003 access for bmansurov (rt #8852) [puppet] - 10https://gerrit.wikimedia.org/r/173794 (owner: 10ArielGlenn) [12:45:36] <_joe_> icinga.cfg:cfg_file=/etc/nagios/puppet_servicegroups.cfg [12:45:40] * _joe_ facepalms [12:45:48] <_joe_> notice the directory [12:46:13] (03PS3) 10Giuseppe Lavagetto: nagios: convert monitor_service to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/173322 [12:48:05] (03PS1) 10ArielGlenn: Revert "stat1002, 1003 access for bmansurov (rt #8852)" [puppet] - 10https://gerrit.wikimedia.org/r/173798 [12:48:44] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [12:48:55] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: puppet fail [12:50:50] ignore that, I'm not reverting it (I don't think) [12:51:10] unless tht change somehow broke puppet but I don't see how [12:56:45] (03CR) 10QChris: "Changes in aggregator repository have been merged." [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [13:08:00] apergos: search outage on en.wiki [13:08:01] An error has occurred while searching: Pool queue is full [13:08:20] ugh [13:08:37] probably should poke ^demon|away and manybubbles too :) [13:08:44] <_joe_> matanya: no [13:08:53] <_joe_> matanya: starting on wednesday, maybe [13:09:03] means ? [13:09:09] still lsearhd ? [13:09:20] * lsearchd [13:09:26] <_joe_> yes [13:09:48] ok, so who ever should address this, FYI :) [13:10:20] <_joe_> matanya: btw cannot reproduce [13:10:34] I was about to ask if it's the same on retry [13:10:34] every 5 or so searches now [13:10:40] mm [13:12:17] and now totally fine [13:12:20] <_joe_> I still need to get one [13:12:28] miracles [13:13:26] I'll take a short hiccup over a lastibgn problem any dy [13:15:06] <_joe_> I'll grab some lunch as well, bbl [13:15:19] thanks both [13:35:35] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [13:38:36] !log ran puppetstoredconfigclean.rb on db1017, it must have been missed in the rename [13:38:41] Logged the message, Master [13:39:30] anything removed by that which shouldn't be, will get put back on the graphite puppet run [13:41:45] (03Abandoned) 10ArielGlenn: Revert "stat1002, 1003 access for bmansurov (rt #8852)" [puppet] - 10https://gerrit.wikimedia.org/r/173798 (owner: 10ArielGlenn) [13:42:44] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:49:51] (03PS1) 10ArielGlenn: stat1003 access for rmoen (rt #8870) [puppet] - 10https://gerrit.wikimedia.org/r/173811 [13:50:18] (03PS2) 10ArielGlenn: stat1003 access for rmoen (rt #8870) and update ssh key [puppet] - 10https://gerrit.wikimedia.org/r/173811 [13:52:38] (03CR) 10ArielGlenn: "don't merge til I have checked in with the user about the key update" [puppet] - 10https://gerrit.wikimedia.org/r/173811 (owner: 10ArielGlenn) [14:02:19] I need to step out for a bit, back in a little while [14:20:31] <^demon|away> Freaking lsearchd :( [14:20:34] <^demon|away> What's up? [14:21:48] <^d> Oh, pool queue. [14:31:02] !log Jenkins/Zuul: disconnected/reconnected Jenkins Gearman client [14:31:06] Logged the message, Master [14:35:22] ^d: what does it mean anyway? the queue of the pool is full ? [14:35:37] <^d> matanya: Yep, too many people trying to go swimming :) [14:35:47] :) [14:36:16] <^d> https://wikitech.wikimedia.org/wiki/PoolCounter [14:38:46] thanks ^d [14:38:54] <^d> yw [15:01:07] (03PS4) 10Giuseppe Lavagetto: nagios: convert monitor_service to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/173322 [15:01:09] (03PS1) 10Giuseppe Lavagetto: icinga: give a non-null default group to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/173825 [15:01:11] (03PS1) 10Giuseppe Lavagetto: monitoring: add config class [puppet] - 10https://gerrit.wikimedia.org/r/173826 [15:04:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [15:17:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:20:54] (03PS5) 10Giuseppe Lavagetto: nagios: convert monitor_service to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/173322 [15:47:22] <^d> Reedy: You about? [15:49:15] (03CR) 10Giuseppe Lavagetto: [C: 032] nagios: convert monitor_service to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/173322 (owner: 10Giuseppe Lavagetto) [15:50:04] <_joe_> akosiaris: https://gerrit.wikimedia.org/r/#/c/173825/ [15:50:15] <_joe_> this should fix the absent servicegroup somehow [15:59:03] <^d> jamesofur: I left some comments on your patch :) [16:00:02] oh? Must have been recent [16:00:05] anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141117T1600). [16:00:05] * jamesofur looks [16:00:08] <^d> Like, 10 minutes ago :) [16:00:15] <^d> I got swat jouncebot [16:00:16] ah, yup, I just got to office [16:01:16] (03CR) 10Chad: [C: 032] Send more update jobs to Elasticsearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173347 (owner: 10Manybubbles) [16:01:48] (03Merged) 10jenkins-bot: Send more update jobs to Elasticsearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173347 (owner: 10Manybubbles) [16:03:04] ^d: totally expensive, but necessary from what I can tell, this is stuff mostly reusing another script that exists already (but allowing me to do the mainspace) and this script itself existed in the past. I can try to make these changes if we need too :-/ but I need to get this out like... now [16:03:31] !log demon Synchronized wmf-config/CirrusSearch-common.php: more jobs (duration: 00m 04s) [16:03:34] creating voter lists has always been expensive sadly :( sometimes running ages [16:03:35] Logged the message, Master [16:04:12] (which is why I think they originally wrote it to use the slave for reading) [16:04:20] <^d> *nod* [16:04:21] <^d> True. [16:05:43] <^d> In that case, just add the wfWaitForSlaves() call after the insert() and I think it'll be fine. [16:05:49] * jamesofur nods [16:08:51] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: give a non-null default group to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/173825 (owner: 10Giuseppe Lavagetto) [16:09:05] * ^d looks for things to stab jenkins with [16:09:27] !log Renamed job mediawiki-vendor-integration to mediawiki-phpunit {{bug|72787}} [16:09:29] Logged the message, Master [16:11:18] !log demon Synchronized php-1.25wmf7/extensions/CirrusSearch: (no message) (duration: 00m 05s) [16:11:21] Logged the message, Master [16:11:22] ^d: https://gerrit.wikimedia.org/r/#/c/172889/ look right? [16:11:28] !log demon Synchronized php-1.25wmf8/extensions/CirrusSearch: (no message) (duration: 00m 04s) [16:11:30] Logged the message, Master [16:11:45] (and what's the easiest way to do that for the cherry picks? Re cherry pick or ammend each one quickly?) [16:12:19] <^d> neither really. whichever you can do easiest. [16:12:29] <^d> merged into master. [16:12:36] for something this quick the ammend is pretty quick, /fixes [16:12:40] appreciate it! [16:12:54] <^d> well, +2'd. [16:12:58] <^d> :) [16:13:01] heh [16:13:02] <^d> jenkins shall merge sometime. [16:13:06] it shall [16:13:16] our robot overlords [16:18:45] <^d> jamesofur: I'm doing the cherry picks to core for you :) [16:19:17] ^d: why thank you :) if you can point me to anything that I did wrong I'd appreciate it ;) [16:19:35] <^d> You did the cherry picks to the branches just right, no worries there. [16:19:53] <^d> I'm just doing the submodule updates for core. [16:19:59] <^d> (cherry pick was bad word, sorry) [16:20:25] ahh cool [16:20:41] no worries ;) I generally assume I only know a bit of it and likely screwed something up ;) [16:29:56] PROBLEM - HHVM busy threads on mw1114 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [90.0] [16:36:16] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [16:38:05] RECOVERY - HHVM busy threads on mw1114 is OK: OK: Less than 1.00% above the threshold [60.0] [16:38:06] !log demon Synchronized php-1.25wmf7/extensions/SecurePoll/: (no message) (duration: 00m 05s) [16:38:09] Logged the message, Master [16:38:17] !log demon Synchronized php-1.25wmf8/extensions/SecurePoll/: (no message) (duration: 00m 05s) [16:38:20] Logged the message, Master [16:38:21] <^d> jamesofur: Ok you're all live [16:38:26] <3 [16:38:31] thanks much [16:38:58] !log upload etherpad-lite_1.4.1-1 on apt.wikimedia.org [16:39:01] Logged the message, Master [16:42:24] <^d> jamesofur: yw [16:44:36] springle: are you on top of the labs replication issues that came up over the weekend? [16:50:38] Hm… Coren, same question? [16:51:32] andrewbogott: I'm next to it, giving it a frowny look - but I wanted Sean to opine first before I started trying to mess with it. [16:51:50] I know he's been battling some issues with replication and I don't want to risk destroying data he needs. [16:52:38] Coren: OK… what about the bugs about missing tables &c? [16:52:47] Or is that somehow the same issue? [16:53:08] It's not; that just needs an update to maintain-replicas to add the new stuff. That's on my todo for today. [16:53:34] ok! [16:54:27] I was having a hard time deciding if "Um, it's the weekend, take it easy" was a fair or unfair response to those email threads. [16:55:11] andrewbogott: I have made some effort to actually take days off on weekend some of the time lately; otherwise ima burn out. [16:56:22] Well, clearly you get to take weekends off :) I'm just wondering if, in the long run, a weekend outage (or replag, or whatever) is part of the expected and advertised toollabs package, or if we need to figure out some way to respond to them. [16:57:13] * YuviPanda should spend some time learning about our replica setup as well [16:57:14] andrewbogott: It might be reasonable to talk to Mark and discuss wether it makes sense to stagger our work weeks now that there is three of us. [16:57:40] Well, that or officially announce that tools support is bank-hours only :) [16:57:47] YuviPanda: so should I [16:58:00] We can't do 24/24, but 7/7 coverage seems like a reasonable objective. [16:58:04] yeah [16:58:11] * YuviPanda has been trying to take weekends off too [16:58:31] YuviPanda: Right, we just need to define "weekend" right. :-) [16:58:38] hehe [16:58:39] true [16:58:41] Maybe we should have a 'how to fix common labs issues' session in January so that we can get other Ops on board with this stuff. (Well, and also Yuvi and me) [16:58:46] yeah [16:58:57] I don't know enough about SGE, for instance [16:59:27] Sounds good. I'll probably have a ~1h primer for the gridengine for all staffers who want to learn. [16:59:37] apergos: hey! I was told you created the deployment-lucid-salt instance. is it still being used? can't ssh... [17:03:55] Coren, YuviPanda, I'd appreciate a security review of https://gerrit.wikimedia.org/r/#/c/173066/ [17:04:07] Not urgent though [17:07:30] (03CR) 10Yuvipanda: "More minor nits - use single quotes on python strings unless required." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/173066 (owner: 10Andrew Bogott) [17:15:58] (03PS1) 10Awight: Set new banner dispatcher vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173849 [17:22:59] (03PS4) 10Yuvipanda: eventlogging: couple less tightly to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/173758 (owner: 10Ori.livneh) [17:24:35] (03PS5) 10Yuvipanda: eventlogging: couple less tightly to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/173758 (owner: 10Ori.livneh) [17:24:41] ori: ^ I've changed it to use the feature flag for now, to be consistent with other merged patches. we can change later if necessary. [17:24:52] andrewbogott: Coren can you +1? ^ [17:25:27] (03PS3) 10Yuvipanda: memcached: Make ganglia inclusion optional [puppet] - 10https://gerrit.wikimedia.org/r/173510 [17:26:43] YuviPanda: um, multitasking, but yes, shortly [17:26:45] (03CR) 10Yuvipanda: [C: 032] memcached: Make ganglia inclusion optional [puppet] - 10https://gerrit.wikimedia.org/r/173510 (owner: 10Yuvipanda) [17:26:48] andrewbogott: ty [17:27:29] wtf puppet merge [17:27:33] just keeps printing 'y' on my terminal [17:28:01] Something is piping 'yes' into something else that died. [17:28:17] hmm [17:28:20] worked fine now [17:37:14] hmmm, uhhhhh [17:37:25] analytics1003 is down again [17:37:29] [200423.847532] CPU: 5 PID: 4033 Comm: kafkatee Tainted: G I 3.13.0-39-generic #66-Ubuntu [17:43:44] ori: kern.coredump = 0 is a BSD thing, not present on Linux :( [17:43:50] * YuviPanda lets current workaround stay [17:44:13] "tainted"? [17:44:18] i know, right [17:44:37] weird that it points out kafkatee...i'm not sure what kernel modules or hackery kafkatee could do [17:44:54] ... that normally means you have a kernel module that's not openseource. wth did it come from? [17:45:04] Coren, context: i upgraded this machine to Trusty last week [17:45:06] <_joe_> guys, that's a tag [17:45:09] its been being weird since then [17:45:13] <_joe_> "Tainted: G" [17:45:15] also, this is syslog [17:45:24] sorry [17:45:25] haha [17:45:28] cisco [17:45:35] not syslog [17:45:46] (they both start with the same syllable? not sure why I typed that) [17:45:48] * YuviPanda makes Syslog branded servers [17:46:00] <_joe_> and bwt ottomata [17:46:15] so much for being off, _joe_ :) [17:46:25] <_joe_> 'G' means no non-gpl modules are loaded [17:46:25] ok, so, Tainted: G is not relevant? [17:46:35] yeah, just googled for that [17:46:50] that's just informative? [17:47:06] so, maybe this output is just showing that kafkatee running on CPU 5 caused some kernel crash [17:47:09] i will gist the whole output [17:47:09] <_joe_> and Comm: kafkatee doesn't mean it's a cpu module :) [17:47:18] <_joe_> ottomata: exactly [17:47:24] https://gist.github.com/anonymous/c77ed0d899d9af85e787 [17:47:55] RECOVERY - Host analytics1003 is UP: PING OK - Packet loss = 0%, RTA = 3.72 ms [17:48:20] do we have any documentation for 'how do I access a prod machine' for people? [17:48:27] people -> WMF employees who just got access to things [17:48:36] <_joe_> ottomata: Oops: 0000 [#1] SMP [17:49:09] ? [17:50:26] bah, i can't schedule downtime in icinga... [17:50:29] Not Authorized [17:50:31] <_joe_> ottomata: this is a kernel oops, happened in executing kafkatee, and you have the call trace [17:50:31] 'grrr [17:50:46] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:56] _joe_: aye, i sent this off to magnus a bit ago [17:54:46] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:55:24] _joe_: can you do things in icinga? or am i the only one having problems [17:55:33] e.g. try to access the 'View Config' link at the bottom of the left nav [17:55:59] i'd log out and back in, but i'm not sure how [17:57:13] !log starting upgrade to trusty of analytics1013 (having trouble scheduling downtime in icinga right now) [17:57:16] Hi, anyone with knolwedge of RL and Varnish caching wanna look over a significant change in how CentrlNotice chooses banners? Enabling change is https://gerrit.wikimedia.org/r/#/c/173220/, lots of tests in recent history in CN master, more tests coming... thanks in advance :) [17:57:20] Logged the message, Master [17:58:06] YuviPanda: ori: anyone else: ^ ? [17:58:20] you're looking for bblack :) [17:59:59] YuviPanda: Are you still waiting on reviews for anything? Or did I miss my chance? [18:00:13] andrewbogott: https://gerrit.wikimedia.org/r/#/c/173758/ [18:00:14] :) [18:00:18] YuviPanda: ah thanks! [18:00:27] bblack: ^ ? [18:02:02] Whoah, you have to actually invoke the hiera() function to do lookups? Somehow that's not what I expected... [18:02:19] andrewbogott: no, you do that for just 'raw' variables [18:02:24] andrewbogott: for regular params you don't [18:02:31] andrewbogott: and since has_ganglia is used in a bunch of places... [18:02:35] I'm not sure I understand the difference [18:03:00] Lookups only happen for params, but not for $vars? [18:03:09] (If so, then I have to rewrite my patch yet again) [18:04:22] andrewbogott: for vars I think you've to explicitly set them? [18:04:29] ok [18:04:32] andrewbogott: _joe_ would know better, but params are class params, and vars you define inside... [18:08:25] Hi RoanKattouw..! Somehow I thought you weren't here... wanna peek at https://gerrit.wikimedia.org/r/#/c/173220/ , especially for potential cachin/infrastructure-related issues? part of a series of changes for making the final CentralNotice banner selection on the client... see preceeding changes for tests and related stuff... [18:09:50] AndyRussG: Adam merged it 10 mins ago but I will look [18:10:19] RoanKattouw: thanks! Yes it is merged, but it's a no-op until we switch a config variable [18:10:24] AndyRussG: Right now though I'm in a meeting and right after that I'm getting on a plane (with wifi though) so it'll be an hour or two [18:11:18] RoanKattouw: woo, fantastic, thanks! Yeah understandably airports and planes would come first, fer sure :) [18:11:21] andrewbogott: that pactch is already live on deployment-prep tho :) [18:14:28] YuviPanda: yes I did and no you can't unles you get in as root [18:14:43] ssh to that instance doesn't work cause it cann't mount the keys (wrong nfs version, thank you lucid) [18:14:46] apergos: hmm, is it still needed? [18:15:13] I would like to keep it in play because until the last lucid server is dead that's how I'll test salt upgrades in labs [18:15:16] PROBLEM - Host db1017 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:16] PROBLEM - DPKG on analytics1013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:15:45] apergos: ah, hmm, ok. [18:15:57] it's got puppet failures and no diamond, so shows up red on shinken [18:16:00] is ok, I guess [18:16:10] well, I don't know if it has puppet failures :) [18:16:12] definitely no diamond [18:16:18] that's why I asked andrew to create the instance. we have two servers left I think with lucid so once they go... [18:16:25] oh I know it has failures. [18:16:40] failure number1: nfs mount of public keys share. [18:16:54] failure number two: can't set up the ssh stuff cause nfs mount. [18:17:05] don't remember the other failures :-D [18:17:05] (db1017 down is me btw) [18:17:11] bd808: i submitted a PR to git-deploy/trebuchet over the weekend, don't know if you saw my github ping [18:17:21] um I thought... db1017 no longr exists? or? [18:17:25] apergos: heh :) [18:17:31] cscott: I did indeed. [18:17:41] apergos: do we have an ETA on lucid dying? [18:17:43] like, [18:17:49] 'months', 'years', 'days', 'hours' [18:17:49] apergos: yes renamed today [18:18:13] yeah that's what I mean, renamed so nothing should be answering to db1017 now ... ? [18:18:31] YuviPanda: I don't know, nickel dying was a help, think sodium might run it [18:18:36] that would be the worst [18:18:41] heh [18:18:42] YuviPanda: Mark said it will be gone before the LTS ends (March 2015 ish) [18:18:47] aaah [18:18:54] apergos: sodium does :) [18:19:16] For sodium, it's dependent on codfw. [18:19:40] apergos: not sure why icinga didn't pick up the new host, but anyways it has the same address [18:20:29] same IP I know, I had to stomp on the puppet stored resources to clean that up [18:20:38] but name should be gone from everything [18:20:58] chasemp: btw, me and legoktm are going to finish up ircnotifier today/tomorrow, use that instead of ircecho for shinken, and then migrate the other bots to it. [18:21:09] we're kind of maintainers for almost all the IRC bots somehow. [18:25:39] PROBLEM - puppet last run on analytics1013 is CRITICAL: Connection refused by host [18:27:28] PROBLEM - check configured eth on analytics1013 is CRITICAL: Connection refused by host [18:27:28] PROBLEM - check if salt-minion is running on analytics1013 is CRITICAL: Connection refused by host [18:27:35] PROBLEM - Hadoop DataNode on analytics1013 is CRITICAL: Connection refused by host [18:27:39] PROBLEM - check if dhclient is running on analytics1013 is CRITICAL: Connection refused by host [18:27:46] PROBLEM - Hadoop NodeManager on analytics1013 is CRITICAL: Connection refused by host [18:27:55] PROBLEM - Disk space on analytics1013 is CRITICAL: Connection refused by host [18:27:55] PROBLEM - RAID on analytics1013 is CRITICAL: Connection refused by host [18:28:58] gone, back in a little while again [18:29:18] (03CR) 10Andrew Bogott: [C: 031] "<| |> ~> creeps me out, but this seems fine :)" [puppet] - 10https://gerrit.wikimedia.org/r/173758 (owner: 10Ori.livneh) [18:29:20] (03PS3) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [18:30:35] RECOVERY - check configured eth on analytics1013 is OK: NRPE: Unable to read output [18:30:36] RECOVERY - check if salt-minion is running on analytics1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:30:36] RECOVERY - Hadoop DataNode on analytics1013 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [18:30:36] RECOVERY - check if dhclient is running on analytics1013 is OK: PROCS OK: 0 processes with command name dhclient [18:30:44] (03CR) 10Ori.livneh: [C: 04-1] "Not sold on the feature flag approach, would like to think it over" [puppet] - 10https://gerrit.wikimedia.org/r/173758 (owner: 10Ori.livneh) [18:30:55] RECOVERY - Hadoop NodeManager on analytics1013 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:30:56] RECOVERY - Disk space on analytics1013 is OK: DISK OK [18:30:57] RECOVERY - RAID on analytics1013 is OK: OK: no disks configured for RAID [18:31:46] (03CR) 10Yuvipanda: "Alright. It's still cherry-picked on betacluster, so 'tis ok :) Should we start an ops@ thread?" [puppet] - 10https://gerrit.wikimedia.org/r/173758 (owner: 10Ori.livneh) [18:32:23] (03CR) 10Ori.livneh: "Yes -- would you mine describing the approach in an email to the list?" [puppet] - 10https://gerrit.wikimedia.org/r/173758 (owner: 10Ori.livneh) [18:33:08] YuviPanda: god speed then [18:33:21] (03CR) 10Yuvipanda: "yup, doing now." [puppet] - 10https://gerrit.wikimedia.org/r/173758 (owner: 10Ori.livneh) [18:38:45] RECOVERY - DPKG on analytics1013 is OK: All packages OK [18:41:43] (03CR) 10Ottomata: "Cool. Couple of comments inline." (032 comments) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 (owner: 10Ori.livneh) [18:43:22] (03CR) 10Andrew Bogott: Allow sshd to pull ssh keys from ldap on Trusty. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/173066 (owner: 10Andrew Bogott) [18:43:42] (03CR) 10CSteipp: [C: 031] "Shouldn't be too significant of a performance impact, and yay for cutting down on work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173519 (owner: 10Hoo man) [18:44:02] (03PS4) 10Andrew Bogott: Allow sshd to pull ssh keys from ldap on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/173066 [18:44:07] (03CR) 10Yuvipanda: Allow sshd to pull ssh keys from ldap on Trusty. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173066 (owner: 10Andrew Bogott) [18:48:40] (03PS1) 10Yuvipanda: diamond: Explicitly set method to use for determining hostname [puppet] - 10https://gerrit.wikimedia.org/r/173870 [18:49:32] (03PS2) 10Awight: Set new banner dispatcher vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173849 [18:50:05] PROBLEM - puppet last run on analytics1013 is CRITICAL: Timeout while attempting connection [18:51:26] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [18:51:40] andrewbogott: also added you to https://gerrit.wikimedia.org/r/#/c/173870/ [18:53:00] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 2.41 ms [18:53:38] (03CR) 10Andrew Bogott: [C: 031] "This seems hard to argue" [puppet] - 10https://gerrit.wikimedia.org/r/173870 (owner: 10Yuvipanda) [18:54:08] (03CR) 10Yuvipanda: [C: 032] diamond: Explicitly set method to use for determining hostname [puppet] - 10https://gerrit.wikimedia.org/r/173870 (owner: 10Yuvipanda) [18:55:05] PROBLEM - Hadoop NodeManager on analytics1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:55:36] (03PS3) 10Awight: Set new banner dispatcher vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173849 [18:55:55] PROBLEM - Hadoop DataNode on analytics1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [18:57:06] RECOVERY - puppet last run on analytics1013 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [19:00:32] hmm, jgage, yt? [19:00:41] i'm having starting hadoop workers, looks like a Gelf problem [19:00:57] (03CR) 10AndyRussG: [C: 031] "Cool!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173849 (owner: 10Awight) [19:00:59] ok, I merge a change, and I suddenly get millions of diamond sudo failures... [19:01:05] but change is completely unrelated. [19:01:11] and I've been getting the sudo failures earlier as well [19:01:13] just less frequently [19:01:16] PROBLEM - puppet last run on analytics1013 is CRITICAL: CRITICAL: Puppet has 2 failures [19:01:16] (failures for ipvsadm) [19:03:35] hmm, they might all just be triggered by the diamond restart [19:04:05] RECOVERY - Hadoop DataNode on analytics1013 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [19:07:15] PROBLEM - Hadoop DataNode on analytics1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [19:09:34] ottomata: hm, gelf problem you say? :( [19:09:55] wooo cronspam! [19:09:56] yes [19:09:57] that's interesting because i didn't have any problems when i rebooted workers last week [19:10:09] h, well this is post trusty upgrade [19:10:12] oho [19:10:13] but, i can't find json-smiple on the classpath [19:10:20] java.lang.NoClassDefFoundError: org/json/simple/JSONValue [19:10:25] so it can't start daemons [19:10:32] hmph weird. in precise it wants libjson-simple-java [19:10:35] lemme look [19:11:07] yeah, it still installed [19:11:08] and the .jar is there [19:11:08] i just don't see how it gets onto the classpath of the hadoop daemons [19:11:09] see analytis1013 [19:11:23] ok [19:12:14] hm symlink breakage [19:12:22] /usr/share/java/json_simple.jar -> json-simple.jar [19:13:12] (03PS1) 10Rush: phab don't try to preview icon/x-icon [puppet] - 10https://gerrit.wikimedia.org/r/173875 [19:14:10] oh [19:14:12] eh? [19:14:38] jgage, where is json_simple.jar used? [19:15:19] jgage: those symlkns seem ok to me [19:15:21] oh [19:16:47] i symlink it into /usr/lib/hadoop/lib/ [19:17:01] i purged and reinstalled the package and the self-referential symlink came back [19:17:08] checking upstream package [19:18:10] (03PS1) 10Kaldari: Set wgMFUseWikibaseDescription to true on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173878 [19:19:08] ah, jgage did the version change? [19:19:09] is that why? [19:19:17] maybe you shoudl symlink with out the version name? [19:19:38] probably just to [19:19:40] /usr/share/java/json_simple.jar [19:19:45] or /usr/share/java/json-simple.jar [19:20:23] yeah [19:20:28] they changed from _ to - [19:20:42] aye, but also I se your smilnk is /usr/lib/hadoop/lib/json_simple-1.1.jar -> /usr/share/java/json_simple-1.1.jar [19:20:48] maybe insteadd you should jsut do [19:21:00] /usr/lib/hadoop/lib/json-simple.jar -> /usr/share/java/json-simple.jar [19:21:06] i see that the _ .jar exists too [19:21:07] yeah, i will fix that [19:21:14] it looks like it is just because it was upgraded [19:21:16] to 1.1.1 [19:22:19] yeah. patch coming in a sec. [19:23:24] (03CR) 10Andrew Bogott: [C: 032] openstack: folsom -> havana as default version [puppet] - 10https://gerrit.wikimedia.org/r/173460 (owner: 10Dzahn) [19:23:37] (03CR) 10Chad: [C: 031] gerrit templates: fix jenkins/lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/173475 (owner: 10Dzahn) [19:25:19] ottomata i think i should do /usr/lib/hadoop/lib/json_simple.jar -> /usr/share/java/json_simple.jar so that it works on precise + trusty and trust contains json_simple.jar -> json-simple.jar [19:26:21] (03PS1) 10Gage: hadoop: gelf: libjson: support precise + trusty [puppet] - 10https://gerrit.wikimedia.org/r/173883 [19:27:03] "[Diffusion] [Committed] rOPSPUPPETb49f9877f433: Merge "openstack: folsom -> havana as default version" into production" <- what the heck is this? [19:27:05] ok [19:27:12] (03PS2) 10Ori.livneh: Update EventLogging listener IP for labs [puppet] - 10https://gerrit.wikimedia.org/r/173352 [19:27:17] not sure of the difference between _ and - in this case, but they both seem to be present [19:27:17] so sure [19:27:18] (03CR) 10Ori.livneh: [C: 032 V: 032] Update EventLogging listener IP for labs [puppet] - 10https://gerrit.wikimedia.org/r/173352 (owner: 10Ori.livneh) [19:27:37] jgage: thanks. [19:27:37] (03PS2) 10Ottomata: hadoop: gelf: libjson: support precise + trusty [puppet] - 10https://gerrit.wikimedia.org/r/173883 (owner: 10Gage) [19:27:44] (03CR) 10Ottomata: [C: 032 V: 032] hadoop: gelf: libjson: support precise + trusty [puppet] - 10https://gerrit.wikimedia.org/r/173883 (owner: 10Gage) [19:27:54] yay [19:27:58] (hopefully) [19:29:29] (03PS1) 10Yuvipanda: shinken: Increase thresholds for free space warnings [puppet] - 10https://gerrit.wikimedia.org/r/173887 [19:29:35] RECOVERY - Hadoop DataNode on analytics1013 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [19:29:45] RECOVERY - puppet last run on analytics1013 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [19:29:45] RECOVERY - Hadoop NodeManager on analytics1013 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [19:29:52] yay [19:29:55] phew! [19:31:20] phew, ok aside from that, this upgrade went pretty easy [19:31:29] i'm going to give that 5 or 10 minutes to chill and then move on with more workers [19:31:38] awesome [19:33:15] PROBLEM - puppet last run on search1015 is CRITICAL: CRITICAL: Puppet has 1 failures [19:33:47] (03Abandoned) 10Andrew Bogott: Move openstack_version and use_neutron into hiera [puppet] - 10https://gerrit.wikimedia.org/r/173294 (owner: 10Andrew Bogott) [19:36:58] who was used google webmaster tools before [19:36:59] has [19:37:06] (03PS4) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [19:37:20] (03PS5) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [19:39:08] (03PS6) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [19:44:15] (03CR) 10Aklapper: [C: 031] "List looks good to me (removes the two offending entries from the default list of files.viewable-mime-types and files.image-mime-types)" [puppet] - 10https://gerrit.wikimedia.org/r/173875 (owner: 10Rush) [19:46:44] (03PS2) 10Rush: phab don't try to preview icon/x-icon [puppet] - 10https://gerrit.wikimedia.org/r/173875 [19:47:28] (03PS2) 10Yuvipanda: shinken: Increase thresholds for free space warnings [puppet] - 10https://gerrit.wikimedia.org/r/173887 [19:51:35] RECOVERY - puppet last run on search1015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:58:48] ebernhardson: :) [19:58:50] ebernhardson: so, sudo on labs [19:58:56] ebernhardson: you need to sudo su first before sudo -u ing [19:59:10] users can't become other users, you need to become root first [19:59:15] there's a setting on wikitech to toggle this [19:59:22] YuviPanda: yea, the annoying thing is its inverse of my vagrant [19:59:28] hahaha [19:59:41] you can set it projectwide on your project [19:59:44] under sudo policy [19:59:48] YuviPanda: so i hve to remember to `sudo -u www-data blah blah` on one, and `sudo su www-data -c '...'` on the other [19:59:51] ahh, that would be better :) [20:00:05] awight, AndyRussG, ejegg: Respected human, time to deploy CentralNotice (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141117T2000). Please do the needful. [20:00:09] ebernhardson: yeah, it was annoying me too, so I annoyed andrewbogot.t enough until he implemented it :) [20:02:03] !log starting upgrade of analytics1014 to trusty [20:02:08] Logged the message, Master [20:02:33] (03PS1) 10Andrew Bogott: Move the openstack_version setting hiera. [puppet] - 10https://gerrit.wikimedia.org/r/173904 [20:03:16] (03CR) 10jenkins-bot: [V: 04-1] Move the openstack_version setting hiera. [puppet] - 10https://gerrit.wikimedia.org/r/173904 (owner: 10Andrew Bogott) [20:04:54] (03CR) 10Giuseppe Lavagetto: "You can either do this (strongly suggested), use individual files for the various hosts, or define a global variable in nodes.pp, which is" [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke) [20:05:05] (03PS2) 10Andrew Bogott: Move the openstack_version setting hiera. [puppet] - 10https://gerrit.wikimedia.org/r/173904 [20:05:12] AndyRussG: So I looked at that commit, the ResourceLoadery parts look pretty straightforward to me. Anything in particular going on there? [20:05:30] (03PS7) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [20:05:52] Hi RoanKattouw... yeas, one sec :) I can explain setup [20:08:15] PROBLEM - DPKG on analytics1014 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:08:18] (03PS8) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [20:09:22] RoanKattouw: wrt AndyRussG's patch, the only thing I'm worried about is being very certain that the data module response will not be cached in any surprising way. [20:09:48] RoanKattouw: ^ yes what awight said :) [20:10:17] (03CR) 10Aklapper: "Ignore my part about bug-attachment.wikimedia.org, let's just ignore that so it'd still work." [puppet] - 10https://gerrit.wikimedia.org/r/173483 (owner: 1020after4) [20:10:29] It's sending in via getScript data that we want to be sure doesn't get cached for more then 15 minutes or so (I think it's expected to be less, tho) [20:10:52] OK let me look at that getScript call, I think I remember seeing one [20:11:18] RoanKattouw: CNBannerChoiceDataResourceLoaderModule [20:11:34] AFAICT you are not setting a ?version= query parameter [20:11:41] Which means RL should give you 5-minute caching headers [20:12:39] (03PS9) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [20:17:26] PROBLEM - puppet last run on analytics1014 is CRITICAL: Connection refused by host [20:17:44] (03PS4) 10Ori.livneh: memcached: tidy [puppet] - 10https://gerrit.wikimedia.org/r/171153 [20:19:12] dammit, debug something for 1h, find out it is because you typed 'bytes' when it should be 'byte' [20:19:13] grr [20:19:36] PROBLEM - Hadoop DataNode on analytics1014 is CRITICAL: Connection refused by host [20:19:46] PROBLEM - RAID on analytics1014 is CRITICAL: Connection refused by host [20:19:46] PROBLEM - check if dhclient is running on analytics1014 is CRITICAL: Connection refused by host [20:19:46] PROBLEM - Hadoop NodeManager on analytics1014 is CRITICAL: Connection refused by host [20:19:55] PROBLEM - check if salt-minion is running on analytics1014 is CRITICAL: Connection refused by host [20:19:56] PROBLEM - Disk space on analytics1014 is CRITICAL: Connection refused by host [20:20:06] PROBLEM - check configured eth on analytics1014 is CRITICAL: Connection refused by host [20:20:07] (03PS3) 10Yuvipanda: shinken: Fix freespace warnings [puppet] - 10https://gerrit.wikimedia.org/r/173887 [20:23:06] RECOVERY - check configured eth on analytics1014 is OK: NRPE: Unable to read output [20:23:36] RECOVERY - Hadoop DataNode on analytics1014 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [20:23:46] RECOVERY - RAID on analytics1014 is OK: OK: no disks configured for RAID [20:23:55] RECOVERY - check if dhclient is running on analytics1014 is OK: PROCS OK: 0 processes with command name dhclient [20:23:56] RECOVERY - check if salt-minion is running on analytics1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:23:56] RECOVERY - Hadoop NodeManager on analytics1014 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [20:23:56] RECOVERY - Disk space on analytics1014 is OK: DISK OK [20:24:32] (03CR) 10Ori.livneh: [C: 032] memcached: tidy [puppet] - 10https://gerrit.wikimedia.org/r/171153 (owner: 10Ori.livneh) [20:31:24] haha! and now the freespace stuff works [20:32:15] RoanKattouw: Krinkle: we're not explicitly setting any version param, but when I load it up in the browser it does take a version param for loading modules [20:32:24] the module I mean [20:32:59] AndyRussG: The version param is the timestamp of when the module was generated by a load.php request. It's not declared somewhere explicitly. [20:33:05] Sure but that's for the code, not the data, right? [20:33:34] However it is important that the code (including any embedded data) will not refresh, unless and only if the logic in the module class allows it to detect a change and give a new timestamp. [20:34:08] which module is this and how is it loaded? [20:34:35] RECOVERY - DPKG on analytics1014 is OK: All packages OK [20:34:37] Krinkle: here's a patch, now near the tip of master, that has it: https://gerrit.wikimedia.org/r/#/c/173220/ [20:34:39] (03CR) 10Yuvipanda: [C: 032] shinken: Fix freespace warnings [puppet] - 10https://gerrit.wikimedia.org/r/173887 (owner: 10Yuvipanda) [20:34:45] PROBLEM - puppet last run on analytics1014 is CRITICAL: Connection refused by host [20:36:20] Krinkle: RoanKattouw: in CentralNotice, includes/CNBannerChoiceDataResourceLoaderModule [20:36:26] It's not file-based [20:36:42] https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/master/includes/CNBannerChoiceDataResourceLoaderModule.php [20:36:53] rather it dynamically creates some JSON which is what it sends in with via getScript() [20:37:06] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [20:37:17] AndyRussG: Mind if I do a quick CR? [20:37:28] Krinkle: please go ahead! [20:37:55] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [20:38:08] As it stands awight and ejegg just deploying to meta, mediawiki and test and aa.wikibooks as we speak [20:38:09] Avoid computing any data in the constructor method. That's used by the startup module, which pretty much has 0 overlap with the main module request, so there shouldn't be anything in there other than processing the parameters from the resourceModule array (if any) [20:38:20] Ah, in fact the 'http' member isn't even used? [20:39:01] Krinkle: ah woops yes that constructor should have been removed :( [20:39:09] !log ejegg Synchronized php-1.25wmf8/extensions/CentralNotice/: Update CentralNotice for client-side banner choice (duration: 00m 03s) [20:39:13] Logged the message, Master [20:42:45] RoanKattouw: Krinkle: thanks! Can you see any other possible issues? Especially the idea is to avoid serving clients stale banner data inadvertently, though any other issues hiding in there would also be great to hear of :) [20:42:45] (03CR) 10Ejegg: [C: 032] Set new banner dispatcher vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173849 (owner: 10Awight) [20:44:13] Hmm it looks like this module should use getModifiedTimestamp() [20:44:34] AndyRussG: Krinkle created that API, so maybe he can introduce you to it while I have a meeting [20:45:34] AndyRussG: Indeed, it's missing the most critical method for ResourceLoader's design. This must not be deployed in its current state. [20:48:31] !log ejegg Synchronized wmf-config: (no message) (duration: 00m 03s) [20:48:33] Logged the message, Master [20:48:35] Krinkle: thank you! The current deploy actually still disables this module via a config variable, except on testwiki and aa.wikibooks (a disabled wiki used for CN testing) [20:48:50] (03PS1) 10Hashar: Fix dependencies for tox 'cover' env [debs/pybal] - 10https://gerrit.wikimedia.org/r/173914 [20:49:07] (03CR) 10Hashar: Move tests to pybal.test; use Twisted's test runner (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/173086 (owner: 10Ori.livneh) [20:49:41] Krinkle: RoanKattouw: we have another deploy slot in an hour, though [20:50:36] AndyRussG: Added comments inline. [20:56:55] I hear second-hand reports that fr.wikisource is getting an unusual rate of 500s; is there a place to look for per-project 500s because such a small project would be drowned in the noise and barely register as a statistical blip in the global stats. [20:58:17] Coren: I wonder if it is being caused by the spike in db traffic caused by a particular djvu file in commons that's used by frwiki [20:58:27] bblack (or akosiaris?) was investigating... [20:58:41] Coren: have you tried https://logstash.wikimedia.org ? :d [20:59:05] Hm, two issues with the same small project? That smells like a not-coincidence. [21:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141117T2100). Please do the needful. [21:00:33] Coren: there is a bunch of DBQueryError :( [21:00:38] hashar: I know of no way to filter per project. [21:00:50] add a filter [21:00:51] must [21:00:57] query: wikidb:frwikisource [21:01:03] Aha. [21:01:10] I don't know how to share the url :( [21:01:26] Coren: https://logstash.wikimedia.org/#dashboard/temp/djpwO4KBSuafgeTWWTg46Q [21:01:29] hashar: No, that works - it's the 'wikidb:' but I did not know [21:01:30] with run jobs excluded [21:01:34] Coren: see also thread '[Ops] appserver<->mysql traffic spikes past two days ' [21:02:06] (03CR) 10Rush: [C: 032 V: 032] Reapply Bugzilla XML-RPC API workaround for a short while again [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/173516 (owner: 10Aklapper) [21:02:37] there is a Lock wait timeout exceeded; try restarting transaction (10.64.16.27) on the user table : / [21:03:48] Coren: when you click on a row, there are more details shown, each field can be used as a new filter with a single click [21:05:01] jouncebot: on it! [21:05:25] RECOVERY - puppet last run on analytics1014 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [21:06:57] (03CR) 10Dzahn: [C: 04-1] "what Ori said, you need to include passwords::mysql::phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/173483 (owner: 1020after4) [21:10:57] (03CR) 10Dzahn: "what Ori said, you need to include passwords::mysql::phabricator" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173483 (owner: 1020after4) [21:14:03] (03CR) 10Dzahn: [C: 032] gerrit templates: fix jenkins/lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/173475 (owner: 10Dzahn) [21:31:27] (03PS10) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [21:35:41] (03PS11) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [21:46:56] (03PS1) 10Awight: Enable new CentralNotice features on beta.wmflabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173973 [21:47:17] (03CR) 10Awight: [C: 032] Enable new CentralNotice features on beta.wmflabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173973 (owner: 10Awight) [21:47:27] (03Merged) 10jenkins-bot: Enable new CentralNotice features on beta.wmflabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173973 (owner: 10Awight) [21:55:23] (03CR) 10Hashar: "puppet will bails out because it doesn't know about jenkins-deploy :-/ Might want to reduce the tmpfs from 512 to 128." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [21:57:20] (03CR) 10Ottomata: [C: 032 V: 032] Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 (owner: 10Ori.livneh) [21:57:36] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:59:12] Hmm ^^ I don't see any changes, using "git status" [21:59:59] awight: [22:00:01] [tin:/srv/mediawiki-staging/wmf-config] $ git log --pretty=oneline HEAD..origin/master [22:00:01] 443dd91a4674a776b00fcb55a5ef8a1542b0f51b Enable new CentralNotice features on beta.wmflabs [22:00:04] awight, AndyRussG, ejegg: Respected human, time to deploy CentralNotice (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141117T2200). Please do the needful. [22:00:11] <_joe_> awight: submodules? [22:01:37] _joe_: maybe, git submodule status reports: 197eb243543e88827401ee47f69cf98bdbfd0cf9 docroot/bits/WikipediaMobileFirefoxOS (heads/master-14-g197eb24) [22:01:46] leftovers on tin would be my fault, let me see [22:01:57] ejegg: I didn't see any fwiw [22:02:01] oh, not sure where that would have come from [22:02:11] ori: ok thanks [22:02:28] ejegg: ori explained -- it was my betalabs commit [22:03:05] oh, ok. [22:03:34] !log awight Synchronized wmf-config: Enable new CentralNotice features on beta.wmflabs (duration: 00m 07s) [22:03:37] Logged the message, Master [22:03:55] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [22:06:57] (03CR) 10Dzahn: [C: 032] "since https://gerrit.wikimedia.org/r/#/c/173460/ has been merged this should be just fine. there is puppet code like "source => "puppet://" [puppet] - 10https://gerrit.wikimedia.org/r/173459 (owner: 10Dzahn) [22:14:47] greg-g, apergos, et al: just fyi, i'm still working on the parsoid deploy. we ran into some issues getting the deploy commit prepared. [22:15:01] but i kicked jenkins' butt and now it looks like i'm just about ready to press go on the deploy [22:15:23] I admit I'm not really here at this point (being midnight here) [22:16:06] cscott: we're not deploying yet, feel free [22:16:11] cscott: please ping when you're done! [22:16:14] thanks. [22:16:18] will do. [22:16:20] k [22:19:22] (03CR) 10Andrew Bogott: "Are you sure it's a problem? I would expect that user => just takes a string and applies it blindly; I don't think it creates a dependenc" [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [22:20:34] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, but - shouldn't we extend this to allo monitors anyway for consistency?" [debs/pybal] - 10https://gerrit.wikimedia.org/r/173673 (owner: 10Ori.livneh) [22:24:28] (03CR) 10Dzahn: [C: 032] "harmless alignments that fix noise in jenkins/compiler output" [puppet] - 10https://gerrit.wikimedia.org/r/173478 (owner: 10Dzahn) [22:25:43] (03CR) 10Dzahn: "bd808, this is good, right?" [puppet] - 10https://gerrit.wikimedia.org/r/173469 (owner: 10Dzahn) [22:26:11] (03CR) 10BryanDavis: [C: 031] logstash beta: remove pmtpa, do TODO [puppet] - 10https://gerrit.wikimedia.org/r/173469 (owner: 10Dzahn) [22:26:34] (03CR) 10Dzahn: [C: 032] logstash beta: remove pmtpa, do TODO [puppet] - 10https://gerrit.wikimedia.org/r/173469 (owner: 10Dzahn) [22:29:12] springle: I was wondering... I think a new trigger caused deadlock this morning, and I was hoping you could CR for that issue before we have it happen again, by chance. https://gerrit.wikimedia.org/r/#/c/173768/ [22:29:34] (03PS1) 10Dzahn: delete class ldap::client::autofs [puppet] - 10https://gerrit.wikimedia.org/r/173991 [22:31:47] (03CR) 10Ori.livneh: "@Giuseppe: yes, we should apply it to all monitors, but I'll make the change to each one as I write tests for them." [debs/pybal] - 10https://gerrit.wikimedia.org/r/173673 (owner: 10Ori.livneh) [22:36:11] (03CR) 10Ori.livneh: [C: 032] make monitor constructor accept a custom reactor object [debs/pybal] - 10https://gerrit.wikimedia.org/r/173673 (owner: 10Ori.livneh) [22:36:29] (03Merged) 10jenkins-bot: make monitor constructor accept a custom reactor object [debs/pybal] - 10https://gerrit.wikimedia.org/r/173673 (owner: 10Ori.livneh) [22:36:36] we have this "facilities.pp", it has PDU monitoring and a tiny unused class for cameras, first i was about to make a facilities module, but then.. move PDU stuff to a monitoring module instead? (and the camera thing can go or be its own module) [22:40:55] (03PS1) 10Dzahn: delete class facilities::dc-cam-transcoder [puppet] - 10https://gerrit.wikimedia.org/r/173996 [22:48:13] Can a hiera db refer to a fact or a node variable? I have a var that should be $::ipaddress_eth0 on some nodes but hard-coded on others. [22:50:19] !log updated Parsoid to version 819b2cf4 [22:50:24] Logged the message, Master [22:52:51] awight: ok, done. thanks for being patient! [22:53:07] cscott: awesome! [22:53:56] manybubbles: the parsoid job throughput seems to have dropped quite a bit since the CirrusSearch job change deployment: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Parsoid%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1416264366&g=cpu_report&z=large [22:53:59] (03PS1) 10Rush: bugzilla handles characters that are invalid for api [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/173998 [22:54:16] gwicke: we're certainly going to be using some cpu [22:54:24] are we crushing you somehow? [22:54:49] I'm not sure, the parsoid jobs don't really use cpu on the job runners at all [22:55:14] they just do a handful of http requests per job [22:55:56] there were some similar job throughput jitters last week, I didn't really look into it then [22:59:11] (03PS2) 10Dzahn: delete class facilities::dc-cam-transcoder [puppet] - 10https://gerrit.wikimedia.org/r/173996 [22:59:13] (03PS1) 10Dzahn: kill facilities.pp, move to nagios_common [puppet] - 10https://gerrit.wikimedia.org/r/173999 [23:00:17] manybubbles: still browsing ganglia for load on the job runners [23:01:20] gwicke: we're actually eating into the backlog of jobs pretty slowly too. [23:01:32] https://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [23:02:58] YuviPanda: ^ killing facilities.pp up there, moving stuff to nagios_common [23:03:09] one less place that has monitoring stuff [23:09:07] (03CR) 10Krinkle: "See inline comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [23:16:13] anybody taken a look at this monitoring processor from stackexchange based on opentsdb? http://bosun.org/ [23:21:26] (03PS1) 10Dzahn: remove gluster from ganglia config [puppet] - 10https://gerrit.wikimedia.org/r/174002 [23:21:30] (03CR) 10Aklapper: "Makes sense as covered in https://phabricator.wikimedia.org/T815#20784" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/173998 (owner: 10Rush) [23:21:46] (03CR) 10Tim Starling: "The problem occurs when you press the down arrow, in order to select a history entry more recent than the one you are currently viewing." [puppet] - 10https://gerrit.wikimedia.org/r/173752 (owner: 10Tim Starling) [23:22:00] (03CR) 10Dzahn: [C: 032] remove gluster from ganglia config [puppet] - 10https://gerrit.wikimedia.org/r/174002 (owner: 10Dzahn) [23:22:33] (03Abandoned) 10Dzahn: remove glusterfs and pmtpa remnants [puppet] - 10https://gerrit.wikimedia.org/r/173349 (owner: 10Dzahn) [23:25:50] (03CR) 10Dzahn: [C: 04-2] "i'd do these instead:" [puppet] - 10https://gerrit.wikimedia.org/r/171493 (owner: 10Dzahn) [23:26:02] (03Abandoned) 10Dzahn: (WIP) facilities: move to module [puppet] - 10https://gerrit.wikimedia.org/r/171493 (owner: 10Dzahn) [23:27:28] (03CR) 10Dzahn: [C: 031] "yea, per " \x0e might need to be added"" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/173998 (owner: 10Rush) [23:28:53] (03CR) 10Ori.livneh: "Were you only seeing this on osmium? Its gdbinit had an older version of this code. If it was only on osmium, see if you can still reprodu" [puppet] - 10https://gerrit.wikimedia.org/r/173752 (owner: 10Tim Starling) [23:30:09] (03Abandoned) 10Ori.livneh: Remove coloured gdb prompt [puppet] - 10https://gerrit.wikimedia.org/r/173752 (owner: 10Tim Starling) [23:35:49] !log awight Synchronized php-1.25wmf8/extensions/CentralNotice: push CentralNotice updates (duration: 00m 05s) [23:35:54] Logged the message, Master [23:40:19] nevermind! i just read the config and found the interface-range part, now i see what i need to do :) [23:40:22] oop [23:42:13] (03PS1) 10Awight: Enable client banner choice on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174007 [23:45:27] greg-g: Is it possible to deploy to just the Group1 wikis? [23:47:14] anyone ^^ [23:47:43] not exactly. what do you want to do? [23:47:53] ori: I see, yeah I'm reading https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Switch_group1_wikis_to_VERSION [23:47:55] awight: this https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Switch_group1_wikis_to_VERSION? [23:47:58] hehe [23:48:09] You owe me a chocolate bar! [23:48:28] ori: The idea was to push a sort of limited CentralNotice deploy, but it looks like that involves switching versions, which is very much not what I'm going to do. [23:48:43] use a feature flag [23:48:46] (03PS2) 10Andrew Bogott: delete class ldap::client::autofs [puppet] - 10https://gerrit.wikimedia.org/r/173991 (owner: 10Dzahn) [23:48:48] oooh [23:49:28] ori: yes we have that... [23:49:52] in InitialiseSettings.php, 'wgCentralNoticeMyFoo' => array( 'default' => false, 'testwiki' => true, /* etc. */ ) [23:50:09] then in CommonSettings.php, if ( $wgCentralNoticeMyFoo ) { /* enable something */ } [23:51:24] ori: we've set that up already, but even disabled, this code has some side-effects, cos it includes a ResourceLoader module. [23:51:37] So we were trying to figure out how to deploy to just aa-wikibooks, for example [23:51:46] but without all of the other 1.25wmf7-based wikis [23:51:49] I see that's not possible. [23:51:56] ... [23:58:40] (03PS2) 10Awight: Enable client banner choice on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174007 [23:58:55] (03CR) 10Awight: [C: 032 V: 032] Enable client banner choice on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174007 (owner: 10Awight) [23:59:26] !log awight Synchronized wmf-config: Enable new CentralNotice features on mediawikiwiki (duration: 00m 04s) [23:59:29] Logged the message, Master [23:59:55] AndyRussG: ejegg: K4-713: ok we should be deployed to mediawiki [23:59:55] (03PS1) 10Dzahn: gerrit role: add ssh::server listening on other IP [puppet] - 10https://gerrit.wikimedia.org/r/174015