[00:01:52] New patchset: Ori.livneh; "Stop gerrit's commit link detection from mangling MediaWiki Urls" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64502 [00:02:17] ops: could we merge this? it drives users insane ^ [00:02:49] New patchset: Jdlrobson; "Enable Nearby on Commons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70352 [00:03:23] * ori-l pounces on andrewbogott [00:03:44] add me as a reviewer? If I'm not already? [00:04:07] andrewbogott: done. there's a +1 from ^demon which my rebase hid. [00:14:25] New review: Andrew Bogott; "I'm taking your word for this since regexps make me angry" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/64502 [00:14:25] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64502 [00:29:25] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70121 [01:01:40] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.00673019886 secs [01:08:53] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 195 seconds [01:09:53] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:15:03] PROBLEM - check_mysql on db1025 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:20:13] PROBLEM - check_mysql on db1025 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:20:14] New patchset: Ori.livneh; "wmgMFLogEvents => true on beta labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70366 [01:20:44] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [01:20:53] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 198 seconds [01:20:53] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70366 [01:22:51] !log olivneh synchronized wmf-config/InitialiseSettings-labs.php 'Iac6a3f2f4d: wmgMFLogEvents => true on beta labs' [01:23:01] Logged the message, Master [01:24:53] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:25:03] PROBLEM - check_mysql on db1025 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:29:04] ACKNOWLEDGEMENT - check_mysql on db1025 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) Matt Walker Ya ya ya, I broke it -- yay replication fail. Emailing Jeff for him to deal with in the morning. [01:31:40] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.005141377449 secs [01:32:40] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.005178689957 secs [01:48:43] !log krinkle synchronized php-1.22wmf8/extensions/VisualEditor/ '{{gerrit|If96bed727ba7af7}}' [01:48:53] Logged the message, Master [01:50:17] James_F: ^ [01:50:24] !log krinkle synchronized php-1.22wmf7/extensions/VisualEditor/ '{{gerrit|If96bed727ba7af7}}' [01:50:28] Krinkle: I noticed. :-) [01:50:32] Logged the message, Master [02:07:29] !log LocalisationUpdate completed (1.22wmf8) at Tue Jun 25 02:07:28 UTC 2013 [02:07:37] Logged the message, Master [02:13:22] !log LocalisationUpdate completed (1.22wmf7) at Tue Jun 25 02:13:22 UTC 2013 [02:13:35] Logged the message, Master [02:26:25] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 25 02:26:25 UTC 2013 [02:26:33] Logged the message, Master [02:47:51] New review: Hazard-SJ; "I didn't manually add those changes, they might have been pulled in." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65860 [03:06:56] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [03:06:58] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [03:06:59] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:59] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:59] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [04:00:13] RECOVERY - check_mysql on db1025 is OK: Uptime: 2813802 Threads: 1 Questions: 55042899 Slow queries: 50832 Opens: 51343 Flush tables: 2 Open tables: 64 Queries per second avg: 19.561 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [04:17:38] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [04:32:45] New patchset: Spage; "Rewrite gerrit gitweb URLs to new gitblit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70370 [04:56:52] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:03:50] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [05:09:28] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:26:32] New review: Aklapper; "Daniel: So let's set up the alias and try?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [05:59:19] PROBLEM - SSH on gadolinium is CRITICAL: Server answer: [06:00:19] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:24:14] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [06:38:18] RECOVERY - search indices - check lucene status page on search1009 is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.004 second response time [07:08:41] apergos: morning [07:08:45] morning [07:09:07] yay on the obsolete [07:09:18] moved a bunch of solaris crapola in there [07:09:22] felt reaaal good [07:09:36] they still show up in search though [07:09:42] I'm searching "zfs" [07:10:45] want to do Media server & friends too? [07:11:14] the redirs will show up but only as redirs that happen to have that word in the title [07:11:39] any that don't won't show up and the obsolete pages themselves won't either [07:11:52] yeah I'm still working my way through things [07:11:57] there's a lot of cruft in here [07:13:40] yeah, a lot of cruft indeed [07:14:23] thanks for that :) [07:15:46] yw [07:28:06] New patchset: Nikerabbit; "ULS deployment phase 3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69643 [07:39:22] New patchset: Tpt; "Clean the aliases of Proofread Page managed namespaces" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70377 [07:50:45] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:51:09] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:54:02] I wonder if that's a real outage... [07:56:21] doubtful, looks from ganglia like search in pmtpa is still the active cluster [07:58:31] hmm [08:01:49] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.006300568581 secs [08:08:55] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69643 [08:12:03] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'ULS phase3' [08:12:10] Logged the message, Master [08:14:14] New review: Ori.livneh; "Appears to have worked! See this patch's commit message, as well as:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64502 [08:25:18] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [08:26:19] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [08:27:18] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [08:29:18] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [08:32:59] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.007350087166 secs [09:16:53] New patchset: Mark Bergsma; "Prepare Parsoid cache manifests for new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70387 [09:16:53] New patchset: Mark Bergsma; "Setup cp1045 and cp1058 as Parsoid caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70388 [09:19:20] New patchset: Mark Bergsma; "Prepare Parsoid cache manifests for new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70387 [09:19:21] New patchset: Mark Bergsma; "Setup cp1045 and cp1058 as Parsoid caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70388 [09:20:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70387 [09:20:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70388 [09:39:21] PROBLEM - Disk space on cp1048 is CRITICAL: Connection refused by host [09:39:41] PROBLEM - RAID on cp1048 is CRITICAL: Connection refused by host [09:39:52] PROBLEM - Host cp1045 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:11] PROBLEM - DPKG on cp1048 is CRITICAL: Connection refused by host [09:40:52] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [09:41:43] mark: morning have you scheduled the varnish media storage renaming from https://gerrit.wikimedia.org/r/#/c/68153/ ? [09:41:56] you mentioned it needs a restart of all backends [09:42:23] hashar: i'll just do that on the new servers only [09:43:13] using a configuration parameter i guess [09:43:24] not even [09:43:26] just with a hack [09:44:12] * YuviPanda wonders if anyone can help merge https://gerrit.wikimedia.org/r/#/c/70064/4 [09:44:15] I cced myself to the change, I need it to adapt the role::cache::upload class on beta [09:44:28] i probably won't merge this one [09:44:54] i can give you a headsup before I merge the relevant change [09:45:54] well you originally wanted to have the media storage named the same in beta and production :D [09:46:16] to get rid of the ugly vdb in the manifests [09:47:44] yes [09:49:55] mark: so if you don't merge the change, that is kind of blocking my role::cache::upload change isn't it ? : -] [09:50:56] no I mean I will merge a similar change [09:51:07] well I can just send a new patchset to this one I guess [09:51:31] PROBLEM - NTP on cp1048 is CRITICAL: NTP CRITICAL: No response from NTP server [09:52:44] ok :) [09:52:52] will rebase mine once you are done [09:53:15] I most probably have to tweak it a lot since you refactored a bit of the varnish/cache classes [09:54:35] yeah [09:54:47] i was right when I said it was better to hold off for a few days I think eh ;) [09:58:24] New patchset: Mark Bergsma; "Slightly up the cache paritition size / free space ratio" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70392 [10:00:03] New patchset: Mark Bergsma; "Slightly up the cache paritition size / free space ratio" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70392 [10:00:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70392 [10:03:51] PROBLEM - Host cp1045 is DOWN: PING CRITICAL - Packet loss = 100% [10:09:03] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [10:11:33] PROBLEM - SSH on cp1045 is CRITICAL: Connection refused [10:11:33] PROBLEM - DPKG on cp1045 is CRITICAL: Connection refused by host [10:11:45] PROBLEM - Varnish HTTP parsoid-backend on cp1045 is CRITICAL: Connection refused [10:11:45] PROBLEM - Disk space on cp1045 is CRITICAL: Connection refused by host [10:11:55] PROBLEM - Varnish HTTP parsoid-frontend on cp1045 is CRITICAL: Connection refused [10:12:03] PROBLEM - RAID on cp1045 is CRITICAL: Connection refused by host [10:15:34] RECOVERY - SSH on cp1045 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [10:20:33] RECOVERY - DPKG on cp1045 is OK: All packages OK [10:20:43] RECOVERY - Varnish HTTP parsoid-backend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.002 second response time [10:20:43] RECOVERY - Disk space on cp1045 is OK: DISK OK [10:21:03] RECOVERY - RAID on cp1045 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:23:53] RECOVERY - Varnish HTTP parsoid-frontend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 643 bytes in 0.003 second response time [10:26:23] PROBLEM - Varnish HTTP parsoid-backend on cp1058 is CRITICAL: Connection refused [10:29:53] PROBLEM - NTP on cp1045 is CRITICAL: NTP CRITICAL: Offset unknown [10:34:23] RECOVERY - Varnish HTTP parsoid-backend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.001 second response time [10:39:02] RECOVERY - NTP on cp1045 is OK: NTP OK: Offset -0.01299083233 secs [10:43:13] PROBLEM - Host cp1058 is DOWN: CRITICAL - Plugin timed out after 15 seconds [10:44:11] PROBLEM - Host cp1045 is DOWN: CRITICAL - Plugin timed out after 15 seconds [10:45:01] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [10:45:23] RECOVERY - Host cp1058 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [10:53:13] apergos: I see you're still cleaning up stuff [10:53:22] need any clarification from me? [10:55:13] no, I'm about ready to call that done for the day and move on to other stuff actually [10:55:36] !log Pooled cp1045 and cp1058 Varnish frontends in the parsoid cache pool [10:55:44] Logged the message, Master [10:57:08] okay [11:04:09] New patchset: Mark Bergsma; "Replace the parsoidcache backends with cp1045/cp1058" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70405 [11:04:43] so [11:04:52] I have a ceph nagios health check [11:05:25] that calls ceph --id nagios --keyring /var/lib/somewhere/ceph-nagios.key health [11:05:41] New patchset: Mark Bergsma; "Replace the parsoidcache backends with cp1045/cp1058" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70405 [11:05:54] plus a ceph::key { } to create that key, owned by root:nagios 0440 [11:06:11] where does that belong now? :) [11:06:20] modules/ceph/manifests/nagios.pp? :) [11:06:32] yes [11:06:50] where else? [11:07:01] all the other nagios checks are under icinga.pp [11:07:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70405 [11:07:10] I also have to configure nrpe for the extra command [11:07:17] and the icinga manifests provide no definitions for that [11:07:24] there is one somewhere [11:07:30] nrpe::monitor_command or something [11:07:52] nrpe::monitor_service [11:07:57] right [11:16:46] !log jenkins: migrating extensions jobs to use gallium slave ( lint , phpcs-HEAD ) or be bound on master (test extensions) [11:16:55] Logged the message, Master [11:19:21] argh, nrpe runs as user icinga now [11:19:25] why oh why [11:44:16] !log Depooled old parsoid cache servers - parsoidcache migration complete [11:44:25] Logged the message, Master [11:48:25] New patchset: Faidon; "nrpe: revert to the nagios user/paths, clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70407 [11:48:30] akosiaris: ^^ [11:48:51] untested [11:51:07] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection timed out [11:51:44] New patchset: Mark Bergsma; "Prepare the mobile cache manifests for the new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70409 [11:51:44] New patchset: Mark Bergsma; "Install cp104[67] and cp1059/cp1060 as mobile caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70410 [11:51:47] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [11:52:34] apergos: any idea how we can get rid of the redirects? [11:52:55] I'd left them in until we find out which articles actually have links to the articles [11:53:01] that will take a little work [11:53:35] if you like I can lookk into that tomorrow morning [11:55:09] New patchset: Mark Bergsma; "Add default selector value" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70411 [11:56:45] apergos: harrse Reedy for a bit of AWB magic [11:56:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70409 [11:57:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70410 [11:57:28] is he an awb user? (I don't, I know it supposedly runs ok wunder wine but meh) [11:57:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70411 [11:58:26] peachey|laptop__: [11:58:45] apergos: he wrote it iirc [11:59:22] really? [11:59:56] huh, I know one of the other authors [12:00:47] well I guess they are both 'developers', not the original author [12:01:37] New patchset: Faidon; "nrpe: revert to the nagios user/paths, clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70407 [12:02:56] paravoid: will it get a nagios user still ? [12:03:59] ah nagios-npre package creates it :) [12:04:39] yep [12:04:46] basically nagios-nrpe-server does everything [12:04:52] I'm not sure why we've reimplemented the wheel here [12:04:55] it's very puzzling [12:05:14] no idea :D [12:08:32] New patchset: Mark Bergsma; "Add new mobile caches, remove old/nonexistant" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70412 [12:09:43] Change merged: Mark Bergsma; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70412 [12:11:14] !log mark synchronized wmf-config/squid.php [12:11:23] Logged the message, Master [12:13:02] hm, so puppet cleanup is broken [12:13:14] knsq* still are on puppet's database [12:13:17] and nagios [12:15:01] were they added to the decommissioning list? [12:15:02] I didn't ;) [12:15:06] they were [12:15:10] ok [12:16:31] also affects a number of other hosts [12:16:45] singer, zinc etc. [12:17:12] also, shouldn't decom_servers.sh also puppet cert revoke/clean [12:17:38] no [12:17:46] then those decommissioned servers can't run puppet anymore [12:18:19] it's still best if they run puppet while decommissioned (at least in theory) [12:18:45] root@sockpuppet:~# ls -la /var/lib/git/operations/puppet/manifests/decommissioning.pp [12:18:46] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [12:18:48] -rw-r--r-- 1 root root 1395 Jan 26 2012 /var/lib/git/operations/puppet/manifests/decommissioning.pp [12:18:51] bwahaha [12:19:02] doesn't stafford run it?\ [12:19:24] maybe, let me check [12:19:38] so, if they run puppet [12:19:44] they'll get readded to stored configs [12:19:50] nope [12:19:54] nope? [12:19:56] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [12:20:03] the manifests check whether the host is in the decommissioned list [12:20:10] (at least, in theory ;) [12:20:22] oh sure, there will be some stuff added [12:20:32] but that's no big deal, that can get cleaned up after the host is really down [12:20:43] found it [12:21:48] someone decided to convert double quotes to single quotes [12:22:14] obviously very important [12:22:36] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [12:23:31] de4926d3 [12:23:36] PROBLEM - Host cp1059 is DOWN: PING CRITICAL - Packet loss = 100% [12:23:36] PROBLEM - Host cp1060 is DOWN: PING CRITICAL - Packet loss = 100% [12:23:38] Author: RobH [12:23:38] Date: Fri May 10 11:39:44 2013 -0700 [12:23:38] barium reclaim [12:23:48] that's the entirety of the commit msg [12:23:56] RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [12:24:35] 'wikinews-lb.wikimedia.org'? [12:24:35] wtf? [12:24:36] RECOVERY - Host cp1059 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [12:24:53] it was there before too [12:24:56] RECOVERY - Host cp1060 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [12:25:59] New patchset: J; "Increase transcode timeout for 720p uploads > 1h" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70413 [12:26:39] !log jenkins: all mediawiki extension jobs got updated. [12:26:48] Logged the message, Master [12:27:23] New patchset: Faidon; "Fix decomissioning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70414 [12:27:38] so mediawiki is configured with allowed XFF ips for our proxies [12:27:41] but those are currently all ipv4 [12:27:53] what if a varnish, which supports ipv6, adds its IPv6 ip to the XFF list on an ipv6 request? [12:28:41] erm? [12:28:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:29:23] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70414 [12:29:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.526 second response time [12:30:55] so, when do we actually revoke certs? [12:30:57] never? :) [12:31:01] never [12:32:43] can't we say that whenever we add something to decom.pp we should also poweroff the machine and revoke the cert? [12:33:05] you could say that [12:33:09] good luck [12:33:38] I'm wondering/asking if it will work in practice [12:34:07] i doubt it [12:34:09] exec { 'poweroff': path => '/sbin', command => 'poweroff' } [12:34:10] :P [12:36:28] !log Pooled the new eqiad mobile frontend caches with lowest weight in PyBal [12:36:36] Logged the message, Master [12:40:06] New patchset: Mark Bergsma; "Pass the storage parameter to mobile frontends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70415 [12:41:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70415 [12:45:50] so much time wasted waiting for puppet to run [13:00:59] New patchset: Mark Bergsma; "Temporarily split the mobile cluster in old and new" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70416 [13:03:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70416 [13:07:48] why does wiktionary not normalize file names i.e. http://en.wiktionary.org/wiki/File:attraction.ogg is not redirected to http://en.wiktionary.org/wiki/File:Attraction.ogg while this happens i.e. for en.wikipedia.org/wiki/File:attraction.ogg [13:07:56] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:57] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [13:07:57] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [13:07:58] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:58] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:59] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:12:52] j^, because words that begin with a capital may be different than those which start with lower case [13:13:24] not just within the same language but across languages too [13:17:41] New review: coren; "I'm no all that familiar with Reddis either, but that patch safely defaults to a noop if there are n..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/70064 [13:19:06] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70064 [13:20:52] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70212 [13:22:15] New patchset: Faidon; "nrpe: revert to the nagios user/paths, clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70407 [13:23:22] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70407 [13:23:43] puppet is slooooowwww atm [13:24:00] let's see if I broke everything [13:24:07] New review: coren; "Simple enough" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/70103 [13:24:08] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70103 [13:26:26] oh god [13:26:37] - require => [ Package[nagios-nrpe-server], File["/etc/icinga/nrpe_local.cfg"], File["/usr/lib/nagios/plugins/check_dpkg"] ], [13:26:40] - subscribe => File["/etc/icinga/nrpe_local.cfg"], [13:26:43] oh dear god [13:27:07] that was there before, but I left it [13:30:05] New patchset: Helder.wiki; "Enable CAPTCHA for all edits of non-confirmed users on pt.wikipedia in order to reduce editing activity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [13:30:26] -allowed_hosts=127.0.0.1,208.80.154.14,208.80.152.161 [13:30:27] - [13:30:27] +allowed_hosts=127.0.0.1 [13:30:27] + [13:30:30] was that intended? :) [13:30:39] New patchset: Helder.wiki; "Enable CAPTCHA for all edits of non-confirmed users on pt.wikipedia in order to reduce editing activity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [13:30:59] yes [13:31:16] we have allowed_hosts in nrpe_local.cfg [13:31:20] ok [13:31:29] that is a template expansion too [13:31:39] I basically reverted nrpe.cfg to the version shipped by the package [13:31:44] good [13:31:51] thanks for cleaning up [13:31:54] i hated that [13:32:03] as the default one includes both nrpe_local.cfg and /etc/nagios/nrpe.d/ [13:32:10] I'm not done yet [13:32:14] it's all very hairy to clean up, unfortunately [13:32:39] so now for example, the pid file path has changed and the init script doesn't think nrpe is running [13:32:55] migrate to upstart while you're at it [13:33:11] I think I might wait for all that to be applied everywhere and just salt killall nrpe [13:33:23] hehe [13:33:29] no, the plan is to remove the init file from puppet completely [13:33:33] the one shipped by the package works just fine [13:33:41] ok [13:33:44] we had it copied to basically do s/nagios/icinga/ [13:33:54] whyyy [13:34:09] for both /etc/nagios -> /etc/icinga and for the user as well [13:34:11] I don't know! [13:34:23] yeah that broke base installs for months [13:34:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:44] so, I copied the package's init script to puppet as for puppet to revert everything to what the package ships [13:34:55] and then I'll just remove nrpe.cfg and the init script from puppet [13:35:19] which of course will break offline hosts when I do, which is why I was looking at decom :) [13:36:25] sounds good [13:36:48] and I'll remove /home/icinga too [13:36:49] :) [13:36:58] it kills me every time I see that [13:38:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.927 second response time [13:41:19] New patchset: Faidon; "nrpe: don't require/subscribe on the same resource" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70418 [13:41:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:27] where's Tim to debug why puppet is having those load spikes again [13:42:13] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70418 [13:42:23] because you unbroke it probably ;) [13:42:41] anyway, require/subscribe is not bad [13:42:59] there was no point in this case [13:44:48] i see vhtcpd using as much cpu as varnish! [13:45:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.874 second response time [13:45:19] and not 0% I presume? [13:45:22] time to give varnish a bit more load ;-) [13:47:15] this nrpe thing all needs a rewrite [13:51:45] ok, I'll let puppet run and I'll cleanup after this interview [13:51:48] that should be enough time :) [13:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:36] apergos: for shared repo files thats of cause not working, wondering if File namespace should also always be capitalized [13:54:06] I see your point but since it's case sensitive for everything else, users there will know to capitalize for the File: names [13:54:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:54:49] apergos: i.e. http://mg.wiktionary.org/wiki/him mg is using some template that does not capitalize [13:57:51] I can't actually tell where on that page but I'll take your word for it [13:58:12] it seems like that's something that should be worked out with the local users, about wha the template should do, if they have local uploads, etc [14:00:21] no local uploads [14:00:33] its just looking if word.ogg exists and embeds it as an audio sample [14:00:47] the linked page embeds a video called Him.ogg [14:01:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:29] so its a) wrong b) shared repo him.ogg->Him.ogg (will fix TMH to handle that case if thats what should happen) [14:01:51] as long as there's no local uploads, I don'tsee why they wouldn't have the template automatically ucase the name of the media file it's looking for [14:02:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [14:02:41] New patchset: Mark Bergsma; "Update ganglia aggregators for parsoid/mobile caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70420 [14:03:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70420 [14:16:47] New patchset: Yuvipanda; "Minor documentation fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70422 [14:17:36] New review: coren; "Tpyo fxi." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/70422 [14:17:36] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70422 [14:18:22] New patchset: Alex Monk; "Enable CAPTCHA for all edits of non-confirmed users on pt.wikipedia in order to reduce editing activity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [14:18:25] mark: Okay to merge 70420 [14:18:29] mark: ? [14:19:09] New patchset: Alex Monk; "Enable CAPTCHA for all edits of non-confirmed users on pt.wikipedia in order to reduce editing activity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [14:19:34] checking [14:19:57] oh, sorry [14:19:57] yes [14:20:42] oo, heya mark [14:20:57] does this mean we should start capturing traffic from these hosts? [14:20:58] https://gerrit.wikimedia.org/r/#/c/70410/ [14:21:39] ottomata: why would't you be already? [14:22:45] just heard about it becauase drdee saw the commit and asked me [14:22:53] we capture based on source hostnames right now [14:23:12] and so far we only have captured traffic from the mobile varnishes [14:24:05]