[00:01:52] New patchset: Ori.livneh; "Stop gerrit's commit link detection from mangling MediaWiki Urls" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64502 [00:02:17] ops: could we merge this? it drives users insane ^ [00:02:49] New patchset: Jdlrobson; "Enable Nearby on Commons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70352 [00:03:23] * ori-l pounces on andrewbogott [00:03:44] add me as a reviewer? If I'm not already? [00:04:07] andrewbogott: done. there's a +1 from ^demon which my rebase hid. [00:14:25] New review: Andrew Bogott; "I'm taking your word for this since regexps make me angry" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/64502 [00:14:25] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64502 [00:29:25] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70121 [01:01:40] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.00673019886 secs [01:08:53] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 195 seconds [01:09:53] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:15:03] PROBLEM - check_mysql on db1025 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:20:13] PROBLEM - check_mysql on db1025 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:20:14] New patchset: Ori.livneh; "wmgMFLogEvents => true on beta labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70366 [01:20:44] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [01:20:53] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 198 seconds [01:20:53] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70366 [01:22:51] !log olivneh synchronized wmf-config/InitialiseSettings-labs.php 'Iac6a3f2f4d: wmgMFLogEvents => true on beta labs' [01:23:01] Logged the message, Master [01:24:53] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:25:03] PROBLEM - check_mysql on db1025 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:29:04] ACKNOWLEDGEMENT - check_mysql on db1025 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) Matt Walker Ya ya ya, I broke it -- yay replication fail. Emailing Jeff for him to deal with in the morning. [01:31:40] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.005141377449 secs [01:32:40] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.005178689957 secs [01:48:43] !log krinkle synchronized php-1.22wmf8/extensions/VisualEditor/ '{{gerrit|If96bed727ba7af7}}' [01:48:53] Logged the message, Master [01:50:17] James_F: ^ [01:50:24] !log krinkle synchronized php-1.22wmf7/extensions/VisualEditor/ '{{gerrit|If96bed727ba7af7}}' [01:50:28] Krinkle: I noticed. :-) [01:50:32] Logged the message, Master [02:07:29] !log LocalisationUpdate completed (1.22wmf8) at Tue Jun 25 02:07:28 UTC 2013 [02:07:37] Logged the message, Master [02:13:22] !log LocalisationUpdate completed (1.22wmf7) at Tue Jun 25 02:13:22 UTC 2013 [02:13:35] Logged the message, Master [02:26:25] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 25 02:26:25 UTC 2013 [02:26:33] Logged the message, Master [02:47:51] New review: Hazard-SJ; "I didn't manually add those changes, they might have been pulled in." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65860 [03:06:56] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:57] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [03:06:58] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [03:06:59] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:59] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:06:59] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [04:00:13] RECOVERY - check_mysql on db1025 is OK: Uptime: 2813802 Threads: 1 Questions: 55042899 Slow queries: 50832 Opens: 51343 Flush tables: 2 Open tables: 64 Queries per second avg: 19.561 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [04:17:38] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [04:32:45] New patchset: Spage; "Rewrite gerrit gitweb URLs to new gitblit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70370 [04:56:52] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:03:50] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [05:09:28] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:26:32] New review: Aklapper; "Daniel: So let's set up the alias and try?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [05:59:19] PROBLEM - SSH on gadolinium is CRITICAL: Server answer: [06:00:19] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:24:14] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [06:38:18] RECOVERY - search indices - check lucene status page on search1009 is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.004 second response time [07:08:41] apergos: morning [07:08:45] morning [07:09:07] yay on the obsolete [07:09:18] moved a bunch of solaris crapola in there [07:09:22] felt reaaal good [07:09:36] they still show up in search though [07:09:42] I'm searching "zfs" [07:10:45] want to do Media server & friends too? [07:11:14] the redirs will show up but only as redirs that happen to have that word in the title [07:11:39] any that don't won't show up and the obsolete pages themselves won't either [07:11:52] yeah I'm still working my way through things [07:11:57] there's a lot of cruft in here [07:13:40] yeah, a lot of cruft indeed [07:14:23] thanks for that :) [07:15:46] yw [07:28:06] New patchset: Nikerabbit; "ULS deployment phase 3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69643 [07:39:22] New patchset: Tpt; "Clean the aliases of Proofread Page managed namespaces" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70377 [07:50:45] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:51:09] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:54:02] I wonder if that's a real outage... [07:56:21] doubtful, looks from ganglia like search in pmtpa is still the active cluster [07:58:31] hmm [08:01:49] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.006300568581 secs [08:08:55] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69643 [08:12:03] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'ULS phase3' [08:12:10] Logged the message, Master [08:14:14] New review: Ori.livneh; "Appears to have worked! See this patch's commit message, as well as:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64502 [08:25:18] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [08:26:19] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [08:27:18] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [08:29:18] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [08:32:59] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.007350087166 secs [09:16:53] New patchset: Mark Bergsma; "Prepare Parsoid cache manifests for new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70387 [09:16:53] New patchset: Mark Bergsma; "Setup cp1045 and cp1058 as Parsoid caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70388 [09:19:20] New patchset: Mark Bergsma; "Prepare Parsoid cache manifests for new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70387 [09:19:21] New patchset: Mark Bergsma; "Setup cp1045 and cp1058 as Parsoid caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70388 [09:20:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70387 [09:20:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70388 [09:39:21] PROBLEM - Disk space on cp1048 is CRITICAL: Connection refused by host [09:39:41] PROBLEM - RAID on cp1048 is CRITICAL: Connection refused by host [09:39:52] PROBLEM - Host cp1045 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:11] PROBLEM - DPKG on cp1048 is CRITICAL: Connection refused by host [09:40:52] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [09:41:43] mark: morning have you scheduled the varnish media storage renaming from https://gerrit.wikimedia.org/r/#/c/68153/ ? [09:41:56] you mentioned it needs a restart of all backends [09:42:23] hashar: i'll just do that on the new servers only [09:43:13] using a configuration parameter i guess [09:43:24] not even [09:43:26] just with a hack [09:44:12] * YuviPanda wonders if anyone can help merge https://gerrit.wikimedia.org/r/#/c/70064/4 [09:44:15] I cced myself to the change, I need it to adapt the role::cache::upload class on beta [09:44:28] i probably won't merge this one [09:44:54] i can give you a headsup before I merge the relevant change [09:45:54] well you originally wanted to have the media storage named the same in beta and production :D [09:46:16] to get rid of the ugly vdb in the manifests [09:47:44] yes [09:49:55] mark: so if you don't merge the change, that is kind of blocking my role::cache::upload change isn't it ? : -] [09:50:56] no I mean I will merge a similar change [09:51:07] well I can just send a new patchset to this one I guess [09:51:31] PROBLEM - NTP on cp1048 is CRITICAL: NTP CRITICAL: No response from NTP server [09:52:44] ok :) [09:52:52] will rebase mine once you are done [09:53:15] I most probably have to tweak it a lot since you refactored a bit of the varnish/cache classes [09:54:35] yeah [09:54:47] i was right when I said it was better to hold off for a few days I think eh ;) [09:58:24] New patchset: Mark Bergsma; "Slightly up the cache paritition size / free space ratio" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70392 [10:00:03] New patchset: Mark Bergsma; "Slightly up the cache paritition size / free space ratio" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70392 [10:00:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70392 [10:03:51] PROBLEM - Host cp1045 is DOWN: PING CRITICAL - Packet loss = 100% [10:09:03] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [10:11:33] PROBLEM - SSH on cp1045 is CRITICAL: Connection refused [10:11:33] PROBLEM - DPKG on cp1045 is CRITICAL: Connection refused by host [10:11:45] PROBLEM - Varnish HTTP parsoid-backend on cp1045 is CRITICAL: Connection refused [10:11:45] PROBLEM - Disk space on cp1045 is CRITICAL: Connection refused by host [10:11:55] PROBLEM - Varnish HTTP parsoid-frontend on cp1045 is CRITICAL: Connection refused [10:12:03] PROBLEM - RAID on cp1045 is CRITICAL: Connection refused by host [10:15:34] RECOVERY - SSH on cp1045 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [10:20:33] RECOVERY - DPKG on cp1045 is OK: All packages OK [10:20:43] RECOVERY - Varnish HTTP parsoid-backend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.002 second response time [10:20:43] RECOVERY - Disk space on cp1045 is OK: DISK OK [10:21:03] RECOVERY - RAID on cp1045 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:23:53] RECOVERY - Varnish HTTP parsoid-frontend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 643 bytes in 0.003 second response time [10:26:23] PROBLEM - Varnish HTTP parsoid-backend on cp1058 is CRITICAL: Connection refused [10:29:53] PROBLEM - NTP on cp1045 is CRITICAL: NTP CRITICAL: Offset unknown [10:34:23] RECOVERY - Varnish HTTP parsoid-backend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.001 second response time [10:39:02] RECOVERY - NTP on cp1045 is OK: NTP OK: Offset -0.01299083233 secs [10:43:13] PROBLEM - Host cp1058 is DOWN: CRITICAL - Plugin timed out after 15 seconds [10:44:11] PROBLEM - Host cp1045 is DOWN: CRITICAL - Plugin timed out after 15 seconds [10:45:01] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [10:45:23] RECOVERY - Host cp1058 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [10:53:13] apergos: I see you're still cleaning up stuff [10:53:22] need any clarification from me? [10:55:13] no, I'm about ready to call that done for the day and move on to other stuff actually [10:55:36] !log Pooled cp1045 and cp1058 Varnish frontends in the parsoid cache pool [10:55:44] Logged the message, Master [10:57:08] okay [11:04:09] New patchset: Mark Bergsma; "Replace the parsoidcache backends with cp1045/cp1058" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70405 [11:04:43] so [11:04:52] I have a ceph nagios health check [11:05:25] that calls ceph --id nagios --keyring /var/lib/somewhere/ceph-nagios.key health [11:05:41] New patchset: Mark Bergsma; "Replace the parsoidcache backends with cp1045/cp1058" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70405 [11:05:54] plus a ceph::key { } to create that key, owned by root:nagios 0440 [11:06:11] where does that belong now? :) [11:06:20] modules/ceph/manifests/nagios.pp? :) [11:06:32] yes [11:06:50] where else? [11:07:01] all the other nagios checks are under icinga.pp [11:07:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70405 [11:07:10] I also have to configure nrpe for the extra command [11:07:17] and the icinga manifests provide no definitions for that [11:07:24] there is one somewhere [11:07:30] nrpe::monitor_command or something [11:07:52] nrpe::monitor_service [11:07:57] right [11:16:46] !log jenkins: migrating extensions jobs to use gallium slave ( lint , phpcs-HEAD ) or be bound on master (test extensions) [11:16:55] Logged the message, Master [11:19:21] argh, nrpe runs as user icinga now [11:19:25] why oh why [11:44:16] !log Depooled old parsoid cache servers - parsoidcache migration complete [11:44:25] Logged the message, Master [11:48:25] New patchset: Faidon; "nrpe: revert to the nagios user/paths, clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70407 [11:48:30] akosiaris: ^^ [11:48:51] untested [11:51:07] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection timed out [11:51:44] New patchset: Mark Bergsma; "Prepare the mobile cache manifests for the new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70409 [11:51:44] New patchset: Mark Bergsma; "Install cp104[67] and cp1059/cp1060 as mobile caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70410 [11:51:47] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [11:52:34] apergos: any idea how we can get rid of the redirects? [11:52:55] I'd left them in until we find out which articles actually have links to the articles [11:53:01] that will take a little work [11:53:35] if you like I can lookk into that tomorrow morning [11:55:09] New patchset: Mark Bergsma; "Add default selector value" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70411 [11:56:45] apergos: harrse Reedy for a bit of AWB magic [11:56:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70409 [11:57:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70410 [11:57:28] is he an awb user? (I don't, I know it supposedly runs ok wunder wine but meh) [11:57:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70411 [11:58:26] peachey|laptop__: [11:58:45] apergos: he wrote it iirc [11:59:22] really? [11:59:56] huh, I know one of the other authors [12:00:47] well I guess they are both 'developers', not the original author [12:01:37] New patchset: Faidon; "nrpe: revert to the nagios user/paths, clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70407 [12:02:56] paravoid: will it get a nagios user still ? [12:03:59] ah nagios-npre package creates it :) [12:04:39] yep [12:04:46] basically nagios-nrpe-server does everything [12:04:52] I'm not sure why we've reimplemented the wheel here [12:04:55] it's very puzzling [12:05:14] no idea :D [12:08:32] New patchset: Mark Bergsma; "Add new mobile caches, remove old/nonexistant" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70412 [12:09:43] Change merged: Mark Bergsma; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70412 [12:11:14] !log mark synchronized wmf-config/squid.php [12:11:23] Logged the message, Master [12:13:02] hm, so puppet cleanup is broken [12:13:14] knsq* still are on puppet's database [12:13:17] and nagios [12:15:01] were they added to the decommissioning list? [12:15:02] I didn't ;) [12:15:06] they were [12:15:10] ok [12:16:31] also affects a number of other hosts [12:16:45] singer, zinc etc. [12:17:12] also, shouldn't decom_servers.sh also puppet cert revoke/clean [12:17:38] no [12:17:46] then those decommissioned servers can't run puppet anymore [12:18:19] it's still best if they run puppet while decommissioned (at least in theory) [12:18:45] root@sockpuppet:~# ls -la /var/lib/git/operations/puppet/manifests/decommissioning.pp [12:18:46] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [12:18:48] -rw-r--r-- 1 root root 1395 Jan 26 2012 /var/lib/git/operations/puppet/manifests/decommissioning.pp [12:18:51] bwahaha [12:19:02] doesn't stafford run it?\ [12:19:24] maybe, let me check [12:19:38] so, if they run puppet [12:19:44] they'll get readded to stored configs [12:19:50] nope [12:19:54] nope? [12:19:56] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [12:20:03] the manifests check whether the host is in the decommissioned list [12:20:10] (at least, in theory ;) [12:20:22] oh sure, there will be some stuff added [12:20:32] but that's no big deal, that can get cleaned up after the host is really down [12:20:43] found it [12:21:48] someone decided to convert double quotes to single quotes [12:22:14] obviously very important [12:22:36] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [12:23:31] de4926d3 [12:23:36] PROBLEM - Host cp1059 is DOWN: PING CRITICAL - Packet loss = 100% [12:23:36] PROBLEM - Host cp1060 is DOWN: PING CRITICAL - Packet loss = 100% [12:23:38] Author: RobH [12:23:38] Date: Fri May 10 11:39:44 2013 -0700 [12:23:38] barium reclaim [12:23:48] that's the entirety of the commit msg [12:23:56] RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [12:24:35] 'wikinews-lb.wikimedia.org'? [12:24:35] wtf? [12:24:36] RECOVERY - Host cp1059 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [12:24:53] it was there before too [12:24:56] RECOVERY - Host cp1060 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [12:25:59] New patchset: J; "Increase transcode timeout for 720p uploads > 1h" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70413 [12:26:39] !log jenkins: all mediawiki extension jobs got updated. [12:26:48] Logged the message, Master [12:27:23] New patchset: Faidon; "Fix decomissioning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70414 [12:27:38] so mediawiki is configured with allowed XFF ips for our proxies [12:27:41] but those are currently all ipv4 [12:27:53] what if a varnish, which supports ipv6, adds its IPv6 ip to the XFF list on an ipv6 request? [12:28:41] erm? [12:28:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:29:23] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70414 [12:29:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.526 second response time [12:30:55] so, when do we actually revoke certs? [12:30:57] never? :) [12:31:01] never [12:32:43] can't we say that whenever we add something to decom.pp we should also poweroff the machine and revoke the cert? [12:33:05] you could say that [12:33:09] good luck [12:33:38] I'm wondering/asking if it will work in practice [12:34:07] i doubt it [12:34:09] exec { 'poweroff': path => '/sbin', command => 'poweroff' } [12:34:10] :P [12:36:28] !log Pooled the new eqiad mobile frontend caches with lowest weight in PyBal [12:36:36] Logged the message, Master [12:40:06] New patchset: Mark Bergsma; "Pass the storage parameter to mobile frontends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70415 [12:41:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70415 [12:45:50] so much time wasted waiting for puppet to run [13:00:59] New patchset: Mark Bergsma; "Temporarily split the mobile cluster in old and new" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70416 [13:03:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70416 [13:07:48] why does wiktionary not normalize file names i.e. http://en.wiktionary.org/wiki/File:attraction.ogg is not redirected to http://en.wiktionary.org/wiki/File:Attraction.ogg while this happens i.e. for en.wikipedia.org/wiki/File:attraction.ogg [13:07:56] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:56] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:57] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [13:07:57] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [13:07:58] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:58] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:59] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:12:52] j^, because words that begin with a capital may be different than those which start with lower case [13:13:24] not just within the same language but across languages too [13:17:41] New review: coren; "I'm no all that familiar with Reddis either, but that patch safely defaults to a noop if there are n..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/70064 [13:19:06] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70064 [13:20:52] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70212 [13:22:15] New patchset: Faidon; "nrpe: revert to the nagios user/paths, clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70407 [13:23:22] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70407 [13:23:43] puppet is slooooowwww atm [13:24:00] let's see if I broke everything [13:24:07] New review: coren; "Simple enough" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/70103 [13:24:08] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70103 [13:26:26] oh god [13:26:37] - require => [ Package[nagios-nrpe-server], File["/etc/icinga/nrpe_local.cfg"], File["/usr/lib/nagios/plugins/check_dpkg"] ], [13:26:40] - subscribe => File["/etc/icinga/nrpe_local.cfg"], [13:26:43] oh dear god [13:27:07] that was there before, but I left it [13:30:05] New patchset: Helder.wiki; "Enable CAPTCHA for all edits of non-confirmed users on pt.wikipedia in order to reduce editing activity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [13:30:26] -allowed_hosts=127.0.0.1,208.80.154.14,208.80.152.161 [13:30:27] - [13:30:27] +allowed_hosts=127.0.0.1 [13:30:27] + [13:30:30] was that intended? :) [13:30:39] New patchset: Helder.wiki; "Enable CAPTCHA for all edits of non-confirmed users on pt.wikipedia in order to reduce editing activity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [13:30:59] yes [13:31:16] we have allowed_hosts in nrpe_local.cfg [13:31:20] ok [13:31:29] that is a template expansion too [13:31:39] I basically reverted nrpe.cfg to the version shipped by the package [13:31:44] good [13:31:51] thanks for cleaning up [13:31:54] i hated that [13:32:03] as the default one includes both nrpe_local.cfg and /etc/nagios/nrpe.d/ [13:32:10] I'm not done yet [13:32:14] it's all very hairy to clean up, unfortunately [13:32:39] so now for example, the pid file path has changed and the init script doesn't think nrpe is running [13:32:55] migrate to upstart while you're at it [13:33:11] I think I might wait for all that to be applied everywhere and just salt killall nrpe [13:33:23] hehe [13:33:29] no, the plan is to remove the init file from puppet completely [13:33:33] the one shipped by the package works just fine [13:33:41] ok [13:33:44] we had it copied to basically do s/nagios/icinga/ [13:33:54] whyyy [13:34:09] for both /etc/nagios -> /etc/icinga and for the user as well [13:34:11] I don't know! [13:34:23] yeah that broke base installs for months [13:34:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:44] so, I copied the package's init script to puppet as for puppet to revert everything to what the package ships [13:34:55] and then I'll just remove nrpe.cfg and the init script from puppet [13:35:19] which of course will break offline hosts when I do, which is why I was looking at decom :) [13:36:25] sounds good [13:36:48] and I'll remove /home/icinga too [13:36:49] :) [13:36:58] it kills me every time I see that [13:38:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.927 second response time [13:41:19] New patchset: Faidon; "nrpe: don't require/subscribe on the same resource" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70418 [13:41:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:27] where's Tim to debug why puppet is having those load spikes again [13:42:13] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70418 [13:42:23] because you unbroke it probably ;) [13:42:41] anyway, require/subscribe is not bad [13:42:59] there was no point in this case [13:44:48] i see vhtcpd using as much cpu as varnish! [13:45:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.874 second response time [13:45:19] and not 0% I presume? [13:45:22] time to give varnish a bit more load ;-) [13:47:15] this nrpe thing all needs a rewrite [13:51:45] ok, I'll let puppet run and I'll cleanup after this interview [13:51:48] that should be enough time :) [13:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:36] apergos: for shared repo files thats of cause not working, wondering if File namespace should also always be capitalized [13:54:06] I see your point but since it's case sensitive for everything else, users there will know to capitalize for the File: names [13:54:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:54:49] apergos: i.e. http://mg.wiktionary.org/wiki/him mg is using some template that does not capitalize [13:57:51] I can't actually tell where on that page but I'll take your word for it [13:58:12] it seems like that's something that should be worked out with the local users, about wha the template should do, if they have local uploads, etc [14:00:21] no local uploads [14:00:33] its just looking if word.ogg exists and embeds it as an audio sample [14:00:47] the linked page embeds a video called Him.ogg [14:01:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:29] so its a) wrong b) shared repo him.ogg->Him.ogg (will fix TMH to handle that case if thats what should happen) [14:01:51] as long as there's no local uploads, I don'tsee why they wouldn't have the template automatically ucase the name of the media file it's looking for [14:02:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [14:02:41] New patchset: Mark Bergsma; "Update ganglia aggregators for parsoid/mobile caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70420 [14:03:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70420 [14:16:47] New patchset: Yuvipanda; "Minor documentation fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70422 [14:17:36] New review: coren; "Tpyo fxi." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/70422 [14:17:36] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70422 [14:18:22] New patchset: Alex Monk; "Enable CAPTCHA for all edits of non-confirmed users on pt.wikipedia in order to reduce editing activity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [14:18:25] mark: Okay to merge 70420 [14:18:29] mark: ? [14:19:09] New patchset: Alex Monk; "Enable CAPTCHA for all edits of non-confirmed users on pt.wikipedia in order to reduce editing activity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [14:19:34] checking [14:19:57] oh, sorry [14:19:57] yes [14:20:42] oo, heya mark [14:20:57] does this mean we should start capturing traffic from these hosts? [14:20:58] https://gerrit.wikimedia.org/r/#/c/70410/ [14:21:39] ottomata: why would't you be already? [14:22:45] just heard about it becauase drdee saw the commit and asked me [14:22:53] we capture based on source hostnames right now [14:23:12] and so far we only have captured traffic from the mobile varnishes [14:24:05] !log reedy synchronized php-1.22wmf8/extensions/ProofreadPage/ [14:24:12] Logged the message, Master [14:26:16] also, mark, in a more nitpicky q, there is a node regex for cp1041-1044 a few lines above that, should we just add these host's to that node regex? the content of the nodes look the same [14:30:02] no, those are going away [14:31:31] ok [14:31:49] but anyway, yes, these new servers are serving 10% of all mobile traffic already [14:32:16] New patchset: Tim Landscheidt; "Make qacct usable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70425 [14:32:35] ok, we'll start scooping up their data too then [14:32:36] thanks [14:32:53] more varnish server changes on the way [14:33:04] upload tomorrow or so [14:33:11] text will be deployed soon [14:33:15] esams will get mobile caches [14:33:24] anyone who knows how salt signs keys? [14:33:24] ulsfo caching center will get deployed soon too [14:33:31] no clue [14:33:35] bleh [14:35:38] mark, cool, good to know, do you know what the hostnames of the new esams mobile cache servers will be? [14:35:50] cp3011-3014 [14:36:03] they're in puppet already, but not serving traffic atm [14:36:13] great, i'll just add those now and when they do we'll get the traffic [14:36:38] you may want to make use of the $active_nodes array in role/cache.pp [14:36:49] that's what we do for monitoring them also [14:37:08] although I must say that at this very moment that would be defeated, as I had to pull a little trick to split the eqiad mobile cluster in old and new [14:37:11] so there's eqiad-old and eqiad [14:37:49] aye ok [14:37:49] danke [14:38:54] wouldn't it be better to tag this traffic inside the log line or something? [14:38:57] instead of relying on hostnames [14:39:57] can we call the same parameterized class twice in puppet ? [14:40:20] no [14:40:20] such as: class { 'my::class': foo => bar } and later: class { 'my::class': foo => zag } [14:40:29] I am doomed :/ [14:40:52] puppet is declarative... you can not declare anything twice [14:41:09] you can create a define instead of a class [14:41:14] what is it that you're trying to do? [14:41:24] creating two tmpfs :) [14:42:29] New review: Nemo bis; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [14:43:40] New patchset: Hashar; "contint: vary tmpfs conf on master and slaves" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70427 [14:43:44] paravoid: here is my lame patch : https://gerrit.wikimedia.org/r/70427 [14:44:15] mark, yeah i'm sure it would, if you can think of a good way to tag it. Maybe another sub field in the X-Analytics header? [14:44:29] PROBLEM - RAID on wtp1003 is CRITICAL: Connection refused by host [14:44:30] PROBLEM - RAID on ms-be3 is CRITICAL: Connection refused by host [14:44:37] !log restarting nrpe on all hosts [14:44:39] RECOVERY - twemproxy process on terbium is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [14:44:40] RECOVERY - twemproxy process on tmh2 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [14:44:42] let's see... [14:44:47] Logged the message, Master [14:44:49] RECOVERY - twemproxy process on tmh1002 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [14:45:09] RECOVERY - twemproxy process on tmh1001 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [14:46:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.053 second response time [14:47:48] New patchset: Hashar; "contint: vary tmpfs conf on master and slaves" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70427 [14:48:20] what a mess [14:49:59] PROBLEM - RAID on labsdb1001 is CRITICAL: Connection refused by host [14:51:21] puppet apply --noop --modulepath=modules manifests/site.pp manifests/foo.pp [14:51:21] :-) [14:51:27] finally found out how to locally test something [14:52:18] oh yeah, you didn't know that? [14:52:19] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: Connection refused by host [14:52:29] PROBLEM - DPKG on cp1041 is CRITICAL: Connection refused by host [14:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:39] PROBLEM - Disk space on cp1041 is CRITICAL: Connection refused by host [14:52:49] PROBLEM - Varnish HTCP daemon on cp1041 is CRITICAL: Connection refused by host [14:52:59] PROBLEM - RAID on cp1041 is CRITICAL: Connection refused by host [14:52:59] paravoid: nop [14:53:07] !log deluser icinga; rm -rf /home/icinga on all hosts but neon [14:53:07] paravoid: I was running puppetd -tv on an instance :( [14:53:15] paravoid, how strongly do you feel about my breaking https://gerrit.wikimedia.org/r/#/c/68584/ ? Changing that exim stuff requires me to modify the ldap node entries for labs; I'd prefer not to do it twice. [14:53:15] Logged the message, Master [14:53:33] Of course, if we merge both patches at once then that's not a problem, but if both patches have to be merged at once then… it seems more like one patch :) [14:53:51] um… *breaking up, I mean [14:54:02] breaking up into what? [14:54:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.165 second response time [14:54:30] yesterday you said (I think) that you want one patch that creates roles in manifests/roles and then a second patch that moves said roles into a role module [14:54:47] (which does seem better, but for the inconvenience) [14:55:12] oh yeah, I'd prefer seeing this use manifests/role, if we were to start using modules/mwrole I'd prefer seeing this in a separate commit [14:55:24] also, I don't understand what "mw" is for, this isn't mediawiki :) [14:55:28] New patchset: BBlack; "Add stat-checking for reloads, fix up error logging" [operations/software/varnish/libvmod-netmapper] (master) - https://gerrit.wikimedia.org/r/70428 [14:55:37] but let's have this discussion on a separate patch, not tied with exim [14:55:44] oh, good point, should be 'wm' :) [14:55:57] just use existing conversions for exim I'd say [14:56:25] I don't understand what you mean by 'existing conversions' [14:56:29] er sorry [14:56:32] conventions [14:56:35] oh, I see [14:56:42] i.e. manifests/role/ [14:56:50] Change merged: BBlack; [operations/software/varnish/libvmod-netmapper] (master) - https://gerrit.wikimedia.org/r/70428 [14:57:42] paravoid, and for each of the upcoming dozens of patches that create new modules… will each of those also be in two steps, creating new roles and then moving them? [14:57:59] no [14:58:25] introducing mwrole or wmrole or whatever is a separate action and deserves a separate commit I think [14:58:33] after that we can start moving things over there instead of manifests/role [14:58:44] New review: coren; "Should suffice for the job at hand." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/70425 [14:58:44] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70425 [14:58:53] but it feels weird commenting on e.g. the name mwrole in an exim changeset [14:59:06] Ah, so, ok, how about if I make a patch that creates wmroles and moves some other arbitrary role there… then the exim patch can remain as is and be applied after that one [14:59:09] mwrole? [14:59:19] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 2 processes with command name varnishncsa [14:59:27] mark, type, should be wmrole [14:59:29] RECOVERY - DPKG on cp1041 is OK: All packages OK [14:59:39] RECOVERY - Disk space on cp1041 is OK: DISK OK [14:59:40] yes, that works too, I suggested the opposite to not have exim be blocked on the other changeset :) [14:59:49] is this to avoid conlict between manifests/role and the module? [14:59:49] RECOVERY - Varnish HTCP daemon on cp1041 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [14:59:54] *typo dammit [14:59:59] RECOVERY - RAID on cp1041 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [15:00:15] mark, yeah, exactly. [15:00:28] We can't really have a module called 'role' unless we move every role at once [15:00:38] can't we just move roles over at the end? [15:00:49] or include everything from the role module for now? [15:01:02] anyway, meeting now [15:01:32] I hate puppet [15:01:44] including everything from the role module won't work well I think [15:01:45] I always ends up being struck with the same bug [15:01:55] but moving roles over at the end doesn't sound like a bad idea to me [15:02:14] Well… right now I am actually testing things as I work on them. [15:02:24] A great renaming at the end would require a lot of faith. [15:02:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:42] !log updated Parsoid to 5ef05e6c9 [15:02:50] Logged the message, Master [15:03:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.163 second response time [15:04:20] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [15:11:07] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [15:15:11] ^demon: we're having more chats on the RFC page about exactly what we should do with the content. Mostly it seems like the right thing to do (certainly for now, maybe forever) is to expand all the templates and hand it over to solr. HTML or plain text might just be an implementation detail for you or I to decide. [15:16:18] New patchset: Andrew Bogott; "Move gerrit roles to new 'wmrole' module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70429 [15:17:10] <^demon> manybubbles: I think HTML will be better. It's way easier to work with than wikitext, we can reuse the parser cache, and is in general more portable. [15:17:54] ^demon: fine by me. would you like to do the php half of getting it in there and then I can do the solr configuraiton portion to suck it back out of the index? [15:18:41] <^demon> Ok, so Solr will strip HTML at indexing time, and return plaintext? [15:18:59] I _think_ that is the right thing for it to do. [15:19:11] <^demon> Sounds completely reasonable to me. [15:19:44] sweet! if you get it sending html to solr I get solr stripping it back out and doing the right thing with it. [15:20:04] <^demon> okie dokie [15:20:06] in the mean time I'm doing research into transclusion and talking on the talk page. [15:20:12] Stripping is the right thing to do in many social occasions as well. [15:20:14] Wait, what are we talking about? :-) [15:20:38] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [15:20:38] <^demon> What could possibly go wrong with any situation that begins with "Well first you strip" [15:20:59] New patchset: Hashar; "contint: vary tmpfs conf on master and slaves" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70427 [15:21:23] Police, mostly. [15:21:29] I [15:21:32] New review: Andrew Bogott; "Please notify me on IRC before merging -- I'll need to change some node definitions in ldap for labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70429 [15:21:39] I'm thinking of streat festivals mostly. [15:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.168 second response time [15:23:24] New review: Hashar; "Finally have something more or less working. Tried it using a lame manifest:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70427 [15:23:32] I am off [15:23:34] see you later tonight [15:24:50] !log jenkins: raising # of executors on 'gallium' slave node from 4 to 8 [15:24:58] Logged the message, Master [15:25:06] paravoid, as requested: https://gerrit.wikimedia.org/r/#/c/70429/ [15:26:22] ^demon & manybubbles, how are you going to update Solr - push or pull? [15:26:40] *sigh* [15:27:10] MaxSem: "update solr" is unclear to me at this point - do you mean add documents to the index? [15:27:16] yep [15:27:22] HTTP POST baby [15:28:00] the batch indexing is also http post, all from the php extension. [15:28:19] <^demon> MaxSem: We're using the SearchUpdate stuff...so when an edit/delete/etc is done, the index is updated as a DeferredUpdate. [15:28:21] Solr works better when you think of it as a fancy relational database - you act on it [15:28:23] definitely post - but which way, e.g. from a dedicated daemon or from MW when pages change> [15:29:20] I'm having trouble with thinking this way, all the relational features are buried under a shitload of XML:P [15:34:23] MaxSem: fair. we'll be doing the post in process (unless we find that is a disaster) with DeferredUpdate like ^demon said. [15:35:03] MaxSem: The batch update'll be from a php maintenance script so pretty much the same. [15:37:26] paravoid, *sigh*? [15:37:39] nothing [15:39:05] <^demon> manybubbles: https://gerrit.wikimedia.org/r/#/c/70432/ [15:39:40] he wants me to spend some time on ElasticSearch. Rightly so. He might be sighing about something else, but if was sighing about that I wouldn't blame him. [15:40:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.162 second response time [15:47:57] ^demon: question about +2 and how to use it in the stage where we are: I'm certainly not qualified to know all the ramification of the code you right in the plugin. Not yet, at any rate. Should I just +2 this on the assumption that we'll get another review at some point in the future (we need it any way for the times before we were using gerrit)? [15:48:33] <^demon> I think we can afford to be liberal with merging at the moment, we're still iterating and liable to break things. [15:49:03] merged then [15:49:08] I figured [15:51:26] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:52:26] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:53:13] New patchset: Ottomata; "Updating monitoring for analytics udp2log hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70436 [15:53:32] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70436 [15:53:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:52] <^demon> manybubbles: When I get cache hits, I'm averaging ~60 page/sec now \o/ [15:54:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.160 second response time [15:54:38] ^demon: I'll take it. we can get that running on 10 cpus and run through enwiki in a day and a half or so if we have to. [15:58:19] <^demon> http://p.defau.lt/?SYuQZp5sRSZ4RFAxpzeCtQ - this is on labs, with cache completely warm [15:59:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.354 second response time [16:06:18] New review: Aaron Schulz; "OK, but this timeout doesn't cause jobs to be aborted in progress (though someday it might and shoul..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/70413 [16:08:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:11:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.930 second response time [16:13:13] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [16:17:01] New patchset: J; "rename $wmgUseVipsTest to $wmgUseVips" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70439 [16:25:03] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [16:28:32] New patchset: Cmcmahon; "Enable VE experimental mode on test2wiki per Bug 49963" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70440 [16:29:01] New review: Alex Monk; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [16:29:32] New review: Cmcmahon; "Long-time fan, first-time committer. I'm not sure I did this correctly, please review." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70440 [16:29:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.167 second response time [16:38:13] paravoid: quick question: let's say there's a new tool that makes uploading images to commons in bulk really easy. What's an order of magnitude in total size (in gb/tb) that we'd like a little heads up on? :-) [16:38:41] The tool is for the GLAM sector, btw [16:42:58] not just size but number (scalers) [16:53:31] apergos: ah, right, forgot about that... [16:53:52] (and I was just replying to a VipsScaler bug.... heh) [16:53:57] :-) [16:54:24] apergos: is paravoid the right person to ask re scaling limits, too? [16:54:59] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70439 [16:55:59] !log reedy synchronized wmf-config/ [16:56:04] I dunno if there is a right person but he likely has a better sense of it than anyone else [16:56:05] Change abandoned: J; "https://gerrit.wikimedia.org/r/70441 for the change in TMH in that case." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70413 [16:56:07] Logged the message, Master [16:56:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:58] apergos: good enough for me, thanks! :) [16:57:10] New review: J; "confused about maxtime now, what is it used for? possibly TMH needs a lower value." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70413 [16:58:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [17:03:30] New review: Aaron Schulz; "The -t parameter just means that runJobs.php will terminate if that much time has passed (and it onl..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70413 [17:06:52] greg-g: hard to say [17:07:12] greg-g: we have about 45T of data right now [17:07:28] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [17:07:32] greg-g: iirc, half of them are thumbnails [17:07:59] greg-g: object count is http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&m=swift_object_count&h=Swift+pmtpa+prod&c=Swift+pmtpa [17:08:11] (appending &trend=1 gives the current trend too) [17:08:29] http://ganglia.wikimedia.org/latest/graph_all_periods.php?m=swift_object_change&z=small&h=Swift+pmtpa+prod&c=Swift+pmtpa&r=hour is the object change [17:09:05] New review: J; "whats a good value for the TMH case, where this runs through jobs-loop.sh and should process as many..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70413 [17:09:07] note that we add thumbs but never expire them, so this is one of the reasons it's monotonically increasing [17:09:11] Coren: ? [17:09:31] Coren: we tend to have nicknames up there, easier for people to know who to ping [17:09:49] I know, this is why I changed it back after CT told me. :-) [17:16:19] greg-g_: did you see my replies or did you fall out from irc? [17:18:09] * AaronSchulz likes how clear-profile does not work in eqiad [17:18:27] !log aaron cleared profiling data [17:18:35] Logged the message, Master [17:18:52] ahh, works from tin actually [17:22:34] ^demon: so I've had a look at transcluded pages and there are really two parts to it. We've got the first part: including templates. [17:23:11] the second part is reindexing pages that include the transcluded pages. which needs some thought. [17:23:37] <^demon> Huh? [17:25:00] say you are strait including page S in page T. If page S changes we should reindex page T so we pick up the changes in S that are included in it. [17:25:15] Well, if we want to do that, that is. [17:27:47] paravoid: got em, reading/looking now [17:29:38] paravoid: so, I guess one question is: how much free space do we have on the machines that host images/thumbs now? [17:30:47] that's not the right question :) [17:31:42] probably not, but is "one" ;) [17:31:46] we are about 60% full [17:31:59] so, how do I respond to a guy asking me "when should I give you all a warning?" [17:32:13] but we have budgeted for more capacity this coming FY [17:32:16] * greg-g_ nods [17:32:43] I pinged robla when we were budgeting to ask about any large projects but noone really can know if e.g. video will take off [17:32:55] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:57] right right [17:33:01] all guess work [17:33:10] so we have budgeted for quite a bit more than we have, I think doubling our current capacity was what we ended up with [17:33:37] well, we have current trends, but we there's a strategy to increase videos which can affect storage a lot [17:33:58] right [17:34:04] so, anyway [17:34:14] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [17:34:17] any guidelines I could share? :-) [17:34:32] can I reverse the question? [17:34:36] what do they want to do? :) [17:34:42] he wasn't happy with my "we'll probably be fine" answer :) [17:34:54] paravoid: basically, he's just playing cautious. Full story...: [17:35:16] people rarely do that, so I appreciate that! [17:35:52] there's this tool that is being developed that will help museums/libraries/etc to upload large batches of images. There are no limits in place right now re size or # of files. [17:36:10] some museums/libraries/etc have LOTS of images they'd be willing to share. Like lots. tons. [17:36:35] but, some might not have that many, so we don't want everyone to give us a heads up, but only the ones that are exceptional [17:36:40] so, to define exceptional :) [17:37:01] heh [17:40:17] New review: Dzahn; "has now been fixed via:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2446 [17:42:15] What is "mayapple"? [17:43:19] Coren: a server in esams that is part of toolserver [17:43:41] it's in rack OE10 [17:43:41] Do we have dracs over there? [17:44:11] i don't think it's a dell, but lemme look [17:44:16] Change merged: Ryan Lane; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/70033 [17:45:46] Coren: i don't think we do [17:45:50] Coren: http://wikimedia.7.x6.nabble.com/mayapple-config-amp-cron-jobs-td4998314.html [17:46:12] http://mayapple.toolserver.org/ [17:47:16] Coren: pretty sure that ticket needs actual hands in dc [17:47:44] mutante: Yeah, I was hoping we had a mgmt console available for it. Who do we normally poke to wander in the ams dc? [17:48:12] Coren: Mark :p [17:50:50] Anyone know about the search issue on e.g. enwiki? [17:50:59] "Pool queue is full" [17:51:03] Coren: i think he'll go back soon anyways to remove knsq stuff, but ask him [17:52:30] https://wikitech.wikimedia.org/wiki/File:Knams-multihomed.png [17:52:34] mutante: I will, if he's going there anyways he might as well take a look. [17:52:52] nods [17:53:42] I mean to say, search is currently working on 1/5 tries (unscientific experiment) so clearly something is up, it'd be great to look in [17:55:09] hmm [17:55:47] yrsh … was just coming for that question too [17:55:57] though so far it's seeming more like 1/3 [17:57:16] New patchset: Demon; "Up gitblit cache to 7 days" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70448 [17:58:01] New review: Demon; "We've still got tons of heap to spare, and this will improve cache hits." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70448 [17:58:48] weird, all of the search related poolcounter locks i see are for search pool 3 (ACQ4ME _lucene:host:10.2.1.13) which isn't enwiki related.. not seeing any for search pool 1 [17:59:44] nm, that appears to just be related to how the requests are hashed amongst poolcounter daemons [18:01:20] <^demon> How many are queued? [18:01:40] switching channels [18:06:28] hello [18:08:26] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [18:08:54] New patchset: Pyoungmeister; "moving en and prefix search traffic back to eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70450 [18:13:20] !log reedy synchronized php-1.22wmf8/extensions/UploadWizard [18:13:28] Logged the message, Master [18:18:27] New patchset: Dzahn; "fix sorting in largest_html, up changelog file that was missing 2.2, minor tab fix" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/70453 [18:25:25] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70450 [18:26:18] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [18:26:38] !log py synchronized wmf-config/lucene-production.php 'moving en and prefix search traffic back to eqiad' [18:26:42] Logged the message, Master [18:27:16] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [18:28:17] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [18:29:08] New patchset: Reedy; "Update php symlink from wmf4 to wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70454 [18:29:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70454 [18:30:18] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [18:31:20] New patchset: Pyoungmeister; "Revert "moving en and prefix search traffic back to eqiad"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70455 [18:31:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:07] New patchset: Reedy; "Remove 1.22wmf3 and 1.22wmf4" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70456 [18:32:14] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70455 [18:32:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.160 second response time [18:32:26] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70456 [18:33:02] !log py synchronized wmf-config/lucene-production.php 'moving en and prefix search traffic back to pmtpa' [18:33:10] Logged the message, Master [18:36:18] PROBLEM - Puppet freshness on search22 is CRITICAL: No successful Puppet run in the last 10 hours [18:37:15] Can soemone please run on tin? rm -rf /a/common/php-1.22wmf3/ [18:39:49] OK [18:40:23] Reedy: Done [18:40:30] Thanks [18:47:28] New patchset: QChris; "Use force push when replicating gerrit repos to antimony (gitblit)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70457 [18:47:51] ^demon: ^ would fix the problem for antimony. [18:48:10] But I haven't found anything for gallium [18:51:31] <^demon> That might be it. [18:51:37] <^demon> We'll go with that, it's not a huge deal. [19:01:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [19:03:40] !log reedy synchronized docroot and w [19:03:48] Logged the message, Master [19:03:54] New review: QChris; "Quote from #wikimedia-ops:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70457 [19:08:11] New patchset: Jdlrobson; "Stop handshaking with commons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70458 [19:08:58] New patchset: Jdlrobson; "Stop handshaking with commons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70458 [19:11:45] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [19:17:13] New patchset: Alex Monk; "Fix links to Gitweb in highlight.php on noc.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70463 [19:19:34] New review: Reedy; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70463 [19:22:08] New patchset: Alex Monk; "Fix links to Gitweb in highlight.php on noc.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70463 [19:25:16] !log upgrading adminbot to 1.7.4 on wikitech-static [19:25:24] Logged the message, Master [19:26:12] !log reedy synchronized . [19:28:32] Logged the message, Master [19:29:10] * Reedy pets ori-l [19:30:56] hm [19:31:04] is morebots dead? heh [19:31:06] that's bad [19:31:28] !log test [19:31:38] Logged the message, Master [19:31:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:55] it propagated to wikitech [19:32:04] but not twitter [19:32:08] indeed [19:32:17] ah [19:32:17] did you set the config settings for twitter? [19:32:19] it's not enabled [19:32:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [19:32:37] I set up the oauth stuff, but forgot to set twitter to true [19:32:52] yeah. i didn't have access to the account so i couldn't generate the keys myself. but i tried to make it painless by documenting the process thoroughly in the config file [19:32:57] !log test [19:33:06] Logged the message, Master [19:33:11] working [19:33:17] woot! :) [19:33:21] thanks Ryan_Lane! [19:33:24] good job :) [19:33:37] Ryan_Lane: looks like it posted twice? [19:33:49] to the wiki [19:33:55] greg-g: to wikitech? no, Ryan_Lane !log test-ed twice [19:33:57] nah, I logged test twice [19:34:06] oh, missed that, I don't see it [19:34:39] oh, there it is [19:34:51] * greg-g can reed [19:37:04] New review: Hashar; "Awesome, thank you mutante!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2446 [19:37:27] did we just drop identi.ca already? I don't see the tests here: http://identi.ca/wikimediatech [19:38:38] hm, maybe Ryan_Lane dropped it in anticipation of the API being disabled [19:38:43] as they migrate to pump.io [19:38:52] I did disable it [19:39:38] otherwise it'll post twice, right? [19:39:41] to twitter [19:39:42] nope [19:39:56] since we have a bridge? [19:40:02] Evan P didn't update the identi.ca -> twitter bridge when it broke to to twitter api v1.1 either :) [19:40:09] heh [19:40:11] the identi.ca twitter bridge is disabled at the moment, but there are vague plans to re-enable it at some point in the future when they're on pump.io [19:40:19] so it could conceivably be re-enabled [19:40:29] since, well, the conversion to pump.io and all [19:40:34] no sense in working on old code that'll only be around for a week or so [19:40:38] right [19:40:47] so, should we bother to keep posting to identi.ca? [19:40:55] probably not, IMO [19:40:58] ok [19:41:03] :( it's how I keep track [19:41:15] but I'll loose it when identi.ca switches to pump.io "soon" anyways :( [19:41:18] lose [19:41:37] greg-g: we can add pump.io support; i don't think it's that hard [19:41:46] s'ok *snif* [19:41:52] heh [19:42:40] IMO evan made a mistake in announcing the migration before nailing down the API specs and ensuring there are compatible libraries for major scripting langauges [19:42:47] agreed [19:43:16] and then did not follow through with the forced migration, which is accommodating on the one hand but also makes it hard to base your plans on the platform [19:43:28] yeah [19:43:59] i kinda hate twitter, don't even have an account, but i had to agree with ryan [19:44:02] he's got a lot of concerns all intermingled, one of the main ones being that running identi.ca costs him a lot of money/mo and the migration to pump.io will drastically reduce that [19:44:22] (like, costs him way more than I thought) [19:50:14] expensi.ca [19:51:43] indeed [19:51:48] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection timed out [19:52:48] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [20:10:35] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70458 [20:11:01] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70352 [20:12:08] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [20:31:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [20:46:53] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:46:53] PROBLEM - LVS HTTP IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:47:05] :( [20:47:43] RECOVERY - LVS HTTP IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 95630 bytes in 0.520 second response time [20:47:43] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 95630 bytes in 0.782 second response time [20:57:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [21:06:36] scapping... [21:10:02] uh oh [21:15:12] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [21:15:21] Logged the message, Master [21:16:43] RECOVERY - LVS Lucene on search-pool2.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [21:17:07] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [21:17:16] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 0.004 second response time on port 8123 [21:17:18] RECOVERY - LVS Lucene on search-pool3.svc.eqiad.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [21:17:41] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [21:17:43] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [21:18:05] !log restarted pybal on lvs1003 [21:18:14] Logged the message, Mistress of the network gear. [21:21:25] New patchset: Pyoungmeister; "for real moving en and prefix search traffic back to eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70543 [21:22:13] New patchset: Pyoungmeister; "for real moving en and prefix search traffic back to eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70543 [21:23:31] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70543 [21:24:20] !log py synchronized wmf-config/lucene-production.php 'moving en and prefix search traffic back to eqiad for realz' [21:24:28] Logged the message, Master [21:27:46] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [21:27:56] Logged the message, Master [21:31:32] New patchset: Pyoungmeister; "moving search pools 2 and 3 to eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70547 [21:31:36] greg-g: note how fatals shot up [21:31:54] Change abandoned: Pyoungmeister; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [21:31:56] greg-g: http://ur1.ca/edq1f [21:32:16] greg-g: I brought it up on #wikimedia-mobile ; we think it's just a side-effect of scap [21:32:33] that is, rsync not being atomic [21:32:37] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70547 [21:33:29] !log py synchronized wmf-config/lucene-production.php 'moving pool2 and 3 search traffic back to eqiad for realz' [21:33:34] greg-g: i think we should strive to fix that. Tim explained to krinkle a while back how he'd go about making scap atomic, there may be a bug with the details [21:33:38] Logged the message, Master [21:36:08] ori-l: eek, yeah [21:36:31] Krinkle: do you remember if there was a bug/something with what Tim's idea was? [21:36:34] re above [21:38:17] greg-g: awjr mentioned that the file cited in the brief spike of fatals was indeed removed in that deployment. that could ostensibly be fixed by using rsync with '--delete-delay', but a precondition for that is https://gerrit.wikimedia.org/r/#/c/57890/ [21:38:50] i should double check that my statement was true :) [21:39:16] im fairly positive it is; checking now [21:41:04] yeah, that file was indeed removed during this deployment [21:42:19] huh [21:42:21] well then [21:42:39] greg-g: there's https://bugzilla.wikimedia.org/show_bug.cgi?id=20085 [21:43:04] from 2009; the last comment from 2011 (by platonides) is basically what tim suggested [21:43:35] i.e., syncing to a versioned directory name and then atomically updating a symlink so that it points to the newest one [21:44:10] New patchset: Pyoungmeister; "moving search pools 4 and 5 to eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70548 [21:44:30] yeah, I've heard rumblings of doing that for a while [21:44:34] doesn't git-deploy solve this? [21:44:39] greg-g: I'm gonna need an LD today [21:44:45] RoanKattouw: well then [21:44:53] RoanKattouw: :) ok [21:45:17] greg-g: it might; i don't know [21:45:22] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70548 [21:46:08] !log py synchronized wmf-config/lucene-production.php 'moving pool4 and 5 search traffic back to eqiad for realz' [21:46:17] Logged the message, Master [21:47:50] greg-g, git-deploy is a vaporware so far:) [21:48:11] ori-l: logically it makes sense (git fetch, the checkout, or whatever) [21:48:14] MaxSem: sure [21:48:25] well, kind of [21:48:42] MaxSem: I use it to deploy EventLogging; I think it's used for parsoid too [21:48:52] I mean, one is fixes to old system and another is a new system with hopefully improved architecture (I could be wrong here in my persepctive) [21:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [21:56:03] greg-g: I'm not sure git fixes this particular problem. It's true that 'git fetch' allows you to pull changes without touching the working directory, but then you still need to update the working directory when you are ready to switch over, and I don't think 'git checkout' is atomic [21:56:15] ah [21:56:43] hmm, is it something that hphpvm can address? [21:57:02] MaxSem: I think the symlink solution is probably simplest [21:57:05] like, cache very file, then reload after a command [21:57:53] dunno [21:59:06] lets go simple first [21:59:12] hphp isn't for a while, anywho [22:09:38] PROBLEM - Puppet freshness on mw1116 is CRITICAL: No successful Puppet run in the last 10 hours [22:10:38] PROBLEM - Puppet freshness on search21 is CRITICAL: No successful Puppet run in the last 10 hours [22:27:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [22:31:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [22:35:52] New patchset: MaxSem; "Don't display mobile view for some foundationwiki pages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70557 [22:36:19] New review: MaxSem; "Waiting for dependency." [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/70557 [22:37:56] greg-g: ori-l: scap atomic TimStarling [22:39:02] Krinkle: keyword Krinkle ? [22:39:42] greg-g: trying to follow up on what you were talking about [22:39:57] you were asking about whether there was a bug with it that may be the reason we don't have it yet? [22:40:09] afaik there wasn't a problem with it, at least not that I know TimStarling mentioned at the time. [22:40:27] I suppose there's a slight initial cost of setting it up, but other than that it seemed like pretty solid plan [22:41:36] so "it" in that sentence is a real thing already written or an idea? [22:43:26] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:43:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:44:22] (for those playing along at home, Krinkle and the VE team are arguing about something in real life) [22:44:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [22:52:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:52:45] the reason we don't have it is because nobody has done it yet [22:53:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [22:54:14] greg-g: what is "it"? [23:01:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:02:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.005 second response time [23:04:57] AaronSchulz: that's kind of my question :) [23:06:52] for another day... [23:07:55] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [23:07:55] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:07:55] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:07:55] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [23:07:55] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [23:07:55] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [23:07:55] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [23:07:56] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [23:07:57] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [23:07:57] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [23:07:58] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [23:09:45] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [23:10:08] No one else doing a lightning deploy? [23:11:20] * RoanKattouw deploys [23:23:02] anyone know what labs instance Nik/manybubbles has been using? [23:24:43] !log catrope synchronized php-1.22wmf7/extensions/VisualEditor 'Updating VisualEditor to master' [23:24:53] Logged the message, Master [23:25:09] !log catrope synchronized php-1.22wmf8/extensions/VisualEditor 'Updating VisualEditor to master' [23:25:17] Logged the message, Master [23:31:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.730 second response time