[00:00:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [00:08:09] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:08:09] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:08:09] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:09:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [00:10:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [00:13:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [00:14:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 24 seconds [00:33:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [00:35:15] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [01:09:09] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:18:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [01:20:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [01:48:13] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [01:50:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [02:08:13] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [02:10:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [02:10:46] !log LocalisationUpdate completed (1.22wmf2) at Sun Apr 21 02:10:46 UTC 2013 [02:10:55] Logged the message, Master [02:16:57] !log LocalisationUpdate completed (1.22wmf1) at Sun Apr 21 02:16:56 UTC 2013 [02:17:04] Logged the message, Master [02:23:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [02:24:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [02:33:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [02:35:10] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [02:37:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Apr 21 02:37:21 UTC 2013 [02:37:28] Logged the message, Master [02:53:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [02:54:10] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [03:08:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [03:10:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [03:18:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [03:20:17] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [03:43:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [03:45:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [03:48:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 185 seconds [03:50:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [03:53:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [03:55:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [03:58:13] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [04:00:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [04:13:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [04:15:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [04:43:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [04:45:15] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [05:18:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [05:22:58] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 20 seconds [05:48:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [05:53:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [05:55:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [06:13:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [06:18:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [06:19:58] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [06:23:26] New patchset: Ori.livneh; "Create self-standing IPython Notebook Puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [06:23:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [06:29:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 236 seconds [06:34:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [06:35:50] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [06:35:50] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [06:36:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:36:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [06:37:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.162 second response time [06:56:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:58:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.160 second response time [07:02:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.158 second response time [07:16:42] New review: Faidon; "First of all, we're deprecating /a in favor of /srv." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54116 [07:24:39] New review: Faidon; "Looks perfect, thanks." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60094 [07:26:17] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [07:30:10] New review: Faidon; "Very thorough job and verbose commit message, nice :)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/60187 [07:40:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:41:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.164 second response time [08:01:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.168 second response time [08:15:13] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [08:22:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.164 second response time [08:44:11] New review: Ori.livneh; "@Faidon: Yep, good points. I'll update the patch." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [10:00:32] New patchset: Ori.livneh; "Create self-standing IPython Notebook Puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [10:00:32] New patchset: Ori.livneh; "Use Upstart rather than supervisor to manage IPython" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60094 [10:08:11] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [10:08:12] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:08:12] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [10:10:19] New review: Ori.livneh; "PS2 is more consistent about naming things. Note that IPython Notebook is an extension of IPython, w..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [10:14:59] New review: Ori.livneh; "(Thanks for the review, by the way!)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [11:09:57] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:33:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [13:54:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 205 seconds [13:55:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [14:01:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [14:45:15] New patchset: Ori.livneh; "Migrate scap-1, scap-2, & sync-common from wikimedia-task-appserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57854 [14:56:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.143 second response time [15:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:56:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [16:01:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:03:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [16:26:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [16:36:09] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [16:36:09] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [17:26:39] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [17:44:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 201 seconds [17:45:17] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 7 seconds [18:15:29] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [18:19:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 204 seconds [18:21:49] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [18:27:09] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [18:28:40] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [18:28:49] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [18:29:09] PROBLEM - SSH on caesium is CRITICAL: Server answer: [18:29:09] PROBLEM - SSH on gadolinium is CRITICAL: Server answer: [18:31:49] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [18:34:09] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:36:31] RECOVERY - SSH on caesium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:36:31] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:36:41] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:49:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 204 seconds [18:53:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [18:54:48] New patchset: Andrew Bogott; "Add manage-nfs-volumes-daemon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60083 [18:58:42] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 219 seconds [19:01:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 23 seconds [19:44:13] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [19:45:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [19:58:13] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [20:00:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [20:08:36] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [20:08:36] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [20:08:37] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [20:44:05] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 212 seconds [20:45:04] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [21:10:07] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:18:07] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 181 seconds [21:20:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [21:25:45] New patchset: coren; "Add manage-nfs-volumes-daemon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60083 [21:26:08] Gah! [21:26:18] How can I base a changeset on yours, andrewbogott [21:26:22] ? [21:26:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:26:41] Ah, no, that actually worked. [21:26:47] Coren, I think that if you preserve Change-Id it should just work [21:27:08] How did you test on labstore3, incidentally? Manual installs? [21:27:39] My test script is probably still sitting in /usr/local/lib [21:27:43] although the one in gerrit is cleaner. [21:28:00] andrewbogott: Want to do a once-over my addition? [21:28:01] The script has the 'dry-run' member which keeps it from running amok [21:28:06] sure. [21:28:08] I left dry-run atm [21:28:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [21:28:58] oops, it's rebased so hard to see the diff. Which files did you change? [21:29:43] Review allows you to pick the revision to compare with; click the '3' in the first column's header. [21:29:51] Um... [21:29:53] Did you try that? [21:29:59] Yeah, it works for me. [21:30:15] If '3' is selected in the first column and '4' in the second. [21:30:29] Changed: added sudo to sync-export in the daemon [21:30:51] And the user to project-nfs-storage-service in openstack.pp [21:31:00] ~line 217 [21:32:11] Ah, I see, in the per-file diff you mean. It's the 'old verison history' popup that is crazy [21:34:26] Do you want to actually merge that patch now, or keep it as a work in progress? [21:34:36] (We would want to add a site.pp line before merging.) [21:35:35] Hm. Lemme add the class in site.pp and we can merge this. [21:36:55] New patchset: coren; "Add manage-nfs-volumes-daemon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60083 [21:37:35] It's safe since it keeps dry-run to true [21:37:55] Yeah, although at some point we'll have to turn it loose :) [21:38:14] andrewbogott: I'll do it by hand once before two puppet runs to check it out [21:38:30] But this way, we know the default reverts to safe. [21:38:56] Oh, we don't want the script to run and 3 and 4, though, only on one of 'em [21:39:08] They're never both on at the same time. [21:39:19] ? [21:39:25] What do you mean 'on'? [21:39:26] PROBLEM - SSH on pdf3 is CRITICAL: Server answer: [21:39:50] Actually turned on. They're cold standbys [21:40:17] They're both wired to the same physical shelves. [21:40:50] They won't even mount the filesystems at all automatically on boot. [21:41:12] By design. [21:41:22] Oh… so the redundancy is only via raid? [21:42:06] * Coren nods. It's raid 6 [21:42:10] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60083 [21:42:26] RECOVERY - SSH on pdf3 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [21:42:26] You doing the merge on sockpuppet? [21:42:49] Yeah. It's raid 6; with two sets of controllers and two servers. That thing isn't going down unless the machine room catches on fire. :-) [21:43:05] Or, we like ignore the array failing. :-) [21:43:08] Yep, merged. [21:44:23] * Coren runs puppet. [21:47:26] notice: /Stage[main]/Openstack::Project-nfs-storage-service/Service[manage-nfs-volumes]/ensure: ensure changed 'stopped' to 'running' [21:47:51] that's good, right? [21:48:25] Yep. [21:53:10] Would even have worked if the upstart script didn't hardcode the logfile to the wrong home. :-) [21:53:22] oops [21:53:30] Heh. Fixing. :-) [21:56:54] New patchset: coren; "Minor bugfix to manage-nfs-volumes upstart script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60224 [21:58:07] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60224 [21:58:44] merged [22:04:28] Not quite working. Odd; as far as I can tell the sudoers is right but it's also not working. Odd. [22:04:43] Which part is misbehaving? [22:05:20] 'sudo rmdir /srv/testlabs/home' asks for a password, but: [22:05:20] [22:05:20] User nfsmanager may run the following commands on this host: [22:05:20] (root) NOPASSWD: /bin/mkdir -p /srv/* [22:05:20] (root) NOPASSWD: /bin/rmdir /srv/* [22:05:20] (root) NOPASSWD: /usr/local/sbin/sync-exports [22:06:29] Dafu? [22:06:38] When I sudo by hand to nfsmanager it works. [22:07:27] * Coren boggles. [22:07:34] So maybe the script isn't really running as nfsmanager? [22:08:26] Looks like it actually does. [22:08:40] It's not running at all atm, right? [22:09:02] Yeah, I stopped it. [22:10:21] I think '*' wildcard expansion in sudoers doesn't include forward slashes [22:10:41] ori-l: I did when I tried it by hand from the nfsmanager user. [22:10:41] so rmdir /srv/testlabs would work, but rmdir /srv/testlabs/home wouldn't [22:10:51] oh. hah. [22:11:06] Oh, the sudo policy doesn't apply recursively? [22:11:25] That'll be tricky [22:11:32] "recursively?" [22:11:40] I mean, on subdirs [22:11:58] No, it does. * does include forward slashes. [22:12:08] At least it does when I try it: [22:12:10] man sudoers : "Note that a forward slash ('/') will not be matched by wildcards used in the path name." [22:12:28] "When matching the command line arguments, however, a slash [22:12:28] does get matched by wildcards. " [22:12:35] root@labstore3:/var/lib/nfsmanager# sudo -iu nfsmanager [22:12:35] nfsmanager@labstore3:~$ sudo rmdir /srv/faux/home [22:12:35] rmdir: failed to remove `/srv/faux/home': No such file or directory [22:12:56] Ah. Works in arguments. [22:13:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [22:13:57] Oh, hm. [22:13:58] so it _should_ work [22:14:05] OK, do you want to update the script or shall I? [22:14:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 26 seconds [22:14:46] No, it's just behaving oddly. It works when I run it from the command line as nfsmanager, but not with su. [22:15:35] Look at this: [22:15:39] 995 5599 2.0 0.0 94860 12140 pts/0 T 22:15 0:00 | \_ /usr/bin/python /usr/local/sbin/manage-nfs-volume [22:15:40] root 5600 0.0 0.0 35200 1552 pts/0 T 22:15 0:00 | | \_ sudo rmdir /srv/testlabs/home [22:15:50] Waits on a password, but: [22:16:02] nfsmanager@labstore3:~$ rmdir /srv/testlabs/home [22:16:02] rmdir: failed to remove `/srv/testlabs/home': No such file or directory [22:16:03] Does manage-nfs-volumes-daemon-2 work any better? [22:16:41] Aha. It does. [22:16:53] That sort of makes sense. [22:17:00] So, hang on, I will subit a patch. [22:19:30] andrewbogott: Odd thing; as far as I can tell it properly created every project directory, but it only added tools-{home,project} to the exports. [22:19:49] Yes, because it only adds exports for volumes that actually exist. [22:19:58] Since it's in dry-run mode it doesn't create those other volumes, so… no exports [22:20:15] Aha, right! [22:20:30] New patchset: Andrew Bogott; "Break up our 'sudo' commands into separate args." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60226 [22:20:48] ^ look right? [22:21:03] New review: coren; "Full of win." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60226 [22:21:28] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60226 [22:22:00] ok, merged on sockpuppet [22:22:05] ... did you just do the puppet-merge? [22:22:08] So you have. [22:22:47] puppetd -tv [22:23:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [22:23:48] works, then exits? [22:24:00] * Coren tries by hand. [22:24:02] It should persist... [22:24:39] File "/usr/local/sbin/manage-nfs-volumes-daemon", line 218, in update_exports [22:24:39] volpath, permissions, fsid) [22:24:40] TypeError: not all arguments converted during string formatting [22:25:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [22:26:11] Last log entry: 04/21/2013 - 22:24:14 - Updated exports for publicdata, home [22:26:16] ok, again, -2 should work better [22:26:32] (this one was just copy/paste error) [22:26:55] So it does. [22:27:13] This time I'm going to wait until everything works before puppetizing :) [22:27:25] * Coren tries it for real [22:28:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [22:29:19] labstore3 and labstore4 have separate raids, right? [22:29:39] PROBLEM - SSH on cp1043 is CRITICAL: Server answer: [22:30:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [22:30:21] andrewbogott: No, it's the same raid. Well, they also have local disks, but the actual storage is on the shared disks. [22:30:33] Ah, ok. [22:30:38] RECOVERY - SSH on cp1043 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:30:45] That makes more sense than what I was imagining [22:31:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [22:33:18] PROBLEM - SSH on caesium is CRITICAL: Server answer: [22:34:18] RECOVERY - SSH on caesium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:38:46] New patchset: Andrew Bogott; "Remove an unwanted arg from a string format." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60228 [22:43:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [22:44:50] Coren, everything behaving OK? [22:45:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [22:47:30] andrewbogott: Kinda, there still seems to be kinks with creation. Trying to work 'em out now. [22:48:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [22:50:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [22:51:28] Odd. Some of 'em, the script doesn't even try the mkdir. Most work. [22:51:56] Maybe I'm not iterating through the projects correctly. [22:51:59] What's an example of something it's missing? [22:52:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [22:53:28] math [22:53:52] Also haproxy [22:54:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [22:54:58] Coren, look at this page: https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProject&action=configureproject&projectname=math [22:55:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [22:55:14] Creation of projects and home is configurable. For that project, creation is disabled. [23:01:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:03:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [23:03:37] Aha. It can't export the volumes because it has no volumes to export. It makes sense. :-) [23:04:26] Noisy in the logs, though. [23:04:52] Otherwise, then everything is fine. Want to update the script with the -2 version? [23:05:02] And turn off dry run? [23:05:12] What log spam are you seeing? [23:05:38] "Unable to set exports for math, project because we can't find it." and so on. [23:07:12] Hm, that seems unnecessary... [23:13:07] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 181 seconds [23:13:26] New patchset: Andrew Bogott; "Turn off dry_run, and fix a few small errors." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60228 [23:14:02] Coren: ^ [23:15:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [23:17:17] PROBLEM - SSH on caesium is CRITICAL: Server answer: [23:18:17] RECOVERY - SSH on caesium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:20:43] New review: coren; "It oozes correctness." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60228 [23:20:44] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60228 [23:23:06] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [23:24:45] Less log spam now? [23:24:45] andrewbogott: That's a shitload of exports. :-) [23:24:52] yeah [23:24:54] Yep. [23:25:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [23:25:14] Works perfectly. [23:25:28] That's good! [23:25:36] I'll let you poke a bit more, then I'm going to dinner :) [23:25:54] Hm. All that's left is to puppetize the NFS automounts and we can deploy. It's local to tools- atm [23:25:56] Good eats. [23:28:07] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [23:28:47] Cool -- catch you later! [23:30:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [23:37:00] New patchset: Reedy; "Remove $wgMemCachedInstanceSize" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60231 [23:38:12] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [23:39:12] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 16 seconds