[00:02:07] Aha [00:02:16] I need to make sure I write this down somewhere ;) [00:02:27] 1095 /build/buildd/php5-5.4.9/ext/exif/exif.c: No such file or directory. [00:02:32] ignore that [00:02:51] So it's exif_read_data() [00:03:54] It's there in 5.4.9 locally and the clusters PHP 5.3.10-1ubuntu3.6+wmf1 [00:05:43] * bd808 runs in fear from (void *)[1] and it's implications [00:07:15] Interesting, no 5.3.10 branch on github for php source [00:08:04] http://archive.ubuntu.com/ubuntu/pool/main/p/php5/php5_5.3.10.orig.tar.gz [00:08:24] http://archive.ubuntu.com/ubuntu/pool/main/p/php5/php5_5.3.10-1ubuntu3.8.diff.gz may have patches that change those line too [00:09:23] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [00:11:23] PROBLEM - Puppet freshness on bast4001 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:23] PROBLEM - Puppet freshness on cp4002 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:23] PROBLEM - Puppet freshness on cp4003 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:23] PROBLEM - Puppet freshness on cp4004 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:23] PROBLEM - Puppet freshness on cp4006 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:23] PROBLEM - Puppet freshness on cp4007 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:24] PROBLEM - Puppet freshness on cp4008 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:24] PROBLEM - Puppet freshness on cp4009 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:25] PROBLEM - Puppet freshness on cp4010 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:25] PROBLEM - Puppet freshness on cp4011 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:26] PROBLEM - Puppet freshness on cp4012 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:26] PROBLEM - Puppet freshness on cp4013 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:27] PROBLEM - Puppet freshness on cp4016 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:27] PROBLEM - Puppet freshness on cp4018 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:28] PROBLEM - Puppet freshness on cp4020 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:28] PROBLEM - Puppet freshness on lvs4002 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:28] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:29] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:58] Reedy: do you have an image I can also test with? [00:12:30] yeah [00:12:41] https://noc.wikimedia.org/~reedy/segfault.tar.gz [00:13:58] large :) [00:14:03] hotel wifi, heh [00:14:21] Reedy: If you haven't figured this out already, you can get the full source tree via `apt-get source php5` [00:14:38] The php exif ext module code looks to be very similar between the 2 versions [00:14:42] line 1095 is identical [00:15:56] Line 1095 has moved to 1085 in master, but still the same [00:16:02] You need to back up the stack to see what's feeding garbage to that cast helper. [00:16:15] https://bugzilla.wikimedia.org/show_bug.cgi?id=55541 [00:18:43] then, what do you run? importImages? [00:18:44] php_ifd_get16u (value=0xfffffffffa3fb318, motorola_intel=0) at [00:19:04] php maintenance/importImages.php --comment-ext=txt --user=Reedy /tmp/uploads --overwrite [00:19:17] extract the tar.gz [00:20:15] in prod? [00:20:46] in prod? [00:20:55] in prod? [00:20:57] are you running this in production? [00:20:58] :) [00:21:07] on terbium [00:21:08] sudo -u apache mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=LSHuploadBot /media/external/keepers/tmp [00:23:07] easier to do it locally with root and can grab files from the interwebs at large [00:24:13] not for me :P [00:24:29] I don't have a working mediawiki locally [00:24:31] apt-get install mediawiki [00:28:33] I wonder if you can narrow it down and call php_ifd_get16u [00:30:08] it's not php_ifd_get16u that is buggy [00:30:14] it's what it's being passed to it [00:30:28] sure [00:30:33] I was meaning a test case [00:30:42] rather than going via mediawiki, php, hell and back [00:30:56] (gdb) zbacktrace [00:30:57] [0xf7edbfb0] exif_read_data() /usr/local/apache/common-local/php-1.22wmf20/includes/media/Exif.php:302 [00:30:59] [0xf7edba88] __construct() /usr/local/apache/common-local/php-1.22wmf20/includes/media/BitmapMetadataHandler.php:268 [00:31:02] [0xf7eda120] Tiff() /usr/local/apache/common-local/php-1.22wmf20/extensions/PagedTiffHandler/PagedTiffHandler.image.php:174 [00:31:09] so, yeah, you can just isolate stuff from exif.php [00:31:53] Interesting, I don't see PagedTiffHandler on mine [00:32:03] that's terbium [00:32:15] yeah [00:32:23] I thought I had it installed, apparently not [00:33:50] the rest of the stack seems very wrong too [00:34:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [00:34:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 12.255 second response time [00:49:22] Reedy: I can't recreate the crash in my vagrant vm. importImages brings it in just fine. :( [00:50:04] $ php -v [00:50:05] PHP 5.3.10-1ubuntu3.7 with Suhosin-Patch (cli) (built: Jul 15 2013 18:05:44) [00:51:19] yay, computers [00:52:09] Somewhat confused why it works fine under valgrind locally too [00:54:03] Pointer cleanliness in php extensions is a black art. There are ton of heisenbugs that disappear under valgrind/gdb. [00:54:29] I've chased a lot of them over the years with very few sucesses. [00:55:24] I even accidentally became the maintainer of a pecl extension because of one. :) [00:55:45] * bd808 should find someone to pawn that off on. [00:58:39] At least it's very much not a MediaWiki bug [00:58:48] Though, I guess needs upstreaming [00:59:05] Can we upload > 100 MB attachments to bugs.php.net? :D [01:00:00] If you can't you should file a bug about it [01:01:08] Or just stuff it in a github public repo with a test script to reproduce. [01:02:42] I think `$data = exif_read_data( $this->file, 0, true );` in your minimum test case if I understand the trace Faidon gave. [01:02:59] * bd808 just got called to dinner [01:06:31] As it's currently 2am and I have no desire to start logging PHP bugs... I'll make a note of it and deal with it post sleep [01:49:23] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [01:59:53] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [02:06:54] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.143 second response time [02:18:06] !log LocalisationUpdate completed (1.22wmf19) at Thu Oct 10 02:18:06 UTC 2013 [02:18:23] Logged the message, Master [02:28:22] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Oct 10 02:28:22 UTC 2013 [02:28:35] Logged the message, Master [03:26:14] (03CR) 10Springle: [C: 031] Move mysql_wmf into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 (owner: 10Andrew Bogott) [04:08:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [04:08:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 18.957 second response time [04:16:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [04:19:26] !log upgrading db1007 to precise + mariadb [04:19:40] Logged the message, Master [04:19:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 11.823 second response time [04:40:13] (03PS1) 10Springle: db1007 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/88922 [04:41:16] (03CR) 10Springle: [C: 032] db1007 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/88922 (owner: 10Springle) [04:45:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [04:49:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.067 second response time [04:55:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [04:55:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 20.974 second response time [05:02:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [05:03:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 21.606 second response time [05:08:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [05:09:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.778 second response time [05:16:36] !log start xtrabackup clone db1039 to db1007 [05:16:48] Logged the message, Master [05:55:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [05:55:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 20.624 second response time [06:01:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [06:02:25] (03PS1) 10Springle: db1044 install mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/88928 [06:04:10] (03CR) 10Springle: [C: 032] db1044 install mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/88928 (owner: 10Springle) [06:08:43] RECOVERY - Puppet freshness on ms-be8 is OK: puppet ran at Thu Oct 10 06:08:39 UTC 2013 [06:10:53] mark: (neon) Oct 10 06:06:27 neon puppet-agent[32114]: Could not retrieve catalog from remote server: Error 400 on SERVER: Must pass ip_address to Monitor_service_lvs_http[wikibooks-lb.eqiad.wikimedia.org] at /etc/puppet/manifests/lvs.pp:984 on node neon.wikimedia.org [06:12:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.400 second response time [06:14:53] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Thu Oct 10 06:14:50 UTC 2013 [06:25:05] (03PS1) 10Springle: db1045 to s5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/88929 [06:29:08] (03CR) 10Springle: [C: 032] db1045 to s5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/88929 (owner: 10Springle) [06:33:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [06:35:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 26.275 second response time [06:48:06] (03PS1) 10Springle: warm up db1039 in s7 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88932 [06:48:43] (03CR) 10Springle: [C: 032] warm up db1039 in s7 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88932 (owner: 10Springle) [06:50:26] !log springle synchronized wmf-config/db-eqiad.php 'warm up db1039 in s7' [06:50:39] Logged the message, Master [06:59:22] (03PS1) 10Ryan Lane: Change deploy repo config to repo => config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88934 [06:59:26] ori-l: ^^ [06:59:49] (03CR) 10jenkins-bot: [V: 04-1] Change deploy repo config to repo => config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88934 (owner: 10Ryan Lane) [06:59:50] that's one step towards being able to split the config of repos into individual parts [07:00:26] ooo i'll review, i'm in a puppet mindset [07:00:53] it's a relatively large change, sorry about that [07:01:10] hard to change the entire config hash and pillar structure without it being large [07:01:39] (03PS1) 10Ori.livneh: Log /proc/diskstats metrics to Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/88935 [07:03:02] (03PS2) 10Ryan Lane: Change deploy repo config to repo => config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88934 [07:03:27] (03CR) 10jenkins-bot: [V: 04-1] Change deploy repo config to repo => config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88934 (owner: 10Ryan Lane) [07:04:18] it looks cleaner, definitely [07:04:50] yep. it's nice to actually be able to work on this when I'm not totally rushed :) [07:06:07] (03PS3) 10Ryan Lane: Change deploy repo config to repo => config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88934 [07:06:32] (03CR) 10jenkins-bot: [V: 04-1] Change deploy repo config to repo => config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88934 (owner: 10Ryan Lane) [07:06:35] -_- [07:11:53] rawr. I have no fucking clue what's broken in this file [07:13:37] ah [07:14:13] (03PS4) 10Ryan Lane: Change deploy repo config to repo => config [operations/puppet] - 10https://gerrit.wikimedia.org/r/88934 [07:16:41] jenkins is happy [07:17:15] yep. I'm going to do some testing in the sartoris project [07:26:09] ok, scratch that, i can't review it this late, it's too big for my head :P [07:26:18] i need to read up more about sartoris i think [07:47:15] ori-l: no worries. I'm going to do testing in the labs project tomorrow anyway, I may have patchsets to follow [07:58:59] PROBLEM - Puppet freshness on cp4001 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:38] (03PS1) 10ArielGlenn: db1039 (s7) back to normal weight [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88937 [08:14:18] (03PS2) 10ArielGlenn: db1039 (s7) to normal weight [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88937 [08:14:51] (03CR) 10ArielGlenn: [C: 032] db1039 (s7) to normal weight [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88937 (owner: 10ArielGlenn) [08:16:29] !log ariel synchronized wmf-config/db-eqiad.php 'db1039 (s7) to normal weight in pool' [08:16:45] Logged the message, Master [08:18:59] PROBLEM - Puppet freshness on cp4014 is CRITICAL: No successful Puppet run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on cp4019 is CRITICAL: No successful Puppet run in the last 10 hours [08:20:59] PROBLEM - Puppet freshness on cp4005 is CRITICAL: No successful Puppet run in the last 10 hours [08:20:59] PROBLEM - Puppet freshness on cp4015 is CRITICAL: No successful Puppet run in the last 10 hours [08:22:59] PROBLEM - Puppet freshness on cp4017 is CRITICAL: No successful Puppet run in the last 10 hours [08:22:59] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:30] (03CR) 10Hashar: "pep8 errors in files/ganglia/plugin are ignored. There is a .pep8 there and Jenkins run pep8 on a per directory basis :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88935 (owner: 10Ori.livneh) [08:24:18] (03CR) 10Hashar: "Ori proposed another diskstat plugin in https://gerrit.wikimedia.org/r/#/c/88935/ . So I guess we have to pick one :] Follow up on the ot" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 (owner: 10Hashar) [08:25:49] (03PS1) 10ArielGlenn: depool db1024 (s7) for upgrade/conversion to mariadb [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88939 [08:26:37] (03CR) 10ArielGlenn: [C: 032] depool db1024 (s7) for upgrade/conversion to mariadb [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88939 (owner: 10ArielGlenn) [08:27:51] !log ariel synchronized wmf-config/db-eqiad.php 'depool db1024 (s7) for conversion to mariadb' [08:28:02] Logged the message, Master [08:28:31] (03Abandoned) 10Ori.livneh: Log /proc/diskstats metrics to Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/88935 (owner: 10Ori.livneh) [08:29:42] (03CR) 10Ori.livneh: "Ah, no, we should go with this patch; I forgot that it existed somehow even though I just looked at it the other day. I abandoned mine, le" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 (owner: 10Hashar) [08:36:00] (03CR) 10Hashar: [C: 031] "The reporting system is nice. We could probably do something similar for the SNMP trap that is used to monitor whether puppet is running " [operations/puppet] - 10https://gerrit.wikimedia.org/r/88888 (owner: 10Ori.livneh) [09:06:11] (03PS1) 10ArielGlenn: db1024 -> file_per_table, mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/88941 [09:07:23] (03CR) 10ArielGlenn: [C: 032] db1024 -> file_per_table, mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/88941 (owner: 10ArielGlenn) [10:03:09] PROBLEM - Disk space on cp1061 is CRITICAL: DISK CRITICAL - free space: /srv/sdb3 12357 MB (3% inode=99%): [10:05:37] (03PS1) 10ArielGlenn: get rid of last references to search1-2, searchidx1 in dsh groups (decommed) [operations/puppet] - 10https://gerrit.wikimedia.org/r/88950 [10:06:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [10:06:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 21.646 second response time [10:06:59] (03CR) 10ArielGlenn: [C: 032] get rid of last references to search1-2, searchidx1 in dsh groups (decommed) [operations/puppet] - 10https://gerrit.wikimedia.org/r/88950 (owner: 10ArielGlenn) [10:09:59] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [10:11:59] PROBLEM - Puppet freshness on bast4001 is CRITICAL: No successful Puppet run in the last 10 hours [10:11:59] PROBLEM - Puppet freshness on cp4002 is CRITICAL: No successful Puppet run in the last 10 hours [10:11:59] PROBLEM - Puppet freshness on cp4003 is CRITICAL: No successful Puppet run in the last 10 hours [10:11:59] PROBLEM - Puppet freshness on cp4004 is CRITICAL: No successful Puppet run in the last 10 hours [10:11:59] PROBLEM - Puppet freshness on cp4006 is CRITICAL: No successful Puppet run in the last 10 hours [10:11:59] PROBLEM - Puppet freshness on cp4007 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:00] PROBLEM - Puppet freshness on cp4008 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:00] PROBLEM - Puppet freshness on cp4009 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:00] PROBLEM - Puppet freshness on cp4010 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:01] PROBLEM - Puppet freshness on cp4011 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:01] PROBLEM - Puppet freshness on cp4013 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:02] PROBLEM - Puppet freshness on cp4016 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:02] PROBLEM - Puppet freshness on cp4012 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:03] PROBLEM - Puppet freshness on cp4018 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:03] PROBLEM - Puppet freshness on cp4020 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:04] PROBLEM - Puppet freshness on lvs4002 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:05] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: No successful Puppet run in the last 10 hours [10:12:05] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: No successful Puppet run in the last 10 hours [10:26:44] (03PS1) 10ArielGlenn: remove search1-12 and searchidx1 from dns, decommed (see rt 2897) [operations/dns] - 10https://gerrit.wikimedia.org/r/88954 [10:44:17] (03PS1) 10ArielGlenn: current dhcpd.conf requires linux-host-entries.ttyS1-9600, add one [operations/puppet] - 10https://gerrit.wikimedia.org/r/88956 [11:24:19] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:19] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [11:33:06] (03PS1) 10Mark Bergsma: Add ulsfo BGP peer addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/88959 [11:34:26] (03CR) 10Mark Bergsma: [C: 032] Add ulsfo BGP peer addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/88959 (owner: 10Mark Bergsma) [11:43:07] mark: what else do we need to do to have ulsfo ready to receive traffic ? [11:43:30] i'm working on pybal BGP peering with the routers [11:43:36] having the other router configured would also be good [11:43:45] then we should test, the ssl part especially [11:43:49] and we should be ready [11:43:58] current plan is to put traffic on it early next week [11:44:45] how can I help ? [11:48:36] you can help testing later :) [11:48:59] ok ... ping me then :-) [11:52:04] !log upgraded php5 packages from php5_5.3.10-1ubuntu3.6+wmf1 to php5_5.3.10-1ubuntu3.8+wmf1 on apt.wikimedia.org [11:52:19] Logged the message, Master [11:53:38] (03PS1) 10Mark Bergsma: Fix order [operations/dns] - 10https://gerrit.wikimedia.org/r/88962 [11:54:05] (03CR) 10Mark Bergsma: [C: 032] Fix order [operations/dns] - 10https://gerrit.wikimedia.org/r/88962 (owner: 10Mark Bergsma) [12:01:56] interesting [12:02:04] cr1-ulsfo dropped off the net as soon as I restarted PyBal ;) [12:03:41] er [12:03:47] and the serial console server is segfaulting :) [12:04:26] !log Rebooting scs-ulsfo, pmshell is segfaulting [12:04:40] Logged the message, Master [12:06:56] didn't fix it [12:15:32] heh ok [13:28:16] (03PS1) 10Dzahn: fix broken links to wikis in stats tables [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/88980 [13:29:35] (03PS2) 10Dzahn: fix broken links to wikis in stats tables [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/88980 [13:30:39] (03CR) 10Dzahn: [C: 032] fix broken links to wikis in stats tables [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/88980 (owner: 10Dzahn) [13:36:54] (03CR) 10Dzahn: "(2 comments)" [operations/dns] - 10https://gerrit.wikimedia.org/r/88954 (owner: 10ArielGlenn) [13:37:41] ugh, no I just can't read [13:40:29] (03PS2) 10ArielGlenn: remove search1-12 and searchidx1 from dns, decommed (see rt 2897) [operations/dns] - 10https://gerrit.wikimedia.org/r/88954 [13:55:46] (03CR) 10Dzahn: [C: 031] remove search1-12 and searchidx1 from dns, decommed (see rt 2897) [operations/dns] - 10https://gerrit.wikimedia.org/r/88954 (owner: 10ArielGlenn) [14:01:11] !g I13082597cd921966a7fae0d5c67ff4359d032dda [14:01:11] https://gerrit.wikimedia.org/r/#q,I13082597cd921966a7fae0d5c67ff4359d032dda,n,z [14:14:40] (03CR) 10Akosiaris: [C: 032] "Good work. I just ran catalog compile tests for all db* (+ various others) hosts and with no errors. We should probably however revisit th" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 (owner: 10Andrew Bogott) [14:24:45] !log Setup cr1-ulsfo:ae0 <--> cr2-ulsfo:ae0 [14:24:51] (03CR) 10Akosiaris: [C: 032] contint: fetch slave scripts on slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/87058 (owner: 10Hashar) [14:25:00] !log Setup OSPF and OSPF3 on cr1-ulsfo:ae0.2 <--> cr2-ulsfo:ae0.2 [14:25:00] Logged the message, Master [14:25:12] Logged the message, Master [14:32:36] akosiaris: hey :-] regarding the jenkins slave scripts being published on slaves. I got to use git::clone latest [14:32:59] akosiaris: but will eventually migrate to git-deploy whenever I found out how to use it :-] [14:33:16] akosiaris: i noticed Ryan Lane send a patch to tweak the git-deploy manifest and make it easier to add a new project. [14:33:19] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:05] yes he did... but I am not sure git-deploy is supposed to be run automated [14:34:10] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [14:34:50] which is what you want here .. right ? [14:35:39] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:19] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [14:37:19] (03Abandoned) 10Akosiaris: move check-raid.py from base/files/monitoring/ to nrpe/plugins/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [14:38:14] !log Configured AS65003 ulsfo BGP confederation on cr1-ulsfo and cr2-ulsfo [14:38:22] !log Setup iBGP between cr1-ulsfo and cr2-ulsfo [14:38:26] Logged the message, Master [14:38:39] Logged the message, Master [14:38:59] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [14:40:48] akosiaris: looking at spec job [14:40:52] akosiaris: err rspec [14:41:19] no jenkins expert so I assume it is fine [14:41:32] (03CR) 10Yurik: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [14:41:59] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.221 second response time [14:42:03] akosiaris: I am a bit concerned by the time it takes for them to run [14:42:28] !jenkins operations-puppet-spec [14:42:45] !jenkins is https://integration.wikimedia.org/ci/job/$1 [14:42:46] Key was added [14:42:49] !jenkins operations-puppet-spec [14:42:50] https://integration.wikimedia.org/ci/job/operations-puppet-spec [14:43:35] i see 2 secs... [14:43:54] yeah need to tweak it [14:43:55] https://integration.wikimedia.org/ci/job/operations-puppet-spec/3/console [14:43:57] grmblbl [14:44:53] huh... what a nice java exception... [14:45:04] yeah the git plugin attempts to rewrite the submodule urls [14:45:40] !bug 42953 [14:45:40] https://bugzilla.wikimedia.org/42953 [14:46:12] ah... yeah i remember that.... [14:46:47] !wikitech is https://wikitech.wikimedia.org/w/index.php?search=$1 [14:46:47] This key already exist - remove it, if you want to change it [14:47:07] akosiaris: yeah we got it by that already [14:47:08] solved with I2dc0ad5fcb51d7720475eae70f91466a78a0fe2e [14:47:18] !wikitech [14:47:18] http://wikitech.wikimedia.org/view/$1 [14:47:22] working now https://integration.wikimedia.org/ci/job/operations-puppet-spec/4/console [14:48:13] !icinga is https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=$1 [14:48:13] Key was added [14:48:18] !icinga carbon [14:48:18] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=carbon [14:50:32] akosiaris: the rspec only takes 25 seconds ! [14:51:17] hashar: yes... because it seems to stop at the first module [14:51:20] stdblib [14:51:25] stdlib* [14:51:40] i never saw the other 3 modules tests being run [14:52:00] and again.. what 25 secs ? i only see 2 secs at the output .... [14:52:18] https://integration.wikimedia.org/ci/job/operations-puppet-spec/4/ [14:52:29] on the top right, it shows the duration of the build [14:53:28] a ok ... i had not click on full log [14:54:34] 14:47:05 Invoking tests on module bacula [14:54:34] 14:47:05 rake aborted! [14:54:34] 14:47:05 Don't know how to build task 'spec_standalone' [14:54:47] so... i should add a rake target 'spec_standalone?' [14:54:56] spec_standalone that is [14:55:37] if you look at the rake file at the root of the repo [14:55:41] it finds modules [14:55:45] and system('rake spec_standalone') [14:55:51] I am not sure where that commands come from [14:56:01] require 'puppetlabs_spec_helper/rake_tasks' [14:56:04] maybe here ... [14:56:16] from modules/apache/Rakefile [14:56:32] ah from rspec-puppet when doing rspec-puppet-init I think [14:56:51] the modules Rakefile have: [14:56:55] require 'rubygems' [14:56:56] require 'puppetlabs_spec_helper/rake_tasks' [14:57:08] I guess the last require provide the spec_standalone [14:57:11] yes it is in the rake_tasks.rb [14:57:21] hmmm [14:57:23] (03PS1) 10coren: Tool Labs: puppetize the new webnode type instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/88996 [14:57:32] akosiaris: http://paste.openstack.org/show/48217/ [14:57:34] for stdlib [14:58:01] I can't remember the exact details, got hacked up with andrew during the summer [14:58:14] andrewbogott: puppet rspec talking [14:58:36] that is a different target from what rspec-puppet-init creates... [14:58:39] I think that's all on hold pending a proper jail to run the tests in [14:58:58] But, I will read the backscroll in a minute [14:59:34] yeah jailing [14:59:40] but we could run them for trusted users [15:00:59] !log Corrected wrong interface addresses on cr1-ulsfo:ae0 sub-units [15:01:12] Logged the message, Master [15:02:21] (03PS2) 10coren: Tool Labs: puppetize the new webnode type instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/88996 [15:02:51] ok, I've read the backscroll but still don't know what we're discussing :) [15:03:24] (03CR) 10coren: [C: 032] Tool Labs: puppetize the new webnode type instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/88996 (owner: 10coren) [15:04:11] we were thinking about adding rspec tests... so i asked what remains to be done [15:04:23] one is the jailing... hashar solved some other issues [15:04:42] with git submodules and now i am looking at the rake targets [15:04:47] !log removed mysql-server-5.5 package from stat1 to ensure that it doesn't get used now that it's no longer maintained by puppet. [15:04:58] Logged the message, Master [15:05:06] rspec-puppet-init creates a Rakefile that does not have rspec_standalone [15:05:42] and, who calls rspec-puppet-init? [15:05:51] IIRC we were just doing 'rake spec' in the top-level puppet dir? [15:05:59] yup [15:06:10] a user that populates a module for the very first time [15:06:12] that recurse in submodules and calls 'rake spec_standalone' whenever a rake file is found there [15:06:18] that being me in this case :-) [15:06:28] the modules coming from puppet labs have that task defined [15:06:41] Oh… I see. [15:06:53] they apparently use something else that rspec-puppet [15:07:06] or have puppet labs has its own wrapper on top of it [15:07:14] So maybe that should just be handled with a 'howto' guide to create the rakefile by hand? [15:07:26] I guess so [15:07:35] I am not sure whether we have any documentation written yet though :( [15:07:44] I am pretty sure I haven't written any [15:07:48] I'm sure we don't! [15:08:00] good [15:08:01] (03PS1) 10Odder: (bug 54828) Enable FlaggedRevs for Portuguese Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89001 [15:08:06] that is part of our culture (see bug 1) [15:08:15] oh [15:08:16] no [15:08:17] wait [15:08:18] https://wikitech.wikimedia.org/wiki/Puppet_coding#Rake_tests [15:08:20] \O/ [15:08:46] Hey, nonzero documentation! [15:08:48] the dream is alive [15:09:13] * hashar blames andrewbogott https://wikitech.wikimedia.org/w/index.php?title=Puppet_coding&diff=77308&oldid=76155 [15:09:40] * andrewbogott wouldn't call himself a hero [15:09:42] we will have to split that [[Puppet coding]] in smaller part one day [15:10:39] andrewbogott: I disagree. Every single line of code earn you a "I wrote doc" badge, and after enough badges you will be consider a hero even against your will [15:11:16] deploying rspec triggering [15:12:07] !log jenkins : triggering operations-puppet-rspec (non voting) on operations/puppet.git {{gerrit|89000}} [15:12:16] Logged the message, Master [15:12:51] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 [15:13:56] Coren, you're using lighttpd in tool-labs? Because I have an item on my todo list about purging all lighttpd use from puppet :( [15:14:22] hashar: So, does that mean tests will be run on commit now? [15:14:33] Or only when requested by hand? Or...? [15:15:14] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 [15:15:17] (03PS1) 10coren: Tool Labs: fix race condition on webnodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/89005 [15:15:23] andrewbogott: it will be run for anyone whitelisted. [15:15:24] (03PS1) 10Andrew Bogott: Fix mysql module so it can work w/out mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/89006 [15:15:33] oh, ok. That seems good. [15:15:34] andrewbogott: which is the usual volunteers, wmde and wikimedia folks. [15:15:34] andrewbogott: I'm not using any classes for it though. [15:15:47] andrewbogott: that will vote +2 regardless of spec result [15:16:03] andrewbogott: BUT untrusted volunteers will not have spec run for them. So Jenkins will only vote +1. [15:16:07] andrewbogott: will mail ops list about it [15:16:24] https://integration.wikimedia.org/ci/job/operations-puppet-spec/5/console : FAILURE in 26s (non-voting) [15:16:26] Coren, can you tell me more? Or should I just read my damn email? [15:16:28] that is faster than the validate job [15:16:47] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [15:17:02] andrewbogott: It's discussed on labs-l. Basically, I'm using lighttpd so that every tool has its own webserver; apache and ngnix are way too heavy for that. [15:17:05] mutante question about rss extension [15:17:08] (03CR) 10Andrew Bogott: "Yes, OK, maybe this module is slated for removal, but in the meantime I need it to work AT ALL so I can figure out what it does." [operations/puppet] - 10https://gerrit.wikimedia.org/r/89006 (owner: 10Andrew Bogott) [15:17:20] (03CR) 10coren: [C: 032] Tool Labs: fix race condition on webnodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/89005 (owner: 10coren) [15:17:24] is it possible to whitelist the domain but not the exact path? [15:17:56] andrewbogott: Why the seek-and-destroy? [15:18:23] drdee: i don't know yet, didn't implement the whitelist part.. which wiki are you on [15:18:43] Coren: OK, so all you're doing is installing the debian package? [15:19:03] I don't have a dog in the fight, but mostly whenever I mentioned lighttpd I get a strong "Don't use that, use nginx" from the room. [15:19:08] andrewbogott: Yep. [15:19:47] Coren: Might be worth running an email past the ops to see if anyone has a legit argument against it. [15:20:02] (03CR) 10Andrew Bogott: [C: 032] Fix mysql module so it can work w/out mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/89006 (owner: 10Andrew Bogott) [15:20:33] Coren, if you're not relying on lighttpd base classes though, then I really don't care at all :) [15:21:05] I do not. It's really just a matter of installing the deb, and a script to generate per-tool config file. :-) [15:21:36] e.g. https://wikitech.wikimedia.org/wiki/Puppet_Todo#lighttpd [15:23:04] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [15:23:10] (03PS3) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 [15:24:16] yeah message [15:24:17] https://integration.wikimedia.org/ci/job/operations-puppet-spec/8/console : Experimental unit tests, please ignore. in 25s (non-voting) [15:32:29] (03PS1) 10coren: Tool Labs: more fixes to the lighttpd startup [operations/puppet] - 10https://gerrit.wikimedia.org/r/89009 [15:32:58] akosiaris: andrewbogott mailed ops about the rspec job. I need to rush out to get my daughter back home [15:33:08] ok, thanks [15:33:10] :-) [15:33:11] akosiaris: andrewbogott: if there is any issue, I should be back online in a bit more than 3 hours [15:33:16] but that should be fine [15:33:18] ok, thanks [15:33:20] the job is not voting [15:33:29] and sorry to have forgotten to deploy that :( [15:33:38] (03CR) 10coren: [C: 032] Tool Labs: more fixes to the lighttpd startup [operations/puppet] - 10https://gerrit.wikimedia.org/r/89009 (owner: 10coren) [15:36:39] I am off [15:36:43] dad time [15:37:49] j #indonesia [15:46:19] !log Configured cr2-ulsfo:ae1 with all sub interfaces and VRRP [15:46:31] Logged the message, Master [15:46:33] !log Imported firewall ACLs on cr2-ulsfo, activated on lo0.0 [15:46:48] Logged the message, Master [15:49:53] eh, it seems there is a global issue with broken categories in MW [15:50:39] ru.wp " putnik> Everybody see arabic letters instead of russian." .. sv.wp "< Stryn> it's been broken now over 3 hours" [16:02:06] (03PS3) 10Andrew Bogott: Move mysql_wmf into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 [16:02:50] (03CR) 10jenkins-bot: [V: 04-1] Move mysql_wmf into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 (owner: 10Andrew Bogott) [16:06:12] Does anyone want to hazard a guess about what this is supposed to do? [16:06:19] if $db_cluster =~ /^fundraisingdb$/ { [16:06:19] $mysql_myisam = true [16:06:20] } [16:06:30] I'm guessing that that =~ is meant to be a comparison and not an assignment :) [16:08:29] Hm, actually that's documented as a comparison. So then why does 'parser validate' think it's an assignment? [16:14:21] andrewbogott: it's the $::db_clusters = { .. that it's barfing on, I think [16:14:44] even though 'puppet parser validate' is citing the block below it [16:14:48] You're right -- it was misreporting the line number [16:15:46] And I failed to notice that I had both $db_clusters and $db_cluster in there... [16:16:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:16:53] that is confusing, but not the issue the parser is reporting [16:16:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 21.733 second response time [16:17:01] it's that you're reaching out of scope with that assignment [16:17:22] (03PS4) 10Andrew Bogott: Move mysql_wmf into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 [16:17:29] Right, when I was linting I added a :: qualifier to db_clusters which is a local variable because I misread it as db_cluster which is a global [16:17:56] ah, makes sense [16:19:03] (03CR) 10Andrew Bogott: "Now linted! You may henceforth address me as "Cap'n Quotemark"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 (owner: 10Andrew Bogott) [16:44:55] (03PS1) 10coren: Tool Labs: fix race condition (for real) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89014 [16:45:37] (03CR) 10coren: [C: 032] Tool Labs: fix race condition (for real) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89014 (owner: 10coren) [17:00:07] Interesting stats: webgrid-01 has 31 web servers running yet: [17:00:07] Cpu(s): 0.2%us, 0.2%sy, 0.0%ni, 99.5%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st [17:00:07] Mem: 16435580k total, 1618028k used, 14817552k free, 90136k buffers [17:00:18] * Coren grins. [17:00:26] *MUCH* more efficient use of resources. [17:02:21] (03PS5) 10Andrew Bogott: Move mysql_wmf into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 [17:03:31] !log reedy synchronized php-1.22wmf21 'staging' [17:03:49] Logged the message, Master [17:06:27] !log reedy synchronized docroot and w [17:06:38] Logged the message, Master [17:09:55] (03PS1) 10Reedy: Add docroot stuff for 1.22wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89018 [17:09:56] (03PS1) 10Reedy: All wikipedias to 1.22wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89019 [17:09:57] (03PS1) 10Reedy: testwiki, test2wiki, mediawikiwiki, testwikidatawiki and loginwiki to 1.22wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89020 [17:09:58] (03PS1) 10Reedy: Add phase1 dblist for laziness [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89021 [17:10:13] (03CR) 10Reedy: [C: 032] Add docroot stuff for 1.22wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89018 (owner: 10Reedy) [17:10:52] (03Merged) 10jenkins-bot: Add docroot stuff for 1.22wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89018 (owner: 10Reedy) [17:16:07] !log reedy Started syncing Wikimedia installation... : testwiki to 1.22wmf21 and build and sync l10ncache [17:16:17] Logged the message, Master [17:19:40] Reedy: thoughts re: 'static-current' symlink? [17:23:14] andrewbogott: ping ? [17:23:24] Ryan_Lane: is there a way to restart the parsoid instances through salt without pushing out new code? [17:24:14] mutante, what's up? [17:24:29] PROBLEM - Apache HTTP on mw1070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:53] andrewbogott: a) made this change https://wikitech.wikimedia.org/w/index.php?title=Puppet&diff=85616&oldid=83886 b) have an issue syncing change in private repo [17:25:18] it says on stafford it's Already up-to-date when i pull [17:25:19] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.146 second response time [17:25:20] but it's not [17:25:22] Oh, that's wrong... [17:25:27] dang [17:26:00] the diagram is right but the text is wrong, fixing... [17:26:05] but .. i.. ok [17:26:08] :) [17:26:09] meanwhile, probably you can push your change and then it'll sync on stafford [17:26:19] PROBLEM - DPKG on labstore4 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:26:54] ori-l: Needs manually updating [17:27:02] andrewbogott: but the diagram has /root/private and that's gone [17:27:03] in the meantime I'm sure I committed in /root/private when I was trrying to fix up my contacts/icinga issue [17:27:04] like the php symlink [17:27:19] RECOVERY - DPKG on labstore4 is OK: All packages OK [17:27:31] Reedy: right. so I'm basically asking if you would be willing to take that on as part of the new branch checklist. [17:27:40] mutante, it's on sockpuppet, not stafford [17:27:46] ori-l: It's nothing to do with the new branch checklist [17:28:08] It's something to do AFTER all wikis have moved off the old branch (what I do with php when I remember) [17:28:22] Reedy: btw, did you remember about Mobile today? [17:28:44] mutante, ok, I think the text of that page is correct now...? [17:28:48] * andrewbogott wheel-wars with self [17:28:54] remember what? [17:29:27] andrewbogott: ah, yes. the confusion was with /root/puppet which moved [17:29:48] mutante: right… the thing that was once /root/puppet now lives in gerrit. For private we have no such luxury. [17:30:01] andrewbogott: thanks, looked at your edit [17:30:08] yep [17:30:08] um mutante I think that's wrong, the /root/private is actually the real repo and [17:30:22] Sorry that I changed that page to say the exact opposite of what, even then, I knew to be the truth. Can't much explain that. [17:30:25] the post commit hook syncs it up to /var/lib/git/operations/private [17:30:33] apergos: Indeed, I just updated the page to say that. [17:30:36] I think. [17:30:40] ok thanks [17:30:41] Reedy: about how they're goign to ride the train with the new branch this time instead of doing their own deploys [17:30:57] apergos: but I've lost all credibility on this issue so I'd advise that you check my work :) [17:31:03] :-D [17:31:14] I don't look at the pictures, just the text [17:31:26] but I did just read the post commit hook so I'm sure about that [17:31:43] Hopefully the pictures and text agree with each other now [17:31:53] yep looks good (the text) [17:32:11] pics look good too [17:32:29] mutante, sorted now? [17:33:26] if i can just make a new commit and sync it, yea [17:33:57] tries [17:34:54] !log reedy Finished syncing Wikimedia installation... : testwiki to 1.22wmf21 and build and sync l10ncache [17:35:08] Logged the message, Master [17:36:34] (03PS1) 10Anomie: Ensure certain perl modules (and git) are installed on exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/89023 [17:37:07] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki back to 1.22wmf20 [17:37:19] Logged the message, Master [17:38:05] (03PS1) 10Ori.livneh: Add 'static-current' w/symlinks to 1.22wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89024 [17:38:09] (03CR) 10coren: [C: 032] "LGTM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/89023 (owner: 10Anomie) [17:38:21] andrewbogott: looks all good, alright [17:42:59] (03CR) 10Reedy: "I guess we should write a script taking a parameter of "version" to delete and recreate all of these more dynamic symlinks" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89024 (owner: 10Ori.livneh) [17:45:22] (03CR) 10Ori.livneh: "Or alternately: http://httpd.apache.org/docs/current/rewrite/rewritemap.html" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89024 (owner: 10Ori.livneh) [17:46:16] (03CR) 10Eranroz: [C: 031] "(Please merge)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87645 (owner: 10Jforrester) [17:48:52] (03PS3) 10Dzahn: redirect pk.wikimedia.org to meta community page [operations/apache-config] - 10https://gerrit.wikimedia.org/r/86652 [17:50:46] (03CR) 10Chad: "I added git to the base packages a long time ago. Do those not get applied here?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/89023 (owner: 10Anomie) [17:55:26] Hm… on my todo list I have Peter as in charge of modularizing puppetmaster.pp. Did someone else officially inherit that task? [17:56:58] andrewbogott: could you review https://gerrit.wikimedia.org/r/#/c/88888/ possibly? [17:57:18] * andrewbogott reads [17:59:42] woo, pluginsyn used for something [17:59:59] PROBLEM - Puppet freshness on cp4001 is CRITICAL: No successful Puppet run in the last 10 hours [18:02:47] ori-l: Are you confident that ::puppet_config_dir is defined on our hosts? Seems reasonable but it doesn't look like we've used it before [18:03:05] Oh, nm, you have a fact that defines it! [18:03:58] Ryan_Lane: NFS switch in progress. [18:04:11] Man, there are a LOT more projects now using NFS than I thought! [18:04:21] heh [18:06:08] * Coren watches in despair as even a 'catch up' rsync doesn't seem like it's about to finish any time soon. [18:07:11] ori-l, this looks good, but do you mind talking me through it a bit? I take it that turning on 'reports=true' causes anything in lib/puppets/reports to be executed at the end of each run? [18:08:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [18:08:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 21.980 second response time [18:10:27] andrewbogott: 'report=true' turns on reporting generally, and 'reports' specifies the report handler to use [18:10:50] andrewbogott: in statsdb.rb, line 7: Puppet::Reports.register_report(:statsd) do [18:10:59] Ryan_Lane: i have added he.wiki to the HTTPS stuff for annons. Thanks for the efforts [18:11:04] this registers the code below it as the 'statsd' reporter [18:11:12] matanya: saw :) thanks [18:11:49] ori-l: So when does that code get executed to perform the registration? Does that happen during the facts phase? [18:12:01] ori-l: Also, how/when is self.metrics populated? [18:13:08] andrewbogott: re: self.metrics, that's part of what puppet provides reporters [18:13:31] the code to execute the registration happens during initial bootstrapping, yes. i haven't seen a race condition if that's what you mean (where the referenced reporter doesn't exist yet) [18:13:34] Is it weird that it's part of self. rather than passed in as an arg? [18:13:48] It's done by inheritance somehow? [18:14:07] that's how james turnbull does it in all the example reporters cited in the official docs [18:14:28] Ah, I'm sure that it works, since you've tested :) I'm just curious. [18:14:35] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.22wmf20 [18:15:16] ori-l, do you want this merged right away, or do you want me to do it when you have a block of time to watch it roll out? [18:15:29] block o' time! [18:15:43] is "i want you to have a block of time right away" an option? :D [18:15:53] just kidding, yes, it's not urgent [18:16:19] I can do it today, ping me when you get back from lunch. [18:16:33] * andrewbogott just reached a stopping point anyway [18:16:40] ok, will do [18:19:59] PROBLEM - Puppet freshness on cp4014 is CRITICAL: No successful Puppet run in the last 10 hours [18:20:59] PROBLEM - Puppet freshness on cp4019 is CRITICAL: No successful Puppet run in the last 10 hours [18:21:59] PROBLEM - Puppet freshness on cp4005 is CRITICAL: No successful Puppet run in the last 10 hours [18:21:59] PROBLEM - Puppet freshness on cp4015 is CRITICAL: No successful Puppet run in the last 10 hours [18:23:59] PROBLEM - Puppet freshness on cp4017 is CRITICAL: No successful Puppet run in the last 10 hours [18:23:59] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: No successful Puppet run in the last 10 hours [18:30:03] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: phase1 wikis to 1.22wmf21 [18:31:19] ick [18:32:24] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: phase1 wikis back to 1.22wmf20 [18:32:41] lols [18:32:47] on https://www.mediawiki.org/wiki/Annoying_little_bugs , i got: "This version of MobileFrontend requires MediaWiki 1.22, you have 1.22wmf21. You can download a more appropriate version from https://www.mediawiki.org/wiki/Special:ExtensionDistributor/MobileFrontend" [18:32:49] known? [18:33:22] Already reverted [18:34:00] ah thanks [18:35:52] <_david_> ^d, Thanks! [18:37:57] sorry Reedy :( [18:41:21] !log reedy synchronized php-1.22wmf21/extensions/ [18:42:52] ugh, sad [18:43:08] of course the mobile riding the train wouldn't go smoothly the first time, oh well, kinks to be worked out [18:47:19] !log reedy synchronized php-1.22wmf20/extensions/MobileFrontend [18:47:56] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: phase1 wikis back to 1.22wmf21 [18:50:36] !log reedy synchronized wmf-config/ [18:51:41] enwiki is always the last to be updated, right? [18:51:48] sort of [18:52:23] is there a canonical Last Wiki? [18:52:37] !log reedy synchronized php-1.22wmf21/extensions/MobileFrontend/MobileFrontend.php [18:53:04] Nope [18:53:21] Reedy: looksl ike that worked [18:53:22] All wikipedias are done at the same time (well, bar closed oneS) [18:54:36] ori-l: Don't bother [18:54:42] Just use /usr/local/apache/common-local/php/resources [18:54:43] etc [18:55:01] 1 place to update [18:55:25] duh, yes, that's better. [18:58:01] !log reedy synchronized wmf-config/ [18:59:50] * aude sighs [19:00:01] Reedy: if you can update localisation, that would be nice [19:00:11] * aude sees <wikibase-sitelinks-sitename-columnheading-special> on test wikidata [19:00:20] not urgent though [19:02:53] Reedy: https://gerrit.wikimedia.org/r/#/c/89024/ ? [19:25:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [19:25:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 20.147 second response time [19:35:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [19:35:14] !log LocalisationUpdate completed (1.22wmf20) at Thu Oct 10 19:35:14 UTC 2013 [19:36:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 29.533 second response time [19:46:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [19:48:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 14.479 second response time [19:51:56] ah manybubbles, i have a 1:1 with ken now [19:52:00] looks like i've been double booked [19:52:04] or, in 10 minutes [19:52:34] ottomata: k. I don't have much other than trying to get that rt ticket moving in some way [19:52:54] k [19:58:33] ottomata: we can move if need be [20:04:05] ack [20:04:09] hey it hink its ok [20:07:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [20:07:11] !log LocalisationUpdate completed (1.22wmf21) at Thu Oct 10 20:07:11 UTC 2013 [20:07:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 17.826 second response time [20:10:59] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [20:12:59] PROBLEM - Puppet freshness on bast4001 is CRITICAL: No successful Puppet run in the last 10 hours [20:12:59] PROBLEM - Puppet freshness on cp4002 is CRITICAL: No successful Puppet run in the last 10 hours [20:12:59] PROBLEM - Puppet freshness on cp4003 is CRITICAL: No successful Puppet run in the last 10 hours [20:12:59] PROBLEM - Puppet freshness on cp4004 is CRITICAL: No successful Puppet run in the last 10 hours [20:12:59] PROBLEM - Puppet freshness on cp4006 is CRITICAL: No successful Puppet run in the last 10 hours [20:12:59] PROBLEM - Puppet freshness on cp4007 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:00] PROBLEM - Puppet freshness on cp4008 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:01] PROBLEM - Puppet freshness on cp4009 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:01] PROBLEM - Puppet freshness on cp4010 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:01] PROBLEM - Puppet freshness on cp4011 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:01] PROBLEM - Puppet freshness on cp4012 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:02] PROBLEM - Puppet freshness on cp4013 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:03] PROBLEM - Puppet freshness on cp4016 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:03] PROBLEM - Puppet freshness on cp4018 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:04] PROBLEM - Puppet freshness on cp4020 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:04] PROBLEM - Puppet freshness on lvs4002 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:04] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:05] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: No successful Puppet run in the last 10 hours [20:14:59] PROBLEM - MySQL Replication Heartbeat on db45 is CRITICAL: CRIT replication delay 327 seconds [20:15:19] PROBLEM - MySQL Slave Delay on db45 is CRITICAL: CRIT replication delay 346 seconds [20:16:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [20:16:37] andrewbogott: I'm around, whenever you want to go for it [20:16:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.269 second response time [20:17:51] ori-l, OK, I'll merge right now. [20:17:59] tungsten is already up and listening? [20:18:14] yep [20:18:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Oct 10 20:18:49 UTC 2013 [20:25:33] andrewbogott: did you merge it? [20:25:55] I did. There's an order-of-operations thing which may or may not be an issue, watching that now... [20:27:25] manybubbles: when searching for 'key metrics' on mediawiki i get "[f3c100ca] 2013-10-10 20:26:56: Fatal exception of type MWException" [20:27:39] drdee_: well that isn't good [20:27:49] but searching 'foo' is fine [20:27:59] manybubbles / drdee_: 2013-10-10 20:26:56 mw1076 mediawikiwiki: [f3c100ca] /w/index.php?search=key+metrics&button=&title=Special%3ASearch Exception from line 187 of /usr/local/apache/common-local/php-1.22wmf21/extensions/Translate/TranslateHooks.php: A reached the parser. This should not happen [20:28:21] thanks ori-l [20:28:27] you are blazing fast [20:28:53] ori-l: does that come with a stacktrace of some sort? [20:30:49] yes, it's hilariously long because it logs parameters passed to the failing function, and in this case it includes the page source [20:31:10] i'll pm you the link, sec [20:31:21] nice [20:33:49] hah, page source [20:38:18] ori-l could you pm me the link as well (just to satisfy my curiosity) [20:39:35] drdee_: sure. I went over it and it doesn't contain private data, so I can just paste it here: https://dpaste.de/uD2J/raw/ [20:39:41] k [20:40:08] Networking question: is it possible to send UPD packets directly to hooft.esams.wikimedia.org from terbium.eqiad.wmnet? [20:40:33] Ping doesn't work but that didn't surprise me too much [20:40:43] s/UPD/UDP/ [20:41:27] probably, but I'm not sure. I'd ask LeslieCarr or mark [20:41:57] LeslieCarr: Do you have a couple minutes to chat about UDP from eqiad to esams? [20:43:05] LeslieCarr: This is follow up to questions I posed a week or so ago re: sending HTCP purges to esams without purging eqiad as well [20:46:54] ori-l: did you get a report from magnesium? [20:48:00] andrewbogott: not yet, but it takes a bit for statsd to flush to graphite [20:48:35] ok. Most hosts will take two puppet runs before they send stats, but I forced multiple runs on magnesium. [20:48:37] bd808: yes, it's possible [20:48:41] i tested [20:48:53] ori-l: sweet [20:48:54] (and on stafford, but stafford has other issues.) [20:49:27] andrewbogott: ok, i'm keeping a lookout :) [20:49:32] i'll let you know as soon as i see something [20:49:41] ori-l: I guess I could have done that as well :) [20:49:43] or as soon as you don't :) [20:50:21] I haven't logged into a host in esams yet; I'll put that on my todo soon list [20:50:46] i hear amsterdam is lovely this time of year [20:51:29] ori-l: I'd love to go sometime. Trappist beer tour is on my bucket list [20:53:00] You probably can't actually log onto any of the hosts there.. [20:53:48] Reedy: is there a way I can see varnishncsa output from there? [20:55:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [20:55:34] No idea, sorry [20:55:51] Reedy: no worries. [20:56:13] this page doesn't render: https://www.mediawiki.org/wiki/Wikimedia_engineering_report/2013/July [20:56:35] so, technically (but only in a very technical sense) this isn't a CirrusSearch bug [20:56:43] still, I found something CirrusSearch was doing wrong [20:58:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 27.652 second response time [20:59:47] anyone else want to have a look at the stack trace for f109662a [21:00:03] CirrusSearch is trying to render a page and it blows up. [21:00:16] when I manually go to the page, it blows up too: 3005745c [21:00:21] sadness [21:03:01] Cheer up manybubbles, you found a bug! There are many other bugs, but this one is yours. [21:03:30] bd808: thanks! I'm wondering if I should pawn it off on Nikerabbit, though [21:04:04] Giving found bugs to others is one of the greatest joys both parties can experience. [21:04:30] * drdee_ smiles [21:05:12] is this the cirrussearch-translate bug again? [21:05:32] or just obligatory mention of the relevant quip I've found a bug I've found hundreds... I usually report them to bugzilla.wikimedia.org [21:07:07] * bd808 stops channeling a drill sergeant on acid [21:08:52] Nemo_bis: this time I think CirrusSearch "discovered" a translate bug [21:09:19] Nemo_bis: cirrussearch has a bug that drdee_ found (it renders all pages in the search results) [21:09:48] Nemo_bis: when it tries to render a particular page it blows up. but that page blows up when you go right to it. [21:11:16] manybubbles: I think it's just [[Flow]] messed up, as Special:ExpandTemplates show [21:11:21] will sort it out after coffee [21:11:31] andrewbogott: https://gerrit.wikimedia.org/r/#/c/89112/ [21:12:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [21:12:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 18.382 second response time [21:13:35] ori-l, abstract ruby-question… a function just returns whatever the last line evaluates to? [21:13:47] yes [21:14:00] It might take a while for me to not hate that [21:14:28] andrewbogott: i usually dislike implicit, tricky stuff like that too, but most style guides recommend it: https://github.com/bbatsov/ruby-style-guide [21:14:33] so i do it in the interest of being idiomatic [21:14:45] manybubbles, bd808, fixed https://www.mediawiki.org/w/index.php?title=Echo_(Notifications)&diff=798854&oldid=796644 [21:16:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [21:17:43] hey, I just saw that [21:18:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 28.689 second response time [21:18:18] I'm surprised you can do stuff in the wiki that just makes it crash like that [21:19:01] manybubbles: you need to read more of the php code :) [21:19:24] parser functions are like that [21:19:24] bd808: doesn't most of it have a try { } catch (EVERYTHING) around it? [21:19:33] scary stuff [21:20:00] we don't even know how comes the parser doesn't explode normally, let alone broken usages [21:20:22] I have seen very few catch blocks in the corners I've visited so far [21:21:44] when a page explodes in your face always remember to look at the source with Special:expandTemplates and the like ;) [21:22:20] then you only need to know the right syntax for all the few dozens locally installed extensions' tags :D [21:25:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [21:28:42] !log stopping StatsD service on tungsten for a couple of minutes to debug [21:28:59] RECOVERY - MySQL Replication Heartbeat on db45 is OK: OK replication delay 0 seconds [21:29:19] RECOVERY - MySQL Slave Delay on db45 is OK: OK replication delay 0 seconds [21:29:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 15.644 second response time [21:33:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [21:41:57] ori-l: morebots AWOL [21:42:04] labs is down, apparently [21:42:09] AGAIN [21:42:20] why would morebots be on labs btw? [21:42:30] andrewbogott: sigh. one more. https://gerrit.wikimedia.org/r/#/c/89119/ [21:43:17] Nemo_bis: it was on rackspace before (same as wikitech-static) but some idle connection killer was causing it to hang [21:43:48] ori-l: what's the bug number to bring it to production again? [21:44:39] what do you mean by 'bring it to production'? [21:49:57] if you mean migrate it to the production cluster, there is no such plan afaik [21:50:10] if you mean the 'morebots is out again' bug, i don't remember the number [21:52:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 20.989 second response time [21:56:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [21:56:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 22.156 second response time [22:03:20] TimStarling: The upgrade to 5.3.10-1ubuntu3.6+wmf1 today has broken category collation. [22:04:01] different ICU? [22:04:34] ii libicu42 4.2.1-3ubuntu0.10.04.1 International Components for Unicode [22:04:34] ii libicu48 4.8.1.1-3 International Components for Unicode [22:04:37] let me find it in the backscroll [22:04:59] Had paravoid backported changes to php5-intl to use libicu48? [22:05:22] bug is https://bugzilla.wikimedia.org/show_bug.cgi?id=55565 [22:05:56] http://paste.debian.net/plain/55306 [22:06:14] Newer version of libicu42 was installed [22:06:36] Ignore the newer part [22:08:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [22:09:17] according to bug 46036, it was all switched to 4.8 [22:09:25] who recompiled the package? [22:12:02] [2208][tstarling@tin:~]$ mwscript eval.php --wiki=enwiki [22:12:02] > echo INTL_ICU_VERSION; [22:12:02] 4.2.1 [22:12:22] c.f. https://bugzilla.wikimedia.org/show_bug.cgi?id=46036#c4 [22:12:31] https://rt.wikimedia.org/Ticket/Display.html?id=5912 [22:12:35] Alex [22:13:08] So it is php5-intl using the wrong libicu version [22:14:52] I have half a mind to text one of them about this [22:15:32] where is the version control for it? "ssh gerrit gerrit ls-projects | grep debs" shows nothing relevant [22:16:08] would i be getting in the way if i sync JS code in php-1.22wmf20/resources and an extension? [22:16:32] no [22:16:37] thanks [22:17:06] No idea, I was also looking for it earlier [22:21:02] I'm getting the source packages [22:23:24] !log olivneh synchronized php-1.22wmf21/extensions/WikimediaEvents 'Updating WikimediaEvents to log DOM retrieval timing for VE' [22:23:37] well, there's no version specified in the control file [22:23:40] !log olivneh synchronized php-1.22wmf20/resources/mediawiki/mediawiki.inspect.js 'Updating 1.22wmf20 for mw.loader.inspect (1/3)' [22:23:53] it was probably just built on an unclean labs instance, instead of in pbuilder [22:23:57] !log olivneh synchronized php-1.22wmf20/resources/mediawiki/mediawiki.js 'Updating 1.22wmf20 for mw.loader.inspect (2/3)' [22:24:13] !log olivneh synchronized php-1.22wmf20/resources/Resources.php 'Updating 1.22wmf20 for mw.loader.inspect (3/3)' [22:24:23] done [22:25:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 28.123 second response time [22:26:57] hey [22:26:59] what's up? [22:27:13] the new PHP package is screwed up [22:27:23] built against the old ICU [22:27:36] I was about to rebuild it myself [22:27:40] ok [22:27:42] alex built it [22:27:45] I wasn't involved [22:27:53] they tested it with hashar on betalabs afaik [22:27:57] but obviously not collation [22:28:14] tell them to use pbuilder next time [22:28:28] oh really? :) [22:28:46] are you building it or should I? [22:29:00] you can do it, my pbuilder labs instance is not responding [22:29:13] yeah I don't use labs for such things [22:29:32] smart man [22:29:45] I was told to use labs and stupidly followed that advice :) [22:30:30] I used to have a dedicated build server but it got thrown in the bin [22:31:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [22:31:14] or donated to charity or something [22:31:16] labs is mostly down now, so not best timing :) [22:31:26] is the package okay otherwise? do you know? [22:31:37] Nothing else has been reported [22:32:55] andrewbogott: it works [22:33:19] andrewbogott: e.g. https://graphite.wikimedia.org/render?from=-2hours&until=now&width=500&height=380&target=stats.timers.puppet.total.median [22:33:47] what is the Y axis? [22:33:54] extent to which i hate puppet [22:33:56] 'thousands of minutes' I could believe :( [22:34:19] I fail to see the point of having puppet metrics [22:34:21] I don't mind them [22:34:47] they're fancy but I don't think we can use them somehow [22:34:49] primarily for testing proposed optimizations for the puppetmaster setup [22:34:56] paravoid, we all know that puppet is unacceptably slow, but we don't currently know /how/ unacceptable! [22:35:14] rob asked yesterday if to enable hyperthreading on puppetmaster, for example [22:35:14] it's unacceptably slow because we have an overloaded server [22:35:27] we just need to scale the box up [22:35:32] to multiple workers [22:35:52] and upgrade to puppet 3 at some point which is reported to be much much better performance-wise [22:36:10] reported by people who measure such things :) [22:36:35] we measure cpu usage, I think we'll know too [22:36:46] YuviPanda: is the API down? [22:36:57] andrewbogott: NFS took it, probably [22:37:07] ah, right… taht's still happening? [22:37:17] nginx is up https://metrics.wmflabs.org/ [22:37:25] andrewbogott: don't think it was resolved yet [22:37:33] morebots is still out [22:37:43] so is gerrit bot [22:37:46] ok [22:38:24] andrewbogott: I can't ssh into the box (hangs when trying to access home, I bet) [22:38:26] so yeah [22:38:48] ok, this patch will just have to wait until tomorrow for testing. [22:39:12] awww [22:39:18] andrewbogott: re: the Y-axis, that might be some arithmetic fail on my part, hang on [22:39:59] PROBLEM - SSH on lvs4001 is CRITICAL: Server answer: [22:39:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.615 second response time [22:40:59] RECOVERY - SSH on lvs4001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:41:25] ori-l: You can fix the math or add a label, either way :) [22:41:30] 'thousands of thousandths of...' [22:46:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [22:51:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 29.314 second response time [22:54:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [23:01:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 21.190 second response time [23:04:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection timed out [23:05:44] php surely takes a while to build [23:18:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 26.599 second response time [23:23:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [23:29:06] !log reprepro include php5 5.3.10-1ubuntu3.8+wmf2 (wmf1 rebuilt in a clean environment, i.e. with precise's libicu) [23:30:49] no morebots [23:30:56] NFS dead [23:30:59] morebots dead [23:31:00] RIP NFS [23:31:06] are Coren/Ryan aware of this? [23:31:19] paravoid: coren is doing maintenance [23:31:21] paravoid: planned / scheduled maintenance [23:31:25] oh, sorry [23:31:30] and it's taken longer than anticipated [23:31:44] Ryan_Lane: Here is one (1) understatement token. [23:31:51] haha [23:31:51] :) [23:31:57] well, I wanted to be nice about it :) [23:32:05] what's wrong? anything I could help with? [23:32:11] it's just rsync taking ages [23:32:15] or brainstorm with [23:43:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 21.559 second response time [23:45:40] Ryan_Lane: parsoid could use a restart [23:45:57] oh? [23:46:20] we are pushing them quite hard currently, and there are some hanging workers [23:46:32] a fix is in rt testing, but not yet deployed [23:46:34] ok, I'll restart them, batched at 5 [23:46:38] salt -b 5 -G 'deployment_target:parsoid' parsoid.restart_parsoid parsoid [23:46:57] Ryan_Lane: can I run that command too? [23:46:57] done [23:47:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [23:47:13] Ryan_Lane: thanks! [23:47:18] no, but I put it in the channel, just in case others wanted to see how I did it [23:47:29] parsoid is non-standard in how it needs to be restarted with salt [23:47:37] ah, k [23:47:39] because of the init script [23:48:06] the abomination of an init script [23:48:20] heh [23:48:21] yep [23:49:22] gwicke: this is the upstart script i use for mwvagrant: https://dpaste.de/dC84/raw/ [23:49:40] production may be a lot more complex, i'm not too familiar with the setup. but sharing the link in case it's useful. [23:51:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 29.742 second response time [23:54:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [23:58:07] ori-l: thanks, saw that before [23:58:59] ori-l: added to https://bugzilla.wikimedia.org/show_bug.cgi?id=53723