[00:00:48] (03PS1) 10Dr0ptp4kt: Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 [00:02:39] (03PS2) 10Dr0ptp4kt: Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 [00:08:02] (03PS3) 10Dr0ptp4kt: Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 [00:10:41] paravoid: link? [00:10:50] graphite.wm.org :) [00:11:10] !log all ams-ix traffic is now on cr2-esams [00:11:23] Logged the message, Mistress of the network gear. [00:11:33] under the "stats" cetegory [00:13:03] I see [00:16:10] PROBLEM - Puppet freshness on osm-cp1001 is CRITICAL: No successful Puppet run in the last 10 hours [02:06:00] Meh, hopefully one day this will be easier and doesn't require mocking parts of core and wmf-config: https://gist.github.com/Krinkle/6878261 [02:06:12] get site confifguration values for all (or some) wmf wikis [02:06:34] https://gist.github.com/Krinkle/6878261#file-usage-log [02:07:40] there is a getConfiguration.php maintenance script though. Which can build a cache per wiki (though that contains private data, so needs to be masked) [02:07:51] and there is sitematrix API which provides some of this data [02:08:32] what private data? [02:08:44] it exports all global variables [02:08:53] that match /^wm?g/ [02:09:03] so there's some passwords, certain tokens/salts and whatnot [02:09:52] hah [02:13:54] !log LocalisationUpdate completed (1.22wmf20) at Tue Oct 8 02:13:54 UTC 2013 [02:14:14] Logged the message, Master [02:18:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [02:19:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 13.172 second response time [02:23:04] (03CR) 10Springle: [C: 031] Labs DB: Add labsdb-side second line of defense [operations/software] - 10https://gerrit.wikimedia.org/r/88149 (owner: 10coren) [02:24:37] (03CR) 10coren: [C: 032 V: 032] "A +1 from Sean is like unto a touch from God, and good enough for me. :-)" [operations/software] - 10https://gerrit.wikimedia.org/r/88149 (owner: 10coren) [02:25:27] !log LocalisationUpdate completed (1.22wmf19) at Tue Oct 8 02:25:27 UTC 2013 [02:25:38] Logged the message, Master [02:27:13] PROBLEM - Puppet freshness on srv193 is CRITICAL: No successful Puppet run in the last 10 hours [02:31:53] !log start online optimize logging indexes s4 & s5 [02:32:04] Logged the message, Master [02:32:30] (03PS2) 10coren: Add templateeditor right, group, and restriction [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88196 [02:32:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [02:33:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 27.151 second response time [02:35:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Oct 8 02:35:56 UTC 2013 [02:36:09] Logged the message, Master [02:36:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [02:37:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 29.596 second response time [03:20:27] !log rmmod nf_conntrack on db1002 causing mass mysql connect failure [03:20:41] Logged the message, Master [03:23:19] !log xtrabackup clone s6 db1039 to db1022 [03:23:35] Logged the message, Master [03:28:35] (03PS1) 10Springle: s6 db1015 to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88400 [03:29:04] (03CR) 10Springle: [C: 032] s6 db1015 to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88400 (owner: 10Springle) [03:30:07] !log springle synchronized wmf-config/db-eqiad.php 's6 db1015 to full steam' [03:30:19] Logged the message, Master [04:12:58] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [04:44:53] what's up with the slaves of Wikidata? [04:47:23] which ones? [04:48:00] https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= [04:49:57] * jeremyb suddenly wants ishmael... [04:50:18] i wonder if springle's there? [04:50:24] is he AEST? [04:50:28] no more lag [04:53:43] s5 (db1021 mainly) has been having issues. there's a optimize process running on logging tables right now, and i'm arranging for a spare slave [04:53:49] jeremyb: Jasper_Deng ^ [04:54:16] oh, the optimize [04:54:23] i was focusing on the xtrabackup :P [04:54:53] springle: s5 is much better now in dbtree [04:54:55] issues == slow queries causing max connections to be hit. related to https://bugzilla.wikimedia.org/show_bug.cgi?id=54876 [04:55:12] *click* [04:55:20] springle: is it AEST? [04:55:34] my tz? yes [04:56:28] huh. there's no A according to my system (wheezy) [04:56:47] $ for TZ in America/Los_Angeles Europe/Athens Australia/Sydney; do zdump -v "$TZ" | fgrep ' 2013 ' | tail -n 1; done | while read TZ line; do printf '%-23s %s\n' "$TZ" "$line"; done [04:56:51] America/Los_Angeles Sun Nov 3 09:00:00 2013 UTC = Sun Nov 3 01:00:00 2013 PST isdst=0 gmtoff=-28800 [04:56:54] Europe/Athens Sun Oct 27 01:00:00 2013 UTC = Sun Oct 27 03:00:00 2013 EET isdst=0 gmtoff=7200 [04:56:57] Australia/Sydney Sat Oct 5 16:00:00 2013 UTC = Sun Oct 6 03:00:00 2013 EST isdst=1 gmtoff=39600 [04:59:07] i use Australia/Brisbane which is EST +1000. AEST is only an abbreviation i think http://www.timeanddate.com/library/abbreviations/timezones/au/est.html [05:01:56] ok, so brisbane hasn't had DST 1992 [05:11:04] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [05:28:04] RECOVERY - Puppet freshness on srv193 is OK: puppet ran at Tue Oct 8 05:27:54 UTC 2013 [05:37:04] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 05:36:57 UTC 2013 [05:38:04] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [06:06:24] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 06:06:17 UTC 2013 [06:07:04] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [06:36:24] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 06:36:15 UTC 2013 [06:37:04] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [06:37:10] !log fixed up salt on mw1046, mw1072, mw1173 [06:37:26] Logged the message, Master [06:45:34] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:45:36] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:45:37] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:46:24] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.62 ms [06:46:26] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.62 ms [06:46:28] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.24 ms [06:47:30] hm [07:05:54] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 07:05:44 UTC 2013 [07:06:04] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [07:11:48] (03PS1) 10ArielGlenn: depool db1040 for upgrade, conversion to mariadb [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88411 [07:16:52] (03CR) 10ArielGlenn: [C: 032] depool db1040 for upgrade, conversion to mariadb [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88411 (owner: 10ArielGlenn) [07:20:20] !log ariel synchronized wmf-config/db-eqiad.php 'depool db1040 for upgrade/conversion to mariadb' [07:20:34] Logged the message, Master [07:37:02] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 07:37:00 UTC 2013 [07:38:02] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [07:39:02] PROBLEM - mysqld processes on db1040 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [07:50:53] (03PS1) 10Springle: warm up db1039 in s6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88413 [07:51:22] (03CR) 10Springle: [C: 032] warm up db1039 in s6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88413 (owner: 10Springle) [07:52:02] !log springle synchronized wmf-config/db-eqiad.php 'warm up db1039 in s6' [07:52:15] Logged the message, Master [08:08:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [08:21:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [08:22:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 10.891 second response time [08:28:40] I wish PHP would stop segfaulting on terbium [08:28:41] Importing Svepmanschett_-_Livrustkammaren_-_34386-negative.tif.../usr/local/bin/mwscript: line 18: 906 Segmentation fault php "$MW_COMMON_DIR_USE/multiversion/MWScript.php" "$@" [08:35:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 08:35:51 UTC 2013 [08:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [08:41:17] (03PS1) 10ArielGlenn: fix up nagios contact name for ariel [operations/puppet] - 10https://gerrit.wikimedia.org/r/88418 [08:42:11] (03CR) 10ArielGlenn: [C: 032] fix up nagios contact name for ariel [operations/puppet] - 10https://gerrit.wikimedia.org/r/88418 (owner: 10ArielGlenn) [08:46:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [08:49:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 14.948 second response time [09:08:15] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 09:08:06 UTC 2013 [09:08:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [09:08:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [09:09:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 24.027 second response time [09:11:25] !log installing package upgrades on calcium (cameras) [09:11:38] Logged the message, Master [09:14:41] mutante: ??? [09:14:49] cameras ? [09:14:57] DC cameras ? [09:15:15] yes [09:15:37] i started going through a list of servers where it's not very obvious from site.pp what their status is [09:15:52] and this turned out to be the camera server setup by Rob [09:16:14] aha... please document :-) :-) :-) [09:16:30] yep ,ok:) [09:28:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [09:31:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 16.226 second response time [09:31:51] (03PS1) 10Odder: (bug 48480) Remove EmailCapture extension settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88424 [09:37:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 09:37:30 UTC 2013 [09:38:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [09:39:48] !log puppetstoredconflicean .. killiam williams.wikimedia.org .. done [09:40:00] Logged the message, Master [09:40:36] arg cant type on this [09:40:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [09:42:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 13.113 second response time [09:43:51] !log installing package upgrades on antimony (gitblit, gerrit repl.) [09:44:02] Logged the message, Master [09:57:23] (03PS1) 10Reedy: Add ttf-sinhala-lklug to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/88430 [09:57:41] mutante: ^ want an easy one? :) [09:58:17] Reedy_: ah a font .. yea [09:58:52] easy bug is easy [09:59:57] Reedy: N: Can't select versions from package 'ttf-sinhala-lklug' as it is purely virtual [10:00:12] fonts-lklug-sinhala - Unicode Sinhala font by Lanka Linux User Group [10:00:32] http://packages.ubuntu.com/precise/ttf-sinhala-lklug [10:00:38] Hmm [10:00:40] Are the others? [10:00:42] * Reedy looks [10:01:10] http://packages.ubuntu.com/precise/fonts-lklug-sinhala [10:01:32] Ahh [10:01:41] It would look like there's a few like this http://packages.ubuntu.com/precise/ttf-sil-yi [10:01:49] Which is a transitional package [10:02:32] Inst defoma (0.11.12ubuntu1 Ubuntu:12.04/precise [all]) [10:02:32] Inst fonts-lklug-sinhala (0.5.4-1 Ubuntu:12.04/precise [all]) [10:02:32] Inst x-ttcidfont-conf (32+nmu2 Ubuntu:12.04/precise [all]) [10:02:41] <- that's what it would do [10:02:53] for install ttf-sinhala-lklug [10:03:37] I'll modify that commit [10:03:47] eh and exactly the same for install fonts-lklug-sinhala [10:03:48] and see about cleaning the others up then to the newer packages [10:03:52] (03CR) 10Akosiaris: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88430 (owner: 10Reedy) [10:04:04] * Reedy eyes akosiaris [10:04:05] ah [10:04:06] but did we expect we still need defoma [10:04:13] so... same talk here :-) [10:04:13] i mean, we already have all those fonts [10:04:34] looked on mw75 [10:04:50] We can just stick defoma at the top of the list ;) [10:07:09] waits for gerrit to publish a comment [10:07:18] isn't defoma deprecated ? [10:07:37] i have no such package on wheezy... [10:07:44] http://packages.ubuntu.com/precise/defoma [10:07:50] Doesn't seem to be in ubuntu land [10:07:54] Maintainer: Ubuntu Developers [10:07:57] Original-Maintainer: Debian QA Group [10:08:02] http://packages.ubuntu.com/saucy/defoma [10:08:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 10:08:01 UTC 2013 [10:08:10] so... not in saucy... [10:08:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [10:08:38] which means that in the next LTS no defoma [10:08:40] disappeared in raring [10:08:53] nice... [10:08:54] (03CR) 10Dzahn: "root@mw75:~# apt-get -s install ttf-sinhala-lklug" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88430 (owner: 10Reedy) [10:08:58] let's please avoid it :-) [10:09:20] why would it want defoma for this specific package [10:09:28] while all the other packages are already installed [10:09:33] and didnt pull it [10:09:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [10:10:15] The defoma dependancy just disappears [10:10:16] http://packages.ubuntu.com/raring/ttf-sinhala-lklug [10:11:18] Reedy: easy bug turns into a general cleanup again :) [10:11:28] just install the virtual package!!!!! [10:11:29] :p [10:11:57] https://wiki.debian.org/OldPkgRemovals [10:12:16] Could dump the list into etherpad and update them all [10:12:52] ttf-sinhala-lkmug ? [10:13:17] mutante: maybe because of recommends x-ttcidfont-conf [10:13:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 12.472 second response time [10:14:00] try a --no-install-recommends (we should have it in apt though) [10:14:27] (03CR) 10Dzahn: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88430 (owner: 10Reedy) [10:15:21] root@mw75:~# apt-get install -s --no-install-recommends ttf-sinhala-lkmug [10:15:27] Inst fonts-lklug-sinhala (0.5.4-1 Ubuntu:12.04/precise [all]) [10:15:27] Inst ttf-sinhala-lkmug (0.5.4-1 Ubuntu:12.04/precise [all]) [10:15:32] <- that looks good [10:15:53] same with "lklug" vs. "lkmug" [10:16:03] Inst fonts-lklug-sinhala (0.5.4-1 Ubuntu:12.04/precise [all]) [10:16:14] gets you that but not the ttf- package [10:16:15] PROBLEM - Puppet freshness on osm-cp1001 is CRITICAL: No successful Puppet run in the last 10 hours [10:18:16] I am afraid that list needs cleanup... even worse... I am afraid that for every font it will be a different name in 10.04 and 12.04 ... :-( [10:18:33] (03CR) 10Dzahn: "< akosiaris> mutante: maybe because of recommends x-ttcidfont-conf" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88430 (owner: 10Reedy) [10:18:52] I don't think we've any machines on 10.04 this would be a problem for... [10:19:11] i was hoping to hear that :-) [10:19:26] but i am surprised i actually did [10:19:38] I know we do have some 10.04 machines around still [10:19:46] But nothing that's mass installed [10:19:54] If we did, mutante would be finding a big stick to hit them with [10:20:24] don't suppose we have a nice way to setting no recommends in the manifest? [10:20:35] not in the manifest [10:20:36] but [10:21:18] $ cat apt.conf.d/90no-recommends.conf [10:21:18] APT::Install-Recommends "0"; [10:21:28] http://serverfault.com/questions/280405/installing-open-vm-tools-in-ubuntu-via-puppet-whats-the-lesser-evil [10:22:00] stabby [10:23:58] http://projects.puppetlabs.com/issues/1766 [10:24:04] Control of recommends installation with aptitude provider [10:24:09] we should have no prolem anyway [10:24:11] Added by Nick Phillips almost 5 years ago. Updated almost 3 years ago. [10:24:17] we don't use the aptitude provider [10:24:32] and the apt provider does not install recommends ... problem solved :-) [10:24:33] ah true [10:24:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [10:24:50] 2 levels of dupe http://projects.puppetlabs.com/issues/4113 [10:24:52] wait, i didn't use aptitude to test though [10:24:57] just apt-get install [10:25:24] which is not what puppet does... it run apt-get install -R (or --no-recommends or smt)... [10:25:31] sweet [10:25:39] then problem solved indeed [10:26:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 16.790 second response time [10:27:33] so..it's ttf-sinhala-lkmug [10:27:37] Reedy: [10:29:53] Ya? [10:30:27] hai [10:31:29] Reedy: just change it to -lkmug instead of -lklug and we can merge it as puppet will not install the recommends [10:33:56] (03PS2) 10Reedy: Add ttf-sinhala-lklug to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/88430 [10:34:15] Should I make a commit to update and bypass all those virtual packages? [10:34:25] Aghh, commit summary [10:35:02] (03PS3) 10Reedy: Add ttf-sinhala-lkmug to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/88430 [10:35:10] (03PS1) 10ArielGlenn: db1040 -> mariadb and file_per_table [operations/puppet] - 10https://gerrit.wikimedia.org/r/88436 [10:35:50] (03CR) 10Dzahn: [C: 031] Add ttf-sinhala-lkmug to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/88430 (owner: 10Reedy) [10:36:13] (03CR) 10Dzahn: [C: 032] Add ttf-sinhala-lkmug to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/88430 (owner: 10Reedy) [10:36:24] (03CR) 10ArielGlenn: [C: 032] db1040 -> mariadb and file_per_table [operations/puppet] - 10https://gerrit.wikimedia.org/r/88436 (owner: 10ArielGlenn) [10:39:22] Reedy: re: patchset to update all. i think akosiaris would like that (some time) [10:39:34] looks at puppet on an imagescaler [10:43:11] They're not all transitional [10:43:11] http://packages.ubuntu.com/precise/ttf-alee [10:43:12] :/ [10:44:24] * Reedy starts looking [10:46:15] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 10:46:13 UTC 2013 [10:46:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [10:47:27] notice: /Stage[main]/Imagescaler::Packages::Fonts/Package[ttf-sinhala-lkmug]/ensure: ensure changed 'purged' to 'latest' [10:52:55] akosiaris: arr.. we still got defoma [10:53:00] the actual commandline from puppet: [10:53:04] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install ttf-sinhala-lkmug [10:53:23] Coren: https://bugzilla.wikimedia.org/show_bug.cgi?id=49350#c4 [11:01:45] (03PS1) 10Reedy: Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 [11:02:06] (03CR) 10jenkins-bot: [V: 04-1] Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 (owner: 10Reedy) [11:03:30] (03PS2) 10Reedy: Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 [11:06:17] (03PS3) 10Reedy: Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 [11:06:30] bloody whitespace [11:06:40] (03CR) 10jenkins-bot: [V: 04-1] Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 (owner: 10Reedy) [11:07:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 11:07:22 UTC 2013 [11:07:32] (03PS4) 10Reedy: Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 [11:07:34] the package is on all scalers for now they can test if bug is fixed [11:07:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [11:11:27] Thanks [11:20:34] what is wrong with the virtual packages ? :( [11:35:52] mutante: arg... [11:37:15] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 11:37:05 UTC 2013 [11:37:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [11:38:22] mutante: could we use salt to remove defoma ? [11:38:36] Wouldn't puppet just end up installing it again? [11:38:54] yes but i am writing a patch to avoid that next time [11:38:59] aha :) [11:39:19] Out of interest, what would be the easiest way to find out how much out traffic dataset2 has a month? [11:39:22] akosiaris: yes, dsh is also easy enough since there is an existing group for image_scalers [11:39:33] while i dont think we have grains for that [11:39:49] and the hostnames are just mw's [11:39:50] mutante: ok... dsh it is [11:40:12] I 'll handle it [11:40:19] thx [11:41:07] bash history of fenari root [11:51:03] :-) [11:51:25] (03PS1) 10Akosiaris: apt-get should not install recommends [operations/puppet] - 10https://gerrit.wikimedia.org/r/88446 [11:57:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [11:58:41] Reedy: er ganglia graphs? [11:59:26] maybe not, that's going to count other network traffic [12:00:00] no... /modules/wikidata_singlenode/files/simple-elements.xml must die [12:00:23] reason: all of a sudden i cant easily grep for our server names in puppet repo :p [12:00:38] because it's an unrelated file full of element names [12:02:29] paravoid: are you a user on host "copper"? (swiftrepl) [12:08:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 27.420 second response time [12:13:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [12:14:08] | grep -v simple-elements :-P [12:14:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 12.844 second response time [12:14:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 12:14:46 UTC 2013 [12:15:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [12:17:45] !log installing package upgrades on ekrem (IRC) [12:18:00] Logged the message, Master [12:18:18] (03CR) 10Akosiaris: [C: 032] apt-get should not install recommends [operations/puppet] - 10https://gerrit.wikimedia.org/r/88446 (owner: 10Akosiaris) [12:20:20] created new wikitech pages for ekrem, chromium, capella, carbon, calcium, .. [12:21:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [12:22:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 29.688 second response time [12:23:55] mutante: yippi [12:25:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [12:30:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 18.767 second response time [12:33:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection timed out [12:34:59] !log apt-get purge defoma on image_scalers dsh group [12:35:14] Logged the message, Master [12:35:21] !log install package upgrades on gallium (jenkins) [12:35:34] Logged the message, Master [12:35:42] mutante: argh [12:35:46] mutante: are you upgrading jenkins ? :D [12:36:10] ah no [12:36:27] hashar: no, just python, linux-tolls-common, some libs [12:36:34] w3m :p [12:36:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.349 second response time [12:36:43] :D [12:37:48] pfff [12:37:52] I restarted jenkins by mistake [12:37:58] !log restarted Jenkins by mistake :-( [12:38:14] Logged the message, Master [12:38:19] sorry if i should have asked, they seemed harmless and unrelated [12:39:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 12:38:57 UTC 2013 [12:39:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection timed out [12:39:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [12:40:38] mutante: yeah yeah that is harmless usually :-] [12:40:51] mutante: just wanted to make sure jenkins was not going to restart [12:40:55] but I did kill it by mistake :-(((((((((((( [12:41:17] :p no worries, i wouldn't touch jenkins itself without pinging you at least [12:43:51] actually.. instead of those wikitech templates for servers we could use wikidata right away :p [12:49:36] !log installing package upgrades on formey [12:49:46] Logged the message, Master [12:52:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 20.908 second response time [12:52:38] hashar: ^ while we're at it, that upgraded and reloaded apache on formey (svn/gerrit) [12:52:48] status OK [12:59:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [13:00:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.961 second response time [13:03:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [13:07:47] (03PS1) 10Akosiaris: elasticsearch plugins in git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/88455 [13:09:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 13:08:59 UTC 2013 [13:09:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [13:13:11] https://wikitech.wikimedia.org/w/index.php?title=Template%3AServer&diff=85340&oldid=70064 [13:13:49] https://wikitech.wikimedia.org/wiki/Special:NewPages [13:15:38] (03CR) 10Odder: "See I3aee7b08501435f1248037eba66912436720bf4d for a follow-up." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88424 (owner: 10Odder) [13:20:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 11.391 second response time [13:22:16] mutante: I am not sure what is left on formey [13:22:37] mutante: ahhh svn [13:22:37] hashar: svn and gerrit repl. destination [13:23:10] mutante: it might be a preproduction gerrit install. [13:23:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [13:23:46] mutante: we will have to migrate it to eqiad eventually :-D [13:24:22] role::deployment::test ? [13:25:19] hashar: at least the last svn users are gone :P [13:25:25] pywiki [13:25:29] yup [13:25:35] mutante: yes that exists too [13:25:44] you might want to follow up with Chad. I get access on formey but I merely use it to query ldap [13:25:56] Ryan created it while giving me a crash course on git-deploy [13:27:36] akosiaris: helium could use use an apt-get upgrade (it's the bacula box) [13:27:45] mutante: go for it [13:27:49] kk [13:28:09] !log installing package upgrades on helium [13:28:20] Logged the message, Master [13:29:10] mutante: we can upgrade carbon anytime since brewster is still primary tftp server [13:30:04] cmjohnson1: alright, cool so i guess reinstall ? [13:30:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.261 second response time [13:30:53] I don't see why we couldn't [13:31:02] ok [13:32:43] mark: do you want me to move dysprosium to c8? or wait for new 10g rack in D? [13:33:06] is it in the way of anything right now? [13:33:16] no [13:33:17] we haven't really assigned it to anything new yet [13:33:26] i'd say leave it for now [13:33:52] kk [13:35:03] (03PS2) 10Akosiaris: elasticsearch plugins in git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/88455 [13:37:45] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 13:37:40 UTC 2013 [13:38:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [13:38:48] !log installing package upgrades on hooper [13:39:00] Logged the message, Master [13:42:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [13:46:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.218 second response time [13:52:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [13:53:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 19.202 second response time [13:58:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [14:00:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 26.354 second response time [14:08:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 14:08:45 UTC 2013 [14:09:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [14:20:49] where is mark? [14:21:59] Snaps: need him in particular? [14:22:26] also, there's 3 different marks. you mean the one that goes by just "mark"? [14:22:37] Mr Bergsma [14:22:58] Im mostly curious, but I wouldnt mind him reviewing some stuff. [14:26:53] Is it in gerrit? [14:27:03] (03PS1) 10ArielGlenn: warm up db1040 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88468 [14:27:42] Reedy: yup. https://gerrit.wikimedia.org/r/#/c/88234/ [14:27:53] I want that reviewed and merged so that ottomata can continue his testing [14:30:37] (03CR) 10ArielGlenn: [C: 032] warm up db1040 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88468 (owner: 10ArielGlenn) [14:30:40] I note he's marked as /away currently [14:31:06] (03PS1) 10Hashar: graphite: upstream doc in storage-aggregation.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/88469 [14:31:07] (03PS1) 10Hashar: graphite: make sure we aggregate min/max/count properly [operations/puppet] - 10https://gerrit.wikimedia.org/r/88470 [14:31:08] (03PS1) 10Hashar: graphite: tweak statsd aggregation [operations/puppet] - 10https://gerrit.wikimedia.org/r/88471 [14:32:05] (03CR) 10Hashar: "That one is easy enough :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88469 (owner: 10Hashar) [14:34:28] !log ariel synchronized wmf-config/db-eqiad.php 'warm up db1040 (s6) after upgrade' [14:34:40] Logged the message, Master [14:36:24] (03CR) 10Hashar: "That doesn't set the aggregation methods for percentile, it apparently has some performances impacts." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88471 (owner: 10Hashar) [14:39:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 14:38:58 UTC 2013 [14:39:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [14:54:16] (03PS1) 10Jgreen: additional rsa/ssh key for awight [operations/puppet] - 10https://gerrit.wikimedia.org/r/88477 [14:54:34] (03CR) 10jenkins-bot: [V: 04-1] additional rsa/ssh key for awight [operations/puppet] - 10https://gerrit.wikimedia.org/r/88477 (owner: 10Jgreen) [14:57:55] (03PS2) 10Jgreen: additional rsa/ssh key for awight [operations/puppet] - 10https://gerrit.wikimedia.org/r/88477 [14:59:48] (03CR) 10Jgreen: [C: 032 V: 031] additional rsa/ssh key for awight [operations/puppet] - 10https://gerrit.wikimedia.org/r/88477 (owner: 10Jgreen) [15:07:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 15:07:48 UTC 2013 [15:08:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [15:08:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:09:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 28.019 second response time [15:14:35] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:25] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [15:19:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:20:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 21.714 second response time [15:34:03] (03CR) 10Mark Bergsma: [C: 032] Use LRU hash for logline cache to avoid memory leak [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/88234 (owner: 10Edenhill) [15:34:11] (03CR) 10Mark Bergsma: [V: 032] Use LRU hash for logline cache to avoid memory leak [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/88234 (owner: 10Edenhill) [15:34:38] thanks mark! [15:37:26] :) [15:37:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 15:37:30 UTC 2013 [15:38:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [15:40:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:43:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 21.526 second response time [15:54:56] !log fixed up salt clients on analytics 1005-6, 1009-27 [15:55:09] Logged the message, Master [15:57:45] How do you invalidate the varnish caches when pages are modified? [15:59:03] (03PS1) 10Mark Bergsma: Move mediawiki.org and wikimediafoundation.org to text-varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/88490 [15:59:32] Snaps: with multicast udp messages (purge requests) that contain the url [15:59:42] mark: ok, cool. [15:59:58] using the HTCP protocol [16:00:05] is that sourced by the web frontends or the backend storage thingie? [16:01:18] by the mediawiki application servers on edits [16:01:33] okay, thanks [16:02:32] (03CR) 10Mark Bergsma: [C: 032] Move mediawiki.org and wikimediafoundation.org to text-varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/88490 (owner: 10Mark Bergsma) [16:07:45] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 16:07:44 UTC 2013 [16:08:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [16:12:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:13:46] (03PS4) 10Dr0ptp4kt: Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 [16:14:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.297 second response time [16:17:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:18:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 14.990 second response time [16:26:05] (03PS3) 10Reedy: Labs: Turn off secure login on loginwiki due to untrusted SSL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87045 (owner: 10Mattflaschen) [16:33:01] hey paravoid, ok, i'm rebuilding librdkafka again, looking closer at the symbols thing [16:33:08] this: [16:33:09] https://gist.github.com/ottomata/6869713#file-gistfile1-diff [16:33:16] is a diff of your .symbols file [16:33:26] with the current output of dpkg-gensymbols [16:33:58] can I just replace the .symbols file with the output of dpkg-gensymbols? (it works if I do) [16:34:00] (03PS3) 10Reedy: Use $wgMessageFileList [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84898 [16:34:11] Snaps: I can build it like that ^ [16:34:14] so i'll just do that for now [16:34:16] (03PS4) 10Reedy: Use $wgMessageFileList [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84898 [16:34:47] (03Abandoned) 10Reedy: Remove usage of list-file in mergeMessageFileList.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/84900 (owner: 10Reedy) [16:35:11] (03PS5) 10Reedy: Use $wgMessageFileList [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84898 [16:35:15] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 311 seconds [16:35:15] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 315 seconds [16:35:26] (03CR) 10Reedy: [C: 032] Use $wgExtensionEntryPointListFiles [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84898 (owner: 10Reedy) [16:35:35] PROBLEM - MySQL Slave Delay on db37 is CRITICAL: CRIT replication delay 333 seconds [16:35:45] ottomata: I actually fixed the package on the plane [16:35:55] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 346 seconds [16:35:55] PROBLEM - MySQL Replication Heartbeat on db37 is CRITICAL: CRIT replication delay 347 seconds [16:36:11] oh! [16:36:13] yeah? [16:36:38] committed paravoid? [16:36:41] pushed? [16:36:52] not pushed yet [16:37:00] I literally just came online again [16:37:04] oh ok! [16:37:44] (03CR) 10jenkins-bot: [V: 04-1] Use $wgExtensionEntryPointListFiles [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84898 (owner: 10Reedy) [16:38:10] (03Merged) 10jenkins-bot: Use $wgExtensionEntryPointListFiles [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84898 (owner: 10Reedy) [16:38:15] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 16:38:06 UTC 2013 [16:38:32] (03PS1) 10Chad: We want 2 replicates per shard in production [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88503 [16:38:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [16:41:11] (03PS1) 10Reedy: Change Echo for Flow in extension-list-labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88504 [16:42:11] (03CR) 10Reedy: [C: 032] Change Echo for Flow in extension-list-labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88504 (owner: 10Reedy) [16:43:26] (03CR) 10Greg Grossmeier: [C: 031] Labs: Turn off secure login on loginwiki due to untrusted SSL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87045 (owner: 10Mattflaschen) [16:43:37] (03Merged) 10jenkins-bot: Change Echo for Flow in extension-list-labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88504 (owner: 10Reedy) [16:43:52] hey, someone with +2 in operations/mediawiki-config wanna merge that change I just +1'd there? [16:44:24] (03CR) 10Ori.livneh: [C: 032] Labs: Turn off secure login on loginwiki due to untrusted SSL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87045 (owner: 10Mattflaschen) [16:44:27] my +2 hammer isn't big enough [16:44:31] thanks ori-l [16:44:46] chrismcmahon: ^^ [16:44:47] (03Merged) 10jenkins-bot: Labs: Turn off secure login on loginwiki due to untrusted SSL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87045 (owner: 10Mattflaschen) [16:45:23] should i sync it? [16:45:50] ottomata: pushed [16:46:17] Don't really need to... [16:46:20] beta should do it itself [16:46:33] Pulled onto tin [16:46:34] ori-l: what Reedy said [16:46:40] "wait 3 or so minutes" [16:47:13] yes, but the next person to deploy has to reason about that [16:47:27] to deploy? [16:47:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:47:45] oh, if it's pulled onto tin then fine [16:48:06] !log reedy synchronized wmf-config/ [16:48:23] Logged the message, Master [16:48:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.359 second response time [16:49:02] heh, and HTTPS icinga warning goes off, timely [16:49:13] (03PS1) 10Chad: Fix path to check_elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/88507 [16:50:35] greg-g: that's for stafford (puppet) [16:50:44] Ryan_Lane: I know, just timely [16:52:35] (03PS1) 10Mark Bergsma: Move the remaining non-wikipedia projects to text-varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/88509 [16:53:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:54:30] i guess i'll do that one tomorrow [16:54:41] or I might have to start paying attention to deployment windows [16:55:09] (03PS1) 10ArielGlenn: normal weights now for db1040 (s6) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88510 [16:56:16] (03CR) 10ArielGlenn: [C: 032] normal weights now for db1040 (s6) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88510 (owner: 10ArielGlenn) [16:56:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 15.332 second response time [16:58:07] !log ariel synchronized wmf-config/db-eqiad.php 'db1046 (s6) back to normal weight in pool' [16:58:17] Logged the message, Master [17:01:49] (03CR) 10Ryan Lane: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88455 (owner: 10Akosiaris) [17:02:19] thanks paravoid [17:02:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [17:04:48] (03CR) 10RobH: [C: 031] delete zwinger and zwinger2 from wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/88113 (owner: 10Dzahn) [17:05:29] thank git [17:05:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.828 second response time [17:06:02] i recall being on an an airplane, doing a ton of puppet manifest work on the plane [17:06:14] and at the end hoping we wouldn't crash, "because that would be such a waste if I can't push!" [17:06:21] hahahaha [17:06:33] RobH (on the same plane) said my priorities were fucked up ;) [17:06:37] (03Abandoned) 10Chad: Add elasticsearch plugins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82673 (owner: 10Manybubbles) [17:07:10] (03CR) 10Manybubbles: "Might as well make a wmg out of this so we can set it to something other than 2 for other wikis." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88503 (owner: 10Chad) [17:08:07] it was on the way to the new orleans hackathon [17:08:15] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 17:08:09 UTC 2013 [17:08:15] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 118 seconds [17:08:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [17:08:35] RECOVERY - MySQL Slave Delay on db37 is OK: OK replication delay 0 seconds [17:08:55] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay -0 seconds [17:08:55] RECOVERY - MySQL Replication Heartbeat on db37 is OK: OK replication delay 0 seconds [17:09:15] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 0 seconds [17:10:14] (03PS2) 10Chad: We want 2 replicates per shard in production [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88503 [17:10:16] ottomata: on an unrelated note, how's openjdk? [17:10:31] mark: I had the same feeling on the way to amsterdam with a shitload of OpenStackManager code [17:10:43] :) [17:11:08] I had like 3 decent sized features written, and that would have sucked :D [17:11:17] heh [17:11:35] don't need life insurance but code insurance [17:11:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [17:12:07] "git push plane-black-box" [17:12:27] !log olivneh synchronized php-1.22wmf20/extensions/WikimediaEvents 'CoreEvents -> WikimediaEvents' [17:12:39] Logged the message, Master [17:12:44] :D [17:12:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.380 second response time [17:13:13] mark: Next time, make sure you have an USB key to store this stuff on; they're surpringly robust and would probably survive a crash you wouldn't. :-) [17:13:31] or my SSD wouldn't ;) [17:13:49] oh, good as far as I can tell paravoid [17:13:50] no problems yet [17:14:06] we aren't running many jobs in hadoop yet, but I have been running some longish running hive queries [17:14:08] they are fine [17:15:25] !log olivneh synchronized php-1.22wmf19/extensions/WikimediaEvents 'CoreEvents -> WikimediaEvents' [17:15:37] Logged the message, Master [17:15:50] (03CR) 10Ori.livneh: [C: 032] Prefer WikimediaEvents to CoreEvents, now that the extension has been renamed. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88199 (owner: 10Ori.livneh) [17:16:33] (03Merged) 10jenkins-bot: Prefer WikimediaEvents to CoreEvents, now that the extension has been renamed. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88199 (owner: 10Ori.livneh) [17:18:08] !log olivneh synchronized wmf-config/CommonSettings.php 'CoreEvents -> WikimediaEvents' [17:18:21] Logged the message, Master [17:19:33] (03CR) 10Manybubbles: [C: 031] "Looks good to me. Someone else will have to approve because I can't +2 this repo." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88503 (owner: 10Chad) [17:20:00] springle-away: if you have some time I have a crazy plan to run by you [17:20:42] (03PS1) 10Ori.livneh: Complete CoreEvents -> WikimediaEvents switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88515 [17:21:15] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:17] (03CR) 10Ori.livneh: [C: 032] Complete CoreEvents -> WikimediaEvents switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88515 (owner: 10Ori.livneh) [17:21:27] (03Merged) 10jenkins-bot: Complete CoreEvents -> WikimediaEvents switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88515 (owner: 10Ori.livneh) [17:22:45] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:50] !log olivneh synchronized wmf-config/CommonSettings.php 'CoreEvents -> WikimediaEvents' [17:23:00] (03PS1) 10Cmjohnson: setting up amssq48-62 as text varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/88516 [17:23:03] Logged the message, Master [17:23:28] !log olivneh synchronized wmf-config/extension-list 'CoreEvents -> WikimediaEvents' [17:23:40] Logged the message, Master [17:24:26] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:29] (03CR) 10Chad: [C: 032] We want 2 replicates per shard in production [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88503 (owner: 10Chad) [17:24:39] (03Merged) 10jenkins-bot: We want 2 replicates per shard in production [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88503 (owner: 10Chad) [17:24:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [17:25:05] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.815 second response time [17:25:15] !log demon synchronized wmf-config/InitialiseSettings.php '2 shards per replica in elastic' [17:25:26] Logged the message, Master [17:25:28] <^d> manybubbles: Went ahead and synced that ^. Now we don't have any worries when the extension change goes out [17:25:47] good [17:26:19] Something br0ke? I'm not seeing any JS or CSS on enwiki atm. [17:26:25] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.698 second response time [17:26:35] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:35] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.221 second response time [17:26:38] <^d> Coren: DanielK was saying bits is slow in -tech. [17:26:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 24.469 second response time [17:27:10] ^d: Times out for me, atm. [17:27:30] <^d> No problems for me here in the US. [17:27:30] same -- timeouts [17:27:52] <^d> (At home, not on office network today) [17:28:15] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:25] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.456 second response time [17:28:26] jebus -- someone deployed a cache busting request -- https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=user.groups&only=styles&skin=vector&user=Mwalker+%28WMF%29&version=20130929T103650Z&* [17:29:27] <^d> Whoops and there goes styles for me. [17:29:56] mwalker: the request for the user module is always user-specific [17:29:59] (03CR) 10Mark Bergsma: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88516 (owner: 10Cmjohnson) [17:30:05] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.533 second response time [17:30:06] ah; didn't know that [17:30:15] in any case it's taking > 20 seconds to load [17:30:25] <^d> http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Bits%2520application%2520servers%2520eqiad&tab=m&vn= - load has been increasing [17:30:37] <^d> And a network dropoff. [17:31:10] ouch...yeah, something bad is happening [17:31:17] indeed [17:31:35] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:58] no styling on mw.org !!!??? [17:32:13] mark ^ [17:32:25] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.123 second response time [17:32:26] eep, i'm here as well... [17:32:30] reports of bits misbehaving [17:32:42] Yeah, it times out for me. [17:32:49] looks like it started just before 17:20 [17:33:02] 19:18:07 !log olivneh synchronized wmf-config/CommonSettings.php 'CoreEvents -> WikimediaEvents' [17:33:08] perhaps? [17:33:27] Hm. This coincides exactly with the 100% test banner thing. [17:33:36] there's no static asset payload to that extension [17:33:59] but the extension that it replaces had a JS module, so it would update the URL for static assets to remove the now-unused module [17:34:04] * 7:23am-8:30am Pacific (UTC 16:23 - 15:30) [17:34:11] which would mean a higher cache miss, but that's normal, we do that all the time [17:34:33] Oh wait, I fail timezone calculation. [17:34:51] * Coren shuts up now. [17:35:43] ori-l: there was more than just a rename? [17:35:54] i see lots of eventlogging and lots of bannercontroller [17:36:00] sent as cache misses to the bits apaches [17:36:14] greg-g: it dropped the https logging; you +1d that yesterday [17:36:40] right, so just the rename and droppign that? [17:36:44] I haven't touched banner controller in weeks [17:37:48] yes; number of events is normal [17:37:49] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=EventLogging&vl=events+%2F+sec&x=&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=%5E%28eventlogging_client-side-events%7Ceventlogging_server-side-events%7Ceventlogging_all-events%29%24>ype=stack&glegend=show&aggregate=1 [17:38:07] weekly view: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=EventLogging&vl=events+%2F+sec&x=&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=%5E%28eventlogging_client-side-events%7Ceventlogging_server-side-events%7Ceventlogging_all-events%29%24>ype=stack&glegend=show&aggregate=1 [17:38:15] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 17:38:05 UTC 2013 [17:38:17] hey mark, have you had a chance to look at https://lists.wikimedia.org/mailman/private/ops/2013-October/024280.html ? [17:38:33] dr0ptp4kt: one second, site wonkiness ;) [17:38:34] dr0ptp4kt: there's a bits issue going on at the moment [17:38:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [17:38:54] greg-g, ori-l, thx, will bug mark later [17:39:52] it looks like the apaches are recovering [17:42:46] cache hit rate is back on bits varnish boxes [17:43:30] I think that was plausibly caused by me dropping a module that was referenced on every page [17:43:43] thus caused the RL URLs to change [17:43:53] s/plausibly/likely/ [17:44:29] very likely [17:44:31] wow [17:44:35] That seems odd, though [17:44:37] Varnish should have cached it almost immediately [17:44:41] I hadn't anticipated that because I was focused on the fact that the sync would mean less JS gets loaded & less events get sent [17:44:50] mark: Were there a higher than usual number of cache misses for load.php URLs? [17:44:57] yes [17:45:30] but in general yeah, I haven't seen this behavior before when adding/removing modules [17:46:05] Yeah it seems very strange [17:46:14] It shouldn't have upset the cluster for this long [17:46:35] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=varnish.cache_miss&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [17:46:52] Hmm, actually you know what [17:46:55] depends on how many different URLs there are [17:47:00] ori-l: How hard did you drop the module? [17:47:10] a bits cache has roughly 75k objects in cache [17:47:12] Did you just stop loading it, or did you actually make it stop existing? [17:47:21] I made it stop existing [17:47:29] Right, that makes sense then [17:47:37] can you explain? [17:47:39] Because a bunch of clients will have the old URL still, referencing the now-missing module [17:47:48] And guess what ResourceLoader does to the caching headers when part of the response is an error [17:48:04] facepalm [17:48:05] hah. [17:48:08] i also don't recall you mentioning this yeterday ori [17:48:13] or did I miss it? [17:48:14] maxage=300, fortunately, not CC: private [17:48:28] But still, 300 < 2592000 [17:48:32] mark: mentioning what yesterday? [17:48:38] this change [17:49:04] Hm, how do I make the MediaWiki notification system send me HTTPS links in its e-mails? [17:49:17] (on Wikimedia) [17:49:31] it wasn't planned; i asked to do it during the LD window and greg-g okayed it, but in hindsight i did not represent it well and i did not think through the cache effect [17:50:19] RoanKattouw, that's not the first time such changes are made - why it didn't cause an overload before? [17:50:36] We normally don't just delete a module, do we? [17:50:48] there are no guidelines in place for that [17:50:54] When I've renamed modules I've always kept the old name as a redirect [17:51:00] mark: I ok'd it since I was under the impression it was just an extension rename (I forgot about the removal of the HTTPS logging that was done yesterday but not deployed) [17:51:11] twkozlowski: Right now I don't think you can? Ask csteipp, he'll know [17:51:22] greg-g: you didn't forget; i didn't mention it, which is my bad [17:51:23] twkozlowski: patches welcome ;) [17:51:32] ori-l: well, both, but sure [17:52:12] the rename may have caused the load as well? [17:52:21] both yeah [17:52:25] RoanKattouw, and users on wiki can cause such problems by creating a default gadget and then deleting it [17:52:25] :/ [17:52:29] mark: good to know [17:52:30] it did [17:52:50] MaxSem: Right :( [17:53:00] Yeah if my theory is right then that would be a consequence of that [17:53:16] So I hope I'm not right and something else was going on [17:53:22] mark, greg-g -- sorry. i was focussed on the fact that the change meant one fewer module will get loaded and one fewer event get sent, so it didn't strike me as a change that would increase load anywhere. [17:53:54] ok, let's just make sure that we're all very careful about any asset presence/name changes that are used on lots of pages [17:53:58] FTR I am not actually sure that a missing module causes maxage=300. It probably shouldn't [17:54:16] mark: very much so, yes. thanks. [17:54:20] is there a way to log maybe 1/1000th of cache misses on bits? [17:54:35] mark: Do you have data on what the CC headers' max-age and s-maxage were during that wave of cache misses? [17:54:40] right now this is the distribution of CC: [17:54:41] 7516.50 TxHeader Cache-Control: max-age=2592000 [17:54:42] 2018.10 TxHeader Cache-Control: private, max-age=86400, s-maxage=0 [17:54:42] 36.00 TxHeader Cache-Control: s-maxage=2678400, max-age=2678400 [17:54:48] i don't know what it was earlier [17:55:13] huh [17:55:15] And that's on bits? [17:55:22] that's on one bits box yes [17:55:27] cp1056 [17:55:30] Is that CC headers coming from Apache into Varnish, or coming out of Varnish into the world? [17:55:52] that's what varnish sends to the client, but iirc that's not overridden on bits, is it... [17:56:17] yeah should be the same [17:56:17] Well that distribution doesn't look like something ResourceLoader would generate [17:56:26] max-age=300 is conspicuously absent [17:57:09] bits varnish doesn't override CC [17:57:25] note that s-maxage=0 [17:57:37] that does prevent varnish from caching at all [17:57:47] Yeah that one is weird [17:57:59] But it's also max-age=86400 and I'm pretty sure we're configured to serve 300 in that case [17:58:33] I curled an RL URL that is referenced on the logged-out view of enwiki's main page and one of the requests was a cache miss, fwiw [17:58:50] ori-l: What was the CC header in the respose? [17:58:55] cache hit rate is 99.66% right now though [17:58:56] I got Cache-control:public, max-age=300, s-maxage=300 [17:58:59] was a bit lower earlier [17:58:59] Cache-control: public, max-age=300, s-maxage=300 [17:59:27] And Cache-control:public, max-age=2592000, s-maxage=2592000 for some other responses [17:59:59] Those are the two different kinds of headers that my browser observes from bits, and neither of those values is in mark's list [18:00:20] Though I guess I'm hitting esams bits right now, not eqiad [18:00:23] i've added and removed modules dozens of times and i've never seen something like this happen [18:00:27] arghhh [18:00:31] mediawiki sends Cache-control [18:00:34] no Cache-Control [18:00:42] GROAN [18:00:51] * ori-l headdesks [18:00:54] Good job, MediaWiki [18:00:58] 12229.30 TxHeader Cache-control: public, max-age=300, s-maxage=300 [18:00:58] 8718.30 TxHeader Cache-control: public, max-age=2592000, s-maxage=2592000 [18:00:58] 7597.30 TxHeader Cache-Control: max-age=2592000 [18:00:58] 1967.00 TxHeader Cache-Control: private, max-age=86400, s-maxage=0 [18:00:58] 33.60 TxHeader Cache-Control: s-maxage=2678400, max-age=2678400 [18:00:59] 0.87 TxHeader Cache-control: private, no-cache, must-revalidate [18:01:14] that makes more sense doesn't it [18:01:27] Yes [18:01:31] That makes a heck of a lot more sense [18:01:32] * Krinkle votes for lowercase headers all the way. Google style. [18:01:56] And that's a high number of 300/300 responses [18:02:20] Although of course I suppose that by their nature 300/300 resources are requested more often [18:02:31] And we don't know what the baseline looks like [18:04:45] (03CR) 10Andrew Bogott: "I'm pretty sure this was right before... do you have tests that show otherwise?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88507 (owner: 10Chad) [18:05:55] anyway, i don't know how many url names changed effectively [18:06:04] but with 75k objects, it takes a while to reretrieve those from apaches [18:06:13] that can easily explain 15 mins of pain [18:06:45] yes. i accept the blame here. :/ [18:06:49] note though how there was a peak roughly every 5 mins in that apache graph [18:07:00] so that would point to max-age=300 I guess [18:07:03] Yea [18:07:07] $wgResourceModules['schema.HttpsSupport'] = $wgResourceModules['ext.coreEvents.httpsSupport'] = array() [18:07:09] ? [18:07:15] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [18:07:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 18:07:25 UTC 2013 [18:07:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [18:07:36] MaxSem: $wgResourceModules['schema.HttpsSupport'] = array( 'dependencies' => array( 'ext.coreEvents.httpsSupport' ); [18:07:39] but yeah, not enough data to definitely confirm [18:07:47] Let me check what happens when requesting a missing module [18:08:14] RoanKattouw, that's two different modules [18:09:07] So I was wrong, when requesting a nonexistent module there is no 300/300 response [18:09:08] At least not in master [18:09:15] PROBLEM - Puppet freshness on bast4001 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:15] PROBLEM - Puppet freshness on cp4001 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:15] PROBLEM - Puppet freshness on cp4002 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:15] PROBLEM - Puppet freshness on cp4003 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:15] PROBLEM - Puppet freshness on cp4004 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:15] PROBLEM - Puppet freshness on cp4005 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:15] PROBLEM - Puppet freshness on cp4006 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:16] PROBLEM - Puppet freshness on cp4007 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:16] PROBLEM - Puppet freshness on cp4008 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:17] PROBLEM - Puppet freshness on cp4009 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:17] PROBLEM - Puppet freshness on cp4010 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:18] PROBLEM - Puppet freshness on cp4011 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:18] PROBLEM - Puppet freshness on cp4012 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:19] PROBLEM - Puppet freshness on cp4013 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:20] PROBLEM - Puppet freshness on cp4014 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:20] PROBLEM - Puppet freshness on cp4015 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:20] PROBLEM - Puppet freshness on cp4016 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:21] PROBLEM - Puppet freshness on cp4017 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:21] PROBLEM - Puppet freshness on cp4018 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:22] PROBLEM - Puppet freshness on cp4019 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:22] PROBLEM - Puppet freshness on cp4020 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:23] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:23] PROBLEM - Puppet freshness on lvs4002 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:24] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:24] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: No successful Puppet run in the last 10 hours [18:09:33] GJ icinga-wm [18:10:07] note that bits normally has a very good hit rate, well over 99% [18:10:11] so that means very few misses [18:10:17] Yeah [18:10:22] What was the hit rate like during that spike? [18:10:47] well check: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=varnish.cache_miss&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:10:53] so normally those boxes do like 25 misses a second [18:10:55] (each) [18:11:09] so about 100 total [18:11:16] Right [18:11:23] if that rises to 2000 as in that graph... :) [18:11:36] what happened at 18:00? [18:11:44] yeah I wonder too [18:11:51] So that's a ~90% hit rate then during that spike [18:11:54] 80-90 [18:12:28] (03CR) 10Chad: "I didn't import this or do anything with it, I just noticed that /usr/lib/nagios/plugins/check_elasticsearch didn't seem to exist on the n" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88507 (owner: 10Chad) [18:13:28] hrm, just 4 bits app servers is a bit low [18:13:42] we can't really provision for a 20x load increase [18:13:50] but I'd say we can do better than 4 app servers in eqiad :P [18:14:37] (03CR) 10Andrew Bogott: [C: 04-1] "I'll set up a test -- meanwhile, -1." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88507 (owner: 10Chad) [18:16:47] or perhaps we should just send RL requests to the general app server pool [18:18:23] We could do that [18:18:49] what was the rationale for separating them in the first place? [18:18:57] I think we originally made it a separate cluster because we were paranoid about load increases taking down the main pool [18:18:59] doable in 1 minute. shall I?:P [18:19:36] I'm happy that we started out this way because it allowed me to identify some performance gains in RL [18:19:47] yeah, the isolation was good especially in the beginning [18:19:52] it's still easier to spot problems this way [18:20:02] but it doesn't provide as good behaviour during spikes like these [18:20:04] what will remain on bits once RL is gone? [18:20:13] we're not talking about that max [18:20:33] MaxSem: it's whether RL requests get handled by the same apaches that render pages [18:20:42] good luck sending RL requests to text (squids) in 1 minute, it'll just break :) [18:20:43] I was able to see the precise CPU load caused by ResourceLoader on a Ganglia graph, and could see that a certain fix I deployed cut the load in half :) [18:20:45] ah, varnish backends [18:21:49] (03PS1) 10Odder: (bug 55342) Add an alias for NS_USER_TALK on kowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88529 [18:22:21] RoanKattouw: right, but you could differentiate CPU load generated by load.php reqs in other ways too [18:22:39] Jenkin-bot -1 this patch with some unit-test failure, the failure is not related to what I am deploying, is it okay I +2 it manually? https://gerrit.wikimedia.org/r/#/c/88520/ [18:23:01] ori-l: sure, but none of that tooling existed when RL got deployed ;) [18:23:06] ori-l: Not very practically with WMF technology in early 2011 [18:23:07] well there was some profiling [18:23:27] Although back then we did have job runners on the main cluster, which we reniced down so we'd get different colors in the graphs [18:23:34] heh [18:23:39] ok, food is ready [18:23:39] (and because renicing them down is generally a good idea, I guess) [18:25:02] bsitu, I personally merge manually because tests are useless for prod branches [18:25:07] OK, I'm still at home, going to head into the office, bbiab [18:25:07] so yes, it's OK [18:25:08] thanks for poking [18:25:09] but, going forward to do that diagnostics you don't need the separate cluster, right? [18:25:21] MaxSem: thx, :) [18:25:53] bsitu: it got a +1 upon recheck [18:25:55] so the problem is moot [18:26:06] ori-l: yeah [18:26:13] <^d> That's a lame timeout error. [18:26:23] <^d> We could maybe up the timeout on that test. [18:26:54] or make the tests faster [18:27:31] hphp! [18:27:49] <^d> Tests run hella fast on hhvm. [18:27:56] that'd be an idea [18:27:58] <^d> ~1.5min for the full suite. [18:28:28] have dual test runners at the beginning of our use of hphp, one with apache, one with hphp, make apache voting, but see where we need to fix things in hphp [18:28:44] something is still fishy [18:28:56] <^d> Also it's hhvm now. hphp is the old hphpc/hphpi crap :) [18:29:05] whatevs [18:29:07] ori-l: ? [18:29:10] with bits? [18:29:26] things are OK now I mean, but the explanation still doesn't still well with me [18:29:31] !log reedy synchronized php-1.22wmf20/includes/Wiki.php [18:29:41] bbiab [18:29:46] Logged the message, Master [18:29:50] ori-l: go forth and ponder [18:29:54] will run scap in a second [18:30:27] btw, for those following along, this one is indeed a planned deploy ;) [18:32:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [18:35:22] (03CR) 10Andrew Bogott: "Could not find dependency Package[icinga] for File[/usr/lib/nagios/plugins/check_elasticsearch] at /etc/puppet/modules/elasticsearch/manif" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88507 (owner: 10Chad) [18:36:41] !log bsitu Started syncing Wikimedia installation... : Update Echo and Thanks to Master [18:36:55] Logged the message, Master [18:38:15] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 18:38:11 UTC 2013 [18:38:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [18:40:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 26.194 second response time [18:43:37] (03CR) 10Andrew Bogott: "The only right way to fix this dependency is to have icinga in a module. Which I might do soon anyway..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88507 (owner: 10Chad) [18:53:08] (03PS2) 10Cmjohnson: setting up amssq48-62 as text varnish remove ssl role [operations/puppet] - 10https://gerrit.wikimedia.org/r/88516 [18:55:10] !log bsitu Finished syncing Wikimedia installation... : Update Echo and Thanks to Master [18:55:20] Logged the message, Master [18:56:25] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [18:57:10] (03CR) 10Bsitu: [C: 032] Enable Echo and Thanks on Various wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88246 (owner: 10Bsitu) [18:57:33] (03Merged) 10jenkins-bot: Enable Echo and Thanks on Various wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88246 (owner: 10Bsitu) [18:59:35] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:04:03] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Enable Echo and Thanks on various wikis' [19:04:24] !log bsitu synchronized echowikis.dblist 'Enable Echo and Thanks on various wikis' [19:06:25] Logged the message, Master [19:06:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 19:06:21 UTC 2013 [19:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [19:06:37] Logged the message, Master [19:11:11] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Enable Echo and Thanks on various wikis' [19:11:24] Logged the message, Master [19:16:25] !log bsitu synchronized php-1.22wmf19/extensions/Echo 'Touch Js file' [19:16:38] Logged the message, Master [19:37:45] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 19:37:38 UTC 2013 [19:38:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [19:42:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [19:43:10] (03CR) 10Cmjohnson: [C: 032] setting up amssq48-62 as text varnish remove ssl role [operations/puppet] - 10https://gerrit.wikimedia.org/r/88516 (owner: 10Cmjohnson) [19:56:20] (03Abandoned) 10Dr0ptp4kt: Adding Wikipedia Zero automation testing servers to XFF whitelist. [operations/puppet] - 10https://gerrit.wikimedia.org/r/74509 (owner: 10Dr0ptp4kt) [19:56:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 29.691 second response time [19:59:55] RECOVERY - Host mw1125 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:02:25] PROBLEM - RAID on mw1125 is CRITICAL: Connection refused by host [20:02:26] PROBLEM - Disk space on mw1125 is CRITICAL: Connection refused by host [20:02:45] PROBLEM - SSH on mw1125 is CRITICAL: Connection refused [20:04:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [20:05:12] !log maxsem synchronized php-1.22wmf19/extensions/MobileFrontend/javascripts/loggingSchemas/MobileWebClickTracking.js [20:05:25] Logged the message, Master [20:08:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 20:08:20 UTC 2013 [20:08:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [20:10:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 26.914 second response time [20:14:35] PROBLEM - NTP on mw1125 is CRITICAL: NTP CRITICAL: No response from NTP server [20:17:15] PROBLEM - Puppet freshness on osm-cp1001 is CRITICAL: No successful Puppet run in the last 10 hours [20:27:02] !log maxsem synchronized php-1.22wmf20/extensions/MobileFrontend/javascripts/loggingSchemas/MobileWebClickTracking.js [20:27:15] Logged the message, Master [20:38:45] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 20:38:36 UTC 2013 [20:39:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [21:00:16] (03CR) 10Ori.livneh: [C: 031] graphite: upstream doc in storage-aggregation.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/88469 (owner: 10Hashar) [21:06:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 21:06:50 UTC 2013 [21:07:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [21:16:12] (03PS1) 10Reedy: wmgConfigDir to wmfConfigDir [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88638 [21:17:19] (03CR) 10Reedy: [C: 032] wmgConfigDir to wmfConfigDir [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88638 (owner: 10Reedy) [21:18:15] PROBLEM - Puppet freshness on mw1125 is CRITICAL: No successful Puppet run in the last 10 hours [21:18:34] (03PS1) 10Odder: (bug 54826) Enable EducationProgram on the Spanish Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88639 [21:23:05] (03Merged) 10jenkins-bot: wmgConfigDir to wmfConfigDir [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88638 (owner: 10Reedy) [21:23:15] (03PS5) 10BBlack: Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [21:26:34] (03PS1) 10Odder: (bug 54223) Enable EducationProgram on the Czech Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88641 [21:35:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 21:35:22 UTC 2013 [21:35:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [21:40:48] !log aaron synchronized php-1.22wmf20/includes/WikiPage.php '337272176758b06c0ed9fd8859d358730152c155' [21:41:04] Logged the message, Master [21:44:00] !log aaron synchronized php-1.22wmf20/includes/LinksUpdate.php '337272176758b06c0ed9fd8859d358730152c155' [21:44:11] Logged the message, Master [21:47:32] <^d> Someone mind merging a super-easy puppet change for me? [21:51:09] (03PS1) 10Andrew Bogott: Remove unused files from files/mysql and templates/mysql. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88646 [21:54:57] (03PS1) 10Ryan Lane: Remove debhelpers for apparmor from labs image [operations/puppet] - 10https://gerrit.wikimedia.org/r/88648 [21:57:05] anyone have experience with dynamic ganglia metric modules? [22:02:41] mwalker: i'm not sure what you mean by 'dynamic', but anyways, sure -- just ask [22:03:38] basically -- I have a python metric module loaded by gmond -- it is running and is at least returning values to gmond -- but they do not show up in the xml generated when I query it over telnet [22:03:56] and now I'm stuck [22:06:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 22:05:58 UTC 2013 [22:06:32] I know that values sent using the gmetric tool will show up -- but they'll appear ungrouped which is why I'm attempting to extend gmond [22:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [22:07:30] mwalker: can i see your script somewhere? [22:08:28] mwalker: also, have you tried running gmond in the foreground and seeing the debug output? [22:09:04] i.e., service ganglia-monitor stop, then run /usr/sbin/gmond --debug=999 [22:09:52] http://pastebin.com/73bzCVfT and http://pastebin.com/wRY81vBM [22:10:00] and ya; the debug output lists the module [22:14:59] * ori-l looks [22:17:31] mwalker: 'value_unit': 'float', [22:17:44] I tried value_unit: uint as well [22:17:49] same story [22:18:03] yes, you should have tried value_type :D [22:18:48] if that's it... [22:18:52] :) [22:19:22] gaaaaaaaaah [22:19:26] heheh [22:19:52] debug output for this would've been nice! /me wonders how hard it would be to add it [22:20:57] there's a good chance gmetad was complaining about it in syslog on the receiving node [22:21:14] are you logging to ganglia.wikimedia.org? [22:21:20] no; this is just on my local [22:21:47] I didn't see any messages from gmetad -- but maybe it doesnt log to /var/log/syslog [22:22:39] check /var/log/messages [22:23:49] doesn't exist on my local -- everything goes to syslog unless it's filtered out elsewhere [22:24:57] on precise it ends up in /var/log/messages by dint of the defaults set in /etc/rsyslog.d (IIRC) [22:25:15] Bitten by the dint. [22:25:28] !log moved fxp0 on cr1-esams into MGMT logical system [22:25:42] Logged the message, Mistress of the network gear. [22:26:27] thx lesliecarr [22:26:34] :) [22:26:54] !log reinstalling amssq48-62 [22:27:06] Logged the message, Master [22:28:06] woot [22:31:55] PROBLEM - SSH on amssq48 is CRITICAL: Connection timed out [22:35:45] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 22:35:44 UTC 2013 [22:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [22:36:45] RECOVERY - SSH on mw1125 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:37:35] PROBLEM - SSH on amssq49 is CRITICAL: Connection refused [22:40:25] PROBLEM - SSH on amssq50 is CRITICAL: Connection refused [22:42:25] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [22:43:45] PROBLEM - NTP on amssq48 is CRITICAL: NTP CRITICAL: No response from NTP server [22:46:10] (03CR) 10Ori.livneh: [C: 031] "LGTM. Thanks for the patch." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88424 (owner: 10Odder) [22:47:45] RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 88.47 ms [22:49:35] PROBLEM - NTP on amssq49 is CRITICAL: NTP CRITICAL: No response from NTP server [22:49:45] PROBLEM - SSH on amssq51 is CRITICAL: Connection refused [22:50:55] RECOVERY - SSH on amssq48 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:50:55] PROBLEM - SSH on amssq52 is CRITICAL: Connection refused [22:52:15] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [22:52:25] PROBLEM - NTP on amssq50 is CRITICAL: NTP CRITICAL: No response from NTP server [22:52:35] RECOVERY - SSH on amssq49 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:54:35] RECOVERY - Puppet freshness on mw1125 is OK: puppet ran at Tue Oct 8 22:54:30 UTC 2013 [22:55:25] RECOVERY - SSH on amssq50 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:56:35] PROBLEM - Host amssq53 is DOWN: PING CRITICAL - Packet loss = 100% [23:00:45] RECOVERY - SSH on amssq51 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:01:45] RECOVERY - Host amssq53 is UP: PING OK - Packet loss = 0%, RTA = 91.53 ms [23:02:15] PROBLEM - NTP on amssq51 is CRITICAL: NTP CRITICAL: No response from NTP server [23:02:25] RECOVERY - DPKG on mw1125 is OK: All packages OK [23:02:26] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [23:02:26] RECOVERY - Disk space on mw1125 is OK: DISK OK [23:02:35] PROBLEM - NTP on amssq52 is CRITICAL: NTP CRITICAL: No response from NTP server [23:03:02] ^d, I'm trying to git-review to the labs-private repo, and getting 'internal server error' [23:03:09] Is that something interesting, or just a dumb mistake on my end? [23:04:08] <^d> Well that's weird. [23:04:12] <^d> '[2013-10-08 22:58:30,905] ERROR com.google.gerrit.server.git.ReceiveCommits : Only 0 of 1 new change refs created in labs/private; aborting' [23:04:15] PROBLEM - Apache HTTP on mw1125 is CRITICAL: Connection refused [23:04:25] PROBLEM - SSH on amssq53 is CRITICAL: Connection refused [23:04:58] \o/ something interesting! [23:05:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 23:05:45 UTC 2013 [23:06:00] StevenW / superm401: You doing an LD right now? [23:06:05] RECOVERY - SSH on amssq52 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:06:09] James_F, yeah. [23:06:20] We haven't officially started, but I believe StevenW okayed it with greg-g [23:06:29] I'll let you know when we're done. [23:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [23:06:36] superm401: Yeah; we're OK'ed too, waiting on you. :-) [23:09:45] superm401: yessir, you and VE [23:13:21] It'd be nice if someone could deploy the fix to the 504 errors. [23:13:41] https://bugzilla.wikimedia.org/show_bug.cgi?id=54876#c9 [23:15:25] RECOVERY - SSH on amssq53 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:16:25] PROBLEM - NTP on amssq53 is CRITICAL: NTP CRITICAL: No response from NTP server [23:18:35] RECOVERY - NTP on mw1125 is OK: NTP OK: Offset -0.0004680156708 secs [23:20:10] (03PS1) 10Mattflaschen: Enable OB6 A/B test, with gating variable [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88662 [23:20:15] PROBLEM - Host amssq54 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:23] (03CR) 10Mattflaschen: [C: 032] Enable OB6 A/B test, with gating variable [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88662 (owner: 10Mattflaschen) [23:20:34] (03Merged) 10jenkins-bot: Enable OB6 A/B test, with gating variable [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88662 (owner: 10Mattflaschen) [23:22:15] PROBLEM - Host amssq55 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:25] RECOVERY - Host amssq54 is UP: PING OK - Packet loss = 0%, RTA = 91.73 ms [23:27:05] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:25] RECOVERY - Host amssq55 is UP: PING OK - Packet loss = 0%, RTA = 91.94 ms [23:27:32] !log mflaschen synchronized wmf-config/CommonSettings.php 'A/B test for GettingStarted' [23:27:44] Logged the message, Master [23:27:45] PROBLEM - SSH on amssq54 is CRITICAL: Connection refused [23:28:02] !log mflaschen synchronized wmf-config/InitialiseSettings.php 'A/B test for GettingStarted' [23:28:17] Logged the message, Master [23:28:24] (03PS1) 10Andrew Bogott: Move mysql_wmf into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 [23:28:26] PROBLEM - SSH on amssq56 is CRITICAL: Connection refused [23:28:45] PROBLEM - Host amssq58 is DOWN: PING CRITICAL - Packet loss = 100% [23:29:25] PROBLEM - SSH on amssq55 is CRITICAL: Connection refused [23:29:28] !log mflaschen synchronized php-1.22wmf19/extensions/GettingStarted/ 'Deploy GettingStarted to wmf19 for A/B test' [23:29:41] Logged the message, Master [23:29:54] !log mflaschen synchronized php-1.22wmf20/extensions/GettingStarted/ 'Deploy GettingStarted to wmf20 for A/B test' [23:30:03] James_F, done. [23:30:08] Logged the message, Master [23:30:27] superm401: Thanks! rmoen, you're up. [23:30:29] Thanks ;) [23:32:15] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 90.02 ms [23:32:43] (03CR) 10jenkins-bot: [V: 04-1] Move mysql_wmf into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 (owner: 10Andrew Bogott) [23:33:55] RECOVERY - Host amssq58 is UP: PING OK - Packet loss = 0%, RTA = 90.12 ms [23:34:15] PROBLEM - SSH on amssq57 is CRITICAL: Connection refused [23:34:25] PROBLEM - SSH on amssq59 is CRITICAL: Connection refused [23:34:35] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [23:35:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Tue Oct 8 23:35:29 UTC 2013 [23:36:05] PROBLEM - SSH on amssq58 is CRITICAL: Connection refused [23:36:26] PROBLEM - SSH on amssq60 is CRITICAL: Connection refused [23:36:33] (03PS2) 10Andrew Bogott: Move mysql_wmf into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 [23:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [23:38:25] PROBLEM - SSH on amssq62 is CRITICAL: Connection refused [23:38:45] RECOVERY - SSH on amssq54 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:39:07] (03CR) 10Andrew Bogott: [C: 032] Remove the trivial class base::mwclient [operations/puppet] - 10https://gerrit.wikimedia.org/r/88214 (owner: 10Andrew Bogott) [23:39:35] PROBLEM - NTP on amssq54 is CRITICAL: NTP CRITICAL: No response from NTP server [23:39:45] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 90.13 ms [23:40:25] PROBLEM - NTP on amssq56 is CRITICAL: NTP CRITICAL: No response from NTP server [23:41:26] RECOVERY - SSH on amssq55 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:41:35] PROBLEM - NTP on amssq55 is CRITICAL: NTP CRITICAL: No response from NTP server [23:42:05] PROBLEM - SSH on amssq61 is CRITICAL: Connection refused [23:44:25] RECOVERY - SSH on amssq56 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:45:16] RECOVERY - SSH on amssq57 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:46:25] PROBLEM - NTP on amssq59 is CRITICAL: NTP CRITICAL: No response from NTP server [23:46:47] PROBLEM - NTP on amssq57 is CRITICAL: NTP CRITICAL: No response from NTP server [23:47:05] RECOVERY - SSH on amssq58 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:48:05] PROBLEM - NTP on amssq58 is CRITICAL: NTP CRITICAL: No response from NTP server [23:48:55] PROBLEM - NTP on amssq60 is CRITICAL: NTP CRITICAL: No response from NTP server [23:49:25] RECOVERY - SSH on amssq59 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:50:25] PROBLEM - NTP on amssq62 is CRITICAL: NTP CRITICAL: No response from NTP server [23:51:35] RECOVERY - SSH on amssq60 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:53:05] RECOVERY - SSH on amssq61 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:53:16] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.110 second response time [23:53:55] PROBLEM - NTP on amssq61 is CRITICAL: NTP CRITICAL: No response from NTP server [23:54:25] RECOVERY - SSH on amssq62 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:55:55] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [23:58:37] (03PS1) 10Cmjohnson: Revert "removing mw1125 from dsh files- new hard drive has been installed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88674 [23:59:20] (03PS2) 10Cmjohnson: Revert "removing mw1125 from dsh files- new hard drive has been installed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88674 [23:59:43] (03CR) 10Cmjohnson: [C: 032] Revert "removing mw1125 from dsh files- new hard drive has been installed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88674 (owner: 10Cmjohnson)