[00:06:11] (03PS1) 10Denny Vrandecic: Set up configuration for App indexing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 [00:23:51] (03CR) 10Dzahn: [C: 032] "labs-tested" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [00:27:34] ooh. sigh, of course minus that one sigh for style reasons [00:27:55] come on gerrit [00:28:52] (03PS1) 10Dzahn: rename planet::site to planet::apachesite [operations/puppet] - 10https://gerrit.wikimedia.org/r/118224 [00:30:21] (03CR) 10Dzahn: [C: 032] rename planet::site to planet::apachesite [operations/puppet] - 10https://gerrit.wikimedia.org/r/118224 (owner: 10Dzahn) [00:35:19] anyone around want to help me test pushbot? [00:35:25] just /join #gregtest [00:45:27] (03CR) 10Dzahn: "ran on zirconium. planet still fine. apache graceful'ed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [00:59:03] bd808: Re https://bugzilla.wikimedia.org/show_bug.cgi?id=60166 (ntpd warnings): Do you have access to an "ordinary" server to see if this problem is limited to Labs? [00:59:50] scfc_de: I don't think I have access to tail /var/log/syslog [01:00:27] Wait, I do. I have sudo on logstash nodes [01:00:36] * bd808 goes to check [01:02:44] scfc_de: `grep ntpd /var/log/syslog` returns nothing on logstash1001. [01:03:36] (03PS1) 10Rush: CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 [01:04:06] bd808: Okay, thanks, then I'll submit a patch that only adds "interface listen ipv4" for Labs instances. [01:04:20] cool [01:11:28] (03CR) 10Dzahn: [C: 04-1] CST time period (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 (owner: 10Rush) [01:16:49] (03PS2) 10Rush: CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 [01:18:22] (03CR) 10Dzahn: [C: 031] CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 (owner: 10Rush) [01:28:14] (03PS1) 10MaxSem: Detect mobile site by X-Subdomain instead of X-WAP [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118227 [01:33:01] (03PS3) 10Tim Landscheidt: Authorize rush for Icinga, add him to contact group sms and add CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 (owner: 10Rush) [02:03:58] (03PS1) 10Springle: S1 deploy db1061 db1062, S2 deploy db1063 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118228 [02:05:50] (03CR) 10Springle: [C: 032] S1 deploy db1061 db1062, S2 deploy db1063 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118228 (owner: 10Springle) [02:10:25] !log LocalisationUpdate completed (1.23wmf16) at 2014-03-12 02:10:25+00:00 [02:10:37] Logged the message, Master [02:15:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [02:16:32] (03PS1) 10Springle: S1 uses only mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/118229 [02:17:54] (03CR) 10Springle: [C: 032] S1 uses only mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/118229 (owner: 10Springle) [02:18:12] !log LocalisationUpdate completed (1.23wmf17) at 2014-03-12 02:18:12+00:00 [02:18:23] Logged the message, Master [02:26:43] (03PS1) 10Springle: DB nodes whitespace [operations/puppet] - 10https://gerrit.wikimedia.org/r/118230 [02:32:49] !log springle synchronized wmf-config/db-eqiad.php 's1 drop load during xtrabackup clone db1051 to db1061' [02:32:57] Logged the message, Master [02:35:18] PROBLEM - mysqld processes on db1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:35:28] PROBLEM - mysqld processes on db1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:36:11] oh, hi neon [02:43:03] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Mar 12 02:42:59 UTC 2014 (duration 42m 58s) [02:43:13] Logged the message, Master [02:48:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [03:31:17] (03CR) 10Springle: [C: 032] DB nodes whitespace [operations/puppet] - 10https://gerrit.wikimedia.org/r/118230 (owner: 10Springle) [03:33:23] (03PS1) 10Springle: Default innodb_file_per_table=OFF for S3 as that shard has many more tables than the others (OFF reduces file handle usage, among other things). [operations/puppet] - 10https://gerrit.wikimedia.org/r/118232 [03:35:01] (03CR) 10Springle: [C: 032] Default innodb_file_per_table=OFF for S3 as that shard has many more tables than the others (OFF reduces file handle usage, among other thin [operations/puppet] - 10https://gerrit.wikimedia.org/r/118232 (owner: 10Springle) [04:49:32] (03PS1) 10Springle: import dbtree hotfix changes, recent and old, made on fenari [operations/software] - 10https://gerrit.wikimedia.org/r/118235 [04:50:33] (03CR) 10Springle: [C: 032] import dbtree hotfix changes, recent and old, made on fenari [operations/software] - 10https://gerrit.wikimedia.org/r/118235 (owner: 10Springle) [04:52:34] !log springle synchronized wmf-config/db-eqiad.php 's1 db1051 warm up' [04:52:43] Logged the message, Master [05:15:58] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [05:17:45] (03PS1) 10Springle: MariaDB uses ARIA instead of MyISAM for implicit temporary tables that hit disk for whever reason. The default 128M page cache is a bottleneck particularly on slaves in 'vslow' load balancing group. [operations/puppet] - 10https://gerrit.wikimedia.org/r/118236 [05:19:46] (03CR) 10Springle: [C: 032] MariaDB uses ARIA instead of MyISAM for implicit temporary tables that hit disk for whever reason. The default 128M page cache is a bottlene [operations/puppet] - 10https://gerrit.wikimedia.org/r/118236 (owner: 10Springle) [05:34:25] (03PS1) 10Springle: s1 pool db1061 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118237 [05:34:59] (03CR) 10Springle: [C: 032] s1 pool db1061 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118237 (owner: 10Springle) [05:35:12] (03Merged) 10jenkins-bot: s1 pool db1061 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118237 (owner: 10Springle) [05:38:48] !log springle synchronized wmf-config/db-eqiad.php 's1 pool db1061, warm up' [05:38:57] Logged the message, Master [05:49:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [06:18:03] !log xtrabackup clone db1018 to db1063 [06:18:16] Logged the message, Master [08:16:58] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [08:40:15] mutante: just out curiosity [08:40:21] https://meta.wikimedia.org/w/index.php?title=Talk:Planet_Wikimedia&diff=next&oldid=5744292 [08:40:24] why the dot? [08:46:22] hello [08:49:06] hi hashar [08:50:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [08:53:22] (03PS1) 10Matanya: apache :remove hardy var from appserver envvars [operations/puppet] - 10https://gerrit.wikimedia.org/r/118240 [09:11:37] (03PS1) 10Matanya: bckup: remove db10 from disklist [operations/puppet] - 10https://gerrit.wikimedia.org/r/118241 [09:12:29] !log Jenkins broken again! Good morning. [09:12:39] Logged the message, Master [09:13:47] !log restarting Jenkins [09:13:58] Logged the message, Master [09:15:53] !log kill -9 of Jenkins since it is unresponsive [09:16:15] Logged the message, Master [09:21:30] hashar: time to move to travis? [09:21:54] no third parties [09:23:01] (03CR) 10Alexandros Kosiaris: [C: 032] apache :remove hardy var from appserver envvars [operations/puppet] - 10https://gerrit.wikimedia.org/r/118240 (owner: 10Matanya) [09:23:18] morning joy :) [09:23:38] (03PS1) 10Hashar: contint: prevent SISTRIX Crawler from browsing Jenkins [operations/puppet] - 10https://gerrit.wikimedia.org/r/118242 [09:23:40] (03PS1) 10Hashar: contint: deny access to RSS feeds on main Jenkins page [operations/puppet] - 10https://gerrit.wikimedia.org/r/118243 [09:23:51] akosiaris: got some more changes to merge for your if you dont mind ^^^^ [09:24:05] akosiaris: related to Jenkins being hit by web crawler and killing its webprocess [09:24:33] another java app being killed by a web crawler... why am I not surprised ? [09:25:21] :-] [09:25:22] (03CR) 10Alexandros Kosiaris: [C: 032] contint: deny access to RSS feeds on main Jenkins page [operations/puppet] - 10https://gerrit.wikimedia.org/r/118243 (owner: 10Hashar) [09:25:32] it is missing a pool counter on some heavy CPU operations [09:25:48] + gallium has very slow IO since we upgraded it to Precise :-( [09:26:10] (03CR) 10Alexandros Kosiaris: [C: 032] contint: prevent SISTRIX Crawler from browsing Jenkins [operations/puppet] - 10https://gerrit.wikimedia.org/r/118242 (owner: 10Hashar) [09:26:36] huh ? [09:26:46] how is that possible ? [09:26:53] no idea [09:27:10] various ops looked at it and we could not find the issue [09:27:19] the workaround has been to migrate some heavy I/O operations to happens on tmpfs [09:27:27] but Jenkins still use the disk [09:27:42] I guess that will be fixed up whenever gallium is replaced :] [09:28:04] reloaded apache. thanks! [09:31:27] :) [09:33:53] hashar: only I/O ? [09:34:12] and the stupid regex does not work :-( [09:42:14] (03PS1) 10Hashar: contint: typo in env var prevented blacklisting [operations/puppet] - 10https://gerrit.wikimedia.org/r/118244 [09:42:31] akosiaris: and the mandatory follow up, I made a typo sorry https://gerrit.wikimedia.org/r/118244 [09:43:29] (03CR) 10Alexandros Kosiaris: [C: 032] contint: typo in env var prevented blacklisting [operations/puppet] - 10https://gerrit.wikimedia.org/r/118244 (owner: 10Hashar) [09:44:40] will restart Jenkins once again :( [09:44:57] !log rerestarting Jenkins [09:45:07] Logged the message, Master [09:47:26] (03PS1) 10Springle: s2 pool db1060 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118245 [09:50:03] (03PS2) 10Springle: s2 pool db1063 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118245 [09:50:25] (03CR) 10Springle: [C: 032] s2 pool db1063 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118245 (owner: 10Springle) [09:50:32] (03Merged) 10jenkins-bot: s2 pool db1063 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118245 (owner: 10Springle) [09:51:42] !log springle synchronized wmf-config/db-eqiad.php 's2 pool db1063, warm up' [09:52:52] Logged the message, Master [09:56:52] akosiaris: any point in having two files which are exaectly the same in dsh group? apaches==apaches-eqiad [09:57:12] i would remove the latter, unless i'm missing something [09:57:25] they just became the same i suppose ? [09:57:39] due to tamp apaches being shutdown just this week ? [09:57:43] tampa* [09:57:46] yes [09:57:52] that is my point [09:58:35] so, can i remove it? what do you recommend ? [09:58:43] we are months away from having a new DC with apaches ready [09:59:01] and i hope dsh will be out by then :) [09:59:19] yeah, I hope too but I am not so optimistic [09:59:41] will leave it for now [09:59:53] * matanya is on a cleanup mania [10:00:34] I never liked those different sets of files [10:01:02] that would easily end up having say machine A but not B because some forgot to update both [10:01:23] anyway don't remove it for now, probably people still do dsh -g apaches [10:01:34] sure [10:01:36] and dsh -g apaches-eqiad [10:01:53] you know what ? wanna link them ? [10:02:05] so that both exist but only one needs to be updated ? [10:02:13] and we split it up again if the need arises [10:02:26] that was my first thought [10:02:36] but i fear it will cause confusion [10:02:56] if one will try to edit the link, it will be weird [10:03:05] unless it is configured in puppet [10:03:42] hmm let me see [10:17:36] !log started s4 dump for toolserver on db72 /a [10:17:46] Logged the message, Master [10:19:58] (03PS1) 10Hashar: contint: retab to get rid of tabulations [operations/puppet] - 10https://gerrit.wikimedia.org/r/118248 [10:29:11] MaxSem: hey, when https://gerrit.wikimedia.org/r/#/c/116147/ gets deployed, could you ping me? [10:29:51] MaxSem: or just tell me now when it's scheduled for :) is it next tuesday? [10:29:54] I always get confused [10:30:13] paravoid, it will be on the normal train, so next Thursday it will be everywhere [10:32:03] ok [10:32:12] thanks [10:32:15] do we store apache logs in logstash /ES ? [10:32:27] (03CR) 10MaxSem: [C: 032] Detect mobile site by X-Subdomain instead of X-WAP [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118227 (owner: 10MaxSem) [10:32:35] (03Merged) 10jenkins-bot: Detect mobile site by X-Subdomain instead of X-WAP [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118227 (owner: 10MaxSem) [10:32:37] heh I was about to ask that [10:32:48] meanwhile, small preparation^^^ [10:32:59] and then we need to wait a month to remove it from varnish... :) [10:33:10] since existing cache objects Vary on X-WAP [10:33:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:33:41] fun [10:34:16] i guess ori would know, but he seems absent [10:34:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.742 second response time [10:35:15] matanya: ori works from home this week and most probably went back to a normal sleep schedule [10:35:40] i don't think he know what that means [10:36:08] (03PS1) 10Faidon Liambotis: varnish: don't set X-WAP on mobile [operations/puppet] - 10https://gerrit.wikimedia.org/r/118249 [10:36:45] paravoid: btw, would love if you can do security scanning on https://gerrit.wikimedia.org/r/#/c/116945/ [10:36:52] (03CR) 10Faidon Liambotis: [C: 04-2] "To be merged 31 days after March 20th." [operations/puppet] - 10https://gerrit.wikimedia.org/r/118249 (owner: 10Faidon Liambotis) [10:36:54] (03PS1) 10ArielGlenn: snapshot module in use on a pmtpa snap host [operations/puppet] - 10https://gerrit.wikimedia.org/r/118250 [10:37:40] matanya: sean would be the person to review this, adding him [10:37:42] paravoid, can we just ban objects with X-WAP: yes? [10:37:47] thanks [10:38:45] MaxSem: that's not the point, the point is that even for X-WAP: no objects, MW returned Vary: X-WAP, so Varnish has recorded that in the cache objects it stored [10:39:11] mmm, and will result in cache miss... [10:39:15] so when a request comes without an X-WAP header, it'll assume that the cache object isn't valid for this request and will do a fetch from the backend [10:39:24] yup [10:39:40] (03CR) 10ArielGlenn: [C: 032] snapshot module in use on a pmtpa snap host [operations/puppet] - 10https://gerrit.wikimedia.org/r/118250 (owner: 10ArielGlenn) [10:39:40] if we only had a hex editor for cache:P [10:39:59] so we need to have mediawiki just emit cache objects with no Vary: X-WAP for our TTL [10:40:04] then remove the header [10:40:09] that's 31 days, apparently [10:40:55] by the way, was there any noticeable change in cache hit ratio when we killed WAP in VCL? [10:41:08] !log maxsem synchronized wmf-config/mobile.php 'https://gerrit.wikimedia.org/r/118227' [10:41:13] I didn't observe it but I wouldn't suppose so [10:41:26] who uses WAP anyway... [10:41:39] Logged the message, Master [10:42:55] well, before I fixed that regex in September, WAp was served in ~10% requests [10:43:12] after that, in 0.1% indeed [10:44:29] lol, really? [10:44:30] 10%? [10:44:56] (The "channel logs" link in the topic doesn't work btw.) [10:46:15] it does pajz you need to remove the % [10:47:36] matanya, hm? http://ur1.ca/edq22 gives me "The requested URL /~wm-bot/logs/#wikimedia-operations/ was not found on this server" [10:50:21] oh, i see now pajz weird, it was open on my browser, didn't notice it is old [10:50:25] sorry [10:51:25] pajz: http://tools.wmflabs.org/wm-bot/logs/ [10:54:52] pajz: see http://lists.wikimedia.org/pipermail/wikitech-l/2014-March/075102.html [10:55:44] I see, thanks. [10:58:09] (03PS1) 10Alexandros Kosiaris: volatile puppet reqs should go to frontend only [operations/puppet] - 10https://gerrit.wikimedia.org/r/118251 [11:00:43] (03CR) 10Alexandros Kosiaris: [C: 032] volatile puppet reqs should go to frontend only [operations/puppet] - 10https://gerrit.wikimedia.org/r/118251 (owner: 10Alexandros Kosiaris) [11:04:10] (03PS1) 10ArielGlenn: fix snapshot module dependency cycle [operations/puppet] - 10https://gerrit.wikimedia.org/r/118252 [11:06:35] (03CR) 10ArielGlenn: [C: 032] fix snapshot module dependency cycle [operations/puppet] - 10https://gerrit.wikimedia.org/r/118252 (owner: 10ArielGlenn) [11:17:58] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [11:27:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Quite the contrary. It is a breaking change with error:" (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 (owner: 10Matanya) [11:32:21] (03PS7) 10Matanya: webserver: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 [11:33:21] (03PS6) 10Alexandros Kosiaris: ganglia_new: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107128 (owner: 10Matanya) [11:48:53] (03PS1) 10ArielGlenn: all snapshot hosts using snapshot module [operations/puppet] - 10https://gerrit.wikimedia.org/r/118258 [11:50:56] (03CR) 10ArielGlenn: [C: 032] all snapshot hosts using snapshot module [operations/puppet] - 10https://gerrit.wikimedia.org/r/118258 (owner: 10ArielGlenn) [11:51:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [11:55:01] (03PS1) 10ArielGlenn: remove snapshots.pp manifest, no longer used [operations/puppet] - 10https://gerrit.wikimedia.org/r/118260 [11:56:49] (03CR) 10ArielGlenn: [C: 032] remove snapshots.pp manifest, no longer used [operations/puppet] - 10https://gerrit.wikimedia.org/r/118260 (owner: 10ArielGlenn) [12:07:28] hmm let me see <-- did you see something interesting ? [12:58:52] (03PS1) 10Cmjohnson: Relocaing mw1201 -1203, This is the dns change [operations/dns] - 10https://gerrit.wikimedia.org/r/118270 [12:59:19] !log shutting down and relocating mw1201, mw1202, mw1203 to d5-eqiad [13:00:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:01:12] Logged the message, Master [13:03:38] PROBLEM - Host mw1201 is DOWN: PING CRITICAL - Packet loss = 100% [13:04:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.158 second response time [13:05:38] PROBLEM - HTTP on fenari is CRITICAL: Connection timed out [13:07:28] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 0.073 second response time [13:07:38] PROBLEM - Host mw1202 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:08] (03PS2) 10Faidon Liambotis: Relocating mw1201-1203 to row D [operations/dns] - 10https://gerrit.wikimedia.org/r/118270 (owner: 10Cmjohnson) [13:10:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:08] PROBLEM - Host mw1203 is DOWN: PING CRITICAL - Packet loss = 100% [13:12:23] wikitech times out on me [13:13:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.314 second response time [13:27:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:28] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.022 second response time [13:31:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.924 second response time [13:41:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:38] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:28] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 4.471 second response time [13:49:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.836 second response time [13:50:52] (03CR) 10Ottomata: "Ah thanks Alex, I saw the numbers change but not the booleans." [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 (owner: 10Matanya) [13:52:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.265 second response time [13:57:06] apergos: do you know the status of https://rt.wikimedia.org/Ticket/Display.html?id=4196 ? [13:58:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.494 second response time [14:02:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:38] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:28] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 5.430 second response time [14:13:13] matanya: nope, never heard an update after I asked [14:13:25] too bad :/ [14:15:51] Coren: see virt0 alerts above; also, can you fix puppet on labstore4 & labstore1001? [14:16:38] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:48] paravoid: They're disabled on purpose during migration; to be reenabled/killed off next week. Ima go see why the http is flapping on virt0 now. [14:17:38] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 6.908 second response time [14:17:46] I told you last time, we don't disable puppet for large periods of time [14:17:51] do your changes in puppet [14:18:58] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [14:19:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.126 second response time [14:20:17] Oh fun; wikitech is being hammered by some sort of spambot. [14:20:30] (03CR) 10Cmjohnson: [C: 032] Relocating mw1201-1203 to row D [operations/dns] - 10https://gerrit.wikimedia.org/r/118270 (owner: 10Cmjohnson) [14:20:59] Coren: why aren't those changes done in puppet? [14:21:48] !log deploying new swift ring @ eqiad, setting weight from 100 to 2000 on all disks [14:22:11] (so that we can add the new boxes with weight 3000) [14:22:25] this needs a rebalance, so it'll take some time [14:22:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:41] Reassigned 442 (0.67%) partitions. Balance is now 0.83. [14:22:45] mark: The short of it: because the settings are mutually incompatible. I either need a shitton of conditional crap for two weeks, or disable puppet on at least one. (On the other, I could actually just add the temporary ssh keys and reenable puppet) [14:23:05] you can at least disable all the labs classes and keep base active [14:23:15] that's better than disabling puppet completely [14:23:15] mark: Ah, yes, that I could do. [14:23:36] * Coren goes do that then. [14:23:39] thanks [14:23:56] :) [14:24:40] Logged the message, Master [14:28:35] hi bd808, do you know if apache logs are in elastic search? [14:29:07] i.e. brought there by logstash [14:29:22] matanya: The fatals log is in logstash but not the access log [14:30:23] thanks bd808 ,i'm asking in regard to files/icinga/check_bad_apaches [14:30:32] * bd808 looks [14:30:55] which might better be replaced by logstash nagios plugin check instead of this hackry way [14:31:01] !log springle synchronized wmf-config/db-eqiad.php 's2 db1063 full steam' [14:32:54] matanya: That script counting segfaults and ??? per host and reporting those with >10 occurances? [14:33:28] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.030 second response time [14:33:37] I'm not sure what event writes "Allowedd" into the apache log stream [14:33:37] not sure what is Allowedd [14:33:52] does it exist in prod's syslog ? [14:34:45] Ryan_Lane: wrote this check, might be wise to ask him [14:35:09] matanya: The apache log feed in logstash does show segfaults. I made a dashboard for that. [14:35:26] anyway, it makes sense to me to replace those with alret [14:35:37] But logstash needs work before it's ready to be critical infrastructure [14:35:47] just pointing out [14:36:06] i think it might be a wise step in the right direction [14:36:09] Specifically this HA bug needs a resolution https://bugzilla.wikimedia.org/show_bug.cgi?id=61785 [14:36:13] * bd808 nods [14:36:38] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:36:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:36:50] logstash -> graphite -> icinga would be a good toolchain to have working [14:37:24] yes, definitly [14:38:05] apergos: Did my clean up of l10n cache files yesterday help with your disk space issues? [14:38:20] greg-g: around? [14:39:21] as for HA, in my $day_job i multicast for the shippers, so any server in the logging cluster that is listening gets the message [14:40:00] and a small tool verifies that ES doesn't get duplicated messages in the ES gateway [14:40:13] (03CR) 10Springle: [C: 04-2] "-2 only until I can test and roll this out near the start of my day, then be around to watch icinga :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116945 (owner: 10Matanya) [14:40:45] springle: -1 would be enough :P [14:41:25] heh ok.. -1 in web ui says there is a problem, -2 says do not submit [14:41:28] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.460 second response time [14:41:40] matanya: That would be nice. The way that we send from udp2log right now has "issues" in that a single log event can be split across multiple packets and has to be reassembled in logstash. It would be nice to replace that too. [14:42:28] though the kafka solution will probably work too [14:43:09] It needs Ops love. I won't have time to think about it for at least a month :( [14:43:47] is anyone deploying? [14:43:55] * bd808 is not [14:44:04] what are we deploying? [14:44:16] since i know nothing about udp2log, i can't really replace it :) [14:44:20] aude: We? Nothing... but I broke some stuff in CentralAuth :P [14:44:29] ok :) [14:44:31] Not myself, but by code reivew :/ [14:45:11] ok, will go for it [14:48:02] matanya: I hooked into udp2log because it was there. :) The logs I'm getting from it right now are files on fluorine that could be processed with a different shipper. [14:48:49] bd808: would it better to use logstash instead ? [14:50:37] matanya: Possibly, or something a little lighter like lumberjack or beaver [14:50:40] bd808: yes, it looks much better, thank you [14:51:11] apergos: Cool. I'll try to make that cleanup a regular part of the deploy train [14:51:13] daw, wikitech down? [14:51:15] lumberjack/beaver makes a lot of sense to me [14:51:32] ottomata: being hit by spambts [14:51:42] Blame springle for being around [14:52:06] wasn't me [14:52:11] (whatever is is) [14:52:18] :D [14:52:20] man spambots, I'm tryign to use that [14:52:22] buncha jerks [14:52:40] What for? https://wikitech-static.wikimedia.org/ any use? [14:52:56] naw, i was tryign to edit and upload files [14:52:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [14:53:40] !log hoo synchronized php-1.23wmf17/extensions/CentralAuth/ 'Fix global account deletion' [14:54:08] !log syncing to mw120[1-3] failed [14:54:19] error? [14:54:21] oh no [14:54:25] mw1202: ssh: connect to host mw1202 port 22: Connection timed out [14:54:33] mw1201: ssh: connect to host mw1201 port 22: No route to host [14:54:37] mw1203: ssh: connect to host mw1203 port 22: Connection timed out [14:54:38] thanks for fixing this hoo [14:54:39] hoo: mw1201-1203 are down right [14:54:40] now [14:54:45] ah ok :) [14:55:33] mh, the bot logging to wikitech is donw? [14:55:41] wikitech is down [14:55:58] :P [14:56:01] ah... [14:56:54] And there you go [15:00:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.093 second response time [15:02:45] RECOVERY - Host mw1201 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:02:55] PROBLEM - twemproxy process on mw1201 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [15:02:55] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:05] PROBLEM - Apache HTTP on mw1201 is CRITICAL: Connection refused [15:03:15] PROBLEM - NTP on mw1201 is CRITICAL: NTP CRITICAL: Offset unknown [15:03:26] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:15] RECOVERY - NTP on mw1201 is OK: NTP OK: Offset -0.001842737198 secs [15:05:36] (03PS1) 10coren: Disable labsness of labstore[34] [operations/puppet] - 10https://gerrit.wikimedia.org/r/118287 [15:06:15] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.861 second response time [15:07:15] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.075 second response time [15:09:35] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.687 second response time [15:12:25] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:35] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:38] (03CR) 10coren: [C: 04-1] "Having created the user in LDAP (which is indeed the correct thing to do), it's not necessary to create it locally with generic::systemuse" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 (owner: 10Hashar) [15:13:45] (03CR) 10coren: [C: 032] "This patch's primary feature is to shut icinga up." [operations/puppet] - 10https://gerrit.wikimedia.org/r/118287 (owner: 10coren) [15:18:05] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Mar 12 15:17:58 UTC 2014 [15:21:25] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 3.920 second response time [15:22:09] (03PS1) 10Cmjohnson: Renaming db1065 and 1065 to dbstore1001 and dbstore1002 [operations/dns] - 10https://gerrit.wikimedia.org/r/118290 [15:22:15] (03PS1) 10Alexandros Kosiaris: Silence netmapper_update cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/118291 [15:22:58] (03CR) 10Cmjohnson: [C: 032] Renaming db1065 and 1065 to dbstore1001 and dbstore1002 [operations/dns] - 10https://gerrit.wikimedia.org/r/118290 (owner: 10Cmjohnson) [15:25:28] akosiaris: regarding carbon. I added 2 3TB disk to it that you can manipulate however you need. The OS is still on the same 2 500GB disks. [15:26:25] I still have tftp going to brewster from eqiad for the time being. Let me know what else you need [15:26:27] cmjohnson1: OK thanks. I install the OS on those two 3TB disks. I might need you to unplug at some point those 2 500Gb disks and plug the other 2 3TB. [15:26:38] cmjohnson1: ok will do [15:27:01] akosiaris: read the comment from paravoid https://rt.wikimedia.org/Ticket/Display.html?id=6801 [15:27:18] yeah I had the same problem with labsdbs [15:27:27] I have a working config for GPT already [15:27:59] oh..cool...do you want me to add the other 2 3TB disks we have...rob purchased 4 total for it. [15:28:25] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.162 second response time [15:28:26] sure. When do you want to do it ? [15:28:39] I can do it now if you like [15:28:47] go ahead then :-) [15:32:25] PROBLEM - Host carbon is DOWN: CRITICAL - Host Unreachable (208.80.154.10) [15:33:37] RECOVERY - Host mw1202 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:33:37] RECOVERY - Host mw1203 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:35:37] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:18] akosiaris: new disks are there..all yours [15:36:27] PROBLEM - twemproxy process on mw1203 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [15:36:27] PROBLEM - twemproxy process on mw1202 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [15:38:47] thanks [15:43:12] bblack: mind if I merge https://gerrit.wikimedia.org/r/#/c/118291 ? I suppose you are not using those emails for something right ? [15:43:59] !log ms-be1005 going down to fix mgmt [15:44:10] ACKNOWLEDGEMENT - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris reinstalling with precise and more disks [15:46:37] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:27] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:35] (03PS3) 10Hashar: beta: skip l10nupdate user/group creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 [15:48:17] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [15:49:08] welcome back carbon [15:50:11] akosiaris: if carbon is ok now, can i move brewster stuff to it? [15:50:24] it is not ok [15:50:37] PROBLEM - Squid on carbon is CRITICAL: Connection timed out [15:50:37] PROBLEM - HTTP on carbon is CRITICAL: No route to host [15:50:57] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [15:52:27] (03PS2) 10Denny Vrandecic: Set up configuration for App indexing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 [15:57:27] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 8.835 second response time [15:59:27] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.419 second response time [15:59:58] hoo|away: I am now :) [16:04:27] PROBLEM - Certificate expiration on virt0 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [16:04:47] (03CR) 10MaxSem: [C: 031] varnish: don't set X-WAP on mobile [operations/puppet] - 10https://gerrit.wikimedia.org/r/118249 (owner: 10Faidon Liambotis) [16:07:52] akosiaris: yes, go ahead! :) [16:08:16] (03CR) 10Alexandros Kosiaris: [C: 032] Silence netmapper_update cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/118291 (owner: 10Alexandros Kosiaris) [16:08:28] who would be the person to ask about ldap ? [16:08:46] i want to move formey's ldap stuff to eqiad [16:08:54] any machine for that? [16:09:54] !log initiating controlled shutdown of analytics1021 kafka broker to do some load testing and also fix runtime java version [16:10:02] Logged the message, Master [16:13:37]