[00:06:11] (03PS1) 10Denny Vrandecic: Set up configuration for App indexing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 [00:23:51] (03CR) 10Dzahn: [C: 032] "labs-tested" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [00:27:34] ooh. sigh, of course minus that one sigh for style reasons [00:27:55] come on gerrit [00:28:52] (03PS1) 10Dzahn: rename planet::site to planet::apachesite [operations/puppet] - 10https://gerrit.wikimedia.org/r/118224 [00:30:21] (03CR) 10Dzahn: [C: 032] rename planet::site to planet::apachesite [operations/puppet] - 10https://gerrit.wikimedia.org/r/118224 (owner: 10Dzahn) [00:35:19] anyone around want to help me test pushbot? [00:35:25] just /join #gregtest [00:45:27] (03CR) 10Dzahn: "ran on zirconium. planet still fine. apache graceful'ed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [00:59:03] bd808: Re https://bugzilla.wikimedia.org/show_bug.cgi?id=60166 (ntpd warnings): Do you have access to an "ordinary" server to see if this problem is limited to Labs? [00:59:50] scfc_de: I don't think I have access to tail /var/log/syslog [01:00:27] Wait, I do. I have sudo on logstash nodes [01:00:36] * bd808 goes to check [01:02:44] scfc_de: `grep ntpd /var/log/syslog` returns nothing on logstash1001. [01:03:36] (03PS1) 10Rush: CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 [01:04:06] bd808: Okay, thanks, then I'll submit a patch that only adds "interface listen ipv4" for Labs instances. [01:04:20] cool [01:11:28] (03CR) 10Dzahn: [C: 04-1] CST time period (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 (owner: 10Rush) [01:16:49] (03PS2) 10Rush: CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 [01:18:22] (03CR) 10Dzahn: [C: 031] CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 (owner: 10Rush) [01:28:14] (03PS1) 10MaxSem: Detect mobile site by X-Subdomain instead of X-WAP [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118227 [01:33:01] (03PS3) 10Tim Landscheidt: Authorize rush for Icinga, add him to contact group sms and add CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 (owner: 10Rush) [02:03:58] (03PS1) 10Springle: S1 deploy db1061 db1062, S2 deploy db1063 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118228 [02:05:50] (03CR) 10Springle: [C: 032] S1 deploy db1061 db1062, S2 deploy db1063 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118228 (owner: 10Springle) [02:10:25] !log LocalisationUpdate completed (1.23wmf16) at 2014-03-12 02:10:25+00:00 [02:10:37] Logged the message, Master [02:15:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [02:16:32] (03PS1) 10Springle: S1 uses only mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/118229 [02:17:54] (03CR) 10Springle: [C: 032] S1 uses only mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/118229 (owner: 10Springle) [02:18:12] !log LocalisationUpdate completed (1.23wmf17) at 2014-03-12 02:18:12+00:00 [02:18:23] Logged the message, Master [02:26:43] (03PS1) 10Springle: DB nodes whitespace [operations/puppet] - 10https://gerrit.wikimedia.org/r/118230 [02:32:49] !log springle synchronized wmf-config/db-eqiad.php 's1 drop load during xtrabackup clone db1051 to db1061' [02:32:57] Logged the message, Master [02:35:18] PROBLEM - mysqld processes on db1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:35:28] PROBLEM - mysqld processes on db1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:36:11] oh, hi neon [02:43:03] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Mar 12 02:42:59 UTC 2014 (duration 42m 58s) [02:43:13] Logged the message, Master [02:48:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [03:31:17] (03CR) 10Springle: [C: 032] DB nodes whitespace [operations/puppet] - 10https://gerrit.wikimedia.org/r/118230 (owner: 10Springle) [03:33:23] (03PS1) 10Springle: Default innodb_file_per_table=OFF for S3 as that shard has many more tables than the others (OFF reduces file handle usage, among other things). [operations/puppet] - 10https://gerrit.wikimedia.org/r/118232 [03:35:01] (03CR) 10Springle: [C: 032] Default innodb_file_per_table=OFF for S3 as that shard has many more tables than the others (OFF reduces file handle usage, among other thin [operations/puppet] - 10https://gerrit.wikimedia.org/r/118232 (owner: 10Springle) [04:49:32] (03PS1) 10Springle: import dbtree hotfix changes, recent and old, made on fenari [operations/software] - 10https://gerrit.wikimedia.org/r/118235 [04:50:33] (03CR) 10Springle: [C: 032] import dbtree hotfix changes, recent and old, made on fenari [operations/software] - 10https://gerrit.wikimedia.org/r/118235 (owner: 10Springle) [04:52:34] !log springle synchronized wmf-config/db-eqiad.php 's1 db1051 warm up' [04:52:43] Logged the message, Master [05:15:58] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [05:17:45] (03PS1) 10Springle: MariaDB uses ARIA instead of MyISAM for implicit temporary tables that hit disk for whever reason. The default 128M page cache is a bottleneck particularly on slaves in 'vslow' load balancing group. [operations/puppet] - 10https://gerrit.wikimedia.org/r/118236 [05:19:46] (03CR) 10Springle: [C: 032] MariaDB uses ARIA instead of MyISAM for implicit temporary tables that hit disk for whever reason. The default 128M page cache is a bottlene [operations/puppet] - 10https://gerrit.wikimedia.org/r/118236 (owner: 10Springle) [05:34:25] (03PS1) 10Springle: s1 pool db1061 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118237 [05:34:59] (03CR) 10Springle: [C: 032] s1 pool db1061 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118237 (owner: 10Springle) [05:35:12] (03Merged) 10jenkins-bot: s1 pool db1061 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118237 (owner: 10Springle) [05:38:48] !log springle synchronized wmf-config/db-eqiad.php 's1 pool db1061, warm up' [05:38:57] Logged the message, Master [05:49:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [06:18:03] !log xtrabackup clone db1018 to db1063 [06:18:16] Logged the message, Master [08:16:58] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [08:40:15] mutante: just out curiosity [08:40:21] https://meta.wikimedia.org/w/index.php?title=Talk:Planet_Wikimedia&diff=next&oldid=5744292 [08:40:24] why the dot? [08:46:22] hello [08:49:06] hi hashar [08:50:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [08:53:22] (03PS1) 10Matanya: apache :remove hardy var from appserver envvars [operations/puppet] - 10https://gerrit.wikimedia.org/r/118240 [09:11:37] (03PS1) 10Matanya: bckup: remove db10 from disklist [operations/puppet] - 10https://gerrit.wikimedia.org/r/118241 [09:12:29] !log Jenkins broken again! Good morning. [09:12:39] Logged the message, Master [09:13:47] !log restarting Jenkins [09:13:58] Logged the message, Master [09:15:53] !log kill -9 of Jenkins since it is unresponsive [09:16:15] Logged the message, Master [09:21:30] hashar: time to move to travis? [09:21:54] no third parties [09:23:01] (03CR) 10Alexandros Kosiaris: [C: 032] apache :remove hardy var from appserver envvars [operations/puppet] - 10https://gerrit.wikimedia.org/r/118240 (owner: 10Matanya) [09:23:18] morning joy :) [09:23:38] (03PS1) 10Hashar: contint: prevent SISTRIX Crawler from browsing Jenkins [operations/puppet] - 10https://gerrit.wikimedia.org/r/118242 [09:23:40] (03PS1) 10Hashar: contint: deny access to RSS feeds on main Jenkins page [operations/puppet] - 10https://gerrit.wikimedia.org/r/118243 [09:23:51] akosiaris: got some more changes to merge for your if you dont mind ^^^^ [09:24:05] akosiaris: related to Jenkins being hit by web crawler and killing its webprocess [09:24:33] another java app being killed by a web crawler... why am I not surprised ? [09:25:21] :-] [09:25:22] (03CR) 10Alexandros Kosiaris: [C: 032] contint: deny access to RSS feeds on main Jenkins page [operations/puppet] - 10https://gerrit.wikimedia.org/r/118243 (owner: 10Hashar) [09:25:32] it is missing a pool counter on some heavy CPU operations [09:25:48] + gallium has very slow IO since we upgraded it to Precise :-( [09:26:10] (03CR) 10Alexandros Kosiaris: [C: 032] contint: prevent SISTRIX Crawler from browsing Jenkins [operations/puppet] - 10https://gerrit.wikimedia.org/r/118242 (owner: 10Hashar) [09:26:36] huh ? [09:26:46] how is that possible ? [09:26:53] no idea [09:27:10] various ops looked at it and we could not find the issue [09:27:19] the workaround has been to migrate some heavy I/O operations to happens on tmpfs [09:27:27] but Jenkins still use the disk [09:27:42] I guess that will be fixed up whenever gallium is replaced :] [09:28:04] reloaded apache. thanks! [09:31:27] :) [09:33:53] hashar: only I/O ? [09:34:12] and the stupid regex does not work :-( [09:42:14] (03PS1) 10Hashar: contint: typo in env var prevented blacklisting [operations/puppet] - 10https://gerrit.wikimedia.org/r/118244 [09:42:31] akosiaris: and the mandatory follow up, I made a typo sorry https://gerrit.wikimedia.org/r/118244 [09:43:29] (03CR) 10Alexandros Kosiaris: [C: 032] contint: typo in env var prevented blacklisting [operations/puppet] - 10https://gerrit.wikimedia.org/r/118244 (owner: 10Hashar) [09:44:40] will restart Jenkins once again :( [09:44:57] !log rerestarting Jenkins [09:45:07] Logged the message, Master [09:47:26] (03PS1) 10Springle: s2 pool db1060 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118245 [09:50:03] (03PS2) 10Springle: s2 pool db1063 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118245 [09:50:25] (03CR) 10Springle: [C: 032] s2 pool db1063 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118245 (owner: 10Springle) [09:50:32] (03Merged) 10jenkins-bot: s2 pool db1063 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118245 (owner: 10Springle) [09:51:42] !log springle synchronized wmf-config/db-eqiad.php 's2 pool db1063, warm up' [09:52:52] Logged the message, Master [09:56:52] akosiaris: any point in having two files which are exaectly the same in dsh group? apaches==apaches-eqiad [09:57:12] i would remove the latter, unless i'm missing something [09:57:25] they just became the same i suppose ? [09:57:39] due to tamp apaches being shutdown just this week ? [09:57:43] tampa* [09:57:46] yes [09:57:52] that is my point [09:58:35] so, can i remove it? what do you recommend ? [09:58:43] we are months away from having a new DC with apaches ready [09:59:01] and i hope dsh will be out by then :) [09:59:19] yeah, I hope too but I am not so optimistic [09:59:41] will leave it for now [09:59:53] * matanya is on a cleanup mania [10:00:34] I never liked those different sets of files [10:01:02] that would easily end up having say machine A but not B because some forgot to update both [10:01:23] anyway don't remove it for now, probably people still do dsh -g apaches [10:01:34] sure [10:01:36] and dsh -g apaches-eqiad [10:01:53] you know what ? wanna link them ? [10:02:05] so that both exist but only one needs to be updated ? [10:02:13] and we split it up again if the need arises [10:02:26] that was my first thought [10:02:36] but i fear it will cause confusion [10:02:56] if one will try to edit the link, it will be weird [10:03:05] unless it is configured in puppet [10:03:42] hmm let me see [10:17:36] !log started s4 dump for toolserver on db72 /a [10:17:46] Logged the message, Master [10:19:58] (03PS1) 10Hashar: contint: retab to get rid of tabulations [operations/puppet] - 10https://gerrit.wikimedia.org/r/118248 [10:29:11] MaxSem: hey, when https://gerrit.wikimedia.org/r/#/c/116147/ gets deployed, could you ping me? [10:29:51] MaxSem: or just tell me now when it's scheduled for :) is it next tuesday? [10:29:54] I always get confused [10:30:13] paravoid, it will be on the normal train, so next Thursday it will be everywhere [10:32:03] ok [10:32:12] thanks [10:32:15] do we store apache logs in logstash /ES ? [10:32:27] (03CR) 10MaxSem: [C: 032] Detect mobile site by X-Subdomain instead of X-WAP [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118227 (owner: 10MaxSem) [10:32:35] (03Merged) 10jenkins-bot: Detect mobile site by X-Subdomain instead of X-WAP [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118227 (owner: 10MaxSem) [10:32:37] heh I was about to ask that [10:32:48] meanwhile, small preparation^^^ [10:32:59] and then we need to wait a month to remove it from varnish... :) [10:33:10] since existing cache objects Vary on X-WAP [10:33:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:33:41] fun [10:34:16] i guess ori would know, but he seems absent [10:34:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.742 second response time [10:35:15] matanya: ori works from home this week and most probably went back to a normal sleep schedule [10:35:40] i don't think he know what that means [10:36:08] (03PS1) 10Faidon Liambotis: varnish: don't set X-WAP on mobile [operations/puppet] - 10https://gerrit.wikimedia.org/r/118249 [10:36:45] paravoid: btw, would love if you can do security scanning on https://gerrit.wikimedia.org/r/#/c/116945/ [10:36:52] (03CR) 10Faidon Liambotis: [C: 04-2] "To be merged 31 days after March 20th." [operations/puppet] - 10https://gerrit.wikimedia.org/r/118249 (owner: 10Faidon Liambotis) [10:36:54] (03PS1) 10ArielGlenn: snapshot module in use on a pmtpa snap host [operations/puppet] - 10https://gerrit.wikimedia.org/r/118250 [10:37:40] matanya: sean would be the person to review this, adding him [10:37:42] paravoid, can we just ban objects with X-WAP: yes? [10:37:47] thanks [10:38:45] MaxSem: that's not the point, the point is that even for X-WAP: no objects, MW returned Vary: X-WAP, so Varnish has recorded that in the cache objects it stored [10:39:11] mmm, and will result in cache miss... [10:39:15] so when a request comes without an X-WAP header, it'll assume that the cache object isn't valid for this request and will do a fetch from the backend [10:39:24] yup [10:39:40] (03CR) 10ArielGlenn: [C: 032] snapshot module in use on a pmtpa snap host [operations/puppet] - 10https://gerrit.wikimedia.org/r/118250 (owner: 10ArielGlenn) [10:39:40] if we only had a hex editor for cache:P [10:39:59] so we need to have mediawiki just emit cache objects with no Vary: X-WAP for our TTL [10:40:04] then remove the header [10:40:09] that's 31 days, apparently [10:40:55] by the way, was there any noticeable change in cache hit ratio when we killed WAP in VCL? [10:41:08] !log maxsem synchronized wmf-config/mobile.php 'https://gerrit.wikimedia.org/r/118227' [10:41:13] I didn't observe it but I wouldn't suppose so [10:41:26] who uses WAP anyway... [10:41:39] Logged the message, Master [10:42:55] well, before I fixed that regex in September, WAp was served in ~10% requests [10:43:12] after that, in 0.1% indeed [10:44:29] lol, really? [10:44:30] 10%? [10:44:56] (The "channel logs" link in the topic doesn't work btw.) [10:46:15] it does pajz you need to remove the % [10:47:36] matanya, hm? http://ur1.ca/edq22 gives me "The requested URL /~wm-bot/logs/#wikimedia-operations/ was not found on this server" [10:50:21] oh, i see now pajz weird, it was open on my browser, didn't notice it is old [10:50:25] sorry [10:51:25] pajz: http://tools.wmflabs.org/wm-bot/logs/ [10:54:52] pajz: see http://lists.wikimedia.org/pipermail/wikitech-l/2014-March/075102.html [10:55:44] I see, thanks. [10:58:09] (03PS1) 10Alexandros Kosiaris: volatile puppet reqs should go to frontend only [operations/puppet] - 10https://gerrit.wikimedia.org/r/118251 [11:00:43] (03CR) 10Alexandros Kosiaris: [C: 032] volatile puppet reqs should go to frontend only [operations/puppet] - 10https://gerrit.wikimedia.org/r/118251 (owner: 10Alexandros Kosiaris) [11:04:10] (03PS1) 10ArielGlenn: fix snapshot module dependency cycle [operations/puppet] - 10https://gerrit.wikimedia.org/r/118252 [11:06:35] (03CR) 10ArielGlenn: [C: 032] fix snapshot module dependency cycle [operations/puppet] - 10https://gerrit.wikimedia.org/r/118252 (owner: 10ArielGlenn) [11:17:58] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [11:27:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Quite the contrary. It is a breaking change with error:" (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 (owner: 10Matanya) [11:32:21] (03PS7) 10Matanya: webserver: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 [11:33:21] (03PS6) 10Alexandros Kosiaris: ganglia_new: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107128 (owner: 10Matanya) [11:48:53] (03PS1) 10ArielGlenn: all snapshot hosts using snapshot module [operations/puppet] - 10https://gerrit.wikimedia.org/r/118258 [11:50:56] (03CR) 10ArielGlenn: [C: 032] all snapshot hosts using snapshot module [operations/puppet] - 10https://gerrit.wikimedia.org/r/118258 (owner: 10ArielGlenn) [11:51:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [11:55:01] (03PS1) 10ArielGlenn: remove snapshots.pp manifest, no longer used [operations/puppet] - 10https://gerrit.wikimedia.org/r/118260 [11:56:49] (03CR) 10ArielGlenn: [C: 032] remove snapshots.pp manifest, no longer used [operations/puppet] - 10https://gerrit.wikimedia.org/r/118260 (owner: 10ArielGlenn) [12:07:28] hmm let me see <-- did you see something interesting ? [12:58:52] (03PS1) 10Cmjohnson: Relocaing mw1201 -1203, This is the dns change [operations/dns] - 10https://gerrit.wikimedia.org/r/118270 [12:59:19] !log shutting down and relocating mw1201, mw1202, mw1203 to d5-eqiad [13:00:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:01:12] Logged the message, Master [13:03:38] PROBLEM - Host mw1201 is DOWN: PING CRITICAL - Packet loss = 100% [13:04:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.158 second response time [13:05:38] PROBLEM - HTTP on fenari is CRITICAL: Connection timed out [13:07:28] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 0.073 second response time [13:07:38] PROBLEM - Host mw1202 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:08] (03PS2) 10Faidon Liambotis: Relocating mw1201-1203 to row D [operations/dns] - 10https://gerrit.wikimedia.org/r/118270 (owner: 10Cmjohnson) [13:10:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:08] PROBLEM - Host mw1203 is DOWN: PING CRITICAL - Packet loss = 100% [13:12:23] wikitech times out on me [13:13:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.314 second response time [13:27:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:28] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.022 second response time [13:31:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.924 second response time [13:41:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:38] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:28] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 4.471 second response time [13:49:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.836 second response time [13:50:52] (03CR) 10Ottomata: "Ah thanks Alex, I saw the numbers change but not the booleans." [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 (owner: 10Matanya) [13:52:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.265 second response time [13:57:06] apergos: do you know the status of https://rt.wikimedia.org/Ticket/Display.html?id=4196 ? [13:58:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.494 second response time [14:02:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:38] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:28] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 5.430 second response time [14:13:13] matanya: nope, never heard an update after I asked [14:13:25] too bad :/ [14:15:51] Coren: see virt0 alerts above; also, can you fix puppet on labstore4 & labstore1001? [14:16:38] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:48] paravoid: They're disabled on purpose during migration; to be reenabled/killed off next week. Ima go see why the http is flapping on virt0 now. [14:17:38] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 6.908 second response time [14:17:46] I told you last time, we don't disable puppet for large periods of time [14:17:51] do your changes in puppet [14:18:58] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [14:19:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.126 second response time [14:20:17] Oh fun; wikitech is being hammered by some sort of spambot. [14:20:30] (03CR) 10Cmjohnson: [C: 032] Relocating mw1201-1203 to row D [operations/dns] - 10https://gerrit.wikimedia.org/r/118270 (owner: 10Cmjohnson) [14:20:59] Coren: why aren't those changes done in puppet? [14:21:48] !log deploying new swift ring @ eqiad, setting weight from 100 to 2000 on all disks [14:22:11] (so that we can add the new boxes with weight 3000) [14:22:25] this needs a rebalance, so it'll take some time [14:22:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:41] Reassigned 442 (0.67%) partitions. Balance is now 0.83. [14:22:45] mark: The short of it: because the settings are mutually incompatible. I either need a shitton of conditional crap for two weeks, or disable puppet on at least one. (On the other, I could actually just add the temporary ssh keys and reenable puppet) [14:23:05] you can at least disable all the labs classes and keep base active [14:23:15] that's better than disabling puppet completely [14:23:15] mark: Ah, yes, that I could do. [14:23:36] * Coren goes do that then. [14:23:39] thanks [14:23:56] :) [14:24:40] Logged the message, Master [14:28:35] hi bd808, do you know if apache logs are in elastic search? [14:29:07] i.e. brought there by logstash [14:29:22] matanya: The fatals log is in logstash but not the access log [14:30:23] thanks bd808 ,i'm asking in regard to files/icinga/check_bad_apaches [14:30:32] * bd808 looks [14:30:55] which might better be replaced by logstash nagios plugin check instead of this hackry way [14:31:01] !log springle synchronized wmf-config/db-eqiad.php 's2 db1063 full steam' [14:32:54] matanya: That script counting segfaults and ??? per host and reporting those with >10 occurances? [14:33:28] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.030 second response time [14:33:37] I'm not sure what event writes "Allowedd" into the apache log stream [14:33:37] not sure what is Allowedd [14:33:52] does it exist in prod's syslog ? [14:34:45] Ryan_Lane: wrote this check, might be wise to ask him [14:35:09] matanya: The apache log feed in logstash does show segfaults. I made a dashboard for that. [14:35:26] anyway, it makes sense to me to replace those with alret [14:35:37] But logstash needs work before it's ready to be critical infrastructure [14:35:47] just pointing out [14:36:06] i think it might be a wise step in the right direction [14:36:09] Specifically this HA bug needs a resolution https://bugzilla.wikimedia.org/show_bug.cgi?id=61785 [14:36:13] * bd808 nods [14:36:38] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:36:38] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:36:50] logstash -> graphite -> icinga would be a good toolchain to have working [14:37:24] yes, definitly [14:38:05] apergos: Did my clean up of l10n cache files yesterday help with your disk space issues? [14:38:20] greg-g: around? [14:39:21] as for HA, in my $day_job i multicast for the shippers, so any server in the logging cluster that is listening gets the message [14:40:00] and a small tool verifies that ES doesn't get duplicated messages in the ES gateway [14:40:13] (03CR) 10Springle: [C: 04-2] "-2 only until I can test and roll this out near the start of my day, then be around to watch icinga :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116945 (owner: 10Matanya) [14:40:45] springle: -1 would be enough :P [14:41:25] heh ok.. -1 in web ui says there is a problem, -2 says do not submit [14:41:28] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.460 second response time [14:41:40] matanya: That would be nice. The way that we send from udp2log right now has "issues" in that a single log event can be split across multiple packets and has to be reassembled in logstash. It would be nice to replace that too. [14:42:28] though the kafka solution will probably work too [14:43:09] It needs Ops love. I won't have time to think about it for at least a month :( [14:43:47] is anyone deploying? [14:43:55] * bd808 is not [14:44:04] what are we deploying? [14:44:16] since i know nothing about udp2log, i can't really replace it :) [14:44:20] aude: We? Nothing... but I broke some stuff in CentralAuth :P [14:44:29] ok :) [14:44:31] Not myself, but by code reivew :/ [14:45:11] ok, will go for it [14:48:02] matanya: I hooked into udp2log because it was there. :) The logs I'm getting from it right now are files on fluorine that could be processed with a different shipper. [14:48:49] bd808: would it better to use logstash instead ? [14:50:37] matanya: Possibly, or something a little lighter like lumberjack or beaver [14:50:40] bd808: yes, it looks much better, thank you [14:51:11] apergos: Cool. I'll try to make that cleanup a regular part of the deploy train [14:51:13] daw, wikitech down? [14:51:15] lumberjack/beaver makes a lot of sense to me [14:51:32] ottomata: being hit by spambts [14:51:42] Blame springle for being around [14:52:06] wasn't me [14:52:11] (whatever is is) [14:52:18] :D [14:52:20] man spambots, I'm tryign to use that [14:52:22] buncha jerks [14:52:40] What for? https://wikitech-static.wikimedia.org/ any use? [14:52:56] naw, i was tryign to edit and upload files [14:52:58] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [14:53:40] !log hoo synchronized php-1.23wmf17/extensions/CentralAuth/ 'Fix global account deletion' [14:54:08] !log syncing to mw120[1-3] failed [14:54:19] error? [14:54:21] oh no [14:54:25] mw1202: ssh: connect to host mw1202 port 22: Connection timed out [14:54:33] mw1201: ssh: connect to host mw1201 port 22: No route to host [14:54:37] mw1203: ssh: connect to host mw1203 port 22: Connection timed out [14:54:38] thanks for fixing this hoo [14:54:39] hoo: mw1201-1203 are down right [14:54:40] now [14:54:45] ah ok :) [14:55:33] mh, the bot logging to wikitech is donw? [14:55:41] wikitech is down [14:55:58] :P [14:56:01] ah... [14:56:54] And there you go [15:00:38] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.093 second response time [15:02:45] RECOVERY - Host mw1201 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:02:55] PROBLEM - twemproxy process on mw1201 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [15:02:55] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:05] PROBLEM - Apache HTTP on mw1201 is CRITICAL: Connection refused [15:03:15] PROBLEM - NTP on mw1201 is CRITICAL: NTP CRITICAL: Offset unknown [15:03:26] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:15] RECOVERY - NTP on mw1201 is OK: NTP OK: Offset -0.001842737198 secs [15:05:36] (03PS1) 10coren: Disable labsness of labstore[34] [operations/puppet] - 10https://gerrit.wikimedia.org/r/118287 [15:06:15] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.861 second response time [15:07:15] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.075 second response time [15:09:35] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.687 second response time [15:12:25] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:35] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:38] (03CR) 10coren: [C: 04-1] "Having created the user in LDAP (which is indeed the correct thing to do), it's not necessary to create it locally with generic::systemuse" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 (owner: 10Hashar) [15:13:45] (03CR) 10coren: [C: 032] "This patch's primary feature is to shut icinga up." [operations/puppet] - 10https://gerrit.wikimedia.org/r/118287 (owner: 10coren) [15:18:05] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Mar 12 15:17:58 UTC 2014 [15:21:25] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 3.920 second response time [15:22:09] (03PS1) 10Cmjohnson: Renaming db1065 and 1065 to dbstore1001 and dbstore1002 [operations/dns] - 10https://gerrit.wikimedia.org/r/118290 [15:22:15] (03PS1) 10Alexandros Kosiaris: Silence netmapper_update cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/118291 [15:22:58] (03CR) 10Cmjohnson: [C: 032] Renaming db1065 and 1065 to dbstore1001 and dbstore1002 [operations/dns] - 10https://gerrit.wikimedia.org/r/118290 (owner: 10Cmjohnson) [15:25:28] akosiaris: regarding carbon. I added 2 3TB disk to it that you can manipulate however you need. The OS is still on the same 2 500GB disks. [15:26:25] I still have tftp going to brewster from eqiad for the time being. Let me know what else you need [15:26:27] cmjohnson1: OK thanks. I install the OS on those two 3TB disks. I might need you to unplug at some point those 2 500Gb disks and plug the other 2 3TB. [15:26:38] cmjohnson1: ok will do [15:27:01] akosiaris: read the comment from paravoid https://rt.wikimedia.org/Ticket/Display.html?id=6801 [15:27:18] yeah I had the same problem with labsdbs [15:27:27] I have a working config for GPT already [15:27:59] oh..cool...do you want me to add the other 2 3TB disks we have...rob purchased 4 total for it. [15:28:25] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.162 second response time [15:28:26] sure. When do you want to do it ? [15:28:39] I can do it now if you like [15:28:47] go ahead then :-) [15:32:25] PROBLEM - Host carbon is DOWN: CRITICAL - Host Unreachable (208.80.154.10) [15:33:37] RECOVERY - Host mw1202 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:33:37] RECOVERY - Host mw1203 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:35:37] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:18] akosiaris: new disks are there..all yours [15:36:27] PROBLEM - twemproxy process on mw1203 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [15:36:27] PROBLEM - twemproxy process on mw1202 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [15:38:47] thanks [15:43:12] bblack: mind if I merge https://gerrit.wikimedia.org/r/#/c/118291 ? I suppose you are not using those emails for something right ? [15:43:59] !log ms-be1005 going down to fix mgmt [15:44:10] ACKNOWLEDGEMENT - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris reinstalling with precise and more disks [15:46:37] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:27] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:35] (03PS3) 10Hashar: beta: skip l10nupdate user/group creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 [15:48:17] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [15:49:08] welcome back carbon [15:50:11] akosiaris: if carbon is ok now, can i move brewster stuff to it? [15:50:24] it is not ok [15:50:37] PROBLEM - Squid on carbon is CRITICAL: Connection timed out [15:50:37] PROBLEM - HTTP on carbon is CRITICAL: No route to host [15:50:57] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [15:52:27] (03PS2) 10Denny Vrandecic: Set up configuration for App indexing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 [15:57:27] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 8.835 second response time [15:59:27] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.419 second response time [15:59:58] hoo|away: I am now :) [16:04:27] PROBLEM - Certificate expiration on virt0 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [16:04:47] (03CR) 10MaxSem: [C: 031] varnish: don't set X-WAP on mobile [operations/puppet] - 10https://gerrit.wikimedia.org/r/118249 (owner: 10Faidon Liambotis) [16:07:52] akosiaris: yes, go ahead! :) [16:08:16] (03CR) 10Alexandros Kosiaris: [C: 032] Silence netmapper_update cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/118291 (owner: 10Alexandros Kosiaris) [16:08:28] who would be the person to ask about ldap ? [16:08:46] i want to move formey's ldap stuff to eqiad [16:08:54] any machine for that? [16:09:54] !log initiating controlled shutdown of analytics1021 kafka broker to do some load testing and also fix runtime java version [16:10:02] Logged the message, Master [16:13:37] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 25.0 [16:13:57] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [16:14:54] (^^FYI, I know, I"m doing this on purpse! :) ) [16:15:42] reedy: twemproxy won't start on mw1202/mw1203...i tried manually starting and it failed...any ideas? [16:16:17] PROBLEM - DPKG on analytics1021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:22:17] RECOVERY - DPKG on analytics1021 is OK: All packages OK [16:22:18] cmjohnson1: file /usr/local/apache/common/wmf-config/twemproxy-eqiad.yaml doesn't exist [16:24:34] cmjohnson1: Reedy killed that file in https://gerrit.wikimedia.org/r/#/c/116036/ ? Not sure why. [16:26:14] maybe because the tmh server is in tampa but will wait till he returns..thx [16:27:57] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [16:28:17] (03CR) 10Manybubbles: "Its been a while and we don't really have a solution yet? Is it still coming very soon? Can/should we hedge our bets on when very soon w" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95333 (owner: 10Matanya) [16:28:51] cmjohnson1: There is still a twemproxy.yaml. Sam may have assumed that "normal" multiversion file loading was used to twemproxy. I don't know anything technical about it. [16:29:39] If it was php using multiversion the code would ask for the realm specific version of the twemproxy.yaml file and that would look for twemproxy-.yaml and fall back to twemproxy.yaml. [16:30:27] Where could be (production,labs) and/or (eqiad,pmtpa,…) [16:32:57] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [16:41:37] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [16:48:16] cmjohnson1: bd808 https://noc.wikimedia.org/conf/highlight.php?file=twemproxy.yaml [16:48:47] * bd808 nods [16:48:56] If there isn't a datacentre suffixed version, it should fall back to using that [16:48:59] "should" [16:49:19] IIRC the eqiad and that one (and maybe even the tampa one) were identical [16:49:44] Of course, that's in MediaWiki land [16:49:51] in twemproxy land... it might be looking for the suffixed version [16:49:58] I think that "should" is the bit that may be under question. Where's the twemproxy code? [16:51:03] Aha [16:51:03] modules/generic/files/upstart/twemproxy.conf: [ -f "/usr/local/apache/common/wmf-config/twemproxy-$(cat /etc/wikimedia-site)".yaml ] || { stop; exit 0; } [16:51:28] I'll move it back [16:51:33] sweet [16:52:51] Reedy: Or you could fix the upstart script to do a multversion style fallback [16:54:15] (03PS1) 10Reedy: twemproxy isn't mediawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118296 [16:54:19] Maybe/probably should do that aswell [16:54:39] (03PS4) 10coren: beta: skip l10nupdate user/group creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 (owner: 10Hashar) [16:54:51] (03CR) 10Reedy: [C: 032] twemproxy isn't mediawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118296 (owner: 10Reedy) [16:54:59] (03Merged) 10jenkins-bot: twemproxy isn't mediawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118296 (owner: 10Reedy) [16:55:57] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2128.57690698 [16:56:00] !log reedy synchronized wmf-config/ 'I6f1f3f8af5b97aa0e537fbae308ce27b28071894' [16:56:09] Logged the message, Master [16:56:42] mw1201: @ERROR: access denied to common from mw1201.eqiad.wmnet (10.64.48.33) [16:56:49] !log reedy synchronized docroot and w [16:56:51] mw1021, mw1022, mw1023 [16:56:57] Logged the message, Master [16:57:15] mw1201, mw1202, mw1203 [16:57:17] even [16:57:54] cmjohnson1: Want to try starting twemproxy again? [17:02:09] !log reedy synchronized wmf-config/ 'noop' [17:02:16] Logged the message, Master [17:03:03] mw1201: @ERROR: access denied to common from mw1201.eqiad.wmnet (10.64.48.33) [17:03:03] mw1201: rsync error: error starting client-server protocol (code 5) at main.c(1534) [Receiver=3.0.9] [17:03:03] mw1202: @ERROR: access denied to common from mw1202.eqiad.wmnet (10.64.48.34) [17:03:03] mw1202: rsync error: error starting client-server protocol (code 5) at main.c(1534) [Receiver=3.0.9] [17:03:03] mw1203: @ERROR: access denied to common from mw1203.eqiad.wmnet (10.64.48.35) [17:03:04] mw1203: rsync error: error starting client-server protocol (code 5) at main.c(1534) [Receiver=3.0.9] [17:05:47] (03PS1) 10Jgreen: relocating drush on aluminium from /srv/drush to /usr/local/drush [operations/puppet] - 10https://gerrit.wikimedia.org/r/118298 [17:07:25] (03CR) 10Jgreen: [C: 032 V: 031] relocating drush on aluminium from /srv/drush to /usr/local/drush [operations/puppet] - 10https://gerrit.wikimedia.org/r/118298 (owner: 10Jgreen) [17:14:24] greg-g: we will deploy, just running a few min behind [17:15:07] yurikR: :) k [17:15:53] root@mw1202:~# /sbin/start twemproxy [17:15:53] start: Job failed to start [17:16:21] reedy ^ [17:17:06] cmjohnson1: I guess related to those rsync errors above [17:17:19] /usr/local/apache/common/wmf-config/ isn't up to date [17:18:09] (03PS1) 10Tim Landscheidt: ntp: Work around Labs network error [operations/puppet] - 10https://gerrit.wikimedia.org/r/118301 [17:18:10] sync-common is the usual fix for that, but that won't work on those machines [17:18:21] Reedy: should yurikR hold off for now? [17:18:57] Nope, should be no need. It's just 3 specific apaches with issues [17:19:01] kk [17:20:16] Let's see [17:20:19] reedy: i moved them to a new row and they have different ip's now.. that shouldn't affect anything [17:20:20] They've been moved racks [17:23:24] Reedy: https://gerrit.wikimedia.org/r/#/c/101889/ [17:23:30] will this be merged or should I abandon it? [17:23:31] hosts allow = 10.0.0.0/16 10.64.0.0/22 10.64.16.0/24 208.80.152.0/22 10.64.32.0/22 [17:24:14] (03Abandoned) 10Odder: Remove Flow from Meta-Wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115412 (owner: 10Odder) [17:24:16] odder: dot in "./modules/planet/templates/feeds/"? By that i want to say "when you're in the operations/puppet repo root directory", because that's what you'd use with "cd" [17:24:31] cmjohnson1: /etc/rsyncd.conf needs updating on tin [17:24:53] doesn't allow 10.64.48.33 [17:26:03] I think.. [17:26:39] (03CR) 10Hashar: "Not sure /home/l10nupdate really need to be defined in puppet since I have created the l10nupdate user in wikitech :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118071 (owner: 10Hashar) [17:27:42] !log yurik synchronized php-1.23wmf16/extensions/ZeroRatedMobileAccess/ [17:27:44] hashar: hi, i'll merge your retab change when you're ready, looks very no-op to me [17:27:49] Logged the message, Master [17:27:54] mutante: should be :] [17:28:00] hashar: and i have another question after that:) [17:28:27] Reedy: was syncing and got access denied on 3 servers, known? [17:28:32] [17:25:35] That ACL list comes from misc::deployment::scap_primary in manifests/misc/deployment.pp [17:28:36] yurikR: yep [17:28:37] yurikR: Yup, what we're discussing [17:28:46] sry, was worried i crashed prod :) [17:29:01] in the rsyncd.conf file, that range for row D is not there [17:29:11] as allowed [17:29:52] (03CR) 10Dzahn: [C: 031] "looks very no-op to me, any time hashar is online:)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118248 (owner: 10Hashar) [17:29:57] mutante: what you wanted to ask? [17:30:51] cmjohnson1: The CIDR range for row D needs to be put in the rsync ACL list from misc::deployment::scap_primary in manifests/misc/deployment.pp [17:31:15] hashar: last night i looked closer at the "experimental" jenkins tests after i merged a "convert to module" change.. and it lists a couple of our modules as "are NOT passing tests" [17:31:24] * bd808 is lost in a maze of twisty little passages [17:31:24] hashar: and i want to find out what is wrong with them [17:31:39] hosts_allow => ['10.0.0.0/16', '10.64.0.0/22', '10.64.16.0/24', '208.80.152.0/22', '10.64.32.0/22']; [17:31:45] those are bacula,install-server,mysql,nrpe,openstack,postgresql and stdlib [17:31:53] the others are apparently passing all tests [17:32:03] mutante: yeah that is something we attempted to work on with alexandros and andrew B a while back [17:32:14] that is rspec tests [17:32:28] (03Abandoned) 10Odder: Enable web fonts by default on Hebrew Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115153 (owner: 10Odder) [17:32:30] do we already know what the actual errors are? [17:32:37] or, where do i get more detail for that [17:32:39] the calling procedure I think [17:32:46] well [17:32:58] there are not any real errors (aka on my laptop the rake tests are OK) [17:33:27] but we have not finalized what jenkins expects to find and how to call rake etc etc [17:33:33] aha [17:33:54] some differences between stdlib (which I suppose we need to update at some point) [17:33:55] (03Abandoned) 10Odder: Add Malayalam aliases for NS_MODULE, NS_MODULE_TALK [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101889 (owner: 10Odder) [17:33:58] and the rest [17:34:00] well, i think in that case, you already convinced me to not worry right now:) [17:34:20] (03PS1) 10Reedy: Add 10.64.48.0/22 to scap allowed hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/118309 [17:34:25] cmjohnson1: ^ [17:34:36] but please leave the experimental tests in, it was nice to see how they did not fail for my module [17:34:37] mutante: it is a quite cool concept though, albeit a bit overly verbose [17:34:42] yes [17:34:48] reedy cool...was about to do it..so thx [17:35:01] mutante: did you even have rspec tests in that module ? [17:35:35] akosiaris: no :p i see [17:35:45] :-) [17:35:52] !log yurik synchronized php-1.23wmf17/extensions/ZeroRatedMobileAccess/ [17:35:54] nothing to fail then :-) [17:35:58] akosiaris: but i got this screen https://integration.wikimedia.org/ci/job/operations-puppet-spec/4387/ [17:36:00] Logged the message, Master [17:36:06] and it's called puppet-spec [17:37:02] greg-g: done [17:37:02] and it says FAILURE, wth, i was sure i saw one with just SUCCESS [17:37:34] yeah i am gonna fix those at some point [17:37:42] got to follow the stream i suppose [17:38:37] hashar: oh, more question about doc.wm.org. is there a feature to make it forget old stuff? [17:38:46] yurikR: cool [17:39:27] mutante: what do you mean? :-] [17:39:30] hashar: here's what i mean, when i go to module "planet" there, it shows old things, that don't actually exist in git anymore [17:39:37] but just existed in the past [17:39:47] so i guess what is needed is some kind of refresh [17:40:38] hashar: f.e. under "planet" there is class "planet::venus" and "planet::languages" those are not actual classes anymore, but the others are [17:40:49] mutante: maybe the job is broken [17:41:16] hashar: want Bugzilla? [17:41:22] mutante: https://integration.wikimedia.org/ci/job/operations-puppet-doc/ [17:41:24] apparently it works [17:41:58] hashar: https://doc.wikimedia.org/puppet/files/srv/org/wikimedia/doc/puppetsource/modules/planet/manifests/venus_pp.html [17:42:09] last update: Wed Apr 10 12:13:03 +0000 2013 [17:42:46] hashar: i probably messed it up because it was a module in the past, it got reverted, and later it came back [17:42:59] as a different thing but with the same name [17:43:19] there were months in between though [17:44:12] hashar: but i can totally put that on a bug and leave you alone with it for now and we merge the retab:) [17:45:26] mutante: I have no clue honestly [17:45:49] cmjohnson1: If you merge, deploy, run puppet on tin.. You should just need to run "sync-common -D master_rsync:tin.eqiad.wmnet" on the 3 apaches before trying to start twemproxy again [17:45:51] https://doc.wikimedia.org/puppet/modules/fr_planet.html was last touched on April 10th [17:46:28] almost a year :-) [17:46:45] yeah [17:46:47] yes, that was the "old" module [17:46:51] that broke things [17:46:52] I have no idea why it get included though [17:46:55] and now it's the new module [17:46:59] but we're using the same name [17:47:04] and that confused it somehow [17:47:12] (03CR) 10Cmjohnson: [C: 032] Add 10.64.48.0/22 to scap allowed hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/118309 (owner: 10Reedy) [17:47:31] maybe puppet rdoc is smart enough to include html files on the disk [17:47:44] that's what made me say "make it forget old things" [17:47:49] some kind of cache files? [17:47:57] no idea [17:48:02] ok, let's worry later? [17:48:14] i dont want to look at puppet rdoc code [17:48:25] i mean, the existing classes are in docs. that's what matters [17:48:31] understood [17:48:34] we could generate the doc to some staging directory then use rsync --delete to get rid of the old files [17:49:38] find with 'mtime'? [17:49:41] mutante: the job is named operations-puppet-doc and the shell script is defined in jenkins job builder at http://git.wikimedia.org/blob/integration%2Fjenkins-job-builder-config.git/df4e21412cb044cd53cc846f40ba675a4c40b77b/operations-puppet.yaml#L80 [17:50:26] hashar: ok! [17:50:47] PROBLEM - SSH on carbon is CRITICAL: Connection refused [17:50:57] the class is no more in puppet but puppet rdoc seems to look at the HTML files existing on disk and include them [17:51:24] so yeah, we need to generate doc to a temp dir then rsync --delete to the target dir ( /srv/org/wikimedia/doc/puppetsource ) [17:51:27] PROBLEM - NTP on carbon is CRITICAL: NTP CRITICAL: No response from NTP server [17:51:29] bug fill it :-] [17:51:29] i can just delete the HTML then? [17:51:36] will do, nod [17:51:45] let's move on to next thing:) [17:51:57] i will file it [17:53:02] so yea, ready to look at puppet run on gallium? wanna get it done with? [17:53:27] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [17:53:37] i can also do it, i don't expect any changes [17:54:34] fleeing sorry [17:54:45] mutante the file{} change is harmless go for it :] [17:54:47] *wave* [17:55:23] the file{} change? now which one did he mean.. ok... [17:58:05] reedy: i see this on the puppet update on tin http://p.defau.lt/?JV2GWmCuFxfXqEn3OI6Vrg [18:02:27] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [18:03:27] RECOVERY - Disk space on virt10 is OK: DISK OK [18:07:47] cmjohnson1: DId you try running `/etc/init.d/rsync start` manually to see if it explains the error? Or look for a log file? [18:13:33] bd808...thx that fixed it [18:13:57] RECOVERY - twemproxy process on mw1201 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [18:17:39] (03PS5) 10Ottomata: Adding archiva module [operations/puppet] - 10https://gerrit.wikimedia.org/r/117024 [18:21:27] RECOVERY - twemproxy process on mw1203 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [18:22:27] RECOVERY - twemproxy process on mw1202 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [18:23:18] I am having massive speed issues on wiki today... any idea what it is? (Mostly on meta right now) in theory it could be the office internet but it's been happening for a couple days now (worse today then it has been) and not on any other sites. [18:25:15] jamesofur: Meta's been slower for me too recently - but hey - always blame office internet :p [18:25:49] It's really nasty :-/ pages are taking multiple minutes to load or save and sometimes just stopping and timing out [18:26:15] It's mostly trouble saving with me. Browsing is not too bad. [18:26:18] I know people on enWiki have been reporting issues the past week but they are all on the east coast so not sure same issue [18:29:10] (03PS1) 10Cmjohnson: adding dbstore1001 and 1002 to dhcp file [operations/puppet] - 10https://gerrit.wikimedia.org/r/118324 [18:29:47] PROBLEM - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% [18:29:50] (03CR) 10Cmjohnson: [C: 032] adding dbstore1001 and 1002 to dhcp file [operations/puppet] - 10https://gerrit.wikimedia.org/r/118324 (owner: 10Cmjohnson) [18:31:31] (03CR) 10Brion VIBBER: [C: 04-1] Set up configuration for App indexing (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 (owner: 10Denny Vrandecic) [18:36:30] (03PS3) 10Denny Vrandecic: Set up configuration for App indexing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 [18:37:40] (03CR) 10Denny Vrandecic: "Removed superfluous scheme." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 (owner: 10Denny Vrandecic) [18:37:43] (03CR) 10Hoo man: [C: 04-1] Set up configuration for App indexing (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 (owner: 10Denny Vrandecic) [18:39:52] (03PS4) 10Denny Vrandecic: Set up configuration for App indexing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 [18:40:10] (03CR) 10Denny Vrandecic: "Fixed whitespace." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118220 (owner: 10Denny Vrandecic) [18:40:17] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [18:47:07] PROBLEM - MySQL InnoDB on db1038 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2147483647 seconds [18:47:37] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [18:49:07] RECOVERY - MySQL InnoDB on db1038 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:51:27] PROBLEM - Puppet freshness on brewster is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:50:34 PM UTC [18:54:27] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (202036) [18:55:27] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [18:57:17] PROBLEM - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% [18:57:50] (03PS1) 10Jdlrobson: Get office wiki mobile image uploads enabled [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118328 [18:59:27] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (204622) [19:03:20] greg-g: we found a minor bug in our cookie setting, can we lightning depl (would take about 10 min) [19:03:22] https://gerrit.wikimedia.org/r/#/c/118326/ [19:03:25] (03CR) 10Dzahn: [C: 032] contint: retab to get rid of tabulations [operations/puppet] - 10https://gerrit.wikimedia.org/r/118248 (owner: 10Hashar) [19:03:37] PROBLEM - MySQL Idle Transactions on db1038 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2147483647 seconds [19:04:37] RECOVERY - MySQL Idle Transactions on db1038 is OK: OK longest blocking idle transaction sleeps for 0 seconds [19:04:43] yurikR: yeah, go for it [19:04:55] next thing is parsoid in 55 minutes [19:05:34] watches puppet run on gallium (contint) [19:05:37] after retab change [19:06:35] !log graceful apache on gallium [19:06:39] looks all ok to me [19:06:44] Logged the message, Master [19:06:53] (03CR) 10Matanya: Adding archiva module (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/117024 (owner: 10Ottomata) [19:07:47] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [19:07:56] (03CR) 10Dzahn: "ran succesfully on gallium. zuul/doc/etc. Apache sites changed content. graceful'ed apache. looks still good" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118248 (owner: 10Hashar) [19:08:05] mutante: can you advice on https://rt.wikimedia.org/Ticket/Display.html?id=6134 ? [19:08:58] greg-g: as in right now? [19:09:08] matanya: no, if you see me being the last one to comment that is usually because i don't have the answer , heh [19:09:15] gotcha, here i go [19:09:35] mutante: thanks too bad, who can have it? [19:09:43] matanya: have what? [19:09:55] matanya: which part do you want advice on? which host it's going to be on? [19:09:58] the answer :) [19:10:01] yes [19:10:06] if i had it it'd be on the ticket already [19:10:11] suew [19:10:17] sure, [19:10:25] looking for the right person to ask [19:10:30] yurikR: right [19:11:29] it's just client though, not an LDAP server [19:11:30] matanya: [19:11:44] !log yurik synchronized php-1.23wmf17/extensions/ZeroRatedMobileAccess/ [19:11:53] Logged the message, Master [19:12:02] !log yurik synchronized php-1.23wmf16/extensions/ZeroRatedMobileAccess/ [19:12:10] Logged the message, Master [19:12:11] greg-g: thx [19:12:13] done [19:13:45] * matanya is looking for a server willing to host it [19:24:37] PROBLEM - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% [19:27:03] i'd suggest ops list to find a home versus irc. we can assign a system, but it may need to live someplace special since its for doing ldap stuff [19:27:22] matanya: plus you can always ask the RT triage duty person [19:27:31] (in fact they should be the first person you can poke) [19:27:48] * matanya pokes bblack  [19:28:39] Ideally then they can gauge the severity, and if they need to find someone immediately, turf to email list, or turf to weekly ops meeting [19:28:53] thanks robh [19:29:05] and if you are already here: https://rt.wikimedia.org/Ticket/Display.html?id=4839 :) [19:29:14] can you please explian this ^ [19:29:45] those are terbium job scripts that development has to update to not be nfs mount dependent [19:30:04] sorry, hume job scripts [19:30:07] that cannot go on terbium for that reason [19:30:15] is this dependency in puppet? [19:30:25] it should link to another ticket, lets see [19:30:37] oh, it is https://rt.wikimedia.org/Ticket/Display.html?id=4792 [19:31:15] wow, some really old cruft linked in there, heh [19:31:47] yes, that is why i'm quite lost :) [19:31:57] (03PS1) 10BryanDavis: Add scap-purge-l10n-cache to /usr/local/bin [operations/puppet] - 10https://gerrit.wikimedia.org/r/118338 [19:32:06] just trying to be useful in getting out of tampa stuff [19:32:35] ty matanya [19:32:47] i cleared up two of them [19:32:51] :) [19:32:55] we dont need to worry about the apache stuff or the ip stuff [19:33:03] since its going away, they are no longer relevant [19:33:15] apaches are off anyway now [19:33:34] well, those jobs affected all apaches, not just tampa [19:33:37] but terbium does it now [19:34:39] looking on manifest/site.pp they are all disabled on hume [19:34:49] enabled=>false [19:35:07] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [19:35:57] huh [19:36:00] and enabled on terbium [19:36:06] soooo, someone did this already it seems =] [19:37:39] yup [19:37:41] (03PS2) 10MaxSem: Upload to non-CentralAuth wikis locally from mobile [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118328 (owner: 10Jdlrobson) [19:37:52] this is my main source of confusion [19:37:52] nice catch, resolved [19:38:04] it was fixed by devs, prolly ^d [19:38:11] whoever did it wasnt in RT is all [19:38:22] since its dev script and all, they handled the fix [19:38:48] i assume it was chad since he fixed the majority of the hume/terbium scripts before that ticket, heh [19:40:26] <^d> Only thing missing on terbium is something we never puppetized. [19:40:32] <^d> All the puppetized ones have been moved. [19:40:49] <^d> https://gerrit.wikimedia.org/r/#/c/74591/ [19:40:57] so once this is done. we can say byebye [19:42:27] <^d> It's the last thing I know of, but we should double check ;-) [19:43:30] i now realize its that time of day im supposed to shove food into my face [19:43:37] <^d> indeed it is. [19:44:14] i didnt wanna just go afk to get food when i was moments ago actively participating in discussion without saying so ;] [19:44:27] * robh does just that [19:44:58] if it's not puppetized it will just disappear at some point [19:45:19] <^d> Hence the patch ;-) [19:45:36] oh, it's the fix for that? didnt realize, nice [19:46:01] got it, cool [19:46:22] matanya: what do you say about contint retab, hah [19:46:35] i know you've been waiting for it:) [19:46:57] sec, reivewing the former patch [19:47:05] ^d: i prefer z :) [19:47:30] (03CR) 10Matanya: Properly puppeti[sz]e purge-checkuser (038 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/74591 (owner: 10Reedy) [19:47:43] 8 :) [19:47:57] i know, i comment too much [19:48:16] didn't say that, you're just fast [19:48:49] jenkins bot vs. matanya bot?:) [19:48:49] oh, you merged the retab one [19:48:58] yes, bold, eh? [19:49:04] but it did nothing [19:49:17] well, replace the apache site content, but all just tabs/spaces [19:49:58] mutante: i'm well known among the stewards for being the fastest steward around, go and ask in #wikimedia-stewards who is the fastest :P [19:50:36] matanya: ok:) btw, planet module :p [19:50:51] rest my case :P [19:51:06] 12:50 < mutante> who's the fastest steward? [19:51:06] 12:50 < Jurgen|busy> matanya [19:51:08] haha:) [19:53:43] well, point clear :D [19:54:35] matanya: https://gerrit.wikimedia.org/r/#/c/108674/ one less in misc, next is RT then [19:54:57] mutante: i loved the site --> apache site there :) [19:55:01] and great work! [19:55:38] matanya: thanks, just minor mistake was to do that in a separate change [19:55:38] we should get rid of such confusing variables [19:55:58] well, as long nothing broke we are all happy, ah? [19:56:12] that was my thinking when you talked to akosiaris about the $site variable name [19:56:31] in this case it was just the file and would have worked fine, but agree, it might be confusing to have site.pp [19:56:40] yes [19:57:11] i would like to see https://gerrit.wikimedia.org/r/#/c/111189/ merged one day [19:58:01] as for RT, i left unanswered cooments :) [19:58:37] ah, thanks for pointing out there are new comments, on it:) [19:59:09] i understand "please add monitoring" [19:59:17] but i don't understand "please add monitoring for the role" [19:59:38] bblack, we'll need a root to restart the parsoids as part of our deployment [19:59:55] mutante: read as please add monitoring in here [20:00:31] at the exact place the comment stands ... [20:01:52] gwicke: If you tell me which keys to press, I'll do it. :-) [20:02:28] !log deployed Parsoid 004c7acc with deploy f97820a2; restart todo [20:02:35] Logged the message, Master [20:02:37] Coren, moment.. [20:03:08] on the salt master: salt-run deploy.restart 'parsoid/deploy' '10%' [20:03:12] (from https://bugzilla.wikimedia.org/show_bug.cgi?id=61882) [20:03:32] Ah, that. Sure thing, just say when. [20:03:38] now is good [20:03:46] I'm done [20:03:47] dammit, outage @ $day_job, bbl [20:05:00] !log restarting parsoids as requested by gwicke [20:05:08] Logged the message, Master [20:05:20] gwicke: Restart completed [20:05:42] Coren, thanks! [20:05:49] (03PS1) 10Dan-nl: adding Amsterdam Museum to the wgCopyUploadsDomains array. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118342 [20:06:56] bblack, do you need anything else from me for https://rt.wikimedia.org/Ticket/Display.html?id=6961 ? [20:07:48] (03CR) 10Matanya: WIP - turn RT from misc/* into puppet module (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116064 (owner: 10Dzahn) [20:08:06] 4 minutes, not bad [20:08:49] !change 111189 | chasemp [20:09:00] ooh, that used to work :p [20:09:04] gwicke: I don't think so [20:09:11] chasemp: https://gerrit.wikimedia.org/r/#/c/111189/4 [20:09:56] bblack, ok; would be awesome to get this done before our next deploy next Monday [20:11:26] matanya: ack @ monitoring, i'm gonna say checking HTTPS with certificate expiry from external is all we need [20:11:38] and thought we already had it [20:11:43] checking thouhg [20:12:02] mutante: check http too maybe? [20:15:13] so it turns out that trebuchet is broken again [20:15:32] our deploy did not actually update the submodule with the code [20:15:45] which means that the parsoid deploy has been a no-op [20:15:55] Ryan_Lane1: ping [20:19:08] no-op on monday, the 10th and no-op today according to http://parsoid-lb.eqiad.wikimedia.org/_version which shows a deployed version from March 3rd. [20:20:46] just sent a mail as well [20:21:22] is there now anybody else who can help to fix trebuchet issues, or is it still only Ryan? [20:22:55] Table 'mediawikiwiki.wikilove_log' doesn't exist (10.64.16.27) [20:27:04] opened https://github.com/trebuchet-deploy/trigger/issues/26 [20:27:55] gwicke: Do you know if the salt logs end up anywhere that I could see them? I'm not sure I can troubleshoot blind [20:28:39] bd808, I sadly don't know much about how salt is set up or works [20:29:00] I am around [20:29:05] but know very little about trebuchet [20:29:07] https://wikitech.wikimedia.org/wiki/Salt is rather sparse [20:29:51] paravoid: Can you look for salt logs related to the deploy that gwicke just tried and see if there are error messages? [20:30:16] I'm wondering if this is a "dies silently" problem or something that the logs can point us to the fix for [20:30:44] gwicke: Are these new submodules or just missing updates to ones that were already checked out? [20:30:56] bd808, they are existing submodules [20:31:05] nothing relevant in the salt logs [20:31:20] before the latest upgrade there was a similar bug that Ryan eventually fixed [20:31:26] now it seems to have come back [20:31:37] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [20:32:23] paravoid: Thanks for looking [20:33:05] gwicke: Is it the "src" submodule in parsoid/deploy that's failing to update on the hosts? [20:33:09] just found this link in my mail: https://gerrit.wikimedia.org/r/#/c/112605/ [20:33:15] bd808, yes [20:34:56] that code still seems to be in puppet [20:35:21] so maybe it's another issue this time [20:38:40] * bd808 can't log into the wpt* hosts to look at the state of /srv/deployment/parsoid/deploy [20:38:54] it's wtp* [20:39:08] I am logged into wtp1020 [20:39:13] my topo in irc, not ssh [20:39:22] *typo BLARGH [20:39:22] * YuviPanda topos bd808 [20:39:47] gwicke: What does git log look like there compared to tin? [20:39:51] bd808, the src checkout is basically still at the state before the last two deploys [20:40:11] commit 98936e7a78f4d63f39d524c35ee5ba46b6ac3177 [20:40:18] that was deployed last Wednesday [20:40:41] before Ryan's work on trebuchet [20:40:58] the deploy checkout is up to date though [20:41:05] Ok. That helps [20:45:13] bd808, gwicke, s/last Wednesday/last Monday/ .. just being nitpicky for the record :). [20:45:42] hmm, yeah [20:46:15] https://www.mediawiki.org/wiki/Parsoid/Deployments [20:46:29] paravoid: In the salt logs/salt config can you see what command trebuchet-trigger is calling? [20:46:45] * bd808 is learning trebuchet in real-time [20:47:07] no, the master logs doesn't log the commands it runs [20:47:12] s/logs// [20:47:16] boo [20:47:21] ok. thanks again [20:47:54] How about sudoer logs from the target hosts? [20:48:09] http://p.defau.lt/?DkDrmqZBYebEi6m6bAgrvw [20:48:13] that's the entirety of the log [20:49:03] salt runs as root, doesn't use sudo [20:49:13] That looks promising actually. [20:49:22] root@wtp1001:/var/log/salt# cat minion [20:49:22] root@wtp1001:/var/log/salt# [20:49:51] how is non-root access to salt actually managed? setuid in front-end scripts? [20:49:56] "Failed matching available minions with glob pattern" [20:50:07] PROBLEM - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% [20:50:37] gwicke: Yeah there are sudoers entries that allow particular salt commands by non-root [20:51:02] aha [20:52:11] gwicke: See manifests/role/deployment.pp for a whole bunch of them [20:53:56] thanks [20:54:13] (03CR) 10Dzahn: [C: 031] Authorize rush for Icinga, add him to contact group sms and add CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 (owner: 10Rush) [20:54:15] `/usr/bin/salt-call -l quiet publish.runner deploy.checkout *` would be the checkout phase of trebuchet [20:54:27] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [20:55:17] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [20:56:22] Those minion matching errors in the log paravoid pasted look suspicions, but I don't know salt guts enough to say for sure [20:56:50] The timestamps and hosts match the trebuchet call though [20:56:51] sadly working around it by rsyncing over the submodule won't work either as the git version we are running has a bug in its submodule handling (see https://bugzilla.wikimedia.org/show_bug.cgi?id=62519) [20:57:30] gwicke: Can you use dsh to force the right git submodule update call? [20:57:49] I don't have enough rights for that afaik [20:58:19] in theory git submodule update --init should do it [20:58:33] Surely someone does. [20:58:38] in /var/lib/parsoid/deploy [20:58:42] on all the wtp hosts [20:59:16] our deploy window is over in two minutes [20:59:22] so I guess that's it for now [20:59:38] I've got the next block… I'm willing to help this get fixed. [20:59:45] bleh, search backend is showing all sorts of warnings in fatal mon [20:59:46] I'm just goign to be testing no-op scap [21:00:30] I have 30 more minutes before a meeting, but can't really contribute much [21:01:01] `dsh -g parsoid -M -F 4 -- 'cd /srv/deployment/parsoid/deploy; git submodule update --init --recursive'` [21:01:10] gwicke: Does that look right? [21:01:29] yes, that looks ok [21:01:41] That would run on 4 nodes at a time [21:01:49] let me check the owner of the files [21:01:58] all owned by root [21:02:05] * bd808 nods [21:02:27] so you need a root such as paravoid to run it (hint; hint) [21:02:47] or Coren [21:02:47] if you're sure that's a good idea... [21:03:06] it would should get us unstuck for now [21:03:08] ori, did you remove the send geoip cookie to everyone vmod? [21:03:17] s/would// [21:03:27] PROBLEM - Puppet freshness on carbon is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:01:44 PM UTC [21:03:33] gwicke: Then you'd need the restart salt command run correct? [21:03:47] yeah, which is also broken for us ;( [21:04:37] so as root, from the salt master, after updating the code: salt-run deploy.restart 'parsoid/deploy' '10%' [21:04:43] sigh [21:05:57] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 877.133362 [21:05:57] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1067.766724 [21:06:07] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 933.799988 [21:06:17] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 901.5 [21:06:47] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 655.133362 [21:07:07] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 806.366638 [21:07:17] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 948.599976 [21:07:17] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 780.56665 [21:07:29] ottomata: ^^ [21:08:05] whoaaaa [21:08:14] hmmk..... [21:09:31] paravoid, are you running those commands? [21:12:46] I am not [21:13:18] I'm about to join the same meeting you are and I'm not sure if I should just run commands I don't understand on the whole parsoid cluster right now [21:13:50] ok [21:14:07] PROBLEM - MySQL InnoDB on db1038 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2147483647 seconds [21:14:41] !log welcome to root, Chase, added key to root-auth-keys [21:14:49] Logged the message, Master [21:15:07] RECOVERY - MySQL InnoDB on db1038 is OK: OK longest blocking idle transaction sleeps for 0 seconds [21:15:47] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:15:48] chasemp: want to review a change for me? [21:15:57] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:15:57] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:15:57] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:15:59] analytics1021 is having losing connections to zookeeper occasionally [21:16:07] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:16:17] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:16:17] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:16:17] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:17:03] I don't have time atm, if it's the sudo stuff it hit my radar. sometime when I come out of shell as a beautiful butterfly, if it's around I will be digging into it. [21:17:20] matanya: i already forwarded the sudo change [21:17:27] PROBLEM - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% [21:17:56] matanya: give him a grace period:) [21:18:13] thanks mutante , thought about other one, but that is cool too [21:18:18] he's not on RT duty yet! [21:18:41] matanya: it fit in,because he is interested in the sudo stuff [21:18:44] hey, guys, the best way to learn is jump to water, isn't it ? :P [21:18:59] oh, great [21:19:21] bd808, see the security channel [21:22:47] mutante: you pointed me at https://rt.wikimedia.org/Ticket/Display.html?id=6583 which i don't have permission to view in https://rt.wikimedia.org/Ticket/Display.html?id=6144 [21:22:56] (03CR) 10Rush: [C: 032] Authorize rush for Icinga, add him to contact group sms and add CST time period [operations/puppet] - 10https://gerrit.wikimedia.org/r/118226 (owner: 10Rush) [21:24:16] and there's the first merge [21:24:28] chasemp: checking puppet run on neon because that is icinga [21:24:33] * matanya point out that changing user access and adding functionality shouldn't be on the same patch [21:24:37] and the same run should add your key as well [21:24:56] the timeperiod is just for him [21:25:02] still [21:25:04] but meh [21:25:07] wouldn't it also be for brandon? [21:25:13] there were 3 separate changes [21:25:16] they got squashed [21:25:43] oh well, just nitpick, don't take it so sirously [21:25:58] due to my learning about gerrit [21:26:08] yea, example for rebase actually [21:26:23] chasemp == rush , yes? [21:26:28] agreed [21:26:37] * matanya is lost in all nicks [21:26:45] !cp [21:26:52] a bot is missing [21:27:02] doesn't work anymore [21:27:18] :( [21:27:48] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [21:27:54] anyway, mutante what about the ticket question above? [21:28:21] greg-g: Can I squeeze in my scap test now? [21:28:55] yeah [21:29:03] matanya: that's cause it's in the procurement queue, would be a question for RobH or cmjohnson1 [21:29:14] .join bd808|deploy [21:29:32] .ready or something [21:29:54] .good [21:29:56] We removed volunteer rights to view the queue, because they were responding to vendors [21:30:04] !log bd808 Started scap: no-diff scap to test script changes [21:30:12] Logged the message, Master [21:30:12] so if its in procurement, thats why you cannot view it matanya [21:30:28] not so critical [21:30:34] just for follow up [21:30:46] its being ordered, but the folks involved know whats up [21:30:57] thanks [21:31:08] so you can just disregard it yep [21:31:27] * matanya forgot that ticket for now [21:31:43] matanya: Does it happen to show you the linked ticket? [21:31:51] [] [21:31:53] robh: we need to try and rush those...i would like to get the before April 1 [21:32:03] ie: can you see on https://rt.wikimedia.org/Ticket/Display.html?id=6144 that it has a link to 6583 or not at all? [21:32:14] cmjohnson1: its not waiting on me, its in approvals. [21:32:29] robh: i see the number but not clickable [21:32:35] and no subject eh? [21:32:39] no [21:32:46] ahh, that would be nice. i wonder if i can tweak it [21:32:53] just 6583: (Mark Bergsma) [] [21:32:53] i update the subject of my procurement tickets iwth the status [21:33:10] ie: quote requested, quote in review, purchase approval pending, ordered, shipped, etc... [21:33:20] i'll see if i cannot tweak it to show the subject [21:33:22] that would be neat [21:33:39] mostly for logstash i wanted that [21:33:55] my english breaks, seems like time for bed [21:38:19] is OIT ever on this channel? [21:39:38] !seen jkrauska [21:39:49] matanya: ^ if bot was still here [21:40:01] if [21:40:05] matanya: cndiv is here in some fashion [21:40:29] i remeber i needed to ask them about sanger [21:40:51] but never did, apperently [21:41:47] one more point mutante [21:41:58] you said app servers in tampa are out [21:42:06] so ersch|tarin can now go as well [21:42:34] if i'm right, i'll create a patch for this [21:43:03] jgage, you around? [21:43:53] hi [21:43:57] chasemp: now puppet ran on neon, and it added your contact [21:44:11] hey, i'm seeing some really weird kafka zookeeper problems right now [21:44:16] oh really [21:44:17] do tell [21:44:20] i would love some other eyes on it [21:44:39] https://rt.wikimedia.org/Ticket/Display.html?id=6877 [21:44:47] chasemp: so, icinga config has reloaded, and the paging should be active [21:44:48] That was the open files limit thing, I changed the name of the ticket [21:44:50] and put relative logs there [21:44:50] enjoy :p [21:44:52] jgage^ [21:44:58] * jgage looks [21:45:04] so, analytics1021 is losing its connection to zookeeper every few seconds [21:45:15] could somebody break the site really quick to test paging for Chase? .. NOT :) [21:45:36] these are causing ISR flaps and replica lag [21:46:00] mutante: Don't jinx my scap :) [21:46:06] which can cause produce errors as high traffic producer queues fill up while kafka is figuring out where logs shoudl go [21:46:10] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=kafka&hide-hf=false [21:46:20] see Replica lag and ISRShrinks at the bottom of that page [21:46:25] matanya: ersch/tarin = poolcounter = yes, correct [21:46:59] still trying to figure out what is going on, but kafka broker traffic in levels are super low right now [21:47:01] any clue why I'm downloading from gerrit at 4 KiB/s from the office? [21:47:12] mwalker: wifi? [21:47:21] good call [21:47:33] ottomata: wow ok, i'll dive in. i assume those "lumps" in the top graph are camus? [21:47:51] camus only affects Out rates [21:47:55] hrm yeah [21:48:05] that lump in those top graphs I am not sure [21:48:14] bd808|deploy: taking a bit, eh? [21:48:39] ottomata, any recent changes? the current situation looks like the first time this has happened. [21:48:40] greg-g: l10n changes and rsync seems slower than usual today [21:49:01] yes this is new to me, i have seen this zk error before [21:49:04] but not this frequently [21:49:29] also, i was alerted to this because eqiad bits varnishkafkas had errors [21:49:29] greg-g: Lots of 3 to 3.5 minute syncs instead of my expectation of 2-2.5m [21:49:46] and i have not seen that before [21:49:53] PROBLEM - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% [21:50:04] I'll be excited to see what happens when I kill a bunch of old branches tomorrow [21:51:06] jgage, hm today I did take down analytics1021 for a few minutes [21:51:15] to check some load levels on an22 if it was the leader of the broker [21:51:18] leader of the partitions* [21:51:19] sorry [21:51:24] manybubbles: ^d "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." on wikitech [21:51:27] ottomata: ^ [21:51:43] PROBLEM - Puppet freshness on brewster is CRITICAL: Last successful Puppet run was Wed 12 Mar 2014 03:50:34 PM UTC [21:51:47] !log bd808 Finished scap: no-diff scap to test script changes (duration: 21m 42s) [21:51:50] ottomata hmm no recent reboots, same kernel on both, disk space and loadavg ok, dmesg ok.. [21:51:55] Logged the message, Master [21:52:16] jgage, notice how analytics1021 is losing its zk connection constantly [21:52:18] it can't hold on to it [21:52:22] that is the problem, we need to figuer out why [21:52:24] hrmmmm [21:52:25] greg-g: Ok if I go again? [21:52:28] tail -f /var/log/kafka/kafka.log [21:52:38] it loses and reconnects about every 5 or 15 seconds [21:53:14] jgage, i'm going to promote an22 to leader by doing controlled shutdown of an21 [21:53:15] objections? [21:53:28] no i was going to suggest the same [21:53:42] k [21:53:46] mutante, hmm; not related to wifi actually... it's still stalling out -- though both times it's been on the CentralAuth repo [21:54:06] !log initiated controlled shutdown of an21, promoting an22 to leader of all partitions [21:54:10] !log bd808 Started scap: another no-diff scap to test script changes [21:54:14] Logged the message, Master [21:54:19] bd808|deploy: yeah [21:54:22] Logged the message, Master [21:54:23] :) [21:55:03] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [21:56:11] hmm, jgage, lots of produce requests still to an21…varnishkafka should have figured out that an22 is the leader by now [21:56:31] (03PS1) 10Odder: Account creation throttle for ptwiki outreach event [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118364 [21:56:53] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4470.166504 [21:57:03] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1641.599976 [21:57:03] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 6089.5 [21:57:03] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 5913.100098 [21:57:04] hey ops, is mediawiki-announce no longer forwarded to mediawiki-l? [21:57:13] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2191.300049 [21:57:20] those are bits [21:57:21] i'm pretty sure it used to be, and i'm pretty sure it is no longer. [21:57:23] both esams and eqiad [21:57:23] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 15.0 [21:57:23] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1046.366699 [21:57:33] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3849.266602 [21:57:36] !log bd808 Finished scap: another no-diff scap to test script changes (duration: 03m 25s) [21:57:43] Logged the message, Master [21:57:53] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:57:57] greg-g: This time most hosts finished rsync in < 10s except fenari. It took 1:46 [21:58:03] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:58:03] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:58:06] jgage, now that an22 is the leader, i'm going to restart bits varnishkafkas, hopefully clearning out the queues and forcing them to get broker topic metadata [21:58:13] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:58:14] i don't want to see any produce requests on an21 [21:58:15] ottomata, ok [21:58:16] greg-g: Also {{done}} [21:58:23] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 4093.93335 [21:58:23] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:59:33] ok, that is looking saner [21:59:34] ottomata i searched for that error but the only mailing list discussion is wrt a bug allegedly fixed in 3.3.4, we have 3.3.5 [21:59:47] i foudn this one [21:59:47] http://zookeeper-user.578899.n2.nabble.com/Ping-and-client-session-timeouts-td5085129.html [21:59:52] yeah [21:59:55] which says maybe GC problems on zk JVMs [22:00:03] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [22:00:03] but i dunno [22:00:06] oh i was reading http://zookeeper-user.578899.n2.nabble.com/zk-keeps-disconnecting-and-reconnecting-td6717952.html [22:00:20] bd808: nice [22:00:23] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [22:00:33] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [22:01:00] ok, full messagesinpersec is back to normal [22:01:02] greg-g: The back to back test points pretty strongly to l10n being the cause of scap's slownes [22:01:04] all traffic going to an22 [22:01:38] bd808: yepper [22:02:10] The core sync was ~15m with l10n changes and ~3m without [22:02:19] jgage, there still it looks like the zookeeper connection flapping has stopped too [22:02:22] or at least lessoned [22:02:25] lessened [22:03:23] so mysterious [22:04:33] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Wed Mar 12 22:04:26 UTC 2014 [22:05:05] yeah, that zk connection is a big problem, that should not happen [22:05:09] i don't know what is going on there [22:05:10] yikes, hm [22:05:18] ok, jgage, i am supposed to run to dinner now [22:05:24] and I am taking tomorrow and friday off... [22:05:26] HMMM [22:05:32] :D [22:05:54] i can check on this in the morning though [22:05:55] hm [22:05:57] bd808: btw, data from the l10nupdate commit: http://paste.debian.net/87382/ [22:06:48] greg-g: Why in the heck does that take so long? [22:06:50] (03CR) 10Jeremyb: [C: 04-1] "I know WHOIS says 186.193.0/20 but is that valid syntax for whatever will be parsing the range? I'd use 186.193.0.0/20 to be safe." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118364 (owner: 10Odder) [22:07:02] ok ottomata, i will keep investigating and send you anything i find [22:09:01] jgage, the zk conn does still timeout [22:09:10] i tjust happened 30 secs ago [22:09:22] ok jgage [22:09:28] gah [22:09:30] so, just in case you have to do a controlled-shutdown [22:09:33] i've added this doc: [22:09:33] https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Kafka/Administration#Safe_Broker_Restarts [22:09:39] thank you [22:09:40] see the Note: at the bottom of that section [22:09:45] k [22:10:00] the command doesn't exist in the version of kafka we ahve installed (I added it to the .deb package recently) [22:10:07] so you can just do it manually in that hacky way there [22:10:22] ok [22:10:47] ok so [22:10:58] i'm not sure if we should promote an21 back to leader of its partiitons now [22:11:12] i'd say, go ahead and do that if you want to experiment or feel comfortable doing so [22:11:20] (via preferred-replica-election) [22:11:32] you could restart the broker first if you wanted to [22:11:33] dunno [22:12:45] ok, i'm going to run to dinner [22:13:25] mutante: tarin is ganglia gmetad source for misc in pmtpa, to leave it around, or kill it? [22:13:52] (03CR) 10Jeremyb: "one data point on 3 octet CIDR ranges:" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118364 (owner: 10Odder) [22:13:56] ok ottomata [22:17:07] (03PS2) 10Lcarr: Initial commit of pmacct module and role [operations/puppet] - 10https://gerrit.wikimedia.org/r/115345 (owner: 10Jkrauska) [22:18:57] matanya: i don't know [22:25:58] (03PS1) 10Matanya: ersch: decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/118375 [22:27:39] (03CR) 10jenkins-bot: [V: 04-1] ersch: decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/118375 (owner: 10Matanya) [22:27:53] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 946.847896141 [22:28:46] (03PS2) 10Matanya: ersch: decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/118375 [22:29:05] way too late to successed on first push [22:30:58] (03CR) 10Lydia Pintscher: "Alright. Let's do that then and work on the proper solution after that." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113972 (owner: 10Thiemo Mättig (WMDE)) [22:31:58] hm analytics1021 and 1022 have different versions of kafka, 0.8.0-1 vs 0.8.0-2 [22:32:05] (03PS3) 10Matanya: ersch: decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/118375 [22:36:04] (03PS1) 10Matanya: ersch: decom [operations/dns] - 10https://gerrit.wikimedia.org/r/118380 [22:41:08] ok, akosiaris_away added that, i'll need to ask him tommorow [22:49:23] PROBLEM - Kafka Broker Replica Lag on analytics1021 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 10134023.0 [22:50:15] (03PS2) 10Dzahn: backup: remove db10 from disklist [operations/puppet] - 10https://gerrit.wikimedia.org/r/118241 (owner: 10Matanya) [22:51:53] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2481.96610372 [22:52:13] (03CR) 10Dzahn: [C: 031] "db10.pmtpa.wmnet - 100% packet loss - also remove from DNS" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118241 (owner: 10Matanya) [22:52:33] (03CR) 10Dzahn: [C: 032] "db10.pmtpa.wmnet - 100% packet loss - also remove from DNS" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118241 (owner: 10Matanya) [22:52:41] removing from dns now [22:56:03] (03PS1) 10Matanya: db10: decom [operations/dns] - 10https://gerrit.wikimedia.org/r/118388 [23:06:23] RECOVERY - Kafka Broker Replica Lag on analytics1021 is OK: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value OKAY: 830375.0 [23:08:23] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [23:16:15] greg-g, am I free to deploy? [23:17:19] greg-g: Charge him to deploy then he's not free :D [23:17:49] JohnLewis: don't you know greg-g gets paid per deploy? that is his salary base [23:18:08] matanya: He gets paid twice then. [23:19:13] (base_salary * mw_version * time_to_scap)?:) [23:19:56] !log mwalker synchronized php-1.23wmf16/extensions/CentralNotice [23:20:05] Logged the message, Master [23:20:22] mutante: does he get anything extra per wmf branch? [23:20:55] !log mwalker synchronized php-1.23wmf17/extensions/CentralNotice [23:21:03] Logged the message, Master [23:21:42] cndiv: are you around ? [23:22:20] (03CR) 10Dzahn: [C: 031] "nice, as the bug reporter and per the "upstream bug" comments there i support this work around" [operations/puppet] - 10https://gerrit.wikimedia.org/r/118301 (owner: 10Tim Landscheidt) [23:33:53] mwalker: good :) [23:37:19] ya; it didn't work though :( [23:37:23] I have a follow on patch [23:37:25] greg-g, ^ [23:37:45] https://gerrit.wikimedia.org/r/#/c/118393/ [23:39:08] hah [23:39:16] kk [23:45:01] greg-g, can I deploy my followon fix? [23:45:27] they're building in jenkins for core right now [23:46:00] mwalker: yeah [23:46:17] thank ya [23:46:32] matanya: I'm back, but afraid I don't recognize your nick. Who's this? [23:46:50] cndiv: just a volunteer [23:47:11] cndiv: wanted to ask about https://rt.wikimedia.org/Ticket/Display.html?id=6163 [23:47:16] matanya: Hi there. :-) What's up? [23:47:26] nice to meet you, all cool [23:47:45] matanya: Let me see if I remember my username to RT... [23:49:38] !log mwalker synchronized php-1.23wmf16/extensions/CentralNotice/ [23:49:47] Logged the message, Master [23:49:54] !log mwalker synchronized php-1.23wmf17/extensions/CentralNotice/ [23:50:02] Logged the message, Master [23:54:43] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [23:55:32] greg-g, that time it worked [23:55:41] so I'm done