[00:20:22] Reedy, only en.wiki? we might have experienced some issues on Meta and even testwiki [00:37:05] Nemo_bis, it's only logged for enwiki [00:37:15] If enwiki is having issues, it's likely the others are too [00:37:24] indeed [00:37:29] Fixing one will fix them all [00:37:41] well, after it's caught up ;) [00:38:15] uh, queue is rising at sight on Meta [00:40:04] I actually came by to say that wp.en seemed to suddenly get a lot slower, decided not to say anything in case it was just my perception, not sure if related [00:40:52] Job queue being backed up wouldn't slow it down [00:41:13] well if the server was suddenly having trouble, the job queue would begin backing up and pages would be slower :) [00:41:23] "the" [00:41:30] You realise we have more that one, right? :p [00:41:32] yes [00:41:34] semantics [00:41:36] I ignore them [00:42:04] job queue items that can be done asynchronously [00:42:12] I actually just got this now [00:42:15] Request: POST http://en.wikipedia.org/w/index.php?title=List_of_PlayStation_3_games&action=submit, from 71.95.101.232 via sq60.wikimedia.org (squid/2.7.STABLE9) to 208.80.152.72 (208.80.152.72) [00:42:15] Error: ERR_READ_TIMEOUT, errno [No Error] at Tue, 03 Jan 2012 00:36:13 GMT [00:42:23] Reedy: Don't be silly, we all know the cluster is powered off a 486 that is powered by a hampster in a wheel >.> <.< [00:42:23] That's certainly not related [00:43:25] Job queue issue seems to have been going on a week or more [00:43:37] ok I hadn't heard of that [00:44:53] Indeed [00:45:05] Stupid irc client [02:05:22] !log LocalisationUpdate completed (1.18) at Tue Jan 3 02:05:21 UTC 2012 [02:05:23] Logged the message, Master [02:11:52] domas: you were responding about status.wm.o, right? petan was asking about stats.wm.o [02:34:13] (ssl error; it's not expired but it's both self-signed (by fred!) and also the wrong CN) [02:35:25] makes sense that it's CN is nagios because it's the same IP as nagios. but maybe spence is ok to have star? [03:11:54] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:11:55] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:41:24] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:25] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:56] PROBLEM - RAID on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:56] PROBLEM - RAID on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:26] PROBLEM - Disk space on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:26] PROBLEM - Disk space on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:56] PROBLEM - DPKG on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:56] PROBLEM - SSH on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:50:57] PROBLEM - DPKG on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:57] PROBLEM - SSH on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:16] RECOVERY - RAID on srv273 is OK: OK: no RAID installed [03:58:17] RECOVERY - RAID on srv273 is OK: OK: no RAID installed [04:00:06] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [04:00:06] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [04:00:36] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:00:37] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:00:56] RECOVERY - DPKG on srv273 is OK: All packages OK [04:00:56] RECOVERY - DPKG on srv273 is OK: All packages OK [04:02:56] RECOVERY - Disk space on srv273 is OK: DISK OK [04:02:56] RECOVERY - Disk space on srv273 is OK: DISK OK [04:03:43] O.O Double bot [04:43:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:43:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [07:01:40] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [07:01:41] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [07:11:20] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:11:20] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:47:30] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [07:47:30] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:09:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [08:19:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:07:55] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=60%): /var/lib/ureadahead/debugfs 99 MB (1% inode=60%): [09:07:56] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=60%): /var/lib/ureadahead/debugfs 99 MB (1% inode=60%): [09:25:35] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=60%): /var/lib/ureadahead/debugfs 1 MB (0% inode=60%): [09:25:36] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=60%): /var/lib/ureadahead/debugfs 1 MB (0% inode=60%): [09:27:15] RECOVERY - Disk space on srv222 is OK: DISK OK [09:27:16] RECOVERY - Disk space on srv222 is OK: DISK OK [09:49:30] RECOVERY - Disk space on srv221 is OK: DISK OK [09:49:31] RECOVERY - Disk space on srv221 is OK: DISK OK [09:53:50] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 443510 MB (3% inode=99%): [09:53:51] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 443510 MB (3% inode=99%): [09:55:40] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 434584 MB (3% inode=99%): [09:55:41] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 434584 MB (3% inode=99%): [10:02:30] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:02:31] RECOVERY - MySQL slave status on es1004 is OK: OK: [13:25:40] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:25:41] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:51:57] !log Restarting job runner on srv236, seems to be stuck [13:51:58] Logged the message, Mr. Obvious [13:59:44] RoanKattouw, think more than srv236 is stuck :p [14:00:28] I know [14:00:38] !log Restarting all job runners that are stuck [14:00:40] Logged the message, Mr. Obvious [14:04:10] !log Restarting job runners on srv242 and mw25, those are the last ones that are stuck [14:04:10] Logged the message, Mr. Obvious [14:06:20] funny how quickly that seems to have taken affect [14:06:55] Yeah [14:07:05] http://ganglia3.wikimedia.org/graph.php?r=20min&z=xlarge&c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=93084&m=enwiki%20JobQueue%20length&jr=&js= [14:08:45] Also begs the question, why wasn't this noticed before [14:10:15] Because the Nagios check is broken [14:10:18] I'm lookign at that now [14:10:33] lol, typical [14:17:05] * hexmode wakes back up, looks at the backscroll [14:20:10] RoanKattouw, Reedy: what sort of things were stuck? Just jobqueue? Anything in particular that would cause problems with (that I might notice)? [14:20:50] i noticed by someone complaining a user rename hadn't gone through [14:20:55] quick look suggested it had backed up over a week [14:21:06] heh [14:21:18] Anything that is on a deferred job could be noticeable [14:21:22] link updates etc [14:21:39] so stuff might have been sitting for a week... but I don't remember any new bugs like that [14:21:50] Reedy: tyvm :) [14:22:43] people only really sweem to notice when it gets even worse [14:24:25] hexmode: Hi, noticed Bugzilla upgrade and SiteMap extension yet? like can people google bugs better than before already? it did submit a sitemap, but it almost seemed too quick to make me believe all bugs where in there.. about to check again [14:25:25] mutante: I tried yesterday, but I didn't see any [14:25:53] New patchset: Catrope; "Fix Nagios job queue check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1766 [14:26:03] Reedy: yea, your RT ticket was good, it made me notice [14:26:10] mutante: which search engines did you submit it to? Is it just google? [14:26:18] me [14:26:34] mutante: https://gerrit.wikimedia.org/r/1766 should fix the nagios check [14:26:44] mutante, heh, it was a case of when I noticed, I think only Tim was around, and it wasn't worth any sort of emergenchy [14:26:56] hexmode: Live: OK [14:26:56] Google: OK [14:26:56] Ask: OK [14:27:03] Yahoo: FAILED [14:27:26] Who cares about yahoo? [14:27:30] hexmode: the extension did that default, i didnt pick them [14:27:40] mutante: k, and the first disallow looks wonky.. but this is what mozilla uses, right? [14:27:56] hexmode: the robots.txt has also been created by the extension [14:28:04] k [14:28:21] hexmode: not what mozilla uses, or the manual suggestion, but after i saw the extension creates it, i left it that way [14:28:22] I think mozilla used a slightly different one [14:28:45] yea, you can see the diff in the ticket [14:30:00] mutante: google "bug site:bugzilla.wikimedia.org" and only 3 hits :( [14:30:44] others show up when you allow duplicates [14:30:56] but it google doesn't show any cached info [14:31:12] and only very basic info [14:31:54] mutante: what email did you use for the sitemap submission? Should I check webmaster@wikipedia email? [14:32:00] * hexmode goes to check anyway [14:32:11] hold on, back in 2 minutes [14:33:42] New review: Dzahn; "works on spence." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1766 [14:33:42] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1766 [14:34:08] I love how people add bugs and say "Add this to LocalSettings" etc [14:36:52] hexmode: checking datadir of sitemap extension on kaulen.. [14:39:23] hrm... email to webmaster@ ... someone needs to be on top of this. Someone sent an email back in October complaining about "403 Requested target domain not allowed." when trying to get to enwiki and asked "Any particular reason why? I find your site quite useful. Do we have to pay now, or have an account or something? And if so, how do we go about doing what we need to get in?" [14:40:32] "The sitemap is generated dynamically every time a search engine asks for it." [14:40:57] If you want to see or download the sitemap yourself for some reason, go to the URL page.cgi?id=sitemap/sitemap.xml on your Bugzilla installation. [14:43:53] hrmm..doesnt look right yet.. i'll see [14:45:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [14:45:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [14:54:00] RobH: Did you have trouble with the digicert ssl certificate? Saw a still active email from last month about the cert to webmaster@... [14:54:32] huh? [14:54:40] i have no idea what yer talking about [14:54:47] no i guess not? [14:54:48] heh [14:55:21] just never approve anything for certs if you get it, folks can submit for certs to most root level emails like root, webmaster, etc... [14:55:32] even if they arent us, the only check and balance is not approving them [14:55:44] robh it was webmaster@wikimedia.org [14:55:57] maybe ctwoo got it and replied? [14:56:03] i wouldnt worry about it. [14:56:11] or its someone trying to get a cert and they arent us [14:58:25] RobH: do you even have a digicert account? [14:58:56] yes [14:59:00] we are using one for testing mobile [14:59:06] again you dont need to worry about it ;] [14:59:20] if you think its a major issue yuou can forward it to me [14:59:27] but i know they are doing mobile testin gwith it, so its prolly fine [14:59:55] though it should prolly be set to not email you [14:59:59] so forward to me and i handle it [15:00:48] it should have emailed more than webmaster [15:00:50] RobH: ctwoo and robla have access webmaster@wikimedia.org to email, too ... it isn't coming to me [15:00:52] (its also just a testing cert) [15:00:58] ? [15:01:07] i thought you just said you got the email? [15:01:36] hexmode: i am really not sure what you are asking then. You got an email from digicert, or you didnt? [15:02:24] I meant I'm going through the email to webmaster@ (something I haven't done and someone needs to do more regularly) and I saw this from last month [15:02:42] hexmode: here's another cosmetic issue with bugzilla, while trying to fix the other one: File does not exist: /srv/org/wikimedia/bugzilla/favicon.ico [15:02:44] You're right: I shouldn't be the one doing this [15:02:56] ok, so I am still not sure what you want from me. You can forward it to me and I can look at it, but otherwise i have no answer for you [15:03:01] hexmode: if you like a favicon ;) [15:03:01] they are testing the cert, thats all i know. [15:03:34] mutante: I see a favicon in firefox! but maybe there is something missing. [15:03:38] not trying to be difficult, im just not sure what you want ;] [15:04:18] robh: no, I know. I just thought this might be important. I'll talk to ctwoo and robla about the future of this [15:04:27] future of what? [15:04:31] there is some email in here that should be responded to [15:04:42] please just forward me the email. [15:04:46] and it isn't clear that it has been [15:04:47] k [15:04:49] i dont think it is important though [15:04:57] but will review it [15:05:14] and get it tied to anohter email [15:05:26] sent [15:06:14] RobH: other emails here should go to legal or community people.... but yeah, this one is for you :) [15:09:31] http://support.google.com/a/bin/answer.py?hl=en&answer=167430 [15:10:39] mutante: thanks! my own, personal lmgtfy ;) [15:11:21] :) [15:11:34] mutante: WTF, where's that Nagios message :) [15:11:48] Did you update the puppetmaster and run puppet on spence? [15:11:57] RoanKattouw: http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=spence&service=check_job_queue [15:12:05] RoanKattouw: ask where the bot is, i think [15:14:25] Last State Change:01-03-2012 14:48:36 [15:14:26] hrmm [15:14:51] ah Roan, it's still in a SOFT state, that's why [15:14:57] SOFT? [15:15:09] one more failed check and it will turn into a HARD state [15:15:11] * RoanKattouw doesn't actually know anything about Nagios [15:15:24] Current attempt: 2/3 (SOFT state) [15:15:46] it tries it 3 times, and it only turns into a "real" (HARD) critical if it fails 3 times in a row [15:16:07] this is a feature to avoid alarms if a service is just flapping or down for 1 minute and then back again [15:16:38] http://nagios.sourceforge.net/docs/3_0/statetypes.html [15:17:24] Ah, OK [15:18:51] Next Scheduled Check: 01-03-2012 15:27:16 [15:19:04] mutante: Does it still run every 10 min? I guess that's why it's taking half an hour to go into HARD state? [15:19:19] (Well, 20 mins best case, 30 worst case, to be fair) [15:20:46] hexmode: ahhh, this is super old [15:20:56] i did this a long time ago, it just seems to email webmaster as well as dnsadmin [15:21:00] you can completely disregard this [15:21:14] :) [15:21:14] but thanks for checking on it =] [15:21:31] RoanKattouw: normal_check_interval => 15, [15:21:32] 540 retry_check_interval => 15, [15:21:48] omg [15:21:50] RoanKattouw: yeah, thats right, so it takes 45 minutes at worst [15:22:01] sorry if i was terse, doing expense reports =P [15:22:04] RoanKattouw: is that too long..? 45 minutes after the first queue is over 10.000 [15:22:12] RoanKattouw: we survived 100k as well ? :p [15:22:41] Well, it seems a bit ineffective to me that way is all [15:22:49] The check is no longer ridiculously slow [15:23:01] suggest a value [15:23:36] made the puppet monitor_service configurable for that, so no problem [15:24:35] nagios.pp lines 539,540 [15:24:56] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [15:24:56] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [15:34:37] New patchset: Dzahn; "job_queue: tweak retry check interval" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1767 [15:35:37] New review: Dzahn; "keep the regular interval at 15 minutes, but if it fails once (SOFT), keep re-checking every 5 minut..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1767 [15:35:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1767 [15:36:06] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (77620) [15:36:07] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (77620) [15:36:17] RoanKattouw: there you go [15:36:28] yay [15:36:29] hexmode: you've got mail :) [15:36:56] RoanKattouw: retry_check_interval is just that, "if it failed once already, then keep re-checking more often" [15:37:28] so the HARD state should be reached in 15 + 5 + 5 now [15:37:35] Thehelpfulone: you're like my own little biff ;) [15:37:51] OK, good [15:37:55] :D [15:45:45] hexmode: are there known issues with users voting on bugzilla? [15:46:04] "Can't locate data/template/extensions/Voting/template/en/default/pages/voting/user.html.tmpl" [15:47:44] this is just stuff i happen to notice in Apache log, didnt try in a browser yet [15:49:06] aha: http://www.bugzilla.org/releases/4.0/release-notes.html#v40_feat_vot_ext [15:49:20] "Voting" is an extension since 4.0 (but wasnt before) [15:50:42] New patchset: Hashar; "class to install Apache Maven" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1768 [15:50:56] New patchset: Hashar; "Add Apache Maven to gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1769 [15:53:21] mutante: I like the voting and would like to get it put in... BUT I want to change the wording [15:53:26] s/vote/follow/ [15:54:52] New review: Dzahn; "typo: ensure => lastest; != latest" [operations/puppet] (production); V: -1 C: -1; - https://gerrit.wikimedia.org/r/1768 [15:55:34] !log Created wikilove tables on siwiki [15:55:35] Logged the message, Master [15:56:53] hexmode: ok, you may want to keep that in a bug somewhere then (that it currently does not find that template, and now needs to be disabled as an extension) [15:56:54] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33485 - Enable WikiLove in si.wikipedia' [15:56:55] Logged the message, Master [15:57:06] hexmode: eh. s/disabled/enabled :) [15:59:25] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [15:59:26] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [16:19:29] apergos, is it past 10 for you? :) [16:23:02] no [16:23:22] give me a few minutes, I'm in the middle of something on another system [16:25:48] sure, no prob [16:28:42] ok well i's going to take a lot more than a few minutes but I don't have to watch it, it seems [16:28:45] so... go ahead/ [16:28:46] ? [16:28:57] bug 32404 [16:30:03] it seems to happen when pages are invalidated [16:30:26] in the same invalidation process, some pages are ok, while others are not [16:30:53] mutante: Is bz in git and puppet yet? [16:31:12] i noticed there were some recent issues about pages being rendered by old copies of MW [16:31:20] i wonder if this is not the case too [16:32:58] apergos? [16:33:12] I'm here [16:33:16] did you read? [16:33:19] I'm just looking at the current state of things [16:33:20] uh huh [16:33:21] ok [16:33:40] if you could check what server recreated some pages, maybe we can have a clue of what's going on [16:34:22] what server? we're not going to have that information [16:34:29] sad but true... [16:34:55] i recall RoanKattouw deploying some patch to do that..., no? [16:35:34] To do what now? [16:35:53] Oh, that [16:35:58] Tim investigated that bug [16:35:59] so we're back with the problem of bad entries in pagelinks [16:36:00] to include some debug info about what server created the cached page [16:36:07] what did he find, anything useful? [16:36:30] i have an example of 30 december if it's needed [16:38:24] http://pt.wiktionary.org/w/api.php?action=query&prop=info&format=xml&titles=kukka [16:38:29] http://pt.wiktionary.org/w/api.php?action=query&prop=links&format=xml&pllimit=10&titles=kukka [16:38:42] cached 4 days ago [16:41:08] Images in ns 0 [16:42:02] yeah they are, that's what's currently in the db all right [16:42:35] this is after Tim's investigation for bug 31576 [16:45:43] http://pt.wiktionary.org/w/index.php?title=Predefini%C3%A7%C3%A3o:-fo-&action=edit [16:46:02] it's all these templates that have the old namespace name in them [16:46:15] what a PITA [16:47:14] yes, that's what eventually causes problematic pages [16:47:31] but even in pages that are not transcluded, it happens [16:47:55] there are ns 0 entries for User_talks [16:48:53] The link table entries could be much older though [16:48:59] Because not all reparses fix them [16:49:42] news ones are being created each time [16:50:21] i noticed that after a user changed an image in a template, and suddenly there were hundreds of wrong links to that image in the next db dump [16:50:32] Aha, so it's still the job runners [16:50:47] that "kukka" page above was touched 30th december [16:50:56] and it has wrong pagelinks [16:51:25] The fact that it was touched then doesn't necessarily mean anything [16:51:47] you mean it could be wrong before too? [16:51:56] Yeah [16:52:04] It's probably still the job runners doing this [16:52:07] if needed, i can check the previous dump [16:52:16] while take some time though [16:52:52] malafaya: You should bring this problem to Tim's attention, though. We now have the job runners throw an exception if they encounter the magic word bug, but I don't think anything is guarding for the namespace bug [16:54:54] ok, meanwhile i just added him to the CC list [16:57:10] great [17:02:01] hello, [17:04:47] mutante: BTW, the job queue thing doesn't seem to be related to the new year: http://ganglia3.wikimedia.org/graph.php?r=week&z=xlarge&c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=93084&m=enwiki%20JobQueue%20length&jr=&js= [17:10:17] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:10:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:16:22] RoanKattouw, mind checking the job runners again? enwiki looks to be plateauing [17:16:39] Might just be it's hit big jobs [17:16:45] Looking [17:18:25] Well, there's a quirk in the job running system that's at least partially responsible [17:18:36] We fork off 5 runJobs.php threads, and tell them to stop after 300s [17:18:53] So after each job is complete, the thread checks how long it's been running for and shuts down if that's >300s [17:19:14] But if one of the threads decides to take on a job that takes 10 minutes ... [17:19:28] all the other ones die as they should but that one just keeps running [17:19:45] And jobs-loop doesn't move on to the next wiki because not all runners have terminated yet [17:21:41] hmm, mw5 seems to be stuck [17:21:57] But that's the only one [17:22:13] It's probably the crappy timeout phenomenon I described above that's causing this [17:22:28] Oh, no, mw5 is fine after all [17:22:38] 03 02:11:51 < jeremyb> domas: you were responding about status.wm.o, right? petan was asking about stats.wm.o [17:22:47] Its job runners were all at 0% CPU when I looked at it, but now their up again [17:23:16] can you check if there are decomissioned ones (the ones that could create the pagelinks problem)? [17:23:51] I'm not sure offhand how to get a list of decommissioned servers, but I'll try [17:24:13] New patchset: Jgreen; "puppetizing fundraising jenkins maintenance cron (oh the irony) scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:25:22] jeremyb: yeh [17:25:31] RoanKattouw: "The logs show it was srv159 running jobs with an old copy of MediaWiki. It still had a job runner on it despite it being marked decommissioned. Roan took it out of the mediawiki-installation group a couple of weeks ago." [17:25:33] jeremyb: I guess if people complain about that, they can use http version [17:25:40] jeremyb: it doesn't have offending cert [17:25:47] Yeah [17:25:47] something like that ^^ [17:25:57] 03 02:35:25 < jeremyb> makes sense that it's CN is nagios because it's the same IP as nagios. but maybe spence is ok to have star? [17:26:05] domas: ^ :) [17:26:19] jeremyb: it isn't nagios [17:26:23] it is watchmouse.com [17:26:26] and it is hosted on AWS [17:26:30] domas: no... [17:26:36] no what? [17:26:43] domas: status!=stats [17:26:53] ah, another one [17:27:00] Well shuit [17:27:05] malafaya: You're right [17:27:16] again? [17:27:31] jeremyb: eh, who cares, it is private host :) [17:27:40] domas: petan :) [17:27:48] he shouldn't! [17:27:48] RoanKattouw: in what exactly? [17:28:12] dunno [17:28:16] There are a whole bunch of old job runners running jobs [17:28:16] what is wildcard cert policy [17:28:24] roankattouw: they need time bomb [17:28:28] :) [17:28:33] idk either but it seems fairly widespread [17:28:38] but thats same as old web servers [17:29:03] jeremyb: frankly, I'd use internal CA for anything internal-facing [17:29:31] otoh, I don't have fundraising team that creates blinking banners [17:29:37] heehehehehe [17:29:48] !log Stopping job runners on the following DECOMMISSIONED servers: srv151 srv152 srv153 srv158 srv160 srv164 srv165 srv166 srv167 srv168 srv170 srv176 srv177 srv178 srv181 srv184 srv185 [17:29:49] Logged the message, Mr. Obvious [17:30:14] Oops [17:30:24] heh, yea [17:30:53] OK that didn't work [17:30:55] they didnt look like they were actively processing jobs though..just sitting in the stuck state [17:31:04] srv158: start-stop-daemon: warning: failed to kill 1622: No such process [17:31:13] I'm killing the processes then [17:31:14] domas, that's where they are going wrong [17:31:18] You need blinking banners [17:31:19] srsly [17:31:27] RoanKattouw, if i'm right, i was just lucky :) i only had a very small feeling it could be related [17:31:34] Hmm, you're right [17:31:35] reedy: we need blinking banners so that we can support our 10% loaded cluster!!!!11 [17:31:36] domas: the thing is stats.wm.o *is* external [17:31:37] They're headless [17:31:54] eh [17:32:22] mutante: I gotta run, could you finish this? A list of servers with bad processes is in /home/catrope/badjobrunners [17:32:34] New patchset: Jgreen; "puppetizing fundraising jenkins maintenance cron (oh the irony) scripts typofix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:32:39] RoanKattouw: ok [17:33:11] is it just `killall php`? :) [17:33:25] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1770 [17:33:26] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:33:40] as long as it's not on solaris [17:39:20] !log killing more runJobs.php / nextJobDB.php processes on a bunch of servers (/home/catrope/badjobrunners) [17:39:21] Logged the message, Master [17:44:14] jeremyb: yes, just killall php was enough [17:49:17] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [17:49:18] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [17:57:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [17:57:09] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:17:48] New patchset: Jgreen; "fundraising mail config for aluminium/grosley" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:25:54] New patchset: Dzahn; "give sudo access to khorn on grosley/aluminium per RT 2196" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1772 [18:26:27]