[00:20:22] Reedy, only en.wiki? we might have experienced some issues on Meta and even testwiki [00:37:05] Nemo_bis, it's only logged for enwiki [00:37:15] If enwiki is having issues, it's likely the others are too [00:37:24] indeed [00:37:29] Fixing one will fix them all [00:37:41] well, after it's caught up ;) [00:38:15] uh, queue is rising at sight on Meta [00:40:04] I actually came by to say that wp.en seemed to suddenly get a lot slower, decided not to say anything in case it was just my perception, not sure if related [00:40:52] Job queue being backed up wouldn't slow it down [00:41:13] well if the server was suddenly having trouble, the job queue would begin backing up and pages would be slower :) [00:41:23] "the" [00:41:30] You realise we have more that one, right? :p [00:41:32] yes [00:41:34] semantics [00:41:36] I ignore them [00:42:04] job queue items that can be done asynchronously [00:42:12] I actually just got this now [00:42:15] Request: POST http://en.wikipedia.org/w/index.php?title=List_of_PlayStation_3_games&action=submit, from 71.95.101.232 via sq60.wikimedia.org (squid/2.7.STABLE9) to 208.80.152.72 (208.80.152.72) [00:42:15] Error: ERR_READ_TIMEOUT, errno [No Error] at Tue, 03 Jan 2012 00:36:13 GMT [00:42:23] Reedy: Don't be silly, we all know the cluster is powered off a 486 that is powered by a hampster in a wheel >.> <.< [00:42:23] That's certainly not related [00:43:25] Job queue issue seems to have been going on a week or more [00:43:37] ok I hadn't heard of that [00:44:53] Indeed [00:45:05] Stupid irc client [02:05:22] !log LocalisationUpdate completed (1.18) at Tue Jan 3 02:05:21 UTC 2012 [02:05:23] Logged the message, Master [02:11:52] domas: you were responding about status.wm.o, right? petan was asking about stats.wm.o [02:34:13] (ssl error; it's not expired but it's both self-signed (by fred!) and also the wrong CN) [02:35:25] makes sense that it's CN is nagios because it's the same IP as nagios. but maybe spence is ok to have star? [03:11:54] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:11:55] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:41:24] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:25] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:56] PROBLEM - RAID on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:56] PROBLEM - RAID on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:26] PROBLEM - Disk space on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:26] PROBLEM - Disk space on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:56] PROBLEM - DPKG on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:56] PROBLEM - SSH on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:50:57] PROBLEM - DPKG on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:57] PROBLEM - SSH on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:16] RECOVERY - RAID on srv273 is OK: OK: no RAID installed [03:58:17] RECOVERY - RAID on srv273 is OK: OK: no RAID installed [04:00:06] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [04:00:06] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [04:00:36] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:00:37] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:00:56] RECOVERY - DPKG on srv273 is OK: All packages OK [04:00:56] RECOVERY - DPKG on srv273 is OK: All packages OK [04:02:56] RECOVERY - Disk space on srv273 is OK: DISK OK [04:02:56] RECOVERY - Disk space on srv273 is OK: DISK OK [04:03:43] O.O Double bot [04:43:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:43:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [07:01:40] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [07:01:41] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [07:11:20] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:11:20] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:47:30] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [07:47:30] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:09:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [08:19:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:07:55] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=60%): /var/lib/ureadahead/debugfs 99 MB (1% inode=60%): [09:07:56] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=60%): /var/lib/ureadahead/debugfs 99 MB (1% inode=60%): [09:25:35] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=60%): /var/lib/ureadahead/debugfs 1 MB (0% inode=60%): [09:25:36] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=60%): /var/lib/ureadahead/debugfs 1 MB (0% inode=60%): [09:27:15] RECOVERY - Disk space on srv222 is OK: DISK OK [09:27:16] RECOVERY - Disk space on srv222 is OK: DISK OK [09:49:30] RECOVERY - Disk space on srv221 is OK: DISK OK [09:49:31] RECOVERY - Disk space on srv221 is OK: DISK OK [09:53:50] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 443510 MB (3% inode=99%): [09:53:51] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 443510 MB (3% inode=99%): [09:55:40] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 434584 MB (3% inode=99%): [09:55:41] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 434584 MB (3% inode=99%): [10:02:30] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:02:31] RECOVERY - MySQL slave status on es1004 is OK: OK: [13:25:40] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:25:41] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:51:57] !log Restarting job runner on srv236, seems to be stuck [13:51:58] Logged the message, Mr. Obvious [13:59:44] RoanKattouw, think more than srv236 is stuck :p [14:00:28] I know [14:00:38] !log Restarting all job runners that are stuck [14:00:40] Logged the message, Mr. Obvious [14:04:10] !log Restarting job runners on srv242 and mw25, those are the last ones that are stuck [14:04:10] Logged the message, Mr. Obvious [14:06:20] funny how quickly that seems to have taken affect [14:06:55] Yeah [14:07:05] http://ganglia3.wikimedia.org/graph.php?r=20min&z=xlarge&c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=93084&m=enwiki%20JobQueue%20length&jr=&js= [14:08:45] Also begs the question, why wasn't this noticed before [14:10:15] Because the Nagios check is broken [14:10:18] I'm lookign at that now [14:10:33] lol, typical [14:17:05] * hexmode wakes back up, looks at the backscroll [14:20:10] RoanKattouw, Reedy: what sort of things were stuck? Just jobqueue? Anything in particular that would cause problems with (that I might notice)? [14:20:50] i noticed by someone complaining a user rename hadn't gone through [14:20:55] quick look suggested it had backed up over a week [14:21:06] heh [14:21:18] Anything that is on a deferred job could be noticeable [14:21:22] link updates etc [14:21:39] so stuff might have been sitting for a week... but I don't remember any new bugs like that [14:21:50] Reedy: tyvm :) [14:22:43] people only really sweem to notice when it gets even worse [14:24:25] hexmode: Hi, noticed Bugzilla upgrade and SiteMap extension yet? like can people google bugs better than before already? it did submit a sitemap, but it almost seemed too quick to make me believe all bugs where in there.. about to check again [14:25:25] mutante: I tried yesterday, but I didn't see any [14:25:53] New patchset: Catrope; "Fix Nagios job queue check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1766 [14:26:03] Reedy: yea, your RT ticket was good, it made me notice [14:26:10] mutante: which search engines did you submit it to? Is it just google? [14:26:18] me [14:26:34] mutante: https://gerrit.wikimedia.org/r/1766 should fix the nagios check [14:26:44] mutante, heh, it was a case of when I noticed, I think only Tim was around, and it wasn't worth any sort of emergenchy [14:26:56] hexmode: Live: OK [14:26:56] Google: OK [14:26:56] Ask: OK [14:27:03] Yahoo: FAILED [14:27:26] Who cares about yahoo? [14:27:30] hexmode: the extension did that default, i didnt pick them [14:27:40] mutante: k, and the first disallow looks wonky.. but this is what mozilla uses, right? [14:27:56] hexmode: the robots.txt has also been created by the extension [14:28:04] k [14:28:21] hexmode: not what mozilla uses, or the manual suggestion, but after i saw the extension creates it, i left it that way [14:28:22] I think mozilla used a slightly different one [14:28:45] yea, you can see the diff in the ticket [14:30:00] mutante: google "bug site:bugzilla.wikimedia.org" and only 3 hits :( [14:30:44] others show up when you allow duplicates [14:30:56] but it google doesn't show any cached info [14:31:12] and only very basic info [14:31:54] mutante: what email did you use for the sitemap submission? Should I check webmaster@wikipedia email? [14:32:00] * hexmode goes to check anyway [14:32:11] hold on, back in 2 minutes [14:33:42] New review: Dzahn; "works on spence." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1766 [14:33:42] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1766 [14:34:08] I love how people add bugs and say "Add this to LocalSettings" etc [14:36:52] hexmode: checking datadir of sitemap extension on kaulen.. [14:39:23] hrm... email to webmaster@ ... someone needs to be on top of this. Someone sent an email back in October complaining about "403 Requested target domain not allowed." when trying to get to enwiki and asked "Any particular reason why? I find your site quite useful. Do we have to pay now, or have an account or something? And if so, how do we go about doing what we need to get in?" [14:40:32] "The sitemap is generated dynamically every time a search engine asks for it." [14:40:57] If you want to see or download the sitemap yourself for some reason, go to the URL page.cgi?id=sitemap/sitemap.xml on your Bugzilla installation. [14:43:53] hrmm..doesnt look right yet.. i'll see [14:45:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [14:45:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [14:54:00] RobH: Did you have trouble with the digicert ssl certificate? Saw a still active email from last month about the cert to webmaster@... [14:54:32] huh? [14:54:40] i have no idea what yer talking about [14:54:47] no i guess not? [14:54:48] heh [14:55:21] just never approve anything for certs if you get it, folks can submit for certs to most root level emails like root, webmaster, etc... [14:55:32] even if they arent us, the only check and balance is not approving them [14:55:44] robh it was webmaster@wikimedia.org [14:55:57] maybe ctwoo got it and replied? [14:56:03] i wouldnt worry about it. [14:56:11] or its someone trying to get a cert and they arent us [14:58:25] RobH: do you even have a digicert account? [14:58:56] yes [14:59:00] we are using one for testing mobile [14:59:06] again you dont need to worry about it ;] [14:59:20] if you think its a major issue yuou can forward it to me [14:59:27] but i know they are doing mobile testin gwith it, so its prolly fine [14:59:55] though it should prolly be set to not email you [14:59:59] so forward to me and i handle it [15:00:48] it should have emailed more than webmaster [15:00:50] RobH: ctwoo and robla have access webmaster@wikimedia.org to email, too ... it isn't coming to me [15:00:52] (its also just a testing cert) [15:00:58] ? [15:01:07] i thought you just said you got the email? [15:01:36] hexmode: i am really not sure what you are asking then. You got an email from digicert, or you didnt? [15:02:24] I meant I'm going through the email to webmaster@ (something I haven't done and someone needs to do more regularly) and I saw this from last month [15:02:42] hexmode: here's another cosmetic issue with bugzilla, while trying to fix the other one: File does not exist: /srv/org/wikimedia/bugzilla/favicon.ico [15:02:44] You're right: I shouldn't be the one doing this [15:02:56] ok, so I am still not sure what you want from me. You can forward it to me and I can look at it, but otherwise i have no answer for you [15:03:01] hexmode: if you like a favicon ;) [15:03:01] they are testing the cert, thats all i know. [15:03:34] mutante: I see a favicon in firefox! but maybe there is something missing. [15:03:38] not trying to be difficult, im just not sure what you want ;] [15:04:18] robh: no, I know. I just thought this might be important. I'll talk to ctwoo and robla about the future of this [15:04:27] future of what? [15:04:31] there is some email in here that should be responded to [15:04:42] please just forward me the email. [15:04:46] and it isn't clear that it has been [15:04:47] k [15:04:49] i dont think it is important though [15:04:57] but will review it [15:05:14] and get it tied to anohter email [15:05:26] sent [15:06:14] RobH: other emails here should go to legal or community people.... but yeah, this one is for you :) [15:09:31] http://support.google.com/a/bin/answer.py?hl=en&answer=167430 [15:10:39] mutante: thanks! my own, personal lmgtfy ;) [15:11:21] :) [15:11:34] mutante: WTF, where's that Nagios message :) [15:11:48] Did you update the puppetmaster and run puppet on spence? [15:11:57] RoanKattouw: http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=spence&service=check_job_queue [15:12:05] RoanKattouw: ask where the bot is, i think [15:14:25] Last State Change:01-03-2012 14:48:36 [15:14:26] hrmm [15:14:51] ah Roan, it's still in a SOFT state, that's why [15:14:57] SOFT? [15:15:09] one more failed check and it will turn into a HARD state [15:15:11] * RoanKattouw doesn't actually know anything about Nagios [15:15:24] Current attempt: 2/3 (SOFT state) [15:15:46] it tries it 3 times, and it only turns into a "real" (HARD) critical if it fails 3 times in a row [15:16:07] this is a feature to avoid alarms if a service is just flapping or down for 1 minute and then back again [15:16:38] http://nagios.sourceforge.net/docs/3_0/statetypes.html [15:17:24] Ah, OK [15:18:51] Next Scheduled Check: 01-03-2012 15:27:16 [15:19:04] mutante: Does it still run every 10 min? I guess that's why it's taking half an hour to go into HARD state? [15:19:19] (Well, 20 mins best case, 30 worst case, to be fair) [15:20:46] hexmode: ahhh, this is super old [15:20:56] i did this a long time ago, it just seems to email webmaster as well as dnsadmin [15:21:00] you can completely disregard this [15:21:14] :) [15:21:14] but thanks for checking on it =] [15:21:31] RoanKattouw: normal_check_interval => 15, [15:21:32] 540 retry_check_interval => 15, [15:21:48] omg [15:21:50] RoanKattouw: yeah, thats right, so it takes 45 minutes at worst [15:22:01] sorry if i was terse, doing expense reports =P [15:22:04] RoanKattouw: is that too long..? 45 minutes after the first queue is over 10.000 [15:22:12] RoanKattouw: we survived 100k as well ? :p [15:22:41] Well, it seems a bit ineffective to me that way is all [15:22:49] The check is no longer ridiculously slow [15:23:01] suggest a value [15:23:36] made the puppet monitor_service configurable for that, so no problem [15:24:35] nagios.pp lines 539,540 [15:24:56] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [15:24:56] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [15:34:37] New patchset: Dzahn; "job_queue: tweak retry check interval" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1767 [15:35:37] New review: Dzahn; "keep the regular interval at 15 minutes, but if it fails once (SOFT), keep re-checking every 5 minut..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1767 [15:35:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1767 [15:36:06] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (77620) [15:36:07] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (77620) [15:36:17] RoanKattouw: there you go [15:36:28] yay [15:36:29] hexmode: you've got mail :) [15:36:56] RoanKattouw: retry_check_interval is just that, "if it failed once already, then keep re-checking more often" [15:37:28] so the HARD state should be reached in 15 + 5 + 5 now [15:37:35] Thehelpfulone: you're like my own little biff ;) [15:37:51] OK, good [15:37:55] :D [15:45:45] hexmode: are there known issues with users voting on bugzilla? [15:46:04] "Can't locate data/template/extensions/Voting/template/en/default/pages/voting/user.html.tmpl" [15:47:44] this is just stuff i happen to notice in Apache log, didnt try in a browser yet [15:49:06] aha: http://www.bugzilla.org/releases/4.0/release-notes.html#v40_feat_vot_ext [15:49:20] "Voting" is an extension since 4.0 (but wasnt before) [15:50:42] New patchset: Hashar; "class to install Apache Maven" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1768 [15:50:56] New patchset: Hashar; "Add Apache Maven to gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1769 [15:53:21] mutante: I like the voting and would like to get it put in... BUT I want to change the wording [15:53:26] s/vote/follow/ [15:54:52] New review: Dzahn; "typo: ensure => lastest; != latest" [operations/puppet] (production); V: -1 C: -1; - https://gerrit.wikimedia.org/r/1768 [15:55:34] !log Created wikilove tables on siwiki [15:55:35] Logged the message, Master [15:56:53] hexmode: ok, you may want to keep that in a bug somewhere then (that it currently does not find that template, and now needs to be disabled as an extension) [15:56:54] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33485 - Enable WikiLove in si.wikipedia' [15:56:55] Logged the message, Master [15:57:06] hexmode: eh. s/disabled/enabled :) [15:59:25] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [15:59:26] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [16:19:29] apergos, is it past 10 for you? :) [16:23:02] no [16:23:22] give me a few minutes, I'm in the middle of something on another system [16:25:48] sure, no prob [16:28:42] ok well i's going to take a lot more than a few minutes but I don't have to watch it, it seems [16:28:45] so... go ahead/ [16:28:46] ? [16:28:57] bug 32404 [16:30:03] it seems to happen when pages are invalidated [16:30:26] in the same invalidation process, some pages are ok, while others are not [16:30:53] mutante: Is bz in git and puppet yet? [16:31:12] i noticed there were some recent issues about pages being rendered by old copies of MW [16:31:20] i wonder if this is not the case too [16:32:58] apergos? [16:33:12] I'm here [16:33:16] did you read? [16:33:19] I'm just looking at the current state of things [16:33:20] uh huh [16:33:21] ok [16:33:40] if you could check what server recreated some pages, maybe we can have a clue of what's going on [16:34:22] what server? we're not going to have that information [16:34:29] sad but true... [16:34:55] i recall RoanKattouw deploying some patch to do that..., no? [16:35:34] To do what now? [16:35:53] Oh, that [16:35:58] Tim investigated that bug [16:35:59] so we're back with the problem of bad entries in pagelinks [16:36:00] to include some debug info about what server created the cached page [16:36:07] what did he find, anything useful? [16:36:30] i have an example of 30 december if it's needed [16:38:24] http://pt.wiktionary.org/w/api.php?action=query&prop=info&format=xml&titles=kukka [16:38:29] http://pt.wiktionary.org/w/api.php?action=query&prop=links&format=xml&pllimit=10&titles=kukka [16:38:42] cached 4 days ago [16:41:08] Images in ns 0 [16:42:02] yeah they are, that's what's currently in the db all right [16:42:35] this is after Tim's investigation for bug 31576 [16:45:43] http://pt.wiktionary.org/w/index.php?title=Predefini%C3%A7%C3%A3o:-fo-&action=edit [16:46:02] it's all these templates that have the old namespace name in them [16:46:15] what a PITA [16:47:14] yes, that's what eventually causes problematic pages [16:47:31] but even in pages that are not transcluded, it happens [16:47:55] there are ns 0 entries for User_talks [16:48:53] The link table entries could be much older though [16:48:59] Because not all reparses fix them [16:49:42] news ones are being created each time [16:50:21] i noticed that after a user changed an image in a template, and suddenly there were hundreds of wrong links to that image in the next db dump [16:50:32] Aha, so it's still the job runners [16:50:47] that "kukka" page above was touched 30th december [16:50:56] and it has wrong pagelinks [16:51:25] The fact that it was touched then doesn't necessarily mean anything [16:51:47] you mean it could be wrong before too? [16:51:56] Yeah [16:52:04] It's probably still the job runners doing this [16:52:07] if needed, i can check the previous dump [16:52:16] while take some time though [16:52:52] malafaya: You should bring this problem to Tim's attention, though. We now have the job runners throw an exception if they encounter the magic word bug, but I don't think anything is guarding for the namespace bug [16:54:54] ok, meanwhile i just added him to the CC list [16:57:10] great [17:02:01] hello, [17:04:47] mutante: BTW, the job queue thing doesn't seem to be related to the new year: http://ganglia3.wikimedia.org/graph.php?r=week&z=xlarge&c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=93084&m=enwiki%20JobQueue%20length&jr=&js= [17:10:17] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:10:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:16:22] RoanKattouw, mind checking the job runners again? enwiki looks to be plateauing [17:16:39] Might just be it's hit big jobs [17:16:45] Looking [17:18:25] Well, there's a quirk in the job running system that's at least partially responsible [17:18:36] We fork off 5 runJobs.php threads, and tell them to stop after 300s [17:18:53] So after each job is complete, the thread checks how long it's been running for and shuts down if that's >300s [17:19:14] But if one of the threads decides to take on a job that takes 10 minutes ... [17:19:28] all the other ones die as they should but that one just keeps running [17:19:45] And jobs-loop doesn't move on to the next wiki because not all runners have terminated yet [17:21:41] hmm, mw5 seems to be stuck [17:21:57] But that's the only one [17:22:13] It's probably the crappy timeout phenomenon I described above that's causing this [17:22:28] Oh, no, mw5 is fine after all [17:22:38] 03 02:11:51 < jeremyb> domas: you were responding about status.wm.o, right? petan was asking about stats.wm.o [17:22:47] Its job runners were all at 0% CPU when I looked at it, but now their up again [17:23:16] can you check if there are decomissioned ones (the ones that could create the pagelinks problem)? [17:23:51] I'm not sure offhand how to get a list of decommissioned servers, but I'll try [17:24:13] New patchset: Jgreen; "puppetizing fundraising jenkins maintenance cron (oh the irony) scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:25:22] jeremyb: yeh [17:25:31] RoanKattouw: "The logs show it was srv159 running jobs with an old copy of MediaWiki. It still had a job runner on it despite it being marked decommissioned. Roan took it out of the mediawiki-installation group a couple of weeks ago." [17:25:33] jeremyb: I guess if people complain about that, they can use http version [17:25:40] jeremyb: it doesn't have offending cert [17:25:47] Yeah [17:25:47] something like that ^^ [17:25:57] 03 02:35:25 < jeremyb> makes sense that it's CN is nagios because it's the same IP as nagios. but maybe spence is ok to have star? [17:26:05] domas: ^ :) [17:26:19] jeremyb: it isn't nagios [17:26:23] it is watchmouse.com [17:26:26] and it is hosted on AWS [17:26:30] domas: no... [17:26:36] no what? [17:26:43] domas: status!=stats [17:26:53] ah, another one [17:27:00] Well shuit [17:27:05] malafaya: You're right [17:27:16] again? [17:27:31] jeremyb: eh, who cares, it is private host :) [17:27:40] domas: petan :) [17:27:48] he shouldn't! [17:27:48] RoanKattouw: in what exactly? [17:28:12] dunno [17:28:16] There are a whole bunch of old job runners running jobs [17:28:16] what is wildcard cert policy [17:28:24] roankattouw: they need time bomb [17:28:28] :) [17:28:33] idk either but it seems fairly widespread [17:28:38] but thats same as old web servers [17:29:03] jeremyb: frankly, I'd use internal CA for anything internal-facing [17:29:31] otoh, I don't have fundraising team that creates blinking banners [17:29:37] heehehehehe [17:29:48] !log Stopping job runners on the following DECOMMISSIONED servers: srv151 srv152 srv153 srv158 srv160 srv164 srv165 srv166 srv167 srv168 srv170 srv176 srv177 srv178 srv181 srv184 srv185 [17:29:49] Logged the message, Mr. Obvious [17:30:14] Oops [17:30:24] heh, yea [17:30:53] OK that didn't work [17:30:55] they didnt look like they were actively processing jobs though..just sitting in the stuck state [17:31:04] srv158: start-stop-daemon: warning: failed to kill 1622: No such process [17:31:13] I'm killing the processes then [17:31:14] domas, that's where they are going wrong [17:31:18] You need blinking banners [17:31:19] srsly [17:31:27] RoanKattouw, if i'm right, i was just lucky :) i only had a very small feeling it could be related [17:31:34] Hmm, you're right [17:31:35] reedy: we need blinking banners so that we can support our 10% loaded cluster!!!!11 [17:31:36] domas: the thing is stats.wm.o *is* external [17:31:37] They're headless [17:31:54] eh [17:32:22] mutante: I gotta run, could you finish this? A list of servers with bad processes is in /home/catrope/badjobrunners [17:32:34] New patchset: Jgreen; "puppetizing fundraising jenkins maintenance cron (oh the irony) scripts typofix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:32:39] RoanKattouw: ok [17:33:11] is it just `killall php`? :) [17:33:25] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1770 [17:33:26] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:33:40] as long as it's not on solaris [17:39:20] !log killing more runJobs.php / nextJobDB.php processes on a bunch of servers (/home/catrope/badjobrunners) [17:39:21] Logged the message, Master [17:44:14] jeremyb: yes, just killall php was enough [17:49:17] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [17:49:18] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [17:57:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [17:57:09] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:17:48] New patchset: Jgreen; "fundraising mail config for aluminium/grosley" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:25:54] New patchset: Dzahn; "give sudo access to khorn on grosley/aluminium per RT 2196" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1772 [18:26:27] New patchset: Jgreen; "fundraising mail config for aluminium/grosley (typofix)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:28:12] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1771 [18:28:12] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:32:07] New review: Dzahn; "approved by woosters" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1772 [18:32:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1772 [18:51:18] why don't we have more recent photos? :( https://commons.wikimedia.org/wiki/Category:Wikimedia_servers [18:59:03] RobH, ^ [18:59:13] Did you ever get around to photoing eqiad? [19:00:19] https://plus.google.com/photos/114688368536436281597/albums/5596244252518518929 [19:00:23] just 9 of the new center [19:00:43] Do we have a g+ import bot? :D [19:00:46] the others didnt come out well, i need to redo. [19:00:57] Nemo_bis, ^ [19:01:05] oh wiat [19:01:06] damn it [19:01:12] those are super old, where are my eqiad shots... [19:01:17] Reedy: sorry about that [19:01:54] :o dataset1 [19:02:09] where is license? Google+ hides it :( [19:02:50] I'm fairly sure if it doesn't say it there, RobH will have cc-by-sa or similar [19:03:03] Nemo_bis: if its of the servers, and its on my plus account, they are open and you can put on commons. [19:03:16] i dunno where the hell google would let me put that [19:03:26] but since I took them, i am saying they are open content for commons =] [19:03:33] it's not hard on Picasa, not sure when you switch to g+ [19:03:43] yea, i dislike that they are joined now as well [19:03:45] =P [19:03:54] (though it did do away with space quotas) [19:04:07] I am exporting my new ashburn shots now, I will go ahead and throw them on commons [19:04:26] RobH: i think you need to specific a specific license [19:04:40] RobH: remember this is *commons* :) [19:05:10] yay [19:05:56] the shots i linked on g+ are already on commons [19:06:07] i neglected to upload the new shots is all (though I used them in my wikimania slides ;) [19:06:12] oh, well that solves that [19:06:35] yea, i just checked commons for them, some are there, some arent... rugh, i will take care of it though =] [19:07:11] hey, i am going to use the upload wizard for the first time ever (usually use commonist) [19:07:31] were the pics all reviewed by eqiad? [19:07:47] Yea, when I take them, I have to put in a work order and be escorted with my camera [19:08:01] i took them, then he reviewed them on the LCD on the camera to ensure it was just our stuff [19:08:07] oh, they just watch you take em. ok [19:08:15] technically their policy is they take them [19:08:21] i thought you send them the pics [19:08:23] but i know my camera better than them, so they are cool about it [19:08:31] =] [19:09:13] upload wizard go! (this is pretty nice actually) [19:09:15] OH NO, YOU'VE PHOTOED THE BACK OF ANOTHER COMPANYS SERVER. IT LOOKS EXACTLY THE SAME AS YOURS, WE CAN'T BE HAVING THAT [19:10:24] hehe, indeed [19:10:37] so its 39 shots, uploadin now =] will link when its all done [19:10:47] Nemo_bis: ^ [19:10:53] I can see their point, just seems a *little* extreme [19:11:09] Well, Eqiad has government acronym customers [19:11:15] yeah, i wasn't really sure if they cared or if it was just for the webcams [19:11:24] in their other DC one of the cages is also blacked out with fabic on the walls of it [19:11:36] Sweet [19:11:40] the webcams they technically watched me install, and reviewed the line of sight [19:11:48] and they arent allowed to pan, fixed view only [19:11:50] So you can't see all the archaic hardware they are still using [19:12:00] that sucks [19:12:08] the cia cyptography runs on vacuum tubes! [19:12:18] well, i said their name, now they are listening to this channel. [19:12:42] i've mentioned the cia a few times in #-glam [19:12:45] Reedy: mostly because the cages that arent ours arent nearly as pretty [19:12:55] heh [19:13:00] their datacenter techs obviously do not have the same degree of ocd for cabling that I do ;] [19:13:28] the upload wizard ETA seems a bit off [19:13:32] Yeah [19:13:38] Known issue [19:13:45] good enough =] [19:13:46] needs chunked upload [19:14:07] much of the https://www.google.com/search?q=scanning+league+iasl content is from the CIA [19:14:35] some is propaganda, some i guess is supposed to be internal (but e.g. about vietnam or even earlier) [19:33:38] Aloha :-) [19:39:26] !log reedy synchronized php-1.18/skins/common/images/ 'r107930' [19:39:27] Logged the message, Master [19:48:01] wow, uplaod wizard sucks for adding categories [19:48:08] why is there no 'apply this category to all uploads' [19:48:28] at least it's not adding the wrong cats [19:48:28] RobH: Complain to neilk_ , he wrote it ;) [19:48:45] hell, i want to cancel [19:48:58] but i dont wanna have to go clean images manually... are they in temp storage until i finish wizard [19:48:59] there might be an ETF [19:49:04] or will it not clean up after itself? [19:49:14] RobH: you can abandon at any time until the last step. They are in temp storage [19:49:27] RobH: I agree the process with categories and metadata kind of sucks. [19:49:30] good, i dont feel like clicking four times for all 40 images to add a category [19:49:46] mechanical turk [19:49:47] otherwise its slick, but not adding those bulk is painful [19:49:57] and its a hell of a lot better than what we had before [19:49:58] RobH: yeah, it's really the last feature on the agenda [19:50:24] RobH: but frankly, it's still kind of fundamentally dumb... should not have been designed this way [19:54:18] someone should write me a lightroom to commons plugin ;] [20:08:28] New patchset: Hashar; "Add Apache Maven to gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1769 [20:08:43] New patchset: Hashar; "class to install Apache Maven" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1768 [20:22:48] commist errored, uploaded all my shit and then messed up [20:22:49] sigh. [20:25:37] New patchset: Pyoungmeister; "adding searchidx.cfg to autoinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1773 [20:25:52] grr bad-prefix on every single commonist upload i just did. [20:27:34] New patchset: Pyoungmeister; "adding searchidx.cfg to autoinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1773 [20:28:00] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1773 [20:28:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1773 [20:37:31] New patchset: Hashar; "integration: make homepage URLs relative" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1774 [20:43:06] RECOVERY - ps1-d3-pmtpa-infeed-load-tower-A-phase-Z on ps1-d3-pmtpa is OK: ps1-d3-pmtpa-infeed-load-tower-A-phase-Z OK - 1188 [20:43:07] RECOVERY - ps1-d3-pmtpa-infeed-load-tower-A-phase-Z on ps1-d3-pmtpa is OK: ps1-d3-pmtpa-infeed-load-tower-A-phase-Z OK - 1188 [20:49:36] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:49:37] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:53:16] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:53:17] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:55:31] API :( [20:55:41] Yup, the API is down [20:57:53] So what hapened? [21:06:08] hmm, no idea [21:06:28] Reedy: any updates on the API breakage? [21:06:41] its being actively worked by ops [21:06:44] Ops are looking into it [21:07:53] okay thanks [21:26:53] Caching for non-logged in users seems to have slipped on en.wp -- is that handled by the job runners or is it a separate issue? [21:27:30] Jarry1250: You mean they're not served by Squid, but by the apaches? [21:27:59] What do you mean by "caching has slipped"? [21:28:30] If you go to certain wiki pages when logged out, they appear several hours out of date. [21:29:12] Are there edits to the actual page that are missing, or just edits to templates or transcluded pages? [21:29:23] Because the job queue did get backed up over the past week [21:29:35] Basically, the job runners got completely stuck around Christmas, and we unstuck them todfay [21:29:39] http://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Aus_dem_M%C3%A4rchenschatz_der_Kaschubei.djvu/page38-465px-Aus_dem_M%C3%A4rchenschatz_der_Kaschubei.djvu.jpg is outtiming here [21:30:00] At the time there were ~100k jobs in the enwiki queue and ~11k in the commonswiki queue [21:30:13] * RobH adds to commonswiki with image uploads [21:30:22] must slow down stie more. [21:30:39] I think commons has probably caught up now [21:30:41] enwiki is down to ~30k [21:30:57] Roan: I thought it might be related but actually we're talking straight edits here not edits to transcluded pages. [21:31:00] RECOVERY - RAID on storage3 is OK: OK: State is Optimal, checked 14 logical device(s) [21:31:25] hmm, then it should just work [21:31:48] OTOH lvs4 was borked during today's outage and I have no ideas what else was behind that [21:32:03] I know it killed the API, and I hear the PDF renderers and the image scalers were also affected [21:33:12] It could be it, I can't remember when I saw it before. The worrying thing is, it's not just missed one edit but still hasn't caught up even after 5 or 6. [21:33:30] Strange [21:33:35] When was the last edit madE? [21:33:57] About 10 minutes ago. [21:34:30] https://en.wikipedia.org/w/index.php?title=Wikipedia:Administrators'_noticeboard&action=history [21:34:43] It's stuck before my edit of 15:19 todaay [21:35:02] Not sure how much before, but it certainly hasn't got that one. [21:35:39] Well shit [21:35:41] [21:35:59] That's from wget "https://en.wikipedia.org/wiki/Wikipedia:Administrators'_noticeboard" [21:36:10] Let me repeat that, that timestamp is December 6th, 2011 [21:36:17] It's from a different YEAR [21:36:42] That's probably the alternate names with URLencodings issue [21:37:36] The URL with %27 gives me [21:37:49] That's 18 minutes ago [21:40:14] Come to think of it, I was actually on WP:AN, which I suppose is actually behind because of the jobrunners. [21:40:46] That uses transclusions, right? [21:40:53] Yeah then it's very likely to be behind [21:41:02] The job backlog is being eaten into fast, but it's big [21:41:40] No, it's a straight redirect I think, but I thought they were delayed updated as well? [21:41:56] It went from 100k to 30k in 7 hours, so that's about 10k/hr, measurements from the past hour are missing, so it should be another 2 hours before it's caught up [21:42:05] not supposed to be [21:47:55] RECOVERY - RAID on storage3 is OK: OK: State is Optimal, checked 14 logical device(s) [22:03:25] neilk_: sent me bad juju for insulting upload wizard, causing me to typo the category name and thus have to manually correct all 40 images once uploaded ;] [22:03:49] all cuz i pointed out the category selection in upload wizard, damn you neilk_ ! [22:03:50] :o [22:03:51] ;] [22:04:00] * Nemo_bis praises RobH's courage in testing UW [22:04:04] Nemo_bis: so commons is slow on thumbnailing at the mooment, but they are now online [22:04:24] the wikimedia servers category now has eqiadwmf###.jpg [22:04:53] also, every server image i have uploaded is also in http://commons.wikimedia.org/wiki/User:RobH/gallery (and they are of course in the category listing) [22:05:47] thank you very much RobH ! [22:05:47] you can of course click to the full res of each, but its kinda a pain [22:05:58] thanks for asking, i had them done and honestly just forgot to upload them [22:06:05] just needed someone to remind me [22:06:40] if i ever find time to make vlc work properly as a video proxy, the cameras in the datacenters will also be public. [22:06:53] unfortunately, I can make vlc take in the info, but not output it properly so far =P [22:07:04] and all the camera's are internal IPs, so we need to gateway them [22:07:36] there are cameras in the datacenter? [22:07:57] yeah, they're currently all shut down though [22:08:11] well, the sdtpa ones are offline [22:08:17] the eqiad are online, but internal vlan only [22:08:23] so that means only ops folks can see them, sorry ;] [22:08:33] (but we dont want to keep it that way) [22:08:36] We used to have one aimed at the coffee machine so you could see if it was working ;-) [22:08:52] sounds right, have to monitor mission critical hardware =] [22:09:12] our SPOF [22:09:29] Altough we had a cold spare back than [22:09:35] RobH, what about the g+ album you linked before? [22:09:42] I didn't see them on Commons [22:09:51] and that wasn't eqiad, right? [22:10:29] those are sdtpa, i thought they were on commons =/ hrrmm [22:10:48] lemme see if I can pull the originals and upload them via commonist quickly (more quikcly then pulling form one to the other [22:10:57] knams is a short gallery :-) [22:11:15] don't break the API this time [22:11:44] no promises [22:12:27] !log reedy synchronizing Wikimedia installation... : Push r107938 r107948 [22:12:28] Logged the message, Master [22:13:00] your last image uploads before this are from 2009 [22:13:13] yep, heh [22:14:38] sync done. [22:16:04] found em, exporting and pushing to commons =] [22:16:27] i actually have 1`60 photos from that visit. [22:16:36] of that, i uploaded 9 cuz i liked them, the rest are just normal. [22:16:47] * RobH takes 1k photos, ends up liking 10 [22:17:26] I get more reports that thumbnail-generating on commons fails [22:17:42] its not making them for me either [22:17:52] apergos: ? (due to backup of queue ebing fixed?) [22:18:30] I don't know why thumbs would fail now [22:18:42] it's not a job queue sort of thing, they are done on demand [22:18:47] well, its not making any for the stuff i just uploaded [22:20:27] when yo ugo view one of them what does it do? [22:20:48] http://commons.wikimedia.org/wiki/File:Eqiadwmf_9043.jpg example [22:20:51] apergos: here it times out, and nginx reports a "502 bad gateway" [22:20:56] it just shows the blank placeholder and keeps 'loading' [22:21:06] waiting for upload.w.o [22:21:12] I guess you want to check the scalers and ms5 [22:21:33] oh if only ganglia wasnt slow. [22:21:36] =P [22:21:53] :-P [22:23:45] apergos: thats you runnin tcmpdump on ms5? [22:23:49] no [22:23:51] folks are on it [22:23:58] I am on it [22:24:01] I will get back off too [22:24:08] leave the jobs running, they are for ganglia [22:24:10] two connections are 24 nad 25 days old ;] [22:24:13] yep [22:24:19] those are ben's I think [22:24:29] running in a screen session, they can keep doing that [22:24:47] so what exactly should I check on here? learn me something =] [22:25:06] I alreadya looked at the syslog and saw nothing, I checked logs on one of the random scalers and saw nothing [22:25:13] how is lvs for those? [22:25:44] I was hoping for somthing like either a high load or some nfs timeouts on a scaler [22:26:36] lvs4 is active, should be doing that [22:26:42] i look on it, it isnt sending traffic to them [22:26:47] sounds familiar =P [22:26:51] does it say they are depooled? [22:26:57] yeah it does. [22:27:00] checking [22:27:47] every single rendering server is pooled [22:27:50] in lvs [22:27:57] they just arent being sent traffic [22:28:05] ok [22:28:07] atleast ipvsadm on lvs4 seems to show that [22:28:09] 0 connections [22:28:12] right [22:29:16] apergos: yea... same issue [22:29:22] scalers need 10.2.1.21 [22:29:31] and its not in the ip loopback on lvs4. [22:29:31] ok [22:29:45] i am going to forfe a puppet run to see if it fixes, but i doubt it will [22:29:50] since its run already [22:30:06] go ahead [22:31:22] nope, didnt fix, manually adding [22:31:56] ok [22:32:34] !log added in the lo addres to lvs4, now its working and generating thumbnails [22:32:35] Logged the message, RobH [22:32:38] fixed! [22:32:52] DaBPunkt: working now [22:33:11] slow, but yes [22:33:27] well, i just made it generate about 60 thumbs just a xsecond ago [22:33:28] heh [22:33:58] aww, look at the old cage, its twice that size now [22:34:00] http://commons.wikimedia.org/wiki/File:Sdtpa_wmf-8.jpg ;] [22:34:11] RECOVERY - Host rendering.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [22:34:30] RobH: What category of hosts is cp10NN again? [22:39:58] RobH: seems to be very dangerous data, if they need such a big cage ;) [22:40:03] poweredge r610 [22:41:06] more specifically, r610, 32gb memory, single cpu intel E5640 4 core [22:41:11] and 4 SSD drives [22:41:35] DaBPunkt: it only looks big from the outside, its not roomy in the cage =] [22:41:56] well, i take that back, the new eqiad cage is huge, but i didnt link those photos (they are on commons now though) [22:42:15] http://commons.wikimedia.org/wiki/Category:Wikimedia_servers, all the eqiad have eqiad in file name [22:42:41] but thats cuz the cage is going to be expanded by another two rows, and we planned for that on initial buildout [22:42:44] mm, there are many fantasy books where things are bigger on the inside, than on the outside -. that's the first time I hear about something the way arround ;) [23:01:20] !log reedy synchronizing Wikimedia installation... : Pushing r107953, r107955, r107956, r107957 [23:01:21] Logged the message, Master [23:03:18] sync done. [23:03:34] !log on spence: restarting gmetad [23:03:35] Logged the message, Master [23:22:58] gn8 folks [23:23:25] Night DaBPunkt! [23:34:23] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [23:34:24] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours