[00:46:06] New patchset: Catrope; "Fix the target of the node_modules symlink" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49188 [00:46:16] Ryan_Lane: Quick typofix ---^^ [00:46:38] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49188 [00:46:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49188 [00:50:19] RoanKattouw: ok. I've sync'd the module out [00:50:23] RoanKattouw: it should work now [00:50:38] Thanks [00:50:42] yw [00:50:50] I synced the wrong module anyways, which brought down the service, so I've already fixed it with dsh [00:50:58] heh [00:51:05] (It's associated with Parsoid, not config as I thought) [00:51:09] * Ryan_Lane nods [00:53:01] I'll sabotage a Tampa machine and test it [00:53:07] cool [00:56:21] drdee: hey [00:59:52] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [01:04:44] New review: Faidon; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49074 [01:06:57] New patchset: Ori.livneh; "Explicitly add Schema NS to test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49191 [01:07:48] New review: Ori.livneh; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/49191 [01:07:55] New review: Ori.livneh; "Patch Set 1: Verified+2" [operations/mediawiki-config] (master); V: 2 - https://gerrit.wikimedia.org/r/49191 [01:07:56] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49191 [01:09:58] !log olivneh synchronized wmf-config/CommonSettings.php 'Enabling Schema NS on test2wiki' [01:09:59] Logged the message, Master [01:12:12] New review: MZMcBride; "Patch Set 1:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49191 [01:19:24] !log deploying squid changes for s/knams/esams/ for knsq* squids (see git log for rationale) [01:19:28] Logged the message, Master [01:21:01] RECOVERY - Backend Squid HTTP on knsq24 is OK: HTTP OK HTTP/1.0 200 OK - 1420 bytes in 0.239 seconds [01:21:33] up after 159d [01:21:58] hm [01:22:01] and others are down dammit [01:30:55] RECOVERY - MySQL disk space on neon is OK: DISK OK [01:31:01] LeslieCarr: what's the deal with the neon nagios alerts? seems to have been happening for 18+ hours [01:31:06] ^^ [01:31:18] couldn't have timed that better [01:36:44] New patchset: Catrope; "Puppetize the /var/lib/parsoid/Parsoid symlink" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49193 [01:37:07] * RoanKattouw looks around for this week's duty monkey [01:37:20] Boo, notpeter isn't on IRC [01:47:39] New patchset: Jeremyb; "redirect wikimediafoundation.info -> wmfwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49194 [01:48:54] New patchset: Jeremyb; "redirect wikimediafoundation.info -> wmfwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49194 [01:53:03] New patchset: Jeremyb; "redirect wikimediafoundation.info -> wmfwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49194 [02:07:25] jeremyb_, if you'd like to do https://rt.wikimedia.org/Ticket/Display.html?id=4240 too? [02:07:46] Thehelpfulone: i'm in the middle of a rebase... [02:07:52] ah, finished [02:07:57] heh ok [02:08:58] New patchset: Jeremyb; "Linkify RT a little more liberally." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49196 [02:09:00] idk if that will actually work... [02:09:22] i guess i should also log in to gerrit [02:30:05] !log LocalisationUpdate completed (1.21wmf9) at Fri Feb 15 02:30:05 UTC 2013 [02:30:09] Logged the message, Master [02:35:40] New patchset: Jeremyb; "redirects for wikiartpedia.{biz,co,info,me,mobi,net}" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49197 [02:35:46] Thehelpfulone [02:36:53] i guess it's pretty late there actually :) [02:46:17] New review: Jeremyb; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49197 [03:07:24] New patchset: Jeremyb; "redirect wikimediafoundation.info -> wmfwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49194 [03:09:36] New review: Jeremyb; "Patch Set 4:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49194 [03:54:49] New patchset: Jeremyb; "fix https://bug-attachment.wikimedia.org/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [04:18:26] New patchset: Jeremyb; "fix https://bug-attachment.wikimedia.org/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [04:21:29] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 200 seconds [04:30:20] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 184 seconds [04:31:23] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 198 seconds [04:38:40] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [05:28:46] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [05:59:49] RECOVERY - MySQL disk space on neon is OK: DISK OK [06:05:40] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [06:30:43] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [06:42:47] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [06:42:56] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 181 seconds [06:48:02] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:48:11] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:55:51] New review: Ori.livneh; "Patch Set 2: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/49200 [07:26:13] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:26:31] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:30:43] Aaron|home: around? [07:33:10] barely [07:33:28] man those headache pills seemed to work [07:34:27] Echo is throwing exceptions; https://gerrit.wikimedia.org/r/#/c/49088/ would fix it [07:34:56] it was merged, the submodule ref updated, and a pull on fenari, but no sync, it seems [07:36:57] trace: http://dpaste.de/zJEbr/raw/ [07:39:21] I'd be comfortable syncing it -- it's on testwiki at the moment. would there be anyone around? apergos? [07:39:43] I'm here [07:39:49] what are we syncing? [07:40:04] https://gerrit.wikimedia.org/r/#/c/49088/, fixes http://dpaste.de/zJEbr/raw/ [07:40:28] merged and on testwiki already but someone forgot to sync, I think. [07:41:28] how widely is the notification stuff used? [07:41:53] Nemo_bis got it just trying to add a message to a talk page, I think. [07:41:55] I see that code on srv230 and fenari [07:42:10] what's srv230? [07:43:33] some random appserver [07:43:49] why is it in decommissioning.pp? [07:43:54] anyways... [07:44:22] go ahead I guess, like you say it's already merged [07:46:34] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [07:46:42] Clearly we need an auto-scap bot to run every few hours. Keep everyone on their toes. [07:47:43] Susan: like chaos monkey? [07:47:59] Something like that. [07:50:28] i think auto-scap isn't the answer [07:51:03] some combination of something like chaos monkey and also git-deploy instead of scap is better [07:51:49] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Puppet has not run in the last 10 hours [07:53:34] apergos / Aaron|home - sorry, not going to after all. it appears to have occurred only once and i can't determine the state of the code on mw1181 to my satisfaction [07:53:52] so going to report it and leave it for now [07:54:03] as long as the next scapper doesn't push it unknowingly [07:54:06] all righty [07:57:07] * apergos autoscaps Susan [07:57:20] :D [07:58:08] apergos: well, s/he would, and a revert could make things worse if the code was half-deployed somehow (confused by aaron seeing it on srv230) [07:58:33] so if you're here, I will sync, if that's ok. [07:58:39] :-D :-D [07:58:43] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [07:58:59] i take that to be "yes" [07:59:13] can you stick around for a bit in case it needs to be undone? [07:59:19] if so, then it's a "yes" [07:59:28] yes, of course [07:59:33] then go ahead [07:59:46] PROBLEM - Puppet freshness on labstore3 is CRITICAL: Puppet has not run in the last 10 hours [08:02:01] !log olivneh synchronized php-1.21wmf9/extensions/Echo 'Syncing Iab077a77aa75a6ff0cf03acecca38403595dda93 to fix exception 640c60c4' [08:02:04] Logged the message, Master [08:02:23] Now there you have a nice, human-readable message [08:02:31] I should have removed the words and kept just the hashes [08:02:32] :) [08:02:51] At least one can grep the IRC logs to have context ^^ [08:17:10] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [08:17:37] RECOVERY - MySQL disk space on neon is OK: DISK OK [08:18:49] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 223 seconds [08:20:37] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 5 seconds [08:20:37] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 2 seconds [08:38:39] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [08:39:24] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [08:41:12] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 5 seconds [08:42:15] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [08:49:43] bock [09:07:54] PROBLEM - Puppet freshness on mw1003 is CRITICAL: Puppet has not run in the last 10 hours [09:25:57] New review: Hashar; "Patch Set 1: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/49133 [09:26:44] Susan: I might help you to edit the foundation wiki :-] [09:27:28] Susan: http://wikimediafoundation.org/wiki/Staff_and_contractors list me wearing an invisible cloak . I have uploaded yesterday a pic of me taken by Victor last year when I was in SF. I have uploaded it on commons : https://commons.wikimedia.org/wiki/File:Antoine_Musso-3500.jpg [09:32:02] hashar, done: https://wikimediafoundation.org/w/index.php?title=Template:Staff_and_contractors&diff=87693&oldid=87675 [09:32:48] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [09:32:57] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [09:33:13] MaxSem: thanks! [09:33:29] Susan: zoned by MaxSem :-] [09:43:36] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 2 seconds [09:43:45] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 214 seconds [09:44:03] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 227 seconds [09:45:15] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [09:46:10] hashar: Do you have an account there? [09:46:16] If not, I'd be happy to make one for you. [09:47:30] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 10 seconds [09:48:21] Susan: I don't have any account on the WMF wiki [09:48:25] Susan: not sure I need any [09:48:32] Fair enough. [09:48:53] there is already tons of helpful people on there ;-] That is the first time I actually have to edit something [09:49:00] the quality of that wiki is just too high [09:49:00] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 2 seconds [09:49:42] Hah, it's a mess. [09:49:52] Everything is quickly outdated due to the fishbowl-ness of the wiki. [09:50:07] We're working on opening it up, but it's a very slow process. [09:52:57] New patchset: Jeremyb; "make logo protorel instead hardcoding HTTP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49215 [10:02:19] New patchset: Ori.livneh; "(RT 2490) HTTPS on bug-attachment.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [10:04:09] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [10:04:13] New review: Ori.livneh; "Patch Set 3: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/49200 [10:15:28] New review: Ori.livneh; "Patch Set 1:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49191 [10:29:26] hashar, Victor usually likes to add photos he's taken to his user page too: https://commons.wikimedia.org/wiki/User:Vgrigas [10:34:42] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [10:36:06] New review: Hashar; "Patch Set 6:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [10:36:21] RECOVERY - MySQL disk space on neon is OK: DISK OK [10:36:21] I hate you puppet [10:39:50] New review: Hashar; "Patch Set 6:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [10:55:29] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [10:58:09] hashar: speaking of which, I noticed the nice gallery but I'm prety sure one or two of the team members without a photo do have a good one on Commons [11:02:33] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [11:03:06] New review: Hashar; "Patch Set 8:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [11:03:47] New review: Silke Meyer; "Patch Set 3: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/48979 [11:10:37] Nemo_bis: I have updated the platform engineering team section https://www.mediawiki.org/wiki/Wikimedia_Platform_Engineering#Team [11:10:49] Nemo_bis: feel free to add pics for people missing. [11:13:17] New patchset: Jalexander; "planet config updates . add 2 update 2 de,en,it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49220 [11:16:07] out for lunch [11:16:25] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [11:21:59] New review: Silke Meyer; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48979 [11:25:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.860 seconds [11:48:28] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [11:55:38] is apache-graceful-all in puppet somewhere? looks like not.. [11:55:52] we need to fix it to also work on eqiad Apaches [11:58:05] more precise, it is /usr/bin/apache-sanity-check which is used by graceful-all [12:00:12] ah.. "it's in the wikimedia-task-appserver Debian package" [12:07:10] New patchset: Dzahn; "apache-sanity-check needs to use new regex to also work on eqiad Apaches see details in RT-4449" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/49231 [12:15:30] New patchset: Silke Meyer; "Inline documentation for Wikidata's puppet files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49232 [12:16:49] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 184 seconds [12:18:29] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [12:19:58] RECOVERY - MySQL disk space on neon is OK: DISK OK [12:27:19] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [12:28:32] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 215 seconds [12:58:49] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [12:59:16] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [13:04:02] re [13:08:40] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [13:09:21] New review: Hashar; "Patch Set 10:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [13:20:28] New review: Hashar; "Patch Set 10: Verified+1" [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/47664 [13:20:36] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [13:36:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [13:37:40] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 233 seconds [13:46:47] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [13:46:47] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [13:50:07] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [13:50:07] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [13:51:41] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [14:02:42] mutante: wanna review yet another beta hack change ? :-D [14:03:13] New review: Dzahn; "Patch Set 4: Code-Review+2" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/49194 [14:04:57] hashar: i can take a look in a little while.. when i [14:05:04] ok :-] [14:05:06] m done with another thing [14:05:18] btw, apache-config repo does not have Jenkins, right [14:05:43] i mean, need to manually hit Verified.. [14:06:14] New review: Dzahn; "Patch Set 4: Verified+2" [operations/apache-config] (master); V: 2 - https://gerrit.wikimedia.org/r/49194 [14:06:15] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49194 [14:08:57] mw1085: rsync: mkdir "/usr/local/apache/conf" failed: No such file or directory (2) [14:09:16] {{somakeit}} [14:09:22] xD [14:09:26] mutante: yeah there are not linting test yet for our apache-config [14:11:30] sync-apache does not sync properly [14:15:44] <^demon> !log [gerrit] updated replication plugin to 20c6fc7e (1.1-SNAPSHOT) which is back in line with upstream [14:15:48] Logged the message, Master [14:23:43] ah, srv233/234 are in decom and not in Nagios, but still running ..that explains [14:24:50] dzahn is doing a graceful restart of all apaches [14:25:29] !log dzahn gracefulled all apaches [14:25:30] Logged the message, Master [14:28:41] !log gracefulling Apaches in eqiad using dsh [14:28:42] Logged the message, Master [14:31:34] !log wikimediafoundation.info activated to redirect .org [14:31:36] Logged the message, Master [14:32:44] how many of these alias domains do we have?:) [14:32:58] too many ?:) [14:33:07] waiting for .museum for GLAM :) [14:33:31] hehe, not for real, but it would match [14:35:21] mutante, know anything about Nagios plugins? [14:35:57] yep, which one? [14:36:14] there are several different packages, plugins-basic, -standard, -extra, -bla [14:36:38] wikimediafoundation.info works for you too, right? [14:36:55] I was wondering if you can review https://gerrit.wikimedia.org/r/47111 [14:37:14] "This domain points to a Wikimedia Foundation server, but is not configured on this server." [14:38:03] hmm sure its not browser cache? [14:38:08] it did that before the change [14:38:21] I didn't visit it before [14:38:27] hrmm [14:38:47] hrm, force-reload did it:) [14:38:52] weird [14:38:53] looks like that is more about Solr [14:38:58] ah, cool [14:40:04] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [14:40:06] i am probably not a good reviewer for Python, but the return values look good, 0/1/2 for OK/WARN/CRIT [14:40:18] but the Solr part.. ehmm.. [14:41:16] since no op knows much about Solr, you all will have to believe me:P [14:42:16] i think in this case i can merge it to just try if it works, it doesnt break anything, worst case is a broken check we don't have yet anyways [14:42:59] I'm not sure how to actually register it with Nagios though [14:43:03] it won't show up in Nagios [14:43:11] by just adding this one file, yea [14:43:53] i have done that part before.. that shouldn't be a big issue as long as the check works [14:44:37] New review: Dzahn; "Patch Set 4: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/47111 [14:44:47] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47111 [14:46:00] MaxSem: have a host:port this should work on? [14:46:26] eh, i see 8983 as port [14:48:05] check_solr -r -a 400:600 -t 5 [14:48:24] it uses the right port bydefault [14:49:32] ValueError: need more than 0 values to unpack [14:50:15] line 190 [14:50:49] on which host? [14:51:30] search1001 [14:52:09] ugh, I don't have access to it [14:52:31] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [14:52:49] trying on fenari [14:53:08] ah, solr1001 ! [14:53:13] not search1001 [14:54:34] unable to open database file [14:55:03] where did you expect the sqlite db? [14:55:18] let me commit a manifest for it [14:55:29] db_conn = sqlite3.connect('/var/lib/check_solr/checks.sqlite3') [14:56:24] we don't have that on neon or spence [14:56:40] where neon is Icinga and spence is Nagios [14:58:17] ok, committing a followup [14:58:49] spence has sqlite the package installed [14:58:57] neon does not, just libsqlite [15:00:25] mutante: if you have finished with the apache config, would you mind reviewing my beta hack please? https://gerrit.wikimedia.org/r/#/c/47398/2 [15:00:41] hashar: yes, it's your turn [15:03:13] hashar: yea, that makes sense, Apache waits for the sync to finish before it starts.. ack [15:03:26] \O/ [15:03:31] that is a bit of a hack [15:04:09] New review: Dzahn; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/47398 [15:04:17] New review: Mark Bergsma; "Patch Set 5: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/47067 [15:04:21] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47398 [15:04:23] using /bin/true , yea [15:04:55] credits go to platinides :-] [15:05:20] going to test out an apache linter [15:05:30] cool [15:05:33] gsed -i s%/etc/apache2/wmf/%% * [15:05:34] httpd -t -d "$WORKSPACE" -f all.conf [15:05:36] oh man [15:05:38] apache2 [15:07:46] hashar: remnant.conf :p [15:08:33] mutante, I'm lost: Synaptic shows that I have no SQLite library for Python, yet the script works [15:10:42] MaxSem: on spence, where sqlite is installed i get this: [15:10:47] AttributeError: _ElementInterface instance has no attribute 'iter' [15:11:04] ha [15:11:05] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:11:27] looks like a no n-compatible ElementTree [15:11:33] MaxSem: and there is just no /var/lib/check_solr/ where it looks for the db [15:11:38] that doesn't work as expected sniff [15:11:44] you would have to create that via puppet [15:11:50] 15:11:08 Syntax error on line 3 of /var/lib/jenkins/jobs/operations-apache-config-lint/workspace/www.wikipedia.conf: [15:11:50] 15:11:08 Invalid command 'Redirect', perhaps misspelled or defined by a module not included in the server configuration [15:11:52] yeah, doing it now [15:11:56] ok, cool [15:12:09] but not sure what packages are needed [15:12:34] spence has both, sqlite and sqlite3 [15:12:42] i suppose just sqlite3 is good enough [15:12:47] sqlite is 2.x [15:13:18] mutante, the error you posted is not related to sqlite [15:13:20] icinga doesnt have either yet, so that means it probably wasnt puppetized [15:13:42] no, it's not [15:13:51] but that's what i get where sqlite is installed [15:14:27] and where it's not installed i get the "unable to open database file" [15:15:57] heh, I tried to uninstall python-libxml2 and apt told me that it would involve uninstalling ubuntu-desktop:P [15:16:52] grmblb [15:16:58] WE do not have ack-grep in production [15:17:00] seriously [15:17:02] :D [15:17:20] hashar, and not even MC:] [15:19:16] hashar: does it not get that mod_alias is installed? [15:19:24] yeah that is the issue [15:19:30] heh [15:19:31] /usr/sbin/apache2 -t -d "$PWD" -C 'Include /etc/apache2/mods-enabled/*.load' -C 'Include /etc/apache2/mods-enabled/*.conf' -f all.conf [15:19:41] that would hack it to load the global conf [15:19:41] yes, alias.conf [15:19:59] aha [15:20:10] ok, brb [15:20:20] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:21:54] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:22:13] RECOVERY - MySQL disk space on neon is OK: DISK OK [15:23:04] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:23:18] bah [15:23:19] zuul ignores me [15:25:23] mutante, what python version does neon have? [15:25:27] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:27:53] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:29:45] * MaxSem wonders if the script can be fixed by using specifically python2.7 [15:30:16] MaxSem: 2.7.3 [15:30:22] weird [15:30:29] meh [15:31:32] MaxSem: once it has that directory, /var/lib/check_solr it creates the db [15:31:43] but then TypeError: object of type 'NoneType' has no len() [15:35:10] be back after food [15:35:31] mutante: notice: /Stage[main]/Applicationserver::Config::Apache/Exec[Fake sync apache wmf config on beta]/returns: executed successfully [15:35:33] \O/ [15:35:33] thx [15:36:04] New review: Hashar; "Patch Set 2:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47398 [15:37:04] heh, that error is indeed caused by python2.6 [15:37:05] mutante: that was bug https://bugzilla.wikimedia.org/show_bug.cgi?id=38996 [15:39:50] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:39:56] :-((( [15:43:09] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:43:20] 2013-02-15 15:43:09,219 INFO zuul.Jenkins: Launch job operations-apache-config-lint for change with dependent changes [] [15:43:21] \O/ [15:43:49] 15:43:13 Invalid command 'ExpiresActive', perhaps misspelled or defined by a module not included in the server configuration [15:43:51] good luck [15:47:50] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 185 seconds [15:48:08] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 184 seconds [15:52:06] New patchset: MaxSem; "Solr monitoring fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49250 [15:52:17] mutante, ^^^ [15:53:15] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:54:09] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49074 [15:55:04] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [15:59:26] apache linter is on [16:00:46] !log Enabled a linter for operations/apache-config . Any apache typo in operations/apache-config will publicly write your name on the typo wall of shame. See {{gerrit|49251}} for implementation details and https://integration.mediawiki.org/ci/job/operations-apache-config-lint/ for some build examples. [16:00:48] Logged the message, Master [16:03:38] * hashar dance [16:03:55] Change abandoned: Hashar; "(no reason)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49243 [16:04:38] New review: Hashar; "Patch Set 2:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24407 [16:05:11] New review: Hashar; "Patch Set 4:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/47088 [16:05:53] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49197 [16:06:00] New review: Hashar; "Patch Set 2:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [16:06:08] New review: Hashar; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [16:06:38] https://gerrit.wikimedia.org/r/#/q/project:operations/apache-config+is:open,n,z [16:06:40] all +1 ed [16:06:42] \|O/ [16:07:11] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [16:10:43] New patchset: Jgreen; "building db29 for RT #4518" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49254 [16:11:51] * Jeff_Green is confused by apparent gerrit UI changes--where did the [review side-by-side] etc go? [16:15:23] Jeff_Green: they are no more buttons but links [16:15:33] Jeff_Green: just below the list of files for the patchset [16:15:39] ahh i see them now. thx [16:15:43] :D [16:15:45] congrats [16:15:48] lol [16:15:49] New patchset: Ottomata; "Fixing hdfs::sync rsync command" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49255 [16:16:02] New review: Jgreen; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49254 [16:16:03] New review: Ottomata; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49255 [16:16:08] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49254 [16:16:15] mark: so I guess no varnish is going to be deployed on a friday evening ? :-] [16:16:15] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49255 [16:16:44] ah, Jeff_Green [16:16:53] did we just collide? [16:16:59] we merged at the same time I think [16:17:01] ja [16:17:04] db29 ok to merge? [16:17:07] yes [16:17:10] k doing [16:17:11] i just merged yours actually [16:17:13] oh ok [16:17:14] :) [16:17:22] fire in the hole! [16:17:22] perfect thank you! [16:17:36] np [16:19:14] <^demon> Someone got a second to merge a one-line puppet change? Wanting to enable diffs in gerrit e-mails [16:19:50] ^demon: looking [16:20:46] New review: Jgreen; "Patch Set 1: Verified+1 Code-Review+2" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/49133 [16:20:54] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49133 [16:21:11] ^demon: merged [16:21:31] <^demon> Thank you. I can run puppet as soon as it's ready on sockpuppet. [16:21:35] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [16:21:57] MY MAILBOX IS FULL!!!!!!!!!!!!! [16:22:05] the gerrit emails are too large :D [16:22:47] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [16:22:49] <^demon> They're limited in size so diffs won't make your machine go into swap. [16:22:52] <^demon> ;-) [16:23:22] * hashar updates his mail proxy rewriting rules. [16:26:13] New patchset: Ottomata; "Adding --delete flag to hdfs::sync rsync command" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49257 [16:26:35] New review: Ottomata; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49257 [16:26:44] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49257 [16:27:55] hashar: apache-linter+! nice! [16:28:06] thanks [16:29:01] <^demon> Jeff_Green: Thanks so much. Diffs look good :) [16:29:10] <^demon> (Very long standing request for gerrit) [16:32:14] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [16:33:17] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 204 seconds [16:33:36] mutante: next step would be to run an apache instance on 127.0.0.2:80 and run Jeff integration tests on it [16:33:44] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 223 seconds [16:34:38] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 207 seconds [16:34:47] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 200 seconds [16:37:04] mutante, can you take a look at https://gerrit.wikimedia.org/r/49250 please? [16:38:05] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 1 seconds [16:38:23] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 1 seconds [16:46:02] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 184 seconds [16:46:11] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [16:47:14] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 197 seconds [16:47:59] MaxSem: looks reasonable so far.. but .. i think we want to add it to Icinga and not to old spence [16:48:28] oh, I thought they both use the same class [16:48:37] how do I do this? [16:48:38] LeslieCarr: new monitoring checks.. we want them on neon now, right [16:48:55] ./manifests/misc/icinga.pp [16:48:58] instead of nagios.pp [16:53:54] Change abandoned: Hashar; "does not really work as is, will have to invest sometime later on." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28627 [16:55:54] !log removing mw1085 from dsh groups - broken hardware (RT-4453) [16:55:56] Logged the message, Master [17:11:17] New patchset: Hashar; "beta: memcached on the two apaches boxes" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49261 [17:11:35] New review: Hashar; "Patch Set 1:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49261 [17:11:37] week end [17:11:41] have a good time! [17:11:42] * hashar waves [17:12:07] bye hashar [17:13:12] New patchset: MaxSem; "Solr monitoring fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49250 [17:32:41] PROBLEM - Host db1030 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:50] !log reimaging db1029-1031 [17:33:52] Logged the message, Master [17:36:21] binasher: for a break in between :) http://mysqlgame.com/ [17:36:55] mutante: hahah aweseome [17:37:38] PROBLEM - SSH on db1031 is CRITICAL: Connection refused [17:38:32] RECOVERY - Host db1030 is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [17:41:32] PROBLEM - SSH on db1029 is CRITICAL: Connection refused [17:42:04] I declare mysqlgame to be the second best game after progressquest. [17:42:21] "attack rows where..." [17:42:35] PROBLEM - SSH on db1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:35] somewhere in here there's a recursive joke about little johnny drop tables [17:42:45] they use mediawiki too [17:43:04] http://progressquest.com/ [17:43:09] http://mysqlgame.com/wiki/index.php/The_Game%27s_Goal [17:44:59] RECOVERY - SSH on db1031 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:45:25] mutante, does it look better now? [17:46:02] RECOVERY - SSH on db1030 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:50:03] MaxSem: yes it does, checkcommands.cfg.erb was missing [17:50:46] MaxSem: i would like to ask Leslie when we start just adding stuff to Icinga and keep Nagios/Icinga in sync. but i can merge that now [17:51:33] New review: Dzahn; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49250 [17:51:40] mutante, wouldn't Nagios fail because it sees the monitor_service() but not the other changes? [17:51:43] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49250 [17:52:40] MaxSem: yea, but not break all of Nagios, just a new broken check for Solr that didnt exist before and we can acknowledge so it doesn't spam [17:52:47] but let me check [17:53:14] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Puppet has not run in the last 10 hours [17:53:17] arg, problem will be that it takes a long time until we can see it [17:57:17] RECOVERY - SSH on db1029 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:00:17] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [18:01:11] PROBLEM - Puppet freshness on labstore3 is CRITICAL: Puppet has not run in the last 10 hours [18:07:11] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [18:08:09] MaxSem: ugh, it broke on Icinga.. did not find the checkcommand.. trying one more run [18:11:32] PROBLEM - NTP on db1031 is CRITICAL: NTP CRITICAL: No response from NTP server [18:12:09] New review: Faidon; "Patch Set 1: Code-Review-1" [operations/debs/wikimedia-task-appserver] (master) C: -1; - https://gerrit.wikimedia.org/r/49231 [18:12:53] PROBLEM - NTP on db1030 is CRITICAL: NTP CRITICAL: No response from NTP server [18:16:08] New review: Faidon; "Patch Set 6: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/47742 [18:17:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 208 seconds [18:18:17] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 224 seconds [18:19:15] New review: Faidon; "Patch Set 6:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [18:21:08] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [18:21:44] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [18:23:08] !log mw87-mw125 installing [18:23:09] Logged the message, RobH [18:23:18] no one mess with brewster ;] [18:23:50] PROBLEM - NTP on db1029 is CRITICAL: NTP CRITICAL: No response from NTP server [18:24:12] New patchset: Jgreen; "removing deprecated file from civi config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49265 [18:24:52] Jeff_Green: deleting the "typo" file? that is something for jenkins [18:25:10] garg that made it in there? i totally missed that. [18:25:18] thanks for catching it [18:25:21] New review: CSteipp; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [18:25:37] yw, i wanted to see it you mean "that" civi,, but you mean fundraising of course:) [18:25:37] Thehelpfulone: hah, this one looks like a bargain. yeah, right [18:26:05] :D [18:26:10] Change abandoned: Jgreen; "oops, erroneously deleted 'typo' file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49265 [18:26:25] garg. now I have to remember how to unscrew my repo [18:28:37] MaxSem: ouch, totally not your fault, but Icinga does not use template icinga/checkcommands.cfg.erb, it uses nagios/checkcommands.cfg.erb :p [18:28:42] git reset --ohdamnimscrewed [18:28:45] let me see how different they are [18:28:52] ahahaha [18:29:18] RobH: perfect! [18:30:14] git reset --endmymisery causes git to fork a hitman to kill you. [18:30:22] git is powerful. [18:31:40] * Jeff_Green is sooo tempted to try it [18:32:00] you have a family man, dont do it [18:32:37] is death-by-git in the life insurance policy? [18:32:52] counts as suicide, you should have known better than to learn git. [18:35:09] New patchset: Dzahn; "add solr monitoring checkcommands to nagios checkcommands template, because Icinga uses this one and not icinga/checkcommands template.." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49266 [18:36:01] New review: Dzahn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49266 [18:36:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49266 [18:39:01] mutante: so, are we waiting to deploy wikiartpedia for some reason? (i.e. it should not be waiting for DNS, right?) [18:39:16] jeremyb_: we are waiting for markmonitor now [18:39:35] mutante: but we can do apache first [18:39:43] it doesnt matter [18:39:49] and even our zones for that matter. [18:39:50] it wont work without DNS [18:40:04] sure, just saying it's not blocked on markmonitor right now. we could prepare more [18:40:21] yea, thats what i said earlier.. we still need to add to DNS but the order doesnt matter [18:40:28] yea [18:40:41] k [18:40:54] * jeremyb_ goes to reply to bug-attachment [18:41:59] RECOVERY - NTP on db1030 is OK: NTP OK: Offset 0.0422295332 secs [18:46:00] MaxSem: http://neon.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=solr1&service=Solr [18:46:25] Icinga is unbroken now, i added the checkcommands to the nagios template as well.. just copied because the files were pretty different [18:46:31] but the check is status Unknown [18:46:38] and i need to go soon [18:47:56] UNKNOWN is a good status though, tells us to fix it but doesnt cause CRITs that would page/mail [18:48:57] and unrelated but sigh: [18:48:59] Error 400 on SERVER: Exported resource Nagios_host[srv229] cannot override local resource on node spence.wikimedia.org [18:49:47] mutante, thanks [18:50:04] looks like the problem is in error handling [18:53:32] RECOVERY - NTP on db1029 is OK: NTP OK: Offset -0.009438157082 secs [18:55:14] New patchset: Asher; "eqiad half of new x1 (extension) coredb shard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49268 [18:57:20] New review: Jeremyb; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [18:59:31] ori-l: csteipp ^ [19:02:14] RECOVERY - NTP on db1031 is OK: NTP OK: Offset -0.006007432938 secs [19:03:11] RobH: If you wanna do the Parsoid box swap-out I'm available for that any time today [19:05:09] RoanKattouw: I didnt get to installing them yet, i took priority to get apaches in tampa back to snuff [19:05:19] cuz not having possible fallback DC is buggin me [19:05:27] sorry about that, i will try to get them today [19:05:34] but if not, i get otday/monday for us on tuesday? [19:05:40] (You are in office on Tuesday right?) [19:06:42] New review: Asher; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49268 [19:06:51] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49268 [19:07:22] !log no one sign mw86-mw125 puppet certs, im still working on them [19:07:23] Logged the message, RobH [19:07:38] RECOVERY - Puppet freshness on mw1003 is OK: puppet ran at Fri Feb 15 19:07:21 UTC 2013 [19:10:35] RobH: No worries, I'm not in a rush [19:10:44] RobH: Yeah I'm out every Mon&Wed but Tue is good [19:10:52] RoanKattouw: Awesome, thx dude [19:11:00] Oh and I guess Monday is a holiday :) [19:11:14] RECOVERY - Puppet freshness on search14 is OK: puppet ran at Fri Feb 15 19:10:54 UTC 2013 [19:11:39] notpeter: So I am about to add 40 more apaches into tampa. Keeping in mind that I have a pool of 6 new image apcahes, and jobrunner additions are not desired [19:11:42] i figured of the 40 new [19:11:45] 1/3rd to api [19:11:49] rest to general apache [19:11:57] Do we generally prefer to stick to Ubuntu-provided packages when they are avaliable, even if they're a little on the older side? [19:12:12] RoanKattouw: Oh, Monday is holiday eh? [19:12:18] Tuesday prolly still works [19:12:49] Yeah [19:12:58] So don't come into the office, do fun things instead ;) [19:13:40] sleep in and play deadspace3 weeeeee [19:13:48] RobH: yeah. 6 imagescalers, then keep roughly whatever ratio there already is between api and regular apache [19:13:48] Also, if it gets pushed to Thursday and you don't see me around, that's because I'm at the Wikia office (just down the street, around the corner from the old office) on Thursdays. But still working and on IRC [19:13:53] RobH: so sounds good to me! [19:13:56] notpeter: coolness, thx dude [19:14:17] RoanKattouw: just out of curiosity, why are you working out of wikia? [19:15:20] RobH: We have 2 Wikia guys embedded in our team. They work from our office 3 days/wk, so we work from theirs one day/wk (the fifth day they work on Wikia stuff) [19:16:18] RoanKattouw: nagios says titanium parsoid is down, is that expected? [19:16:31] Hmm [19:16:33] I thought I'd fixed that [19:16:35] Checking [19:17:00] Oh I see [19:17:26] There's this unpuppetized symlink that I happen to have created manually on all boxes except titanium [19:17:38] I realized this yesterday and submitted https://gerrit.wikimedia.org/r/#/c/49193/ to fix it [19:17:47] Merging that and running puppet on titanium should bring it up [19:18:22] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49193 [19:18:32] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49193 [19:18:42] RoanKattouw: as long as they dont stick you in the corner heatbox like they used to do to Trevor ;] [19:19:00] hehe [19:19:09] I was in that heatbox with him at the time [19:19:26] It was pretty weird coming back to the Wikia building last year and seeing that room again [19:21:16] i forgot you were in there too [19:21:25] i just recall 3 or 4 of you crammed in there. [19:21:32] and it was hot and not awesome [19:22:00] New patchset: RobH; "adding in mw86-125 & updating decom for servers that are being reclaimed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49270 [19:25:32] New review: RobH; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49270 [19:25:42] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49270 [19:26:31] Coren: yeah but that's just a preference. and the flip side of that is that it sometimes because part of the push driving upgrading to a new ubuntu release [19:28:49] jeremyb_: The case in point is the Ubuntu-provided gridengine (in Precise, anyways) is the pre-fork sunsource's 6.2, it's not /horribly/ old but is already EOL [19:29:13] Coren: so? [19:29:47] what would you use it for? [19:30:12] jeremyb_: Tools lab. Anything I meantion is pretty much guaranteed to be for that purpose. :-) [19:31:40] Coren: is that just because people are most familiar with SGE? there might be other things that work better on ubuntu (or at least haven't been EOL'd) [19:32:30] Coren: idk the status of the postfork Oracle project. is that the same license as opensolaris? or it's closed completely? [19:32:48] Coren: we can certainly make our own packages if need to [19:32:52] if we need to* [19:33:00] jeremyb_: Familiarity is one point, so is functionallity (it does exactly the right thing for the task at hand). The problem isn't that it's been EOL'ed because there is now an open source fork (Open Grid Scheduler) [19:33:28] jeremyb_: Hence my question about preference for directly Ubuntu'ed packages. :-) [19:33:56] well presumably it's just a fork not e.g. a rewrite [19:34:10] so we should be able to adapt existing packages to the new source [19:34:11] It is, from that exact code base. [19:34:31] we would build our own though rather than use a 3rd party repo [19:34:50] (see apt.wikimedia.org) [19:36:54] That'd work. I honestly also expect that Raring will have switched to OGS by release time, but they don't seem to have done so yet. [19:37:41] I've deployed the Ubuntu package on the POC/testing instances for the moment anyways. [19:38:26] ok, so you can work with that for now and we can help you roll packages for a later stage [19:38:49] what are the package names? what you're using now and what's the EOL'd stuff [19:39:54] gridengine-{common,client,exec,master,qmon}. They are all 6.2u5 from http://gridengine.sunsource.net [19:40:10] Further work from Oracle is now proprietary [19:40:44] The open source fork is at http://gridscheduler.sourceforge.net/ [19:41:53] ok, great [19:42:05] Coren: do you want to file a bug for this? [19:42:13] there's a labs product or something [19:42:25] (CC me please) [19:42:53] jeremyb_: Probably not yet, /whether/ we'll be using it is still in question (hence the POC), but I wanted to plan ahead for the possibility. [19:43:11] ok [19:43:52] jeremyb_: I'm still not familiar with the internal culture yet, so what is the "normal" way of doing things is mostly guesswork and bugging people like you. :-P [19:44:41] I forsee a great deal of brain picking in the weeks to come. :-) [19:45:03] Coren: well i'm in #-labs too [19:45:55] kk, I'll bug you there instead. :-) [20:03:35] PROBLEM - LDAPS on nfs2 is CRITICAL: Connection refused [20:03:35] PROBLEM - LDAP on nfs2 is CRITICAL: Connection refused [20:04:39] New review: Ryan Lane; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48996 [20:04:48] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48996 [20:05:05] PROBLEM - LDAP on nfs1 is CRITICAL: Connection refused [20:05:14] PROBLEM - LDAPS on nfs1 is CRITICAL: Connection refused [20:06:11] New patchset: Ryan Lane; "Remove opendj and ldap client from nfs1/2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49274 [20:06:50] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49274 [20:07:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49274 [20:09:10] New patchset: Faidon; "LDAP: configure CA in the Nagios check properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49275 [20:09:19] Ryan_Lane: ^^^ [20:09:49] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49275 [20:11:15] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 193 seconds [20:11:50] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 199 seconds [20:12:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49275 [20:15:04] !log removed opendj from nfs1/2 [20:15:06] Logged the message, Master [20:15:19] !log restarting opendj and pdns on virt0 and virt1000 [20:15:20] Logged the message, Master [20:16:10] New patchset: Faidon; "Fix missing $ca_name in ldap::server definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49276 [20:16:44] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49276 [20:16:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49276 [20:17:05] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 26 seconds [20:18:08] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 10 seconds [20:28:18] New patchset: MaxSem; "Fix check_solr" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49279 [20:28:23] notpeter, ^^ :) [20:31:48] New patchset: Jgreen; "add pgehres to db29" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49288 [20:32:11] New review: Jgreen; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49288 [20:32:24] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49288 [20:41:19] New patchset: Jgreen; "more classes for db29" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49341 [20:41:35] New review: Jgreen; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49341 [20:41:42] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49341 [20:44:42] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [20:45:42] New patchset: Jgreen; "add gid=500 to db29" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49342 [20:45:54] New review: Jgreen; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49342 [20:46:02] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49342 [20:50:04] New patchset: Jgreen; "add roots to db29" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49343 [20:51:25] New review: Jgreen; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49343 [20:51:36] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49343 [21:00:56] New patchset: Jgreen; "php packages for db29" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49345 [21:03:15] New review: Jgreen; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49345 [21:03:26] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49345 [21:10:20] RECOVERY - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is OK: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z OK - 1063 [21:10:38] RECOVERY - ps1-d2-sdtpa-infeed-load-tower-A-phase-X on ps1-d2-sdtpa is OK: ps1-d2-sdtpa-infeed-load-tower-A-phase-X OK - 863 [21:16:29] RECOVERY - MySQL disk space on neon is OK: DISK OK [21:41:56] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49279 [21:42:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49279 [21:46:15] eh, is it known that ishmael is down? [21:46:47] same stuff for graphite... [21:47:35] IIRC that stuff all ran on fenari ... [21:48:30] so, hooray to migration? [21:50:29] New review: Catrope; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47103 [21:52:31] MaxSem: merged on sockpuppet [21:52:40] notpeter, awesome, thanks [21:54:03] RoanKattouw: on fenari?? i don't think so [21:54:27] (proxied through fenari it was. but it had it's own box) [21:54:48] * jeremyb_ will be back later [21:55:18] Hmm well the Apache configs for it were no fenari [21:55:30] So maybe it was a proxy, but I just remember having to deal with the SSL certs [21:55:33] that's only for the proxying [22:02:09] PROBLEM - Puppet freshness on db1002 is CRITICAL: Puppet has not run in the last 10 hours [22:03:57] PROBLEM - Puppet freshness on cp1023 is CRITICAL: Puppet has not run in the last 10 hours [22:09:52] PROBLEM - mysqld processes on db1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:09:52] PROBLEM - MySQL Replication Heartbeat on db1030 is CRITICAL: NRPE: Unable to read output [22:10:10] PROBLEM - Solr on solr1001 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [22:10:10] PROBLEM - Solr on solr3 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [22:10:37] PROBLEM - Solr on solr1 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [22:10:37] PROBLEM - Solr on solr2 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [22:10:46] PROBLEM - mysqld processes on db1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:10:46] PROBLEM - MySQL Replication Heartbeat on db1029 is CRITICAL: NRPE: Unable to read output [22:10:46] PROBLEM - MySQL Replication Heartbeat on db1031 is CRITICAL: NRPE: Unable to read output [22:11:04] PROBLEM - Solr on solr1003 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [22:11:31] PROBLEM - Solr on solr1002 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [22:11:40] PROBLEM - Solr on vanadium is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [22:11:50] notpeter, ^^^ [22:14:09] MaxSem: I'm running puppet on spence now [22:14:11] that might clear it up [22:21:39] what's up with mw1161 [22:22:05] RoanKattouw: the titanium alert didn't go away [22:22:40] Looking [22:22:41] /e/init.d/parsoid restart runs but I see nothing parsoid-related in ps aux [22:22:58] New patchset: Catrope; "Give the admins::parsoid group shell access to the Parsoid machine" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47103 [22:23:03] haha [22:23:37] Yeah it's missing another symlink [22:23:43] The deployment system should put that in [22:23:47] I'll just put it in manuall [22:23:49] y [22:24:33] OK it's running now [22:24:36] That symlink is all it neede [22:24:38] d [22:24:56] RoanKattouw: I think, I *think* that your commit above constitutes an access request and should go through RT [22:25:13] (I'm fine with it though) [22:25:19] Yeah it probably does [22:25:23] That's reasonable [22:25:37] RECOVERY - Parsoid on titanium is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.057 seconds [22:25:37] I will point out that the user in question is already in mortals [22:25:40] But I'll file a ticket [22:25:48] thanks! [22:28:10] PROBLEM - Apache HTTP on mw1161 is CRITICAL: Connection refused [22:29:31] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [22:31:30] New review: Catrope; "Patch Set 2:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47103 [22:33:17] paravoid: I filed an RT ticket, see Gerrit for the link [22:33:41] thanks! [22:48:44] New patchset: Faidon; "Add check_solr to Nagios too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49355 [22:49:07] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49355 [22:49:16] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49355 [22:56:04] RECOVERY - MySQL disk space on neon is OK: DISK OK [22:58:15] RobH: hey [22:58:34] RobH: wanna figure out why mw1161 isn't set up properly/ [22:59:37] yep, i have it and one other eqiad one, and two tampa apaches that didnt properly deploy [22:59:43] working on them today =] [22:59:56] k [22:59:59] well, working on them now, but not to 1161 yet [23:00:02] (also 1165 iirc) [23:01:11] paravoid, https://gerrit.wikimedia.org/r/#/c/49355/ is not enough, you also neeed to create /var/lib/check_solr [23:02:52] /var/lib/check_solr? wtf is that? [23:03:32] a directory where that plugin stores its database:) see icinga.pp [23:03:55] wikidev? [23:03:56] wtf! [23:04:27] !log disregard mw1161 flaps, im working on it [23:04:30] Logged the message, RobH [23:04:57] paravoid, I thought it could be useful to test this script manually, feel free to reset to root or whatever [23:05:08] why does it need a database? [23:05:33] to store the previous error counter values [23:05:49] PROBLEM - Apache HTTP on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:49] why? [23:07:47] to be able to detect new errors:) [23:08:06] I don't understand, could you explain in more words? [23:08:14] rather than short answers to my questions? :) [23:08:26] check N: 3 errors [23:08:34] check N+1: 3 errors, OK [23:08:49] check N+2: 4 errrors, WARNING [23:09:06] check N+3: 5 errors, ERROR [23:09:10] is that supposed to tell me something? [23:09:32] I don't understand. [23:09:44] to see if error counter has changed, I need to store its previous state [23:10:38] (error counter in Solr statistics, not Nagios/Icinga error counter) [23:11:40] PROBLEM - SSH on mw1161 is CRITICAL: Connection refused [23:16:42] MaxSem: 3 errors sounds like an error on itself, doesn't it? [23:16:52] why would you want to get the difference in errors? [23:17:04] these 3 errors might have happened a week ago [23:17:20] I'm only interested in moments where new errors occur [23:17:46] sounds to me like you need a ganglia graph, not a nagios alert [23:18:07] and then maybe an alert for trends of that [23:22:19] RECOVERY - SSH on mw1161 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:24:58] New patchset: Faidon; "Kill check_solr (at least in this form)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49358 [23:25:56] paravoid, "only based on the errors diff"? [23:26:05] it's just one of the checks [23:28:16] !log modified project objects to indicate whether or not they are using gluster volumes for project and/or home storage [23:28:17] Logged the message, Master [23:28:24] !log *modified them in LDAP [23:28:25] Logged the message, Master [23:29:26] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [23:29:53] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [23:30:56] RECOVERY - Apache HTTP on mw1161 is OK: HTTP OK HTTP/1.1 200 OK - 455 bytes in 0.053 seconds [23:34:41] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [23:35:17] PROBLEM - NTP on mw1161 is CRITICAL: NTP CRITICAL: Offset unknown [23:47:11] New review: MaxSem; "Patch Set 1: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/49358 [23:50:08] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [23:51:11] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 212 seconds [23:53:44] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [23:54:47] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [23:56:26] PROBLEM - Apache HTTP on mw1161 is CRITICAL: Connection refused [23:56:54] !log authdns-update to correct mw110 reverse entry [23:56:55] Logged the message, RobH [23:57:30]