[01:03:03] PROBLEM - RAID on analytics1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:58] RECOVERY - RAID on analytics1009 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [01:09:08] PROBLEM - RAID on analytics1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:58] RECOVERY - RAID on analytics1010 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [01:43:41] YuviPanda|away: if you're going to ignore it then you should add a .pep8 file [01:44:01] jeremyb: to what? operations/puppet.git? [01:44:29] and so get jenkins to ignore it too [01:44:42] YuviPanda|away: i guess to the dir that has the python in it? [01:44:52] hmm, maybe [01:44:56] I'll check it out when I wake up [01:44:59] need to sleep in a bit [01:45:12] right after I finish checking out this pupept alternative ;D [01:46:56] YuviPanda|away: see puppet:modules/ldap/files/scripts/.pep8 as an example [01:47:16] puppet alternative? [01:47:20] ansible [01:47:25] for my personal VPS [01:49:49] ahh. i once sat next to that guy at the bar after he gave a talk at a LUG. but we only covered cobbler/puppet because ansible didn't exist yet [01:50:17] jeremyb: ah! [02:07:54] !log LocalisationUpdate completed (1.23wmf4) at Sun Dec 1 02:07:54 UTC 2013 [02:08:12] Logged the message, Master [02:13:36] !log LocalisationUpdate completed (1.23wmf5) at Sun Dec 1 02:13:36 UTC 2013 [02:13:50] Logged the message, Master [02:36:06] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Dec 1 02:36:06 UTC 2013 [02:36:20] Logged the message, Master [03:09:01] PROBLEM - SSH on amslvs3 is CRITICAL: Server answer: [03:11:00] RECOVERY - SSH on amslvs3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [03:32:01] PROBLEM - Backend Squid HTTP on sq37 is CRITICAL: Connection refused [05:55:49] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 8d 20h 37m 22s [06:02:09] RECOVERY - Puppet freshness on sq37 is OK: puppet ran at Sun Dec 1 06:02:06 UTC 2013 [06:04:01] !log pwercycled sq37, was giving 'i/o error' to all commands [06:04:17] Logged the message, Master [06:06:10] PROBLEM - SSH on sq37 is CRITICAL: Server answer: [06:08:00] PROBLEM - Backend Squid HTTP on sq37 is CRITICAL: Connection refused [06:16:01] PROBLEM - NTP on sq37 is CRITICAL: NTP CRITICAL: No response from NTP server [06:20:10] RECOVERY - SSH on sq37 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:22:20] PROBLEM - Frontend Squid HTTP on sq37 is CRITICAL: Connection refused [06:23:12] !log sq37 hardware errors, probably controller, rt #6418 [06:23:29] Logged the message, Master [06:28:10] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:30:10] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [06:30:50] RT 4802 is a new spin on cronspam :P [08:16:47] PROBLEM - DPKG on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:57] PROBLEM - Disk space on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:17:07] PROBLEM - puppet disabled on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:17:27] PROBLEM - RAID on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:07] RECOVERY - puppet disabled on virt10 is OK: OK [08:20:37] RECOVERY - DPKG on virt10 is OK: All packages OK [08:20:47] RECOVERY - Disk space on virt10 is OK: DISK OK [08:21:17] RECOVERY - RAID on virt10 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [08:25:27] PROBLEM - RAID on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:47] PROBLEM - DPKG on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:57] PROBLEM - Disk space on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:07] PROBLEM - puppet disabled on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:35:53] PROBLEM - NTP on virt10 is CRITICAL: NTP CRITICAL: No response from NTP server [08:50:03] RECOVERY - puppet disabled on virt10 is OK: OK [08:50:03] RECOVERY - DPKG on virt10 is OK: All packages OK [08:50:13] RECOVERY - Disk space on virt10 is OK: DISK OK [08:50:53] RECOVERY - NTP on virt10 is OK: NTP OK: Offset 0.0007612705231 secs [08:59:13] RECOVERY - RAID on virt10 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [09:02:53] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 3h 0m 37s [09:05:13] PROBLEM - puppet disabled on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:05:13] PROBLEM - DPKG on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:05:28] PROBLEM - Disk space on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:05:28] PROBLEM - RAID on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:58] PROBLEM - NTP on virt10 is CRITICAL: NTP CRITICAL: No response from NTP server [09:47:39] (03PS2) 10Nemo bis: Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98073 [10:13:01] RECOVERY - puppet disabled on virt10 is OK: OK [10:13:01] RECOVERY - DPKG on virt10 is OK: All packages OK [10:13:11] RECOVERY - Disk space on virt10 is OK: DISK OK [10:13:11] RECOVERY - RAID on virt10 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [10:13:57] Ryan_Lane: hey [10:15:00] or not :) [10:17:21] PROBLEM - RAID on virt10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:17:31] PROBLEM - SSH on virt10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:19:31] PROBLEM - Host virt10 is DOWN: PING CRITICAL - Packet loss = 100% [10:24:41] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [10:25:41] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [11:17:17] Is anyone awake? Tool Labs is down, and iirc Coren requested to be called when this happens. [11:36:26] (03CR) 10Steinsplitter: [C: 031] Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98073 (owner: 10Nemo bis) [12:03:28] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 3h 0m 35s [12:44:17] valhallasw: it is not down here [12:44:23] i'm connected [12:45:12] paravoid: do you know anything about toollabs being down? [12:47:46] Hey, it's up again. Great. [12:48:43] but tool accounts are still down: http://tools.wmflabs.org/gerrit-reviewer-bot/ [12:49:31] apergos: ^ [12:49:59] mutante: ^^ [13:00:53] matanya: see http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&s=by+name&c=tools&tab=m&vn= [13:01:06] 5 hosts are down. I rebooted 3 of them, and they are all stuck at 'rebooting' [13:02:08] yeah, i see YuviPanda thanks [13:02:26] 'get console output' also doesn't seem to work [13:02:32] :/ [13:03:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [13:11:31] (03PS9) 10Matanya: varnish: whitespace & lint cleanups [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 [13:51:05] Gah! Virt10 is down. [13:58:36] !log Hard powercycle to virt10: box was wedged hard. [14:01:31] please come back up please come back up please come back up [14:01:43] * matanya is praying too [14:02:00] RECOVERY - SSH on virt10 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:02:10] RECOVERY - Host virt10 is UP: PING OK - Packet loss = 0%, RTA = 35.45 ms [14:02:20] RECOVERY - RAID on virt10 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [14:02:21] whoo [14:02:30] Why is it things like this unfailingly happen over long weekends? [14:03:24] The kvms are going back up. [14:04:00] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [14:46:40] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:26] (03PS1) 10Matanya: toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 [14:50:30] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [14:51:18] (03CR) 10jenkins-bot: [V: 04-1] toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 (owner: 10Matanya) [15:02:14] (03PS2) 10Matanya: toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 [15:03:05] (03CR) 10jenkins-bot: [V: 04-1] toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 (owner: 10Matanya) [15:04:00] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 6h 1m 7s [15:04:21] (03PS3) 10Matanya: toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 [15:05:12] (03CR) 10jenkins-bot: [V: 04-1] toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 (owner: 10Matanya) [15:07:15] (03PS4) 10Matanya: toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 [15:08:07] (03CR) 10jenkins-bot: [V: 04-1] toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 (owner: 10Matanya) [15:11:07] (03PS5) 10Matanya: toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 [15:12:00] (03CR) 10jenkins-bot: [V: 04-1] toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 (owner: 10Matanya) [15:13:13] (03PS6) 10Matanya: toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 [15:13:47] matanya: "puppet parser validate foo.pp" [15:14:04] (03CR) 10jenkins-bot: [V: 04-1] toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 (owner: 10Matanya) [15:14:25] paravoid: labs was down until now, and i don't have puppet installed [15:14:34] but yes, agreed :) [15:20:10] (03PS7) 10Matanya: toollabs: lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/98377 [15:21:22] finally [15:27:36] (03PS1) 10Matanya: toollabs: remove old tips absent declartions [operations/puppet] - 10https://gerrit.wikimedia.org/r/98379 [16:04:32] Hey guys, I don't know who owns this http://www.google-melange.com/gci/task/view/google/gci2013/5280487538950144 but it's sending people to #wikimedia-ops instead of #wikimedia-operations, which is wrong [16:06:21] Hello everyone, I am a GCI 2013 student working on the "CREATE A PAGE WITH INSTRUCTIONS FOR NEW OPS VOLUNTEERS" task. I have finished a draft of my page at https://wikitech.wikimedia.org/wiki/Get_involved#How_to_Propose_Changes . Would anyone be willing to provide feedback? [16:07:58] iShirik: aw, poop [16:08:26] iShirik: i don't think if it's still possible to edit descriptions of published tasks [16:08:41] iShirik: andre klapper (andre__) or quim gil (qgil) are the people to contact [16:08:59] worst case scenario we just keep redirecting them to the right place :P [16:09:01] thanks [16:09:42] SanjayR_: have you already asked guillom or qgil? :) [16:09:52] SanjayR_: (it's weekend, though, so they might be unavailable) [16:11:05] yes, I asked qgil, but he said to also ask at #wikimedia-ops, and they told me to go to #wikimedia-operations [16:13:29] SanjayR_: yeah, this is the right channel, but most other people who could answer are just not here on weekends :( [16:14:47] * valhallasw also always gets confused by -ops vs -operations [16:16:20] ok, thanks anyway [16:50:07] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [18:04:58] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 9h 2m 5s [18:12:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [18:42:54] PROBLEM - MySQL Slave Delay on db73 is CRITICAL: CRIT replication delay 324 seconds [18:43:04] PROBLEM - MySQL Replication Heartbeat on db73 is CRITICAL: CRIT replication delay 327 seconds [18:45:54] RECOVERY - MySQL Slave Delay on db73 is OK: OK replication delay 123 seconds [18:46:04] RECOVERY - MySQL Replication Heartbeat on db73 is OK: OK replication delay 122 seconds [19:51:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [20:18:37] matanya: who can comment on GCI tasks? [20:18:54] jeremyb: not me :) [20:19:02] matanya: :( [20:19:13] I don't know, sorry [20:19:24] jeremyb: can you give an example? [20:19:24] matanya: did you see above about people being sent to #-ops? [20:19:55] no, where? [20:20:36] oh, you were gone [20:20:43] see from ~16 UTC [20:21:08] * matanya is looking [20:21:11] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20131201.txt [20:21:36] wm-bot don't handle the colors from grrrit-wm so well [20:21:52] GCI likes to say """""" [20:21:52] GCI likes to say """""" [20:21:52] This page is inaccessible because you do not have a profile in the program at this time. [20:22:15] gah, how did that get doubled? [20:22:32] aha! the answer here would be qgil jeremyb [20:23:20] matanya: well you're mentioned there [20:23:21] jeremyb: i can comment of them, if you want just that [20:23:30] (or any other mentor, see [[mw:GCI#Mentors]]) [20:23:44] andre and quim are "org admins" and they can do some other things too [20:24:19] MatmaRex: well part of what i want to reply to is something that came from you. i.e. the part about it's a weekend [20:25:05] we should encourage people to leave a client idling here and teach them how to notice when they've been hilighted [20:25:41] and matanya *is* here now and was active earlier. i don't want to commit him to staying all night but he will be here sometimes [20:26:05] MatmaRex: ^ [20:27:02] I had an irssi session active 24/7 in the past, but got way too many pings when was away, means many tasks to do when i'm back [20:27:09] so i stoped that habit [20:27:13] jeremyb: maybe ops are different (i don't get to interact with y'all too much :) ), but in general catching a WMF person during weekend is practically impossible [20:27:30] MatmaRex: you don't need a WMF person for this task [20:27:45] non-WMF people are also somewhat rarer than during weekdays [20:27:46] anyway [20:27:49] matanya: but you at least don't close the client just because you went to do the dishes, right? :) [20:28:16] i can forward anything you want to the GCI task until you register at melange and andre/quim make you a mentor (which will let you comment on things) :) [20:28:26] ah, you cought me jeremyb :) [20:28:35] https://www.mediawiki.org/wiki/GCI#Become_a_Wikimedia_GCI_mentor [20:28:38] MatmaRex: caught* :) [20:29:06] jeremyb: vishna's law in action. [20:29:07] jeblair: we have definitely too many people whose nicks start with the same letters, don't we? :> [20:29:18] + too tired to type correctly [20:29:32] LOL at MatmaRex [20:29:56] wow, had to reread a bit before i caught what i'd done [20:30:14] i'm particularly unlucky [20:30:19] 'ma' seems to be quite popular :D [20:30:25] je is popular! [20:31:05] (less so here though) [20:31:39] jeremyb: do you do puppet work? [20:41:32] matanya: sometimes [20:41:46] aha [20:41:59] (03PS1) 10Matanya: tcpircbot: tabs to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/98455 [20:49:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [20:51:37] MatmaRex: https://code.google.com/p/soc/issues/detail?id=1974 [20:52:46] (03PS1) 10Matanya: role.pp: minor lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/98456 [20:59:03] matanya: how do I get the catalog (?) for a specific host, e.g. stat1 ? [20:59:16] you don't? [20:59:23] Nemo_bis: puppetwise? [20:59:27] if you mean puppet catalog [20:59:34] matanya: yes [20:59:44] catalogs contain all kinds of secrets [20:59:55] well, the closest approximation [21:00:06] you can't get the catalog unless you can prove that *you* are that host [21:00:25] Nemo_bis: you can see in that host's log [21:00:49] matanya: not the "you" behind the keyword in front of me though [21:00:55] oh, yeah, the other way is you could have puppet keep copies of catalogs and you could look at it on the puppetmaster [21:01:04] Nemo_bis: keyboard* [21:01:26] pointless corrections :P [21:01:34] !xy [21:01:38] grrr [21:01:47] yes I already read that [21:01:50] !xy is http://meta.stackoverflow.com/a/66378 [21:01:51] Key was added [21:01:56] Nemo_bis: if you have root access can run puppet agent --test --noop on the node and see :) [21:02:06] sigh [21:02:25] I guess my attempt at asking questions without first learning all the terminology failed [21:02:43] what do you want to do? [21:02:57] maybe you mean manifest instead of catalog? [21:03:23] dunno [21:03:37] well look at the manifests [21:03:38] Nemo_bis: I spent two weeks in learning the puppet pro before even coming along and it didn't help much :) [21:03:58] nope [21:04:06] start with site.pp and search for a host name (e.g. stat1) and then go from there [21:04:10] Nemo_bis: the node is descibed in the manifest, the catalog is what the node applies [21:04:28] yep, just read the definition again [21:04:34] * matanya hopes he is clear [21:04:39] * jeremyb runs away to catch some remaining light before sun disappears [21:05:06] !xy del [21:05:06] Successfully removed xy [21:05:09] !xy is The XY problem is asking about your attempted *solution* rather than your *actual problem*. http://meta.stackoverflow.com/a/66378 [21:05:09] Key was added [21:06:17] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 12h 3m 24s [21:17:24] matanya: ok, so it seems it just installs some special packages there; so I guess puppet resource package --host would be enough to save me manually parsing that code, is there any way to do it? [21:18:14] Nemo_bis: what packages? something in ubuntu repo? [21:19:08] presumably; I don't see special sources defined [21:21:33] so you can use : package { 'nameofpackage' : [21:21:43] ensure => installed, [21:21:46] { [21:21:54] { [21:21:57] arr [21:21:59] { [21:22:01] } [21:22:29] hm? I want a list of what's there, not to change anything [21:22:58] on stat1? [21:23:43] yes [21:24:42] Nemo_bis: look at manifests/site.pp [21:25:50] see in line 2549 it has the role::statistics::cruncher line? [21:25:59] that is what this server is doing. [21:27:04] sure, I've already read that [21:27:16] got it, must do the list manually [21:27:30] grep -r is your friend [21:27:37] you can see in manifests/role/statistics.pp [21:27:46] what exactly is done there [21:28:31] I already found all this [21:28:45] though I used ack -a :P [21:29:04] there are tools called ENC (External Node Classifiers) that manage puppet in an easier way, but the wmf doesn't use any [21:29:25] known examples are hiera and the-foreman [21:53:17] matanya: you mean `git grep`?? :) [21:55:01] matanya: OSM is an ENC [21:55:12] that too [21:58:57] Nemo_bis: anyway, what's the point? what do you want to answer? [21:59:31] intellectual curiosities [21:59:44] then grep away :) [23:37:32] grrrrrrrr. not liking the formatting for GCI. can't we just use wikitext??? :/