[00:01:17] the problem with overlapping nagios users. how's your s1 slave btw, were you able to find heartbeat lines in your relay log? [00:02:05] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.223 seconds [00:04:22] binasher: didn't look for them until now. If there is any damage possible, it already happend; so no need to change the start-position again in my eyes [00:05:55] PROBLEM - Disk space on db1042 is CRITICAL: Connection refused by host [00:05:55] RECOVERY - MySQL disk space on db30 is OK: DISK OK [00:06:01] DaBPunkt: do you still want a fresh copy of enwiki? [00:07:31] binasher: I will wait until our slave is live again and than run some tests. If I find problems, I will ask for a dump, if I find nothing, then not. [00:07:35] PROBLEM - DPKG on srv300 is CRITICAL: Connection refused by host [00:07:35] PROBLEM - DPKG on srv225 is CRITICAL: Connection refused by host [00:09:15] PROBLEM - RAID on mw1129 is CRITICAL: Connection refused by host [00:09:25] PROBLEM - DPKG on srv263 is CRITICAL: Connection refused by host [00:09:25] PROBLEM - Disk space on srv300 is CRITICAL: Connection refused by host [00:10:15] PROBLEM - DPKG on mw1129 is CRITICAL: Connection refused by host [00:10:45] PROBLEM - RAID on mw69 is CRITICAL: Connection refused by host [00:11:25] PROBLEM - Disk space on srv263 is CRITICAL: Connection refused by host [00:12:15] PROBLEM - Disk space on mw1129 is CRITICAL: Connection refused by host [00:13:05] RECOVERY - mysqld processes on db36 is OK: PROCS OK: 1 process with command name mysqld [00:13:47] PROBLEM - DPKG on mw1147 is CRITICAL: Connection refused by host [00:14:15] PROBLEM - DPKG on srv199 is CRITICAL: Connection refused by host [00:14:25] binasher: do you plan to change the nagios-access on the other clusters too? [00:14:35] PROBLEM - DPKG on db1042 is CRITICAL: Connection refused by host [00:14:45] PROBLEM - RAID on srv300 is CRITICAL: Connection refused by host [00:15:05] PROBLEM - RAID on srv225 is CRITICAL: Connection refused by host [00:15:45] DaBPunkt: don't think so, s7 was the last of them [00:16:23] PROBLEM - DPKG on snapshot1001 is CRITICAL: Connection refused by host [00:16:33] PROBLEM - RAID on srv193 is CRITICAL: Connection refused by host [00:16:33] PROBLEM - DPKG on srv216 is CRITICAL: Connection refused by host [00:16:43] PROBLEM - RAID on srv263 is CRITICAL: Connection refused by host [00:16:43] PROBLEM - RAID on srv269 is CRITICAL: Connection refused by host [00:17:03] PROBLEM - MySQL disk space on db1042 is CRITICAL: Connection refused by host [00:17:13] PROBLEM - Disk space on ganglia1001 is CRITICAL: Connection refused by host [00:17:33] RECOVERY - DPKG on srv300 is OK: All packages OK [00:17:40] New patchset: Asher; "monitoring for all core dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2135 [00:17:42] !log py synchronized wmf-config/CommonSettings.php 'changing eqiad cp1001-cp1020 IPs to their new, private IPs' [00:17:43] Logged the message, Master [00:17:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2135 [00:18:03] PROBLEM - Disk space on mw69 is CRITICAL: Connection refused by host [00:18:13] PROBLEM - DPKG on ganglia1001 is CRITICAL: Connection refused by host [00:18:13] PROBLEM - DPKG on srv193 is CRITICAL: Connection refused by host [00:18:23] PROBLEM - Disk space on srv225 is CRITICAL: Connection refused by host [00:18:33] PROBLEM - DPKG on srv269 is CRITICAL: Connection refused by host [00:18:33] RECOVERY - Disk space on srv300 is OK: DISK OK [00:18:38] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2135 [00:18:38] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2135 [00:19:33] RECOVERY - DPKG on mw1129 is OK: All packages OK [00:19:33] RECOVERY - DPKG on mw1147 is OK: All packages OK [00:20:13] PROBLEM - Disk space on srv193 is CRITICAL: Connection refused by host [00:20:23] PROBLEM - DPKG on srv261 is CRITICAL: Connection refused by host [00:20:23] PROBLEM - Disk space on srv216 is CRITICAL: Connection refused by host [00:20:53] Hello! I had a question about templates. Is there a way to make dynamic templates, say for example I want to include the parameters rownum and colnum to dynamically change the number of rows and columns in table? [00:21:23] PROBLEM - Disk space on srv264 is CRITICAL: Connection refused by host [00:21:33] RECOVERY - DPKG on srv263 is OK: All packages OK [00:21:33] RECOVERY - Disk space on mw1129 is OK: DISK OK [00:22:03] PROBLEM - RAID on snapshot1001 is CRITICAL: Connection refused by host [00:22:33] PROBLEM - Disk space on srv261 is CRITICAL: Connection refused by host [00:22:33] PROBLEM - DPKG on srv264 is CRITICAL: Connection refused by host [00:22:43] PROBLEM - RAID on srv261 is CRITICAL: Connection refused by host [00:22:53] RECOVERY - DPKG on db1042 is OK: All packages OK [00:23:13] PROBLEM - RAID on ganglia1001 is CRITICAL: Connection refused by host [00:23:23] PROBLEM - Disk space on srv269 is CRITICAL: Connection refused by host [00:23:33] RECOVERY - DPKG on srv199 is OK: All packages OK [00:24:13] PROBLEM - RAID on srv264 is CRITICAL: Connection refused by host [00:24:13] RECOVERY - Disk space on srv263 is OK: DISK OK [00:24:23] PROBLEM - RAID on srv216 is CRITICAL: Connection refused by host [00:24:23] PROBLEM - Disk space on srv223 is CRITICAL: Connection refused by host [00:24:23] RECOVERY - RAID on srv225 is OK: OK: no RAID installed [00:25:53] RECOVERY - Disk space on db1042 is OK: DISK OK [00:26:03] RECOVERY - RAID on mw1129 is OK: OK: no RAID installed [00:26:23] RECOVERY - RAID on srv300 is OK: OK: no RAID installed [00:26:33] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [00:26:53] PROBLEM - RAID on srv190 is CRITICAL: Connection refused by host [00:26:53] RECOVERY - RAID on srv193 is OK: OK: no RAID installed [00:26:53] RECOVERY - RAID on srv263 is OK: OK: no RAID installed [00:27:03] RECOVERY - RAID on srv269 is OK: OK: no RAID installed [00:27:23] PROBLEM - RAID on mw1142 is CRITICAL: Connection refused by host [00:27:23] RECOVERY - MySQL disk space on db1042 is OK: DISK OK [00:27:23] RECOVERY - Disk space on ganglia1001 is OK: DISK OK [00:27:53] RECOVERY - DPKG on srv225 is OK: All packages OK [00:28:03] RECOVERY - Disk space on mw69 is OK: DISK OK [00:28:13] PROBLEM - DPKG on srv235 is CRITICAL: Connection refused by host [00:28:23] PROBLEM - DPKG on srv190 is CRITICAL: Connection refused by host [00:28:23] RECOVERY - DPKG on srv193 is OK: All packages OK [00:28:23] RECOVERY - DPKG on ganglia1001 is OK: All packages OK [00:28:33] RECOVERY - Disk space on srv225 is OK: DISK OK [00:28:33] RECOVERY - DPKG on srv269 is OK: All packages OK [00:28:43] PROBLEM - RAID on srv223 is CRITICAL: Connection refused by host [00:28:43] PROBLEM - Disk space on srv235 is CRITICAL: Connection refused by host [00:29:33] New patchset: Asher; "experiment at avoiding endlessly appending to nagios svc check files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2136 [00:29:43] PROBLEM - DPKG on mw1142 is CRITICAL: Connection refused by host [00:29:46] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/2136 [00:30:03] RECOVERY - RAID on mw69 is OK: OK: no RAID installed [00:30:23] PROBLEM - Disk space on srv190 is CRITICAL: Connection refused by host [00:30:23] RECOVERY - Disk space on srv193 is OK: DISK OK [00:30:23] PROBLEM - RAID on srv204 is CRITICAL: Connection refused by host [00:30:43] PROBLEM - RAID on srv210 is CRITICAL: Connection refused by host [00:30:43] RECOVERY - Disk space on srv216 is OK: DISK OK [00:30:53] RECOVERY - DPKG on srv261 is OK: All packages OK [00:31:33] RECOVERY - Disk space on srv264 is OK: DISK OK [00:32:03] PROBLEM - RAID on mw25 is CRITICAL: Connection refused by host [00:32:23] RECOVERY - RAID on snapshot1001 is OK: OK: no RAID installed [00:32:33] PROBLEM - DPKG on srv223 is CRITICAL: Connection refused by host [00:32:43] RECOVERY - Disk space on srv261 is OK: DISK OK [00:32:43] RECOVERY - DPKG on srv264 is OK: All packages OK [00:32:53] RECOVERY - RAID on srv261 is OK: OK: no RAID installed [00:33:03] PROBLEM - Disk space on mw1142 is CRITICAL: Connection refused by host [00:33:23] RECOVERY - RAID on ganglia1001 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [00:33:53] RECOVERY - Disk space on srv269 is OK: DISK OK [00:34:03] PROBLEM - Disk space on srv204 is CRITICAL: Connection refused by host [00:34:03] PROBLEM - DPKG on srv204 is CRITICAL: Connection refused by host [00:34:23] RECOVERY - RAID on srv264 is OK: OK: no RAID installed [00:34:33] PROBLEM - Disk space on srv210 is CRITICAL: Connection refused by host [00:34:33] RECOVERY - RAID on srv216 is OK: OK: no RAID installed [00:34:36] binasher: ok. good night [00:34:43] RECOVERY - Disk space on srv223 is OK: DISK OK [00:34:47] night! [00:35:03] PROBLEM - DPKG on mw25 is CRITICAL: Connection refused by host [00:35:03] gn8 folks [00:36:13] PROBLEM - RAID on mw1115 is CRITICAL: Connection refused by host [00:36:33] PROBLEM - Disk space on mw25 is CRITICAL: Connection refused by host [00:36:53] RECOVERY - DPKG on srv216 is OK: All packages OK [00:37:03] RECOVERY - RAID on srv190 is OK: OK: no RAID installed [00:38:13] PROBLEM - DPKG on srv288 is CRITICAL: Connection refused by host [00:38:33] RECOVERY - DPKG on srv190 is OK: All packages OK [00:38:43] RECOVERY - DPKG on srv235 is OK: All packages OK [00:38:53] RECOVERY - RAID on srv223 is OK: OK: no RAID installed [00:38:53] PROBLEM - Disk space on srv288 is CRITICAL: Connection refused by host [00:38:53] RECOVERY - Disk space on srv235 is OK: DISK OK [00:39:53] RECOVERY - DPKG on mw1142 is OK: All packages OK [00:40:03] PROBLEM - DPKG on mw1115 is CRITICAL: Connection refused by host [00:40:33] PROBLEM - RAID on srv191 is CRITICAL: Connection refused by host [00:40:43] RECOVERY - Disk space on srv190 is OK: DISK OK [00:40:43] RECOVERY - RAID on srv210 is OK: OK: no RAID installed [00:41:53] PROBLEM - Disk space on mw1105 is CRITICAL: Connection refused by host [00:41:53] PROBLEM - Disk space on mw1115 is CRITICAL: Connection refused by host [00:42:13] RECOVERY - RAID on mw25 is OK: OK: no RAID installed [00:42:33] PROBLEM - DPKG on srv191 is CRITICAL: Connection refused by host [00:42:53] RECOVERY - DPKG on srv223 is OK: All packages OK [00:43:23] RECOVERY - Disk space on mw1142 is OK: DISK OK [00:44:13] RECOVERY - DPKG on srv204 is OK: All packages OK [00:44:23] RECOVERY - Disk space on srv204 is OK: DISK OK [00:44:44] RECOVERY - Disk space on srv210 is OK: DISK OK [00:45:13] RECOVERY - DPKG on mw25 is OK: All packages OK [00:46:23] PROBLEM - DPKG on mw1105 is CRITICAL: Connection refused by host [00:46:23] RECOVERY - RAID on mw1115 is OK: OK: no RAID installed [00:46:33] PROBLEM - RAID on mw1105 is CRITICAL: Connection refused by host [00:46:53] RECOVERY - Disk space on mw25 is OK: DISK OK [00:47:43] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [00:48:33] RECOVERY - DPKG on srv288 is OK: All packages OK [00:49:03] RECOVERY - Disk space on srv288 is OK: DISK OK [00:50:13] RECOVERY - DPKG on mw1115 is OK: All packages OK [00:50:43] RECOVERY - RAID on srv204 is OK: OK: no RAID installed [00:51:03] RECOVERY - RAID on srv191 is OK: OK: no RAID installed [00:52:03] RECOVERY - Disk space on mw1115 is OK: DISK OK [00:52:03] RECOVERY - Disk space on mw1105 is OK: DISK OK [00:52:53] RECOVERY - DPKG on srv191 is OK: All packages OK [00:56:33] RECOVERY - DPKG on mw1105 is OK: All packages OK [00:56:44] RECOVERY - RAID on mw1105 is OK: OK: no RAID installed [01:00:18] Change abandoned: Asher; "fail" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2136 [01:03:57] New patchset: Ottomata; "Tab -> spaces, formatting changes for PEP8." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2137 [01:03:58] New patchset: Ottomata; "observation.py - fixed __str__ method" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2138 [01:09:17] New patchset: Ottomata; "Changes to Pipeline classes + unit tests. Need to talk about this with Diederik (which is why it is in a new branch!)" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2139 [01:13:16] ottomata: I don't know about the new branch thing but you did just submit to master [01:14:07] eh? [01:14:08] really? [01:14:19] To ssh://gerrit.wikimedia.org:29418/analytics/reportcard.git [01:14:19] * [new branch] HEAD -> refs/for/master [01:14:24] OHHH [01:14:34] because my push for review alias does that [01:14:34] grrr [01:27:51] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 172315 seconds [01:29:11] PROBLEM - MySQL Replication Heartbeat on db48 is CRITICAL: NRPE: Unable to read output [01:29:16] New patchset: Pyoungmeister; "adding cp1001 and 1002 as ganglia agregators for eqiad text squids" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2140 [01:29:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2140 [01:30:32] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2140 [01:30:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2140 [01:33:01] PROBLEM - MySQL Replication Heartbeat on db49 is CRITICAL: NRPE: Unable to read output [01:39:13] New patchset: Catrope; "WIP puppetization of fatal error log (RT 623). DO NOT MERGE" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2141 [01:58:36] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: NRPE: Unable to read output [02:03:46] PROBLEM - MySQL Replication Heartbeat on db1048 is CRITICAL: NRPE: Unable to read output [02:05:25] !log LocalisationUpdate completed (1.18) at Fri Jan 27 02:05:24 UTC 2012 [02:05:27] Logged the message, Master [02:28:53] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [02:40:43] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [04:15:38] RECOVERY - Disk space on es1004 is OK: DISK OK [04:26:18] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:33:48] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [04:41:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:49:13] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2139 [04:50:01] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2138 [04:51:30] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2137 [04:51:31] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2139 [04:51:31] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2138 [04:51:31] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2137 [05:50:24] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:16] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [09:07:16] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [09:41:05] Is it a known issue that the blacklist can be circumvented by having protocol relative links? [09:55:08] not widely known I think [09:56:10] as protocol relative links were developed more locally, and it is not something that I wish to overtly yell about, what is the process to notate? [09:56:57] bugzilla should be fine [09:57:31] okay [09:57:36] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 402461 MB (3% inode=99%): [09:57:49] I might go and just do some more double checking [09:58:02] maybe adding some relevant devs to CC of that bug... and check that there isn't a bug already [09:58:04] who stole all the inodes!!! [09:58:13] yessir [09:58:30] sDrewth: only 1% of inodes are in use [09:58:46] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 395279 MB (3% inode=99%): [09:59:43] ah, ok, well I got that back to front :-) (-: [10:00:16] Nikerabbit: any dev who you would consider more pertinent? [10:07:13] roan did most of the work to get protocol relative links working [10:12:46] k thx [10:18:21] Reedy, did you see last comment to https://bugzilla.wikimedia.org/show_bug.cgi?id=16112 ? [10:18:53] Yeah [10:18:54] And? :P [10:18:59] How can we get some maintenance scripts run, do I have to open a thread on wikitech? [Never got any result that way.] [10:21:40] It's not acceptable that such scripts are not run let's say once in 3-4 years if there's a request. [11:06:45] RECOVERY - MySQL slave status on es1004 is OK: OK: [11:07:46] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33864 - Flood flag on sr.wiki' [11:07:48] Logged the message, Master [11:11:44] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33960 - Import sources for etwiki, etwikisource and etwiktionary' [11:11:45] Logged the message, Master [11:18:33] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33862 - Request for logo change in Tamil Wikiquote' [11:18:34] Logged the message, Master [11:29:26] !log reedy synchronized wmf-config/InitialiseSettings.php 'th wikilogos' [11:29:27] Logged the message, Master [11:34:21] !log reedy synchronized wmf-config/InitialiseSettings.php 'th wikilogos' [11:34:23] Logged the message, Master [12:39:37] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [12:51:37] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [13:36:17] <^demon> !log gallium: cleaning up /tmp again, tests really need to clean up after themselves. [13:36:20] Logged the message, Master [14:44:54] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [17:32:15] New patchset: Diederik; "Initial commit" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2142 [17:44:47] hmm, trying to figure out some git stuff, anyone around? [17:45:27] trying to write a generic push for review alias where it pushes to the correct place regardless of what repo or branch I am currently in [18:25:04] Evening guys, just a quick ask - does anyone here know if we're blocked completely throughout India? I'm speaking with yannf, who tells us that normal routes to Wikipedia are blocked through his ISP in India. [18:25:25] He's with a state owned company called BSNL [18:26:05] He can connect via secure.wikimedia.org, but not via https://en.wikipedia.org for example [18:26:58] ls [18:37:45] hello [18:38:03] I have just found that BSNL is blocking Wikipedia in India [18:38:20] usually I just get an error message [18:38:39] but now I get a page of advertising about web hosting [18:39:10] *.wikimedia.org works [18:39:17] i.e. secure, commons, etc. [18:39:28] I get ads for any *.wikipedia.org [18:44:54] yannf: nice [18:44:57] blackout! [18:45:22] maybe someone hijacked routes! [18:45:33] yannf: can you show traceroute? [18:46:19] domas: if they can reach wikimedia it's _probably_ due to interception by a dns recursive resolver. [18:46:33] yannf: can you try switching to some open access dns server like 8.8.8.8 ? [18:47:19] gmaxwell: that is interesting thing for me, which IP it resolves to and where it is being sent [18:47:25] traceroute gives both! [18:47:25] :) [18:47:40] ah, yannf gave an ip but in another channel. [18:47:51] also can be browser toolbar! :) [18:47:55] en.wikipedia.org gives 212.113.36.83 [18:48:17] Ukraine? :) [18:48:31] are you sure it is not single-computer problem? [18:48:44] smells kinda hackerish rather than censorish. [18:48:48] yeh [18:48:55] hackerish obviously [18:49:03] I never trusted those ukrainians [18:50:18] domas, http://pastebin.com/w0Xj4Uvg [18:51:14] yannf: yeh, IP alone already reveals that it is something extremely stupid, as in hacked resolver [18:51:30] yannf: that's not any of our ip addresses [18:51:42] (missed the conversation, but saw the pastebin and yelped) [18:54:12] http://212.113.36.83/ is the ad I see [18:54:52] yannf: thats a stupid ad to show once you hack some resolver to point to Wikipedia :( [18:55:09] yannf: what's your isp/resolver ? [18:55:27] how is that other websites work? [18:55:41] 109.74.196.50 [18:56:12] your dns resolver is a linode node ? [18:56:22] no idea [18:56:35] heh [18:56:41] that's what I have in /etc/resolv.conf [18:58:20] that seems like the problem there… [18:58:49] I think that is a bit smaller impact than whole India then [18:58:57] \o/ [18:59:02] * domas continues doing whatever I was doing [19:00:42] yannf: i'd call your isp and find out what resolver you are supposed to use, then what OS are you running ? no matter what it is i would change all the passwords on it and revoke anyone's account who is shady [19:01:48] /etc/resolv.conf can only be set via root or using information from a dhcp server.. so either your system is compromised, the dhcp server is compromised, or there is rogue dhcp server on your network. [19:03:08] that IP doesn't seem to be owned by the ISP [19:15:53] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [19:15:53] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [19:15:53] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [19:15:53] PROBLEM - check_gcsip on payments2 is CRITICAL: CRITICAL - Cannot make SSL connection [19:18:13] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [19:18:13] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [19:20:23] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.175 second response time [19:20:23] RECOVERY - check_gcsip on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.161 second response time [19:20:23] RECOVERY - check_gcsip on payments4 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.171 second response time [19:20:23] RECOVERY - check_gcsip on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.157 second response time [19:30:14] New review: Catrope; "Comments inline. I apologize for my pedantry :)" [analytics/udp-filters] (master) C: 0; - https://gerrit.wikimedia.org/r/2142 [19:32:23] RECOVERY - RAID on ms-fe2 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [19:32:23] RECOVERY - RAID on ms-fe1 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [19:35:33] RECOVERY - DPKG on ms-fe1 is OK: All packages OK [19:35:43] RECOVERY - DPKG on ms-fe2 is OK: All packages OK [19:39:43] RECOVERY - Disk space on ms-fe1 is OK: DISK OK [19:40:03] RECOVERY - Disk space on ms-fe2 is OK: DISK OK [19:41:23] RECOVERY - Memcached on ms-fe1 is OK: TCP OK - 0.003 second response time on port 11211 [19:46:02] New patchset: Asher; "db54 -> s2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2143 [19:46:29] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2143 [19:46:30] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2143 [19:50:51] * Reedy eyes ialex [19:52:09] Reedy: yes? [19:53:11] Are you here to commit more code? :D [19:54:04] bleh [19:54:05] wrong channel [19:54:06] meh [20:00:25] New patchset: Bhartshorne; "inserting real AUTH key for pmtpa prod swift cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2144 [20:00:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2144 [20:00:43] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2144 [20:00:43] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2144 [20:04:47] New patchset: Bhartshorne; "adding in accept all traffic from localhost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2145 [20:05:04] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2145 [20:05:04] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2145 [20:05:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2145 [20:17:52] PROBLEM - RAID on ms-fe1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:02] PROBLEM - RAID on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:32:12] PROBLEM - DPKG on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:32:22] PROBLEM - DPKG on ms-fe1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:02] PROBLEM - Disk space on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:03] PROBLEM - Disk space on ms-fe1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:02] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:21] <^demon> !log gallium: clearing /tmp yet again. Aaron claims he's fixing it now [20:53:23] Logged the message, Master [21:18:20] PROBLEM - mysqld processes on db54 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:42:32] !log pausing payments queue consumption in jenkins to backup and then run some db updates [21:42:33] Logged the message, Master [21:57:28] !log updates complete, re-enabling queue consumption on jenkins on aluminium [21:57:29] Logged the message, Master [22:23:19] RECOVERY - Disk space on ms-fe2 is OK: DISK OK [22:25:10] New patchset: Bhartshorne; "adding iptables rules to allow connections to port 80 for anybody that wants to talk to swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2146 [22:25:27] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2146 [22:25:27] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2146 [22:25:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2146 [22:38:44] zzz =_= [22:48:29] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:49:49] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [22:57:29] PROBLEM - Disk space on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:02:59] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [23:14:05] so, for random technical folks here, who are not wmf employees… would you be interested in a panel of operations folks talking about operations ? [23:19:27] LeslieCarr: a panel? like a notification tray widget? Random wikidown excuse generator? ( :) ) [23:20:12] haha [23:20:18] oh i was thinking wikimania conference :) [23:20:24] or does that just sound boring ? [23:20:32] i do like the idea of a random wikidown generator though :) [23:20:59] LeslieCarr: No, it's not boring. I'd expect it to be one of the better attended things. [23:21:47] (Esp if you pitch it partially as "hilarious ways we've ended the universe, or at least made wikipedia unusable") [23:22:27] oh [23:22:37] that's a good idea :) though most of them have been accidental ;) [23:22:55] (though it might not be too honest to do that since operations hasn't really ended the world _that much_) [23:23:19] how we prevented the developers from ending the universe? [23:24:36] LeslieCarr: "...or at least undid the destruction after the fact" [23:52:20] New patchset: Lcarr; "Changing the ganglia redirect to /latest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2147 [23:52:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2147 [23:52:45] RECOVERY - mysqld processes on db54 is OK: PROCS OK: 1 process with command name mysqld [23:52:48] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2147 [23:52:48] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2147 [23:56:15] New patchset: Bhartshorne; "moving pmtpa swift cluster to treat upload as its backend while populating the cache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2148 [23:56:31] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2148 [23:56:32] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2148 [23:56:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2148