[00:09:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [00:16:08] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: No route to host [00:16:32] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: No route to host [00:17:04] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: No route to host [00:17:14] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: No route to host [00:19:56] binasher: How much of a pain would it be to get hte labs dbs shown on dbtree for things like replication? Can we just add it to db-eqiad.php with 0 load so MW doesn't start using them as slaves? [00:20:03] addshore: ^ [00:20:21] It could be hacked into dbtree as extra hosts to search... [00:20:23] * Reedy barfs [00:22:04] Reedy: :< [00:22:06] # Conversely, all servers which are down or do not replicate should be [00:22:06] # removed, not set to load zero, because there are certain situations [00:22:06] # when load zero servers will be used, such as if the others are lagged. [00:22:46] hackhackhack [00:23:17] nooooo [00:24:02] the labsdb hosts don't replicate from the prod masters anyway, so their reported lag isn't necessarily useful [00:24:33] where is the code for dbtree? :> [00:24:48] they do get the heartbeat table replicated though, so it can always be determined how far they are behind the real masters [00:25:12] but that data isn't in ganglia currently [00:25:41] and dbtree parses gmetad's xml [00:25:57] addshore: operations/software [00:26:36] open a ticket to provide a web service displaying the true replag of labsdb instances :) [00:26:44] will do :) [00:26:55] iv written it down but I should proabbly go to sleep first ;p [01:31:34] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.006409168243 secs [01:33:24] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.004976272583 secs [01:53:56] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:53:56] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [01:53:56] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:53:56] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [01:53:56] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [01:53:57] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [01:53:57] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [01:53:58] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [01:53:58] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [01:53:59] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [02:07:30] !log LocalisationUpdate completed (1.22wmf7) at Sat Jun 22 02:07:29 UTC 2013 [02:07:42] Logged the message, Master [02:13:00] !log LocalisationUpdate completed (1.22wmf8) at Sat Jun 22 02:13:00 UTC 2013 [02:13:08] Logged the message, Master [02:19:27] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jun 22 02:19:26 UTC 2013 [02:19:35] Logged the message, Master [02:21:48] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [02:25:48] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [04:15:43] PROBLEM - Disk space on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:53] PROBLEM - RAID on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:16:03] PROBLEM - SSH on tin is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:13] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: No route to host [04:16:54] RECOVERY - RAID on tin is OK: OK: State is Optimal, checked 2 logical device(s) [04:16:54] RECOVERY - SSH on tin is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [04:17:03] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: No route to host [04:17:33] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: No route to host [04:17:36] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: No route to host [04:17:43] PROBLEM - DPKG on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:18:33] RECOVERY - DPKG on tin is OK: All packages OK [04:18:33] RECOVERY - Disk space on tin is OK: DISK OK [04:20:04] ACKNOWLEDGEMENT - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: No route to host asher Planned maintenance [04:20:07] ACKNOWLEDGEMENT - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: No route to host asher Planned maintenance [04:20:13] ACKNOWLEDGEMENT - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: No route to host asher Planned maintenance [04:20:15] ACKNOWLEDGEMENT - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: No route to host asher Planned maintenance [04:39:31] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [04:40:41] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.75 ms [04:43:21] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [04:47:21] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.123 second response time [05:40:15] PROBLEM - DPKG on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:15] PROBLEM - Disk space on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:15] PROBLEM - RAID on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:15] PROBLEM - SSH on tin is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:45:15] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:50:15] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:52:15] PROBLEM - NTP on tin is CRITICAL: NTP CRITICAL: No response from NTP server [07:00:21] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [07:27:39] PROBLEM - search indices - check lucene status page on search20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60051 bytes in 0.110 second response time [08:20:01] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [08:29:41] RECOVERY - Puppet freshness on ms-be2 is OK: puppet ran at Sat Jun 22 08:29:33 UTC 2013 [09:02:28] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.00237929821 secs [09:34:17] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.004452943802 secs [10:53:07] PROBLEM - NTP on nescio is CRITICAL: NTP CRITICAL: Offset unknown [10:53:07] PROBLEM - carbon-cache.py on professor is CRITICAL: PROCS CRITICAL: 2 processes with args carbon-cache.py [10:53:08] PROBLEM - search indices - check lucene status page on search1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 53403 bytes in 0.010 second response time [10:53:08] PROBLEM - search indices - check lucene status page on search16 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 58689 bytes in 0.109 second response time [10:53:08] PROBLEM - RAID on db45 is CRITICAL: CRITICAL: Degraded [10:53:08] PROBLEM - search indices - check lucene status page on search1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60534 bytes in 0.010 second response time [10:53:08] PROBLEM - twemproxy process on tmh1001 is CRITICAL: NRPE: Command check_twemproxy not defined [10:53:17] PROBLEM - twemproxy process on terbium is CRITICAL: NRPE: Command check_twemproxy not defined [10:55:07] PROBLEM - SSH on virt1 is CRITICAL: Connection refused [10:55:17] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1093.5232 (gt 1000) [10:56:07] PROBLEM - SSH on virt3 is CRITICAL: Connection refused [10:56:07] PROBLEM - DPKG on virt6 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:08:29] PROBLEM - search indices - check lucene status page on search1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 56311 bytes in 0.011 second response time [11:08:29] PROBLEM - search indices - check lucene status page on search1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60534 bytes in 0.008 second response time [11:08:29] PROBLEM - search indices - check lucene status page on search20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60003 bytes in 0.110 second response time [11:08:29] PROBLEM - search indices - check lucene status page on search1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 53403 bytes in 0.011 second response time [11:08:29] PROBLEM - twemproxy process on tmh2 is CRITICAL: NRPE: Command check_twemproxy not defined [11:08:39] PROBLEM - NTP on tin is CRITICAL: NTP CRITICAL: No response from NTP server [11:23:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [11:53:58] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:53:58] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [11:53:58] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [11:53:58] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [11:53:58] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [11:53:58] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [11:53:59] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [11:53:59] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [11:54:00] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [11:54:00] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [12:07:38] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:08:28] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [12:21:58] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [12:25:58] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [13:52:43] PROBLEM - search indices - check lucene status page on search18 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 55856 bytes in 0.117 second response time [15:45:37] !log tin is down, dead to SSH [15:45:45] Logged the message, Master [16:05:57] PROBLEM - Host tin is DOWN: PING CRITICAL - Packet loss = 100% [16:06:27] RECOVERY - SSH on tin is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:06:27] RECOVERY - Disk space on tin is OK: DISK OK [16:06:37] RECOVERY - DPKG on tin is OK: All packages OK [16:06:37] RECOVERY - Host tin is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [16:07:07] RECOVERY - RAID on tin is OK: OK: State is Optimal, checked 2 logical device(s) [17:00:51] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [17:19:40] anyone know which redis module we use? [17:23:35] https://github.com/nicolasff/phpredis [17:26:15] ori-l: I mean puppet module :) [17:26:29] hmm, should've clarified, even though i'm on the ops channel [17:26:34] ori-l: also why aren't you decompressing? [17:26:50] * ConfusedPanda is cloning operations/puppet [18:20:09] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [18:43:07] New patchset: Alex Monk; "Re-enable captcha on ptwiki per overwhelming community consensus" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [18:46:46] ? [18:47:09] New review: Odder; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/69982 [18:48:52] odder: No it's not. Remove your -1 please. [18:49:27] no. [18:49:28] The objections on bugzilla are not from ptwiki community members, and are not technical concerns. [18:49:40] New review: Alex Monk; "No it's not." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [18:49:50] Reedy, ^ [18:50:50] odder, what do you mean 'no'? There is valid community consensus in favour of this, it is NOT shellpolicy [18:52:11] Krenair: last time I checked, shellpolicy also meant 'there is consensus for this change, but it is against movement principles' [18:52:24] No it doesn't [18:52:45] if people are having problems with the fact that there is larger involvement of anonymous editors in a Wikipedia, then I'm speechless. [18:52:55] "Enable CAPTCHA for all edits of non-confirmed users on pt.wikipedia in order to reduce editing activity" [18:53:04] Whether or not it is against 'movement principals' or not is 100% opinion, it cannot be made into a keyword [18:53:53] And is certainly not a valid reason to -1 in gerrit [18:54:37] Krenair: of course, principles are such are 100% opinions [18:54:41] as such* [18:55:26] If the global community on meta disagreed, then you might have a case [18:55:50] disagree with what? [18:55:57] This change [18:56:18] What change? [18:56:26] The one you just -1'd [18:56:52] How can a global community disagree with something that's related just to the pt.wiki and has been posted 10 minutes ago? [18:57:14] But at the moment you are just interfering with the autonomy of a wiki community [18:57:37] They could've disagreed with it before it was even put into gerrit, when the original concerns were raised on bugzilla [18:57:52] no such thing has happened, so this change is valid [18:57:59] that's your opinion. [18:58:11] No, it's fact. [18:58:33] this is also your opinion. [19:09:57] Krenair: the problem with this request is that there was no significant explanation on what are the other ways of dealing with vandalism [19:10:13] Krenair: and the fact that it works /against/ having wider participation in the project [19:10:39] Krenair: WMF-made statistics show a 58% increase of valid (positive) IP edits during the month after CAPTCHA was disabled [19:11:15] of which how much are legitimate? [19:11:40] by valid (positive) you mean non-vandalism etc.? [19:12:11] If this is true, you likely have a good reason to go to the ptwiki vote page and vote 'no'. [19:12:23] You do not have a good reason to vote -1 on the mediawiki-config gerrit change. [19:12:53] Krenair: they are all legitimate, ie. non–vandalism [19:13:00] Krenair: the vote is closed. [19:13:08] guess you missed your chance then [19:13:24] I didn't, I'm not a contributor to that wiki [19:13:35] (and I'm also affected by their CAPTCHA, by the way :) [19:14:38] You only need to make another edit to become autoconfirmed I think [19:14:48] autoconfirmed can skipcaptcha [19:15:03] Yes. [21:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [21:32:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:33:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [21:54:07] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [21:54:07] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [21:54:07] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:54:07] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [21:54:07] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [21:54:07] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [21:54:08] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [21:54:08] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [21:54:09] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [21:54:09] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [21:56:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [22:10:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:11:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.112 second response time [22:20:30] RECOVERY - search indices - check lucene status page on search20 is OK: HTTP OK: HTTP/1.1 200 OK - 60075 bytes in 0.117 second response time [22:22:50] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [22:26:50] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [22:30:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [22:38:48] New review: MZMcBride; "Local community consensus is certainly clear in this case. The question has become whether a Wikimed..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [22:45:03] New review: MZMcBride; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [22:59:29] New patchset: Alex Monk; "Re-enable captcha on ptwiki per overwhelming community consensus" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [23:01:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:02:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [23:02:42] New review: Ori.livneh; "The Code-Review score on Gerrit should be limited to reviewing code. Please take policy questions ba..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69982 [23:07:30] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [23:07:40] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [23:30:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time