[00:00:29] RECOVERY - NTP on mw1161 is OK: NTP OK: Offset -0.002508997917 secs [00:05:17] RECOVERY - MySQL disk space on neon is OK: DISK OK [00:25:23] RECOVERY - Puppet freshness on stafford is OK: puppet ran at Sat Feb 16 00:25:12 UTC 2013 [00:28:50] RECOVERY - Apache HTTP on mw1161 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.117 second response time [00:34:33] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49358 [00:34:42] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49358 [00:39:02] paravoid, could you explain why you don't like saving the previous state? [00:41:47] how many checks have you seen that do what you did? [00:43:32] PROBLEM - Apache HTTP on mw98 is CRITICAL: Connection refused [00:44:08] PROBLEM - Apache HTTP on mw93 is CRITICAL: Connection refused [00:44:53] PROBLEM - Apache HTTP on mw95 is CRITICAL: Connection refused [00:45:12] RobH: that you? [00:45:39] paravoid: those aren't in prod yet [00:45:40] I odn't think [00:46:06] yeah [00:46:21] and also in pmtpa.... [00:46:24] so no traffic anyway [00:46:32] PROBLEM - Apache HTTP on mw124 is CRITICAL: Connection refused [00:48:05] replying to a question with another question? okay, I haven't seen many Nagios plugins in my life. maybe I would've preferred to see as few of them as possible, but then we would still have to guess whether a service is alive or not [00:51:11] PROBLEM - Apache HTTP on mw97 is CRITICAL: Connection refused [00:51:29] PROBLEM - Apache HTTP on mw121 is CRITICAL: Connection refused [00:52:23] PROBLEM - Apache HTTP on mw94 is CRITICAL: Connection refused [00:53:17] PROBLEM - Apache HTTP on mw125 is CRITICAL: Connection refused [00:53:35] PROBLEM - Apache HTTP on mw117 is CRITICAL: Connection refused [00:53:51] what's going on? [00:54:02] New patchset: Faidon; "Restore previous version of solr monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49366 [00:54:02] PROBLEM - Apache HTTP on mw91 is CRITICAL: Connection refused [00:54:11] PROBLEM - Apache HTTP on mw119 is CRITICAL: Connection refused [00:54:11] PROBLEM - Apache HTTP on mw122 is CRITICAL: Connection refused [00:54:11] PROBLEM - Apache HTTP on mw90 is CRITICAL: Connection refused [00:54:19] RobH: hey. [00:54:36] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49366 [00:54:46] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49366 [00:54:56] PROBLEM - Apache HTTP on mw92 is CRITICAL: Connection refused [00:55:32] PROBLEM - Apache HTTP on mw89 is CRITICAL: Connection refused [00:56:25] PROBLEM - Apache HTTP on mw114 is CRITICAL: Connection refused [00:56:25] PROBLEM - Apache HTTP on mw120 is CRITICAL: Connection refused [00:56:33] New review: Catrope; "Patch Set 1: Code-Review+1" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/47519 [00:58:43] nagios@spence:~$ /usr/local/nagios/libexec/check_job_queue [00:58:43] PHP Notice: Undefined variable: urlprotocol in /home/wikipedia/common/wmf-config/filebackend.php on line 190 [00:58:46] PHP Notice: Undefined variable: urlprotocol in /home/wikipedia/common/wmf-config/filebackend.php on line 195 [00:58:49] PHP Notice: Undefined variable: urlprotocol in /home/wikipedia/common/wmf-config/filebackend.php on line 196 [00:58:52] JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - [00:58:52] anyone knows why's that and how to fix it? [00:58:54] PHP Fatal error: Class 'Memcached' not found in /home/wikipedia/common/php-1.21wmf9/includes/objectcache/MemcachedPeclBagOStuff.php on line 57 [00:58:58] PROBLEM - Apache HTTP on mw116 is CRITICAL: Connection refused [00:59:04] I'll dig in further, but thought of asking the lazyirc first [00:59:36] Ah, WTF? [00:59:41] paravoid: On which machine? The Nagios box? [00:59:49] oic, spence [00:59:56] Make sure it's got an up-to-date copy of MediaWiki [01:00:11] I don't remember whether spence is on the dsh list for scap now but for a time it wasn't [01:00:19] PROBLEM - Apache HTTP on mw88 is CRITICAL: Connection refused [01:01:22] PROBLEM - Apache HTTP on mw115 is CRITICAL: Connection refused [01:01:31] PROBLEM - Apache HTTP on mw113 is CRITICAL: Connection refused [01:01:37] spence is on the mediawiki-installation dsh group, is that what you're referring to? [01:01:40] PROBLEM - Apache HTTP on mw123 is CRITICAL: Connection refused [01:02:39] Yeah [01:02:40] Hmm that's odd [01:02:53] it says 1.21wmf9 [01:03:18] paravoid, the checks you've restored are all broken [01:03:37] I can remove those as well :-) [01:03:48] or do you want to? [01:03:53] would it be OK with you if I disabled the error counter checks for now? [01:04:03] what do you mean? [01:06:04] it checks for several things, counters are just one of them (yes I know these should be separate merics, but Solr statistics are pretty slow to generate and they would have to be queried several times if split to separate checks) [01:06:17] ISn't Memcached from pecl? [01:06:26] hence the name of the module [01:06:51] there's more than one memcached lib iirc. i think we've tried at least 2 of them [01:06:55] PROBLEM - Apache HTTP on mw96 is CRITICAL: Connection refused [01:06:55] PROBLEM - Apache HTTP on mw118 is CRITICAL: Connection refused [01:06:58] paravoid: php5-memcached should be installed [01:07:52] huh, how would the library have been removed from spence? [01:08:47] Reedy: thanks. what about the urlprotocol thing? [01:09:29] That's probably because there is no request URL [01:09:37] So there is no protocol it can pull from there [01:09:46] RoanKattouw: No [01:09:50] Oh? [01:10:02] It's because I removed the $urlprotocol from commonsettings as it was only set to '' [01:10:20] Oh, right [01:10:21] 'url' => "$urlprotocol//upload.wikimedia.org/wikipedia/commons", [01:10:22] OK [01:10:30] that's what /home/wikipedia/common/wmf-config/filebackend.php:190 says [01:10:31] PROBLEM - Apache HTTP on mw99 is CRITICAL: Connection refused [01:10:35] New patchset: Reedy; "Remove urlprotocol from filebakcend" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49368 [01:10:45] New patchset: Reedy; "Remove urlprotocol from filebackend" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49368 [01:10:53] yay for gerrit summary editing [01:10:57] New review: Reedy; "Patch Set 2: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49368 [01:10:58] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49368 [01:11:00] oh heh [01:11:30] !log reedy synchronized wmf-config/filebackend.php [01:11:36] mw1165: rsync: change_dir#3 "/apache/common-local" failed: No such file or directory (2) [01:11:40] for how long was it broken?:P [01:11:54] these are broken [01:11:58] new boxes [01:12:15] lols [01:12:16] oh heh, the check actually works now [01:12:17] JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , hewiki (24094), huwiki (24463), itwiki (24574), svwiki (18738), test2wiki (51495), Total (163348) [01:12:22] wheee [01:12:31] Logged the message, Master [01:12:34] look at that [01:12:35] eek test2wiki [01:12:37] actual problems! [01:12:43] MaxSem: It's been broken most of this week [01:12:50] on test2wiki [01:12:56] I've ran runJobs manually on fenari at least twice [01:13:09] MaxSem: so, I'd be happy to +2 more checks [01:13:12] if I like them :) [01:13:42] reedy@fenari:/home/wikipedia/common$ mwscript showJobs.php test2wiki [01:13:43] 3 [01:13:56] Why is there a , after nothing? :/ [01:13:59] MaxSem: push what you're thinking and I'll comment in gerrit [01:14:18] paravoid, they are all in the same check_solr [01:14:35] strip down check_solr to something that doesn't have state and push it [01:14:35] paravoid: That job queue check is giving incorrect numbers (from somewhere).. [01:14:39] and I'll see about merging it [01:14:42] all of them are < 500 [01:14:55] php /home/wikipedia/common/multiversion/MWScript.php extensions/WikimediaMaintenance/getJobQueueLengths.php [01:14:58] that's what it runs [01:15:20] hawwiki 2 [01:15:20] hewiki 24125 [01:15:20] hewikiquote 39 [01:15:20] hifwiki 1 [01:15:20] hrwiki 1 [01:15:22] etc. [01:15:38] reedy@fenari:/home/wikipedia/common$ mwscript showJobs.php hewiki [01:15:38] 47 [01:15:51] try --group [01:16:01] wait, yeah [01:16:08] 24134 in the db [01:16:11] AaronSchulz: you broke it [01:16:16] probably.. [01:17:25] reedy@fenari:/home/wikipedia/common$ mwscript showJobs.php hewiki --group [01:17:25] ChangeNotification: 59 queued; 24083 acquired [01:17:32] * Reedy looks blankly [01:20:49] nothing on my side, right? [01:20:50] New patchset: Faidon; "Add all applicationserver packages to nagios boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49370 [01:21:08] I'd say not, that script is showing what's in the database [01:21:18] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49370 [01:21:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49370 [01:21:28] RECOVERY - Apache HTTP on mw95 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.457 second response time [01:21:37] RECOVERY - Apache HTTP on mw98 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.278 second response time [01:22:14] Apart from svwiki... Those are Wikidata related ChangeNotification jobs [01:22:31] RECOVERY - Apache HTTP on mw124 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.098 second response time [01:22:31] RECOVERY - Apache HTTP on mw93 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.101 second response time [01:25:04] New patchset: MaxSem; "check_solr, attempt 2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49372 [01:25:26] paravoid, ^^ [01:27:10] MaxSem: add the puppet manifests that use it in the same commit [01:27:11] what is solr used for so far? [01:27:44] jeremyb_, https://wikitech.wikimedia.org/view/Solr#Current_uses_on_WMF [01:27:58] how convenient [01:28:05] ;) [01:28:17] hahaha, slapped with the fine manual :) [01:28:30] ooooh, geodata [01:32:44] Reedy: that stopped on the 13th [01:32:53] that was fixed already [01:33:12] the dead jobs will sit there for a week before they get nuked [01:33:25] which is always useful for investigating [01:33:42] lol [01:33:49] Which mean our job queue checks are useless [01:34:16] well, you want to know if a bunch of dead jobs build up [01:34:27] that only happens when something wonk happens [01:34:42] though after the fact it's probably useless for a few days yeah [01:35:24] the graphs are probably still useful [01:35:31] (e.g. gdash) [01:36:50] RECOVERY - Apache HTTP on mw91 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.084 second response time [01:37:08] RECOVERY - Apache HTTP on mw119 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.100 second response time [01:37:08] RECOVERY - Apache HTTP on mw122 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.249 second response time [01:37:08] RECOVERY - Apache HTTP on mw94 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.085 second response time [01:37:08] RECOVERY - Apache HTTP on mw90 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.242 second response time [01:37:15] [01:37:35] RECOVERY - Apache HTTP on mw116 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.092 second response time [01:37:39] New patchset: MaxSem; "check_solr, attempt 2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49372 [01:37:44] RECOVERY - Apache HTTP on mw121 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.098 second response time [01:37:53] RECOVERY - Apache HTTP on mw125 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.091 second response time [01:37:53] RECOVERY - Apache HTTP on mw97 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.111 second response time [01:38:02] RECOVERY - Apache HTTP on mw117 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.080 second response time [01:38:17] MaxSem: you didn't bump your attempt # :-P [01:38:20] RECOVERY - Apache HTTP on mw92 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.089 second response time [01:38:29] RECOVERY - Apache HTTP on mw114 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.083 second response time [01:38:29] RECOVERY - Apache HTTP on mw89 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.088 second response time [01:38:33] paravoid, https://gerrit.wikimedia.org/r/49372 plz [01:38:50] New patchset: Faidon; "Sync icinga's nrpe_local.cfg with nagios'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49373 [01:39:24] MaxSem: icinga has different paths, look at the entries above and below yours [01:39:29] it's confusing I know [01:39:33] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [01:39:34] I'm annoyed at the nagios/icinga split myself. [01:39:44] fuuuuuu... [01:39:49] I'll prod leslie next week and try to finish it off [01:39:55] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49373 [01:40:05] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49373 [01:40:32] New patchset: MaxSem; "check_solr, attempt 2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49372 [01:40:36] New patchset: Reedy; "Expose filebackend.php in noc conf" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49374 [01:41:45] New review: Faidon; "Patch Set 3: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49372 [01:41:51] New review: Reedy; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49374 [01:41:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49374 [01:41:57] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49372 [01:42:12] yawn [01:42:16] thanks :) [01:42:32] I'm not going to wait for puppet to push this [01:42:48] * MaxSem falls asleep [01:43:14] yeah, I'll watch it [01:43:44] RECOVERY - Apache HTTP on mw113 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.087 second response time [01:43:53] RECOVERY - Apache HTTP on mw120 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.082 second response time [01:43:53] RECOVERY - Apache HTTP on mw123 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.083 second response time [01:44:20] RECOVERY - Apache HTTP on mw88 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.072 second response time [01:44:38] RECOVERY - Apache HTTP on mw115 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.084 second response time [01:44:55] 37 Fatal error: require() [function.require]: Failed opening required '/usr/local/apache/common-local/php-1.21wmf9/includes/WebStart.php' (include_path='.:/usr/share/php:/usr/local/apache/common/php') in /usr/local/apache/common-local/php-1.21wmf9/index.php on line 55 [01:44:58] Rarrghhh [01:45:06] Are they in rotation, or is it just nagios? [01:45:15] what is? [01:45:29] apaches [01:45:36] which ones? [01:45:42] Loads of fatals from missing files reporting in the apache syslogs [01:45:49] do you have hostnames? [01:45:56] http://p.defau.lt/?rej8eQ1RHh7d4sEsBRqyKg [01:45:59] Take your pick ;) [01:46:26] I see tampa IPs [01:46:42] Some before have just had screwed permissions on the php-1.21wmf9 folder somehow [01:47:01] someone brought up a bunch of pmtpa apaches [01:47:08] either RobH or Steven [01:47:15] but neither are on irc [01:47:20] RECOVERY - Apache HTTP on mw96 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.088 second response time [01:47:20] RECOVERY - Apache HTTP on mw99 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.087 second response time [01:47:22] and I'm annoyed :) [01:47:26] !log reedy synchronized php-1.21wmf9/ [01:47:27] Logged the message, Master [01:47:29] RECOVERY - Apache HTTP on mw118 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.091 second response time [01:47:44] Numerous timeouts, from I guess are decomissioned apaches in tampa [01:48:07] it might be new apaches [01:48:14] srv226: ssh: connect to host srv226 port 22: Connection timed out [01:48:15] srv227: ssh: connect to host srv227 port 22: Connection timed out [01:48:15] srv228: ssh: connect to host srv228 port 22: Connection timed out [01:48:15] etc [01:48:21] oh [01:48:22] maybe [01:48:25] no idea [01:48:29] then also [01:48:29] mw1041: ssh: connect to host mw1041 port 22: Connection timed out [01:48:30] mw1045: ssh: connect to host mw1045 port 22: Connection timed out [01:48:30] mw1165: rsync: mkdir "/apache/common-local/php-1.21wmf9" failed: No such file or directory (2) [01:48:35] I know Rob is replacing a bunch of apaches in tampa [01:48:41] grrr [01:48:43] signal:noise is bad :( [01:50:46] I'm trying to get ahold of him and failing [01:50:53] do you mind filing an RT? [01:54:14] RECOVERY - ircecho_service_running on spence is OK: PROCS OK: 4 processes with args ircecho [01:54:51] nagios is looking much better :D [01:54:53] New patchset: Faidon; "Fix check_ircecho" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49376 [01:55:38] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49376 [01:55:48] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49376 [02:08:38] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:14:47] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 207 seconds [02:18:23] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 1 seconds [02:28:27] !log LocalisationUpdate completed (1.21wmf9) at Sat Feb 16 02:28:26 UTC 2013 [02:28:30] Logged the message, Master [02:29:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.869 seconds [02:33:41] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [02:47:38] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:06:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.389 seconds [03:49:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:54:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Puppet has not run in the last 10 hours [03:58:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.465 seconds [04:01:44] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [04:02:38] PROBLEM - Puppet freshness on labstore3 is CRITICAL: Puppet has not run in the last 10 hours [04:32:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:08] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 186 seconds [04:37:53] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 197 seconds [04:39:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds [04:42:23] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:43:08] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [04:55:15] New review: Jeremyb; "Patch Set 1: Code-Review-1" [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/48868 [05:03:08] New review: Jeremyb; "Patch Set 1:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [05:26:38] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:31:49] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.024 second response time on port 8123 [05:34:13] PROBLEM - Apache HTTP on mw108 is CRITICAL: Connection refused [05:34:40] PROBLEM - Apache HTTP on mw109 is CRITICAL: Connection refused [05:36:28] RECOVERY - Apache HTTP on mw109 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.100 second response time [05:38:52] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 210 seconds [05:39:19] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 228 seconds [05:40:40] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 6 seconds [05:40:58] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:41:07] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [05:44:16] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.019 second response time on port 8123 [05:46:22] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [05:46:58] RECOVERY - Apache HTTP on mw108 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.103 second response time [05:53:25] RECOVERY - Lucene on search1015 is OK: TCP OK - 9.019 second response time on port 8123 [05:58:59] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:04:22] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:09:28] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:09:29] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [06:10:49] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [06:11:16] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [06:24:10] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:41:50] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:43:38] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 191 seconds [06:44:50] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:50] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:50] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:45:26] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:46:29] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.112 second response time [06:46:30] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.379 second response time [06:48:17] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.816 second response time [06:52:20] RECOVERY - Lucene on search1015 is OK: TCP OK - 3.022 second response time on port 8123 [07:03:17] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [07:15:26] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [07:15:44] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [07:15:53] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:21:26] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:40:47] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.032 second response time on port 8123 [07:47:39] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:02:48] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [08:03:42] PROBLEM - Puppet freshness on db1002 is CRITICAL: Puppet has not run in the last 10 hours [08:04:45] PROBLEM - Puppet freshness on cp1023 is CRITICAL: Puppet has not run in the last 10 hours [08:26:57] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [08:32:57] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:33:51] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:34:09] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 197 seconds [08:35:57] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [08:38:12] PROBLEM - Lucene on search1011 is CRITICAL: Connection timed out [08:38:21] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [08:40:45] RECOVERY - LVS Lucene on search-pool3.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [08:41:30] RECOVERY - Lucene on search1011 is OK: TCP OK - 0.027 second response time on port 8123 [08:51:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 195 seconds [08:52:00] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 197 seconds [09:01:27] RECOVERY - Lucene on search1015 is OK: TCP OK - 3.026 second response time on port 8123 [09:12:33] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [09:21:34] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.025 second response time on port 8123 [09:26:39] RECOVERY - Lucene on search1015 is OK: TCP OK - 3.019 second response time on port 8123 [09:27:06] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [09:33:02] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 21 seconds [09:33:11] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [09:33:38] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [09:34:32] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 7 seconds [09:39:29] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [09:56:17] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 183 seconds [09:56:53] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 194 seconds [09:57:20] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [10:03:02] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [10:10:41] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [10:11:35] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [10:25:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 214 seconds [10:25:59] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 222 seconds [10:27:29] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 12 seconds [10:27:47] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [10:40:05] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 203 seconds [10:40:32] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 218 seconds [10:41:53] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [10:42:20] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [11:12:29] RECOVERY - MySQL disk space on neon is OK: DISK OK [11:13:14] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [11:16:00] New patchset: Alex Monk; "(bug 44893) Set up redirect from tartupeedia.ee to a page on etwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [11:27:11] RECOVERY - Lucene on search1015 is OK: TCP OK - 9.030 second response time on port 8123 [11:36:38] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 181 seconds [11:38:08] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [11:38:20] is ther an issue with search today? [11:38:24] https://bugzilla.wikimedia.org/45073 [11:38:27] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [11:41:44] aude, ^^^:P [11:41:55] yes [11:41:59] seems not just wikidata [11:42:24] http://www.mediawiki.org/w/index.php?title=Special:Search&search=mapnik&fulltext=Search&profile=all&redirs=1 [11:42:30] for example sure has results, per google [11:42:39] * aude sighs [11:49:35] ugh, most wikis are on that pool [11:50:01] Reedy, I guess it's time for me to wake notpeter up? [11:55:17] New patchset: Dereckson; "(bug 44604) Enable PostEdit on ur.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49378 [11:58:01] text sent [11:59:08] eww, and didn't go through [11:59:31] grr grr [12:00:03] mutante, yt? [12:03:56] :( [12:04:11] * aude at least glad it's not a problem specific to wikidata [12:04:23] no, it's far worse [12:04:24] e.g. i don't have to figure it out [12:04:33] yeah, but hope we can fix it soon [12:04:38] cuz most wikis are affected [12:04:43] true [12:04:49] wikidata has enough problems with search [12:05:05] what problems? [12:05:10] all sorts ^^ [12:05:11] the results suck [12:05:21] aude: is your special page broken by this too? [12:05:29] not WD-specific;) [12:05:34] Nemo_bis: i don't think so [12:05:39] MaxSem: extra bad for wikidata [12:05:55] because wikipedia counts incoming links or something and the sort order on search is a little more sane [12:06:03] aude: why don't you make it the default? the usual search looks completely useless [12:06:13] on wikidata, you get Egypt, Arkansas and alot of other stuff before you see Egypt the country :D [12:06:28] Nemo_bis: working on it, maybe soon as monday :D [12:06:38] we can deploy it or soon thereafter [12:07:00] and then we're looking at solr to allow hopefully some smarter sorting and searching [12:07:05] aude: so the JS will be killed? [12:07:16] which JS? [12:07:21] the "enhanced search" or whatevre [12:07:29] don't know [12:07:44] isn't it the same? [12:08:12] * aude finding the bug for this [12:08:46] https://www.wikidata.org/wiki/MediaWiki:Gadget-Search.js [12:09:35] aude: in other words, are you able to answer https://bugzilla.wikimedia.org/show_bug.cgi?id=43020#c2 ? :) [12:09:53] Nemo_bis: https://gerrit.wikimedia.org/r/#/c/49263/ is the pach [12:09:55] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [12:10:30] fuck [12:10:41] * MaxSem hates his phone [12:11:55] why do these things *always* have to happen on a weekend, especially middle of the night for SF? [12:12:37] dammit, the glorious Galaxy II quietly loses the network and still happily shows all bars [12:12:42] maybe because during the weekend people are more bored and do more stuff like obsessive searches which kill the servers? [12:12:44] Nemo_bis: yes i think what we have will be similar to the gadget [12:12:49] heh [12:12:54] now the text to mutante was sent for realz [12:13:05] * aude will have to review the patch today or tomorrow to see if we can get it in for monday [12:13:23] no, it's because the replication cronjob runs at this time [12:14:48] daily cronjob? [12:15:25] see wikitech-l archives [12:19:25] ah [12:24:46] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [12:26:52] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [12:26:54] nagios-wm, lies! [12:32:34] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [12:33:19] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 206 seconds [12:33:22] mmm, from reading the engineering@ archives, this might unbreak itself after some time [12:34:58] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [12:35:07] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [12:36:01] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [12:48:55] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:05:07] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.028 second response time on port 8123 [13:10:58] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [13:12:46] RECOVERY - Lucene on search1015 is OK: TCP OK - 9.035 second response time on port 8123 [13:16:04] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [13:37:56] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 194 seconds [13:38:42] !log restarted lucene-search on search1015 about 20 minutes ago [13:38:45] Logged the message, Master [13:39:44] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 20 seconds [13:48:35] PROBLEM - Puppet freshness on kaulen is CRITICAL: Puppet has not run in the last 10 hours [13:55:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Puppet has not run in the last 10 hours [13:57:35] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [13:57:35] PROBLEM - Puppet freshness on sq85 is CRITICAL: Puppet has not run in the last 10 hours [13:58:38] PROBLEM - Puppet freshness on db1026 is CRITICAL: Puppet has not run in the last 10 hours [13:58:38] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [14:02:32] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [14:03:35] PROBLEM - Puppet freshness on labstore3 is CRITICAL: Puppet has not run in the last 10 hours [14:34:28] PROBLEM - Puppet freshness on mc1003 is CRITICAL: Puppet has not run in the last 10 hours [14:47:50] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [14:48:52] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [14:50:31] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:55:01] PROBLEM - Lucene on search1012 is CRITICAL: Connection timed out [15:19:37] RECOVERY - MySQL disk space on neon is OK: DISK OK [15:20:31] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [15:29:31] RECOVERY - Lucene on search1012 is OK: TCP OK - 9.032 second response time on port 8123 [15:31:01] RECOVERY - LVS Lucene on search-pool3.svc.eqiad.wmnet is OK: TCP OK - 0.029 second response time on port 8123 [17:28:03] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [17:28:57] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 218 seconds [17:39:54] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:33] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [17:51:36] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [17:52:30] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [18:05:19] PROBLEM - Puppet freshness on db1002 is CRITICAL: Puppet has not run in the last 10 hours [18:06:22] PROBLEM - Puppet freshness on cp1023 is CRITICAL: Puppet has not run in the last 10 hours [19:43:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.064 seconds [20:17:12] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 196 seconds [20:18:06] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 204 seconds [20:19:54] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:20:48] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [20:21:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.067 seconds [20:38:39] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 183 seconds [20:38:57] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 191 seconds [21:09:42] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 25 seconds [21:11:13] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [21:25:27] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 209 seconds [21:25:45] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 220 seconds [21:36:42] RECOVERY - Puppet freshness on labstore2 is OK: puppet ran at Sat Feb 16 21:36:14 UTC 2013 [21:45:07] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 29 seconds [21:46:19] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [21:49:37] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [21:50:04] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [22:11:40] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [22:20:40] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [22:22:01] RECOVERY - MySQL disk space on neon is OK: DISK OK [22:35:40] PROBLEM - Puppet freshness on mc1006 is CRITICAL: Puppet has not run in the last 10 hours [22:37:37] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: Puppet has not run in the last 10 hours [22:42:34] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [22:45:42] [22:50:22] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [23:08:04] RECOVERY - Puppet freshness on labstore3 is OK: puppet ran at Sat Feb 16 23:07:42 UTC 2013 [23:25:01] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Sat Feb 16 23:24:56 UTC 2013 [23:49:19] PROBLEM - Puppet freshness on kaulen is CRITICAL: Puppet has not run in the last 10 hours [23:58:10] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [23:58:11] PROBLEM - Puppet freshness on sq85 is CRITICAL: Puppet has not run in the last 10 hours [23:59:22] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [23:59:22] PROBLEM - Puppet freshness on db1026 is CRITICAL: Puppet has not run in the last 10 hours