[00:01:03] dzahn is doing a graceful restart of all apaches [00:01:26] !log dzahn gracefulled all apaches [00:01:36] Logged the message, Master [00:01:53] !log reedy synchronized wmf-config/InitialiseSettings.php 'fix whitespace' [00:02:03] Logged the message, Master [00:03:59] New patchset: Tim Starling; "Fix failure to write to ExtensionMessages-*.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44548 [00:03:59] New patchset: Tim Starling; "Network-aware scap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44526 [00:04:00] New patchset: Tim Starling; "Moved scap scripts to their own directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44525 [00:06:03] New patchset: Reedy; "Fix whitespace" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44549 [00:06:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44549 [00:06:37] RECOVERY - Host cadmium is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [00:06:48] New patchset: Dzahn; "commenting wikidata redirect , redirect loop" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/44550 [00:07:14] j^: are you around? [00:07:16] New review: Dzahn; "revert" [operations/apache-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44550 [00:07:16] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/44550 [00:07:38] RobH: cool, I can take it fro mhere [00:07:52] j^: removing the 'async: true' in mw.UploadWizardDetails.js won't break anything right? [00:08:10] it should work handle either case [00:08:32] dzahn is doing a graceful restart of all apaches [00:08:56] !log dzahn gracefulled all apaches [00:09:00] yeah, looks fine [00:09:05] Logged the message, Master [00:10:06] !log aaron synchronized php-1.21wmf8/extensions/UploadWizard/resources/mw.UploadWizardDetails.js [00:10:16] Logged the message, Master [00:10:40] PROBLEM - SSH on cadmium is CRITICAL: Connection refused [00:11:44] New patchset: Dzahn; "fix wikidata redirect" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/44552 [00:13:31] New patchset: Dzahn; "fix wikidata redirect" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/44552 [00:13:48] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/44552 [00:17:26] RECOVERY - SSH on yttrium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:17:34] New patchset: Pyoungmeister; "finishing up spinup of tmh boxes for eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44553 [00:18:18] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44541 [00:18:34] New patchset: Dzahn; "fix wikidata redirect, for real, sry" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/44554 [00:18:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44553 [00:19:03] New review: Dzahn; "still bug 41847" [operations/apache-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44554 [00:19:03] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/44554 [00:19:22] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 293 seconds [00:20:08] dzahn is doing a graceful restart of all apaches [00:20:34] !log dzahn gracefulled all apaches [00:20:46] Logged the message, Master [00:21:10] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [00:21:34] yay RobH made asw-a-eqiad changes :) RobH++ [00:21:46] PROBLEM - NTP on yttrium is CRITICAL: NTP CRITICAL: No response from NTP server [00:26:33] TimStarling: tmh1001 and tmh1002 are eqiad and also will need code deploys, for the record [00:27:10] added, thanks [00:27:14] cool! [00:28:31] PROBLEM - Host cadmium is DOWN: PING CRITICAL - Packet loss = 100% [00:33:32] AaronSchulz: no [00:34:05] * AaronSchulz had to think for second to remember the question :) [00:34:14] I already sent an email :) [00:34:22] RECOVERY - Host cadmium is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [00:34:52] New patchset: Dzahn; "revert the whole www dropping for wikidata, cant have it due to the way bits works" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/44558 [00:34:56] paravoid: can you just log into fenari and add that info? [00:35:41] add it where? [00:35:55] privatesettings [00:36:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:36:56] which is where exactly? :) [00:37:26] /h/w/c/wmf-config/PrivateSettings.php [00:37:47] New patchset: Dzahn; "revert the whole www dropping for wikidata, cant use it due to bits" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44559 [00:37:53] thanks. [00:38:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.060 seconds [00:38:34] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/44558 [00:38:51] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44559 [00:40:13] AaronSchulz: I'm not sure how to create the array so that both are used [00:40:20] or am I missing something? [00:40:47] I it won't be used yet [00:40:53] I just want the info to be there [00:40:58] !log dzahn synchronized ./wmf-config/InitialiseSettings.php [00:41:04] okay, do you mind if I'll put it as a comment there so you can fix it up? [00:41:08] Logged the message, Master [00:41:19] paravoid is paranoid? [00:41:20] sure [00:41:29] hm? [00:41:52] haha :) [00:42:04] !log dzahn synchronized ./wmf-config/InitialiseSettings.php [00:42:12] sorry, I'm really sleepy :) [00:42:13] Logged the message, Master [00:42:33] !log dzahn synchronized ./wmf-config/CommonSettings.php [00:42:34] paravoid: you could sleep :) [00:42:42] Logged the message, Master [00:42:53] NOT UNTIL THIS ONE THING IS DONE [00:42:56] ;) [00:42:58] heh [00:43:09] AaronSchulz: have a look [00:43:24] and please test before I hit the bed :) [00:43:27] oh, and guess what [00:43:28] dzahn is doing a graceful restart of all apaches [00:43:30] no mw:thumb anymore! [00:43:34] everyone done with scap for now? [00:43:51] !log dzahn gracefulled all apaches [00:44:00] Logged the message, Master [00:44:07] paravoid: can you set rgwS3AccessKey and rgwS3SecretKey too ? [00:44:15] TimStarling: scap is still running for me [00:44:31] AaronSchulz: you wrote that part already? [00:44:34] the info should already be in radosgw-admin or whatever [00:44:35] seeing some interesting failures - https://gist.github.com/d363d11e1217f65d65e3 [00:44:36] paravoid: :) [00:44:47] hehe [00:44:50] actually, i dunno if they really are interesting, but figured i'd post in case they are. [00:45:26] notpeter: is collector on professor? [00:45:30] yeah [00:45:32] it won't start [00:45:39] haven't gone too far into troubleshooting yet [00:45:52] AaronSchulz: username is different too [00:45:56] unless it's the wrong init script [00:45:57] blarg [00:46:09] good, it should be different :) [00:46:11] awjr: those errors are not dangerous [00:46:17] we should probably exclude vim swapfiles [00:46:52] fenari is unhappy, I guess someone's doing a scap [00:46:59] paravoid: guilty [00:47:05] heh [00:47:07] were are the gallows? [00:47:14] *where [00:47:21] i dunno, but i hope we're using them for scap rather than me [00:47:30] heh [00:47:34] scap will be faster after my changes [00:47:59] this is the longest it's taken for me - 74 minutes so far [00:48:00] *should be faster [00:48:20] it's the netapp, by the looks of it [00:48:28] http://ganglia.wikimedia.org/latest/graph.php?h=nfs1.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358470071&g=cpu_report&z=medium&c=Miscellaneous%20pmtpa [00:48:31] lots of orange [00:48:39] zesty [00:49:07] I wouldn't mind moving the source off /home, even if we do stay with scap [00:49:16] AaronSchulz: done [00:49:34] even with the new scap, I still have to wait for the source to be copied off the netapp a few times [00:49:37] do you still have vim open? [00:49:40] not anymore [00:49:52] ah, there we go, I was getting swap [00:50:28] paravoid: that should be good, I'll clean it up from there [00:50:39] do you have an easy way to test it? [00:50:49] make sure it works before I go? [00:51:16] well it's not urgent, and I can easily use eval.php to play around with the backend object I add while nothing uses it [00:51:35] okay [00:53:43] some of the load on nfs1 may be due to syslog [00:53:57] why is swift writing access logs to NFS? [00:54:17] because ben configured it so and noone changed it [00:54:35] what's especially funny is that it gets copied to nfs1001 too [00:54:39] er, nas1001 [00:55:36] I was going to move syslog to fluorine myself, but I think I was put off by some detail of the puppet configuration [00:57:19] RECOVERY - SSH on cadmium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:58:25] die bot die [00:58:27] finally ! [00:58:36] ok, now to figure out why icinga isn't writing to this [01:00:36] is there a manual for the netapps? [01:01:14] lots of them but I don't think it's such a good idea :) [01:01:14] what do you need? [01:01:23] I want to know why it is especially slow right now [01:01:31] PROBLEM - NTP on cadmium is CRITICAL: NTP CRITICAL: No response from NTP server [01:02:05] haha [01:02:06] good one [01:02:32] sure, you can get the access credentials, then download a multi-megabyte windows app called "NetApp Operations Manager" or something [01:02:44] or you can figure out some perf data out of SNMP [01:02:45] !log aaron synchronized wmf-config/PrivateSettings.php [01:02:54] Logged the message, Master [01:03:27] but let's see if it complains about anything on the logs [01:04:02] should we be particularly concerned about the fact that scap has been running for 90 minutes now? this is by far the longest i've ever seen it take [01:04:12] TimStarling: http://torrus.wikimedia.org/torrus/Storage?path=/Storage/nas1-a.pmtpa.wmnet/NetApp_General/DiskWriteBytes [01:04:17] lots of fun graphs [01:05:24] awj's scap still has a long way to go, so there's no hurry [01:05:54] gah!!! i hate you so much ircecho [01:05:56] er TimStarling any idea how much more? [01:06:10] nothing strange on the logs [01:06:27] http://torrus.wikimedia.org/torrus/Storage?path=/Storage/nas1-a.pmtpa.wmnet/NetApp_General/NfsOps [01:06:44] that's no small time [01:06:49] awjr: well, it's doing srv231-238, and it goes in the order of the hosts in /etc/dsh/group/mediawiki-installation [01:07:01] * AaronSchulz wonders if there are hourly graphs, recalls the answer being no [01:07:05] LeslieCarr: :-( [01:07:19] ha [01:07:23] test [01:07:27] nas1001-b has taken over nas1001-a [01:07:37] so it's maybe 64% done [01:07:45] how long has it been going? [01:07:52] i run it manually - it gets the data from irc.log -- i run it automatically, it doesn't read irc.log --- but irc.log is world readable! [01:08:00] * AaronSchulz looks at the wrong nas [01:08:01] TimStarling: 94 minutes now [01:08:12] Tue Jan 15 14:07:55 GMT [nas1001-b:cf.fm.takeoverDuration:info]: Failover monitor: takeover duration time is 6 seconds. [01:08:12] TimStarling: this is becoming a bit problematic as i need to leave soon :| [01:08:15] Tue Jan 15 14:07:57 GMT [nas1001-a:mgr.partner.stack.saved:notice]: Cluster takeover has saved partner panic stack trace information for logging. [01:08:18] Tue Jan 15 14:07:57 GMT [nas1001-a:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU2. MC5 Error: STATUS<0xb200000084200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on Contro [01:08:22] Tue Jan 15 14:07:57 GMT [nas1001-a:mgr.stack.at:notice]: Panic occurred at: Tue Jan 15 14:07:48 2013 [01:08:25] Tue Jan 15 14:07:57 GMT [nas1001-a:mgr.stack.proc:notice]: Panic in process: idle_thread2 [01:08:28] Tue Jan 15 14:07:58 GMT [nas1001-b:callhome.sfo.takeover.panic:info]: Call home for CONTROLLER TAKEOVER COMPLETE PANIC [01:08:31] COMPLETE PANIC I TELL YOU [01:08:43] though actually I don't see why nas1 would be doing much [01:08:45] so you'd expect it to take another 54 minutes [01:09:13] well that includes the time it took for localisation cache updating and stuff [01:09:27] so hopefully less! [01:09:59] blah blah [01:10:00] blah blah [01:10:54] i'm a stupid bot which needs kill -9 because i won't work from my init script, though will work from the command line when using the same arguments as the commit script [01:10:54] i'm a stupid bot which needs kill -9 because i won't work from my init script, though will work from the command line when using the same arguments as the commit script [01:11:10] paravoid: that doesn't explain it [01:12:06] icinga-bot: try from the command line with an empty environment [01:12:28] RECOVERY - NTP on yttrium is OK: NTP OK: Offset -0.04104089737 secs [01:13:48] mutante: here's a bad url: http://wikidata.org/w/load.php?debug=false&lang=en&modules=site&only=styles&skin=vector&* [01:14:18] http://bits.wikimedia.org/www.wikidata.org/load.php?debug=false&lang=en&modules=site&only=styles&skin=vector&* [01:14:57] http://www.youtube.com/watch?v=FE0XcdM22Yo awjr is the scapman [01:15:09] i am dying on the inside [01:15:32] http://wikidata.org/wiki/Wikidata:Project_chat is broken :( [01:16:05] http://www.wikidata.org/wiki/Wikidata:Main_Page has no css :( [01:16:14] Ryan_Lane: purging 1 urls. done [01:16:41] New patchset: Aaron Schulz; "Added ceph file backend configuration." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44562 [01:19:56] AaronSchulz: what happened with the cleanup script? [01:20:25] the patch is in gerrit [01:20:29] never merged [01:20:46] https://gerrit.wikimedia.org/r/#/c/44463/1 [01:21:22] okay [01:21:25] I'll ping Reedy tomorrow [01:23:47] awjr: if you like, I can kill your scap and run the new one instead [01:24:32] it might be faster [01:25:17] i.e. faster to run a whole scap with the new one than to let the old one finish [01:25:36] TimStarling: well, i expect this will finish before a new scap would finish localisation update… [01:25:40] then again, at this point i could just sync-dir [01:25:54] !log authdns update adding db1051-1060 to eqiad mgmt zone file [01:26:06] Logged the message, Master [01:26:59] yes, do a sync-dir [01:27:07] TimStarling although… are the l10n cache updates copied out during this portion of scap as well? [01:27:17] do a sync-dir for that too [01:27:34] sync-dir php-1.21wmf8/cache [01:27:36] interesting - where do i sync-dir to pick up the l10n changes? [01:27:39] or whatever version you use [01:27:45] ok cool, we have changes for both [01:27:52] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44456 [01:28:19] TimStarling: it feels so close though… srv279 [01:29:31] maybe another 10 minutes? [01:29:36] AaronSchulz: wmf8 killed the udp profiling collector, not sure yet why [01:30:04] TimStarling: that's what im thinking, or less unless there's some hangup. im inclined to wait it out [01:30:35] i've come this far! [01:30:41] !log reedy synchronized wmf-config/ [01:30:51] Logged the message, Master [01:30:55] AaronSchulz: maybe not wmf8, but something in the cluster of deploys around then [01:34:57] dzahn is doing a graceful restart of all apaches [01:35:21] !log dzahn gracefulled all apaches [01:35:31] Logged the message, Master [01:35:49] 1.20wmf11 is deployed somewhere? [01:36:41] ok, well here's one problem: [01:36:49] rsync is running with --exclude=**/.git/objects [01:37:15] but rsync is spending all its time reading from files like /home/wikipedia/common/php-1.21wmf8/.git/modules/extensions/WebFonts/objects/pack/pack-cac49193e6b78728682d3e915c5a499150fc8c5a.pack [01:37:30] binasher: shouldn't be.. [01:38:42] there are live eqiad apaches sending profiling packets identifying as wmf11 [01:39:03] that's exciting [01:40:07] are we skipping wmf9 and 10 and hoping to catch up with firefox in a few more weeks? [01:42:01] !log awjrichards synchronized php-1.21wmf7/extensions/MobileFrontend/ 'touch files' [01:42:05] !log awjrichards Finished syncing Wikimedia installation... : Updating MobileFrontend per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2013-01-17 [01:42:10] Logged the message, Master [01:42:19] Logged the message, Master [01:43:13] holy lord finally [01:43:25] !log awjrichards synchronized php-1.21wmf8/extensions/MobileFrontend 'touch files' [01:43:34] Logged the message, Master [01:43:54] binasher, LeslieCarr, mutante, or anyone else available to help with a mobile varnish cache flush? [01:44:03] awjr: i'll get yours [01:44:05] 130 mins of scap [01:44:07] thanks LeslieCarr [01:44:14] hopefully that should be the end of this deployment [01:44:16] others are busy with sitebroken [01:44:18] You're supposed to hold your breath. [01:44:19] o [01:44:46] done [01:44:54] !log purged mobile varnish cache for awjr [01:45:05] Logged the message, Mistress of the network gear. [01:45:08] thanks so much LeslieCarr [01:45:35] New patchset: Tim Starling; "Moved scap scripts to their own directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44525 [01:45:41] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44525 [01:45:47] New patchset: Tim Starling; "Network-aware scap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44526 [01:45:53] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44526 [01:46:02] New patchset: Tim Starling; "Fix failure to write to ExtensionMessages-*.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44548 [01:46:08] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44548 [01:50:07] PROBLEM - profiler-to-carbon on professor is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [01:57:29] RECOVERY - NTP on cadmium is OK: NTP OK: Offset -0.04393815994 secs [02:04:30] AaronSchulz: stats/1.21wmf8 job-insert-duplicate is what's breaking things [02:04:32] mutante: I'm assuming you're currently working on getting back Wikidata's AAAA record [02:05:25] Jasper_Deng: i never touched DNS for wikidata [02:05:36] just apache/mw config [02:05:43] are you having issues with dns Jasper_Deng ? [02:05:47] mutante: hhm... if you're dropping www, the base domain seems to be missing AAAA [02:05:55] the www.* still has it [02:06:25] AaronSchulz: wfIncrStats( 'job-insert-duplicate', [02:06:26] count( $rowSet ) + count( $rowList ) - count( $rows ) ); [02:06:44] only do that if count( $rowSet ) + count( $rowList ) - count( $rows ) ) > 0 [02:07:16] it's 0 on every call and incrstats(0) isn't currently allowed [02:07:29] let's not focus on other broken things (tickets for that!) - let's just focus on getting the site to its previous stable state [02:07:51] hrm.. maybe i should fix that [02:07:59] Jasper_Deng: wikidata.org zonefile did not change, it has 2 edits, last one on Oct 30th [02:08:05] that must be an existing issue [02:08:09] it is [02:08:58] wikidata.org should be back to prior state now [02:10:49] yep, reverted https://gerrit.wikimedia.org/r/#/c/44558/ and the mw-config too and we restarted bits [02:10:52] bbl [02:11:01] thanks all who helped me [02:11:34] RECOVERY - profiler-to-carbon on professor is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [02:13:44] AaronSchulz: i'm making wfIncrStats(0) not break everything.. but still, wfIncrStats sends a packet on every request instead of just the 2% that get profiling so best not to call it unless its actually incrementing a stat by a positive integer [02:21:10] RECOVERY - carbon-cache.py on professor is OK: PROCS OK: 1 process with command name carbon-cache.py [02:24:27] New patchset: Asher; "fix divide by zero exception when wfIncrStats(0) is called" [operations/software] (master) - https://gerrit.wikimedia.org/r/44564 [02:24:27] New patchset: Asher; "moving carbon-collector pidfile somewhere ephemeral to prevent a stale file post-server crash" [operations/software] (master) - https://gerrit.wikimedia.org/r/23099 [02:25:46] ok, mw prof data is flowing into graphite again [02:26:59] !log LocalisationUpdate completed (1.21wmf7) at Fri Jan 18 02:26:57 UTC 2013 [02:27:09] Logged the message, Master [02:31:32] Bits seems unhappy [02:31:38] Giving 503s [02:31:49] PROBLEM - Apache HTTP on srv248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:32:48] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:34:44] did mutante change the config? [02:36:26] Not for bits I don't think [02:37:38] I get a 404 trying http://bits.wikimedia.org/en.wikipedia.org/load.php from the backend [02:38:26] I guess it is rewritten by varnish [02:40:16] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.652 second response time [02:40:52] RECOVERY - Apache HTTP on srv248 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.977 second response time [02:44:38] those two weren't in ganglia apparently [02:47:47] it's an overload on the backends [02:48:57] !log restarted bits apaches due to overload, were totally down [02:49:08] Logged the message, Master [02:50:35] !log LocalisationUpdate completed (1.21wmf8) at Fri Jan 18 02:50:35 UTC 2013 [02:50:46] Logged the message, Master [02:53:03] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:28] PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:37] PROBLEM - Apache HTTP on srv248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:55] PROBLEM - Apache HTTP on mw60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:16] RECOVERY - Apache HTTP on srv248 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.892 second response time [02:56:28] Is there no pmtpa bits app server group on ganglia? [02:56:37] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.341 second response time [02:57:15] no [02:57:24] none of them are in ganglia, but they are overloaded [02:57:31] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.924 second response time [02:58:04] looks like mostly mobile requests [02:58:27] probably a batch of mobile requests, then it's overloaded and varnish depools it [02:58:42] GET /w/load.php?debug=false&lang=en&modules=mobile.startup%2Csite%2Cproduction-only%7Cmobile.device.operamini&only=styles&skin=mobile&version=1358476092&* [02:58:48] hard to tell if it's normal [03:00:07] I wonder if we've got the point we can't actually clear the mobile varnish cache on demand [03:00:21] they have an expires header and public caching header [03:00:31] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.356 second response time [03:03:49] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:05:03] there's no bans on any of the varnishes [03:06:13] PROBLEM - Apache HTTP on srv248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:56] ok, we have to fix this now [03:07:56] we need someone who knows about varnish [03:08:03] and/or mobilefrontend [03:08:30] Asher might be best. Want me to text him? [03:08:50] ok [03:10:07] PROBLEM - Apache HTTP on mw60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:28] PROBLEM - Apache HTTP on mw61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:46] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.296 second response time [03:13:16] RECOVERY - Apache HTTP on srv248 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [03:13:42] notpeter: Ryan_Lane Who likes varnish? [03:14:36] the bits backend is mostly down, apparently because MF is overloading it [03:14:44] I would like to just block all MF bits requests [03:16:23] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cache_miss&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [03:16:26] cache miss rate [03:17:19] PROBLEM - Apache HTTP on mw60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:31] Just tried Tomasz and Arthur too [03:18:13] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.332 second response time [03:18:49] PROBLEM - Apache HTTP on srv248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:20:31] TimStarling: MF=mobile frontend? [03:20:37] RECOVERY - Apache HTTP on srv248 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.927 second response time [03:20:38] yes [03:20:38] yup [03:20:48] Reedy: yo [03:20:49] what's going on? [03:20:50] can they roll back whatever the fuck they rolled out? [03:20:58] and yes, blocking mobile to save the site is reasonable [03:21:04] what is happening? [03:21:15] referer match *.m.wikipedia.org [03:21:25] or URL match mobile.startup [03:21:48] let's get the site up first and then worry about fixing MF [03:22:05] 503 (Service Unavailable) on load.php [03:22:07] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.937 second response time [03:22:11] Krinkle: We know. [03:22:28] notpeter: I will do it another way if you don't do it in varnish [03:22:29] Interesting, I worked upto this very second I posted the error [03:22:38] TimStarling: go for it [03:22:42] I just walked in on this [03:22:44] no context [03:22:53] and varnish isn't my strong suit [03:24:28] !log attempting to block MF in bits apache configuration [03:24:39] Logged the message, Master [03:29:31] getting some random successes now, most of requests still 503'ing. [03:29:37] PROBLEM - Apache HTTP on srv248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:29:46] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.121 second response time [03:30:10] !log tstarling synchronized live-1.5/load.php [03:30:13] i am preparing to roll back to MF to where it was before today's deployment [03:30:25] Logged the message, Master [03:31:13] awjr: where are we at? [03:31:16] RECOVERY - Apache HTTP on srv248 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.066 second response time [03:31:45] tfinc_: i still ahve no idea what happened, sounds like TimStarling is attempting to shut down to the mobile site to bring things back up; in the meantime i am preparing to roll back what was deployed today [03:31:52] i just got on a couple of minutes ago [03:32:01] awjr: do you need anyone else online to help? [03:32:13] tfinc_: i shouldn't [03:32:18] rolling back should be no problem [03:32:25] awjr: k. lets chat after you roll back [03:32:30] yup [03:32:48] Reedy: thanks for the txt message about the issue. I'm glad our Reedy monitoring service is working [03:33:01] srsly [03:33:09] here is the problem: http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cache_miss&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [03:33:15] you see the spike in cache misses [03:33:22] that overloaded the bits apaches (which are not in ganglia) [03:33:35] * tfinc_ looks up the auth info for ganglia [03:33:44] the spike on the ones other than arsenic [03:33:54] the spike on arsenic is probably a measurement error [03:34:47] ok, I think it is fixed now [03:35:49] varnishadm reports that all backends are healthy [03:35:54] thanks TimStarling [03:36:20] this is presumably because I blocked MF, so you still have a bit of work to do tfinc ;) [03:36:40] TimStarling i've updated the prod branches to point to MobileFrontend pre-today's deployment; shall i sync it out? [03:37:03] ok [03:38:33] !log awjrichards synchronized php-1.21wmf7/extensions/MobileFrontend 'Roll back today's MobileFrontend deployment to 1ded0894cdbc089370a58ba41aeb4db53e5c7345' [03:38:45] Logged the message, Master [03:38:57] * tfinc_ pulls up tweetdeck to see what people are saying [03:39:50] !log awjrichards synchronized php-1.21wmf8/extensions/MobileFrontend 'Roll back today's MobileFrontend deployment to 1ded0894cdbc089370a58ba41aeb4db53e5c7345' [03:39:52] ok, now we'll need to flush the mobile varnish cache [03:40:00] Logged the message, Master [03:40:05] yeah, I have a feeling that is how this started [03:40:14] how did you flush the mobile varnish cache exactly? [03:40:21] that doesn't seem likely if the problem was with bits [03:40:26] one sec - there are docs [03:40:48] http://wikitech.wikimedia.org/view/MobileFrontend#Flushing_the_cache [03:41:48] it has also been a regular part of our deployment process for pretty much the last year [03:42:46] ok [03:43:30] I'll run it then [03:44:35] seems to be the same [03:44:47] can you start tracing the bits requests? [03:45:05] i'm surprised that https://gdash.wikimedia.org/dashboards/reqmobile/ seems unaffected [03:45:13] woosters: yo [03:45:19] hi [03:45:28] here is a URL: http://de.wikipedia.org/w/load.php?debug=false&lang=de&modules=mobile.startup%2Csite%2Cproduction-jquery%2Cproduction-only%2CfilePage%7Cjquery.hidpi%7Cmediawiki.hidpi%7Cmobile.device.iphone&only=styles&skin=mobile&version=1358480271&* [03:45:46] Referer: http://de.m.wikipedia.org/w/index.php?title=Datei:Florian_silbereisen_mannheim.jpg&filetimestamp=20080713170031 [03:45:46] woosters: awjr is rolling back todays deployed after we got reports of an increase in cache misses [03:45:48] is this a normal bits URL for MF? [03:45:49] deployment* [03:46:24] hmm [03:46:33] hmm .. we have not made much infrastructure changes lately [03:46:38] that looks right [03:46:55] it hink [03:46:58] *i think [03:47:07] FYI Asher is out for dinner without his laptop [03:47:07] that's the backend URL [03:47:18] the frontend URL would have bits as its host [03:47:23] right [03:47:37] that looks right to me, but i also don't typically look very closely at those [03:48:57] TimStarling: was that url pulled from a request pre roll back? [03:49:09] no, after [03:49:47] before and after, the responses had appropriate caching headers [03:50:50] I can try disabling the error message temporarily to see if the requests start getting cached [03:50:59] the error message will be disabling caching at the moment [03:54:00] so do i understand correctly that something changed that cause bits requests coming from MobileFrontend to skip the cache? [03:54:47] probably [03:55:19] do bits resources still get cached for logged in users? [03:55:19] the number of misses went up, I haven't completely confirmed that that was due to a reduction in cache hit rate, but it's the most likely explanation [03:56:52] and is there any reason bits would cache variably by protocol? [03:57:12] it appears likely that they do get cached [03:57:37] sending a Cookie header does not suppress caching [03:57:45] and I see cached responses in my browser [03:57:46] hmm [03:58:00] the backends are sending responses with appropriate headers [03:58:07] and varnish appears to be respecting them [03:58:25] I confirmed that the backends were sending appropriate cache headers before the rollback [03:58:37] the biggest user facing change that went out today was a watchlist star on articles that, when clicked, would prompt the user to log in. when going to the login form, the user would get automatically directed to https (from http) [03:59:13] so there was probably an increase in https usage as well users logging in via the mobile site [03:59:40] but if protocol and cookies dont have bearing on bits cache... [04:00:49] you guys caught me right as i was walking in the door from a bike ride - im sweating and freezing - brb while i dry off and put some warm clothes on [04:03:22] I can confirm that the miss rate increased [04:03:45] back [04:04:19] and the problem started at 02:25 [04:05:26] so quite a long time after the MF deployment [04:05:33] that was at least an hour after [04:06:44] actually, more like ~45 minutes [04:15:13] how many different bits URLs can MF cause requests for? [04:15:25] a lot [04:15:44] a milion? a billion? [04:16:06] ok probably not that many [04:18:00] the text caches show nothing special at 02:25: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=large&c=Mobile+caches+eqiad&h=cp1041.eqiad.wmnet&v=767635739&m=cache_miss&jr=&js=&vl=N&ti=Cache+misses [04:18:02] there are unique permutations that come from different device types (although this is constrained to probably less than 10), then permutations based on stable/beta/alpha version of the site, then permutations based on functionality around special pages; probably somewhere in the three-digit range [04:18:15] maybe a small spike [04:18:45] yeah; the spike prior would've been the cache flush that leslie did [04:19:08] oh, although there is another funny blip just after [04:19:17] so if we were getting 1800 misses per second then you'd expect the cache to be full pretty quickly if there were only 1000 URLs to cache [04:19:53] i guess unless there was something generating an unusual number of unique URLs [04:20:28] is anyone in here familiar with how event logging works? [04:21:14] i think event logging might actually make a request of bits… and i believe event logging was enabled around the watchlist star clicks [04:21:28] hopefully there are logs on locke [04:23:10] ori-l: ping [04:23:32] I guess the Ganglia username/password isn't "ganglia"/"password". [04:24:12] sorry, the password is random [04:24:25] ok, no such luck, no logs on locke [04:26:15] wmf-config/CommonSettings.php: $wgEventLoggingBaseUri = '//bits.wikimedia.org/event.gif'; [04:26:28] nope [04:26:49] I saw none of that [04:26:53] ok so i think EventLogging loads a 1px gif or something from bits, but im not sure if the request URL has other businness apened to it [04:26:54] oh [04:26:54] ok [04:27:06] just a lot of MF-looking requests? [04:27:59] I don't suppose there's an easy diff between yesterday's MF code and today's? [04:28:10] sure, git diff : [04:28:10] Would that be viewable in a branch? [04:28:14] I'm looking through my scrollback to find you some tcpdump output that was definitely pre-rollback [04:28:35] Susan I have been slowly looking through it [04:28:42] Okay. [04:31:04] sorry, all gone [04:31:28] but I can take out the error message temporarily, that will reproduce it pretty quick [04:31:52] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [04:32:25] maybe I can enable some access logs first [04:32:46] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [04:33:33] time to read the varnish manual [04:33:47] always a great time to read manuals, when the site is half down [04:37:07] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [04:37:33] well turning back on bits for MF might not reproduce the problem if MF is to blame since I rolled back [04:39:31] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [04:40:20] awjr: how stable are we? [04:42:18] TimStarling did you flush the mobile varnish cache? i appear to still be getting cached content (which is probably fine right now, but just curious) [04:42:46] tfinc_: as far as i can tell we're still serving cached content from mobile varnish cache, but if a user misses the cache they'll get no styles or js atm [04:42:55] k [04:42:59] root cause is still being diagnosed [04:43:54] tfinc_: but in spite of no js/css, still usable [04:45:10] TimStarling: also is there somewhere i can see a big blob of logged requests to bits? [04:47:06] i finished a quick review the diffs from pre/post deploy today for MobileFrontend, and nothing is jumping out at me as scary [04:47:40] with nothing else to look at though, i'll go back through and look closer [04:48:24] sorry the power just went off here [04:48:34] oh great [04:48:40] I had to move the UPS to get back online [04:48:56] jeez [04:49:03] yes I flushed the mobile varnish cache, I think [04:49:15] I followed the instructions on that wikitech page [04:49:17] hmm [04:49:59] in case I disappear: I blocked MF by editing /home/wikipedia/common/live-1.5/load.php [04:50:07] if you look in that file you will see my patch there [04:50:20] if you take it out and do sync-file live-1.5/load.php, it will be back to normal [04:50:26] oh interesting, it seems i am still getting stuff back from bits [04:50:32] (from MF requests) [04:50:40] ok [04:50:49] I was just about to try taking it out temporarily, when the power went out [04:50:59] I figured out how to do varnish access logging [04:51:00] great timing [04:51:04] ok [04:51:41] it's 43C outside today, I hope the power's not off for long [04:51:51] otherwise I might relocate this ops session to the beach [04:51:57] heh [04:52:30] so your change did not appear to block all bits requests from MObileFrontend [04:53:40] just those with a referer of *.m.wikipedia.org [04:54:34] or maybe en.m.wikipedia.org/* because it looks like referrs of http://en.m.wikipedia.org are still getting through [04:55:55] no, your patch looks like it should match.. [04:56:35] anyway, if you take out that patch, you should flush the mobile varnish cache again [04:57:07] it shouldn't need it [04:57:07] actually, i take that back, it should be fine [04:57:07] yeah [04:57:14] a 503 response shouldn't be cached [04:57:23] yeah [04:57:34] X was a bit confused about my sudden change of screen resolution, brb [04:59:55] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.945 second response time [04:59:58] back [05:01:37] ok, let's try this... [05:04:09] !log tstarling synchronized live-1.5/load.php 're-enabling bits for MF' [05:04:19] Logged the message, Master [05:05:16] cool [05:05:32] they seem to be still up [05:05:59] if whatever we deployed today is to blame, we shouldn't see anything spiraling out of control [05:06:42] I'm not seeing any spiralling [05:07:10] there's no ganglia on the bits backends but I have a vmstat 5 running, it showed a small spike when I did the deployment, from 19% to 25%, then back down [05:07:54] TimStarling: I'm intermittently getting an infinite loop of 301's when trying to load my watchlist on wikidata even though the change was supposed to be reverted now. [05:08:40] with host www.wikidata.org? [05:08:47] no, w/ no www [05:09:14] (it seems to have changed to that again, though before the switch I often got that looping) [05:09:41] I'm getting it again [05:09:55] looks like it's only limited to specific servers [05:27:31] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [05:28:34] !log running sync-apache and gracefulling apaches to hopefully fix reports of occasional wikidata redirect issues [05:28:44] Logged the message, Master [05:29:21] http://www.bom.gov.au/products/IDN60901/IDN60901.94777.shtml [05:29:39] 44C outside [05:29:48] not quite time to open a window yet [05:30:15] the thermal mass in the house still has a bit left to give [05:30:50] that's 111F for the americans [05:31:11] TimStarling: that sounds like tucson summer [05:32:02] i drove my wife to the airport in phoenix back in july - we got there ~10pm, and i looked at the thermometer in the car - it read 109F outside; i couldn't beleive it was still that hot that late [05:32:57] TimStarling: I think your fix is working [05:33:01] thanks a lot [05:33:04] TimStarling: the only other things that i can tell that happened around 225 was: [05:33:05] 8:20 [05:33:05] binasher [05:33:06] 8:20 [05:33:07] [02:25:46] ok, mw prof data is flowing into graphite again [05:33:08] awjr: remind me not to go there in july ;) [05:33:11] 8:20 [05:33:13] logmsgbot [05:33:13] 8:20 [05:33:14] [02:26:57] !log LocalisationUpdate completed (1.21wmf7) at Fri Jan 18 02:26:57 UTC 2013 [05:33:15] heh [05:33:31] is there any reason data flowing to graphite would have an impact? (i know nothing about it) [05:33:44] unlikely [05:34:12] the timing implies LU but it's hard to see how [05:34:22] LU? [05:34:37] LocalisationUpdate, it ran very close to when the spike started [05:34:41] ah yeah [05:35:13] maybe it caused an update in the versions in the startup module [05:35:23] and then that caused the spike [05:35:31] I guess it's not entirely impossible [05:36:57] * TimStarling reconfigures the wifi so he can move downstairs [05:37:01] TimStarling: I think it's the timestamp query string on asset URLs that is busting the cache [05:37:05] see #-tech. [05:37:31] that would've caused the spike earlier though [05:37:37] (what ori-l is referring to) [05:37:57] and we probably would've seen this problem before, since that's nothing new for us [05:38:41] ok, well if we're fine for now, im going to go eat. i'll keep lurking on irc for another hour or so [05:38:53] and i'll have my mobile on in case something explodes again [05:39:36] oh ok [05:39:46] TimStarling: im otherwise happy to help keep digging tomorrow [05:39:47] enjoy dinner, ttyl [05:40:18] see ya, thanks for your help ori-l and for your heroics TimStarling [05:45:46] awjr: send a quick update to mobile-tech before you sign off [05:46:03] tfinc_: yep, im already in the middle of writing it [05:48:19] TimStarling: Reedy: hey, just got home. any issues with mobile still? [05:48:48] binasher: no; tim noticed that load was tapering off even before you blocked requests [05:49:02] i noticed that image requests have a timestamp appended to them in a query string [05:49:06] i didn't block anything [05:49:14] whomever did, sorry [05:49:26] everything seems fine now [05:49:37] i actually have no idea about anything that was going on :) i was out til just now [05:49:39] I might go offline to conserve power in the UPS [05:50:00] good good, sorry i wasn't around to help when paged earlier [05:50:01] TimStarling: ok. I think I can fix this. [05:50:13] you can send me an SMS if there's a problem [05:50:28] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [05:50:35] timestamp appears to be fixed, generated by server: see '2013-01-18T01:38:20Z' in http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=jquery.client%2Ccookie%2CmwExtension%7Cmediawiki.cldr%2CjqueryMsg%2Clanguage%2Cnotify%2Cutil%7Cmediawiki.language.data%2Cinit%7Cmediawiki.libs.pluralruleparser%7Cmobile.alpha%2Cproduction-jquery%2Cstartup%7Cmobile.beta.jquery&skin=vector&version=20130118T054146Z&* [05:50:46] binasher: the power is off here, presumably because it's very hot and everyone wants to use their A/C [05:50:52] i think that busted browser caching [05:51:02] that would explain why requests spiked but gradually tapered off [05:51:25] Tim-away: hope it doesn't last too long, ttyl [05:52:10] i think the thinking was: we purge the varnish cache for static assets on every deploy anyway, the fact that we're adding a timestamp to the querystring won't matter [05:52:31] but that doesn't take into consideration browser cache [05:52:41] and bits is setting far future expires on those images [05:52:51] they purge the varnish cache for everything on every deploy :/ [05:53:15] right, but browsers don't know that; they have the image cached locally with the instruction not to expire it for another month [05:53:31] unless of course the URL changes [05:53:45] which it did, for everything [05:53:58] well, specifically images in CSS [05:54:07] * referenced in [05:55:05] or do you mean that they've always appended a timestamp and that they renew it on every deploy? [05:58:34] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [06:23:31] ori-l: i think they've been appending a timestamp for certain assets lately, but i wasn't aware it was on images as well. they purge the entire mobile varnish cache every time though (not via url change, by "ban .") [06:24:02] its the equivalent of if we destroyed the squids every iteration [06:24:40] that timestamp is hardcoded in the article html [06:24:54] yeah [06:25:06] '2013-01-18T01:38:20Z' [06:25:09] the day the bits stood still [06:25:46] i'm not sure if their goal is to bust browser cache, or if its how they're trying to manage bits [06:26:35] i wish i was on then! i would have liked to observe the drowning of the bits [06:27:32] heh [06:28:02] i'm trying to figure out if the timestamp is added by mobilefrontend or if it's added by resourceloader erroneously because MF is doing something it isn't expecting [06:37:34] PROBLEM - Puppet freshness on db10 is CRITICAL: Puppet has not run in the last 10 hours [06:40:27] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [07:19:36] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:19:36] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:44:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:46:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.277 seconds [08:22:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [08:40:09] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 190 seconds [08:40:09] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 190 seconds [08:41:39] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [08:41:39] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [08:41:39] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [08:41:39] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [08:50:48] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 182 seconds [08:50:57] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 187 seconds [09:47:30] New review: preilly; "@Tim Starling ? Have you tested this in production yet?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44526 [10:03:05] wow, this was quite the night [10:04:40] Is it possible for anyone to clear the cache on bits for /www.wikidata.org ? [10:04:59] It's serving 301s to wikidata.org/load.php [10:05:26] preilly: I'm going to do it now [10:05:39] eg. https://bits.wikimedia.org/www.wikidata.org/load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector&* [10:05:47] I was just about to do it when I was distracted by bits being down [10:05:49] TimStarling: Okay that's great [10:06:01] TimStarling: That change set looks really promising [10:06:36] TimStarling: Also I wanted to let you know that Facebook is coming to the office of the 27th of February to talk about hhvm [10:22:23] !log tstarling Started syncing Wikimedia installation... : [10:22:32] Logged the message, Master [10:26:03] TimStarling: Are you running a timer? [10:26:51] it didn't work anyway [10:28:00] TimStarling: Oh no what happened? [10:28:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:31] bugs [10:31:08] TimStarling: hey Tim :-] Remember the 'exit code 139' under PHPUnit? I am pretty sure it turns out to be a bug in PHP. I found an upstream report: https://bugs.php.net/bug.php?id=63055 [10:31:21] TimStarling: and reapplied your nice patch which disappeared when I have upgraded PHPUnit [10:31:39] nice to know [10:31:45] and [10:32:03] I manage to found out all the command you used to get the nice backtrace :-] [10:32:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.195 seconds [10:32:05] managed [10:32:08] \O/ [10:34:45] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Puppet has not run in the last 10 hours [10:39:38] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [10:40:04] someone broke sudoers again [10:40:05] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 205 seconds [10:44:08] PROBLEM - Puppet freshness on mw1004 is CRITICAL: Puppet has not run in the last 10 hours [10:45:02] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [10:45:02] PROBLEM - Puppet freshness on mw1014 is CRITICAL: Puppet has not run in the last 10 hours [10:45:02] PROBLEM - Puppet freshness on mw1007 is CRITICAL: Puppet has not run in the last 10 hours [10:45:29] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [10:46:05] PROBLEM - Puppet freshness on mw1015 is CRITICAL: Puppet has not run in the last 10 hours [10:47:08] PROBLEM - Puppet freshness on mw1009 is CRITICAL: Puppet has not run in the last 10 hours [10:48:02] PROBLEM - Puppet freshness on mw1005 is CRITICAL: Puppet has not run in the last 10 hours [10:48:02] PROBLEM - Puppet freshness on mw1001 is CRITICAL: Puppet has not run in the last 10 hours [10:51:05] New patchset: Tim Starling; "Fix new scap sudoers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44588 [10:52:06] PROBLEM - Puppet freshness on mw1013 is CRITICAL: Puppet has not run in the last 10 hours [10:52:06] PROBLEM - Puppet freshness on mw1006 is CRITICAL: Puppet has not run in the last 10 hours [10:52:06] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [10:52:13] mark: Our www-less domains are A records pointing to pmtpa load balancers, while domains with www (and other subdomains) are CNAMEs to eqiad load balancers. For instance, wikipedia.org resolves to 208.80.152.201 (pmtpa), but www.wikipedia.org resolves to 208.80.154.225 (eqiad). Is this known / intended / a problem for the eqiad migration? [10:53:12] * preilly thinks RoanKattouw should be sleeping [10:53:27] Yes, I should be [10:53:30] And so should you ;) [10:53:40] RoanKattouw: heh heh [10:54:28] New patchset: Tim Starling; "Fix new scap sudoers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44588 [10:55:02] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44588 [10:56:55] RoanKattouw: yes known, not a problem [10:57:08] OK [10:57:13] I ran into it when commenting on https://bugzilla.wikimedia.org/44097 [10:57:59] our new DNS system should make that easier to fix [10:58:05] PROBLEM - Puppet freshness on mw1012 is CRITICAL: Puppet has not run in the last 10 hours [10:58:05] PROBLEM - Puppet freshness on mw1008 is CRITICAL: Puppet has not run in the last 10 hours [10:58:11] TimStarling: ha ha ha — sudo-seeking missile will find its target... [10:58:29] New review: preilly; "@TimStarling: ha ha ha ? sudo-seeking missile will find its target... priceless." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44588 [10:59:08] PROBLEM - Puppet freshness on mw1011 is CRITICAL: Puppet has not run in the last 10 hours [11:03:00] TimStarling: so what's the state of things? [11:04:45] re bits or scap? [11:05:07] i don't know what would be up with bits [11:05:13] so probably both ;) [11:05:23] the backend was overloaded for about 1.5 hours [11:05:41] due to mobile deploy? [11:06:01] I didn't think it was a great time to read the varnish manual so I ended up hacking MediaWiki to block requests with a MobileFrontend referer, early in the startup process [11:06:16] it was fast enough to get the rest of the site back up [11:06:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:33] hehe [11:06:47] it happened about 45 minutes after the mobile deploy, last I checked, awjr and tomasz were discussing why it happened [11:06:58] probably something to do with RL version numbers changing [11:07:05] PROBLEM - Puppet freshness on mw1016 is CRITICAL: Puppet has not run in the last 10 hours [11:07:05] PROBLEM - Puppet freshness on mw1003 is CRITICAL: Puppet has not run in the last 10 hours [11:07:15] it's all back to normal now [11:07:40] so working on bits set me back on scap a bit, that's why I'm working on it now [11:08:12] ok [11:08:39] damn [11:08:57] mobile really needs to get a handle on those cache purges [11:09:02] PROBLEM - Puppet freshness on mw1002 is CRITICAL: Puppet has not run in the last 10 hours [11:09:48] mark: yeah, totally — it's getting ridiculous [11:11:01] New patchset: Tim Starling; "In scap: exclude git submodule objects" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/44592 [11:12:33] we should actually trade bits apaches for mobile apaches [11:12:39] although [11:12:48] i guess mobile uses bits too so it wouldn't have helped here [11:12:51] meh [11:13:07] * mark checks ganglia [11:14:03] bits app servers pmtpa is missing in ganglia [11:14:14] we noticed [11:14:22] I was using vmstat, it was old school [11:15:11] also Reedy had to SMS people, there was no monitoring alert [11:15:32] for what? [11:15:56] bits overload [11:16:54] varnish was serving 503s for 1.5 hours [11:17:08] we had loads of people complaining on #wikimedia-tech [11:17:47] * mark fixes ganglia [11:19:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.535 seconds [11:19:50] New patchset: Mark Bergsma; "Fix Bits application servers pmtpa group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44595 [11:20:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44595 [11:21:45] spence is a very special server [11:22:30] TimStarling: How could this have been avoided? [11:23:36] *shrug* maybe a bit more overprovisioning wouldn't hurt [11:24:09] TimStarling: with bits? [11:24:12] but you need the servers in ganglia before you know they are not overprovisioned [11:24:18] yes [11:24:29] they are now [11:24:48] the ganglia manifest is pretty stupid right now [11:24:48] TimStarling: Were they ever in ganglia? [11:24:50] they were [11:25:00] mark: When did they get removed? [11:25:05] * mark runs git blame [11:25:40] dec 27 [11:25:48] mark: who removed them [11:26:10] * preilly realizes that I could have just git blamed as well and now feels really lazy  [11:26:25] peter I think, when preparing the eqiad bits app servers [11:26:31] TimStarling: Did you ever hear back from Rob Richards regarding libxml? [11:26:38] mark: argh [11:26:44] he replaced the tampa servers with the eqiad ones [11:26:53] but to his defense, it was already wrong [11:27:12] it said "Bits application servers" instead of the correct "Bits application servers pmtpa" [11:27:24] really that list in gmetad.conf should be autogenerated by the list at the top of ganglia.pp [11:27:25] preilly: yes, it's probably in my court now [11:27:53] he may possibly accept it if it's a configure option, off by default [11:28:23] TimStarling: oh, that's great [11:28:50] TimStarling: Did he comment on: https://bugs.php.net/bug.php?id=63380&edit=1 ? [11:29:32] yes [11:29:42] mark: Are you going to fix ganglia.pp a bit more [11:31:04] TimStarling: Are you going to write the patch to address his concern? [11:31:17] I guess [11:31:19] I've been busy [11:31:24] * preilly realizes I should have read the damn link that I posted before asking a stupid question  [11:32:11] the thought did occur to me [11:32:15] ;) [11:32:18] TimStarling: yeah, I can see that https://gerrit.wikimedia.org/r/#/q/owner:+%2522Tim+Starling%2522,n,z [11:32:53] * preilly feels stupid now but at least I can blame the fact that it's 3:32:48 right now [11:34:12] matthiasmullie: how's it going? [11:34:24] matthiasmullie: Are you feeling any better? [11:34:54] I am, yes :) [11:35:03] matthiasmullie: that's good [11:35:29] matthiasmullie: Have you given anymore thought to nuking DataModel? [11:36:25] TimStarling: Have you looked at: https://gerrit.wikimedia.org/r/#/c/42953/ at all? [11:37:07] not really [11:37:27] just heard rumours of a secret grand jury being assembled to deal with the problem [11:37:29] TimStarling: it's not like you're busy or anything [11:37:40] TimStarling: ha ha ha [11:37:54] TimStarling: Asher was really pissed about Terry's email [11:38:40] preilly: Erik had mailed to discuss it Jan 25th [11:39:03] matthiasmullie: So is your plan NOT to work on it until that time? [11:39:29] preilly: as in add more missing(?) server groups, or how the manifest is working [11:39:39] the latter, yes, eventually, it annoys me every time I edit that file [11:39:44] mark: yeah [11:40:02] also the way we setup ganglia aggregators is annoying [11:40:10] preilly: yes, I'll be working on something unrelated until then [11:40:28] mark: gmetad.conf auto-generation would be wonderful too [11:40:37] that's the case already [11:40:39] matthiasmullie: argh [11:40:41] that's what I'm complaining about sort of [11:40:50] it's autogenerated, but from a list in ganglia.pp [11:40:50] mark: ah [11:40:56] it's almost the same as editing the file [11:41:03] I want it autogenerated from the list of clusters in the top of ganglia.pp instead [11:41:04] mark: oh, I see [11:41:13] so it automatically makes them for multiple data centers, automatically chooses the right aggregators etc [11:41:14] mark: yeah, that makes total sense [11:44:10] wow, nfs1 is so fast [11:44:15] it's pumping out like 4MB/s [11:44:24] it puts my 5.25" disk drive to shame [11:44:36] don't mind me, it's late [11:44:57] things always seem funnier when it's late [11:45:30] paravoid: what did you find about the netapps? [11:46:21] i'm having a discussion with ryan about using git vs bittorrent vs multicast [11:46:45] I'd like either bittorrent or multicast for things like l10n, binary blobs, but would prefer git fetch for actual source code distribution [11:46:53] he complains that git fetch is rather slow [11:47:00] TimStarling: ha ha [11:47:02] or, well, not as fast as bittorrent [11:47:09] it's late for preilly too [11:47:18] but my argument is that if the git fetch and checkout (actual deployment) steps would be separate, that wouldn't be a problem [11:47:28] TimStarling: 10:47 PM right? [11:47:42] you'd do the fetch step for large code deployments well in advance to the actual checkout/deployment [11:48:18] yes, but I was up at 7:10 and went to bed quite late the night before [11:48:29] mark, TimStarling, paravoid: Have you guys looked at: https://github.com/lg/murder ? [11:48:46] I think paravoid came up with that too if i'm not mistaken [11:48:49] and ryan has tested it [11:48:51] WMF people sleep at all sorts of funny times, I figure late = tired is the best definition [11:49:02] TimStarling: Please don't burn yourself out [11:49:17] anything I can help with tim? [11:50:05] don't think so, I'm mostly just waiting for rsync [11:50:28] so the netapps are unusually slow [11:50:32] as for git fetch, ryan tells me there are two problems with it [11:50:43] one is the corrupted blobs issue, which is fixable [11:50:48] yeah [11:50:58] the other is that the localisation cache CDBs will bloat the repo if you put them in it [11:51:06] so I don't want to do that [11:51:11] for binary blobs we should use something else [11:51:16] bittorrent, multicast, whatever [11:51:20] not in git [11:51:30] Aaron suggested a scheme based on putting a JSON representation of the CDBs in git and having MW generate the CDBs on demand [11:51:48] isn't the php source much like that? [11:52:22] well, you need both PHP source and LU output which is also in CDB form [11:52:36] knowing Aaron it'll be finished by Monday unless you tell him not to do it [11:52:44] then we'll be able to test to see if it's faster [11:53:00] hehe [11:53:06] or a C program instead of mediawiki [11:53:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:58:23] !log root Started syncing Wikimedia installation... : [11:58:32] !log Mounted nfs1:/home noatime [11:58:33] Logged the message, Master [11:58:42] Logged the message, Master [12:02:58] noatime probably doesn't do anything over nfs, but it doesn't hurt to try [12:03:50] !log root Started syncing Wikimedia installation... : [12:03:59] Logged the message, Master [12:04:19] I'm actually running scap with a few bits commented out to make it run faster [12:04:23] since I've tested those bits already [12:07:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [12:08:01] preilly: indeed, I was the one that pointed Ryan to murder (and Herd, the rewrite in python) [12:08:08] never used it though [12:08:17] just read through the code [12:08:29] you don't need to use it if you read the code [12:08:39] you can execute it in your head, right [12:08:47] ;) [12:09:00] :) [12:09:10] TimStarling: are you aware that bits cached were cleared yesterday too? [12:09:16] some wikidata hiccup or something [12:09:33] that happened around the same time of the mobile deployment [12:09:44] I saw an entry in the .bash_history, I wasn't sure when it was [12:10:05] yeah, it was yesterday, soon after I left to go to bed [12:10:16] no acct installed [12:10:41] !log root Started syncing Wikimedia installation... : [12:10:43] 04:10 < mutante> yep, reverted https://gerrit.wikimedia.org/r/#/c/44558/ and the mw-config too and we restarted bits [12:10:51] Logged the message, Master [12:10:52] 2am UTC [12:11:03] right, pretty close to the start then [12:11:05] yeah [12:11:14] from ganglia graphs we figured the start was around 2:25 [12:11:33] mutante was talking to me about it before I left [12:12:02] could have been related [12:12:21] I would say it was most likely related [12:12:21] so MF might have been unrelated, or a combination of the two [12:12:50] the mobile people were scratching their heads to work out what could have been different this time around [12:12:51] I declare anyone clearing caches just like that guilty. [12:12:56] a bits cache clear could have been it [12:13:52] mark: nothing on the netapp, no [12:14:04] I asked Ryan for contacts, he said he only had sales people [12:14:12] er [12:14:14] for what? [12:14:28] i just know you logged in [12:15:10] https://rt.wikimedia.org/Ticket/Display.html?id=4371 [12:16:30] new scap: http://ganglia.wikimedia.org/latest/graph.php?c=Application%20servers%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358511351&g=network_report&z=medium&r=hour [12:16:47] haha [12:16:55] that's a lot of bytes, eh? [12:17:22] amazing [12:17:24] nice job [12:17:43] thanks [12:18:05] paravoid: so basically we need to open a case with netapp [12:18:09] yes [12:18:22] the netapp itself did that [12:18:29] it did call home [12:18:32] but we heard nothing from them [12:18:34] or at least I didn't [12:18:41] case closed [12:18:42] not sure if they sent private mails to someone else [12:19:01] to me at least [12:19:06] oh? [12:19:38] !log root Finished syncing Wikimedia installation... : [12:19:48] Logged the message, Master [12:20:27] reopened the case [12:20:44] what's the subject [12:21:00] NetApp Log # 2003877025 , Case Create Notification [12:21:11] to... me [12:21:57] certainly not in my mailbox [12:22:00] Change merged: Tim Starling; [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/44592 [12:22:05] can we change contacts to noc@ or something? [12:22:11] New patchset: Tim Starling; "More tweaks to new scap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44605 [12:22:30] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44605 [12:25:08] mark: on completely unrelated note, gdnsd has "DYNA" (and "DYNC" for CNAMEs) so A/AAAA georecords would be possible [12:25:31] (re: A record being pointed to pmtpa) [12:25:53] running real scap now [12:26:31] I think I will just leave it running and come back in 30 mins to see if it worked [12:26:49] it's unlikely to take the site down [12:26:56] registered noc@wikimedia [12:27:06] oh! [12:27:07] cool :) [12:27:36] I'm so glad I didn't try to figure it out yesterday night [12:29:00] TimStarling: do we need a post-mortem for yesterday's outage? I saw some things in the backlog that needed fixing, like missing ganglia graphs? [12:29:02] the netapp support site is so annoying and confusing [12:29:13] (plus the "be careful when clearing caches") [12:29:13] ganglia I just fixed [12:29:26] ok [12:29:30] I feel like a manager [12:29:34] it would be nice to know the root cause [12:29:34] saying things and doing nothing [12:29:34] !log tstarling Started syncing Wikimedia installation... : [12:29:35] :P [12:29:44] Logged the message, Master [12:29:50] it probably was root cause [12:29:55] and ideally nagios would complain if bits served 503s for all backend requests [12:30:09] (kidding) [12:31:24] i wonder why all of bits was cleared for a wikidata specific issue anyway [12:31:44] !log tstarling Finished syncing Wikimedia installation... : [12:31:53] Logged the message, Master [12:32:11] judging from my e-mail box, a guy called Tim Starling tried to hack all our servers! [12:32:20] isn't he always? [12:32:47] it seems he didn't succeed [12:51:30] New review: Hashar; "here we go!" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/44278 [12:52:05] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44278 [12:52:51] what email? [12:53:03] all the sudo emails [12:53:36] oh, so when sudo says "this incident will be reported", it's actually reporting to you? [12:53:47] TimStarling: so do we have mediawiki deployed on any eqiad boxes yet? [12:54:44] the source is on mw1010 and mw1070 [12:55:00] I'll add it to the rest of them if you like [12:56:44] I'll leave it running in a root screen since it'll probably take ages [12:57:57] !log added eqiad apaches to mediawiki-installation and started scap, it will take a while [12:58:07] Logged the message, Master [12:59:07] paravoid: up to write the puppet wikimedia module README file ? :-D [12:59:08] !log root Started syncing Wikimedia installation... : [12:59:18] Logged the message, Master [12:59:20] there's still a few boxes where sudo is not working, but I was going to leave them since the script recovers well enough [12:59:35] but it will probably spam mark every time someone runs scap [12:59:59] not just mark [13:00:07] if the sudo configuration is under puppet, isn't puppet going to fix on all box ? [13:00:10] everyone on the ops team gets them :-D [13:00:28] the sudo configuration is in two places in puppet [13:00:36] but some servers don't have either class [13:00:51] killed scap for now [13:01:31] someone just needs to run sync-common on all those boxes [13:01:39] I put mediawiki-installation back how it was [13:01:53] good night [13:02:04] * hashar waves at Tim [13:02:21] rthanks tim [13:02:44] oh, also some of the sudo errors are due to /usr/bin/find-nearest-rsync not existing [13:02:55] sudo considers it a security violation to attempt to run a nonexistent script [13:06:45] * MaxSem reads last night's logs [13:08:09] looks like scap finished? [13:08:11] if I attached the right screen session [13:08:51] oh you killed it, right [13:08:58] New patchset: Hashar; "wikimedia module placeholder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43420 [13:09:00] ok, will work on that [13:09:13] Reedy: ping me when you're around? [13:09:23] New review: Hashar; "removed the useless init.pp" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/43420 [13:10:13] Reedy: I'd like to have https://gerrit.wikimedia.org/r/#/c/44463/1 merged today by the end of US day so that I can run it over the (extended) weekend [13:10:24] Reedy: if it all possible :) [13:11:01] <^demon> paravoid: I'll take care of it. [13:11:06] <^demon> I'm about to merge Roan's stuff too. [13:11:34] <^demon> Oh, it's not in master. Won't cherry pick it yet then. [13:12:04] oh hi :) [13:12:09] yeah, it looks like it needs a review [13:13:15] ops easy merge https://gerrit.wikimedia.org/r/#/c/43999/ (already applied on gallium) ;-D [13:13:59] New patchset: Hashar; "move PHP linter under `wikimedia` module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29937 [13:14:01] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43999 [13:14:20] thx :-) [13:15:43] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [13:21:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:10] !log demon synchronized php-1.21wmf7/includes/MessageBlobStore.php 'Deploying Idc83a0fe' [13:25:21] Logged the message, Master [13:25:36] !log demon synchronized php-1.21wmf7/includes/resourceloader/ResourceLoader.php 'Deploying Idc83a0fe' [13:25:47] Logged the message, Master [13:26:01] !log demon synchronized php-1.21wmf7/includes/resourceloader/ResourceLoaderFileModule.php 'Deploying Idc83a0fe' [13:26:11] Logged the message, Master [13:26:13] <^demon> mark, paravoid: Ok, Roan's read-only fixes for resourceloader are out for 1.21wmf7. [13:34:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.885 seconds [13:48:34] PROBLEM - Host lvs6 is DOWN: PING CRITICAL - Packet loss = 100% [13:49:01] PROBLEM - Host mediawiki-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:49:28] RECOVERY - Host mediawiki-lb.pmtpa.wikimedia.org is UP: PING WARNING - Packet loss = 50%, RTA = 52.69 ms [13:49:52] hm [13:50:04] PROBLEM - Host wikinews-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:50:13] PROBLEM - Host foundation-lb.pmtpa.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [13:50:32] !log Powercycled lvs6 [13:50:42] Logged the message, Master [13:50:58] PROBLEM - Host wikisource-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:50:59] PROBLEM - Host mediawiki-lb.pmtpa.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [13:51:16] PROBLEM - Host wikidata-lb.pmtpa.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.152.218) [13:51:52] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:01] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:02] RECOVERY - Host wikisource-lb.pmtpa.wikimedia.org is UP: PING WARNING - Packet loss = 80%, RTA = 128.22 ms [13:52:10] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:11] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:11] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:12] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:12] PROBLEM - LVS HTTPS IPv6 on foundation-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:12] PROBLEM - Host wikipedia-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:52:37] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:37] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:37] PROBLEM - LVS HTTPS IPv4 on bits.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:55] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:53:04] RECOVERY - Host lvs6 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [13:53:22] RECOVERY - Host wikinews-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 16%, RTA = 23.82 ms [13:53:31] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64414 bytes in 0.401 seconds [13:53:31] RECOVERY - Host wikipedia-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 114.57 ms [13:53:40] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64414 bytes in 0.131 seconds [13:54:07] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:25] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 67107 bytes in 9.603 seconds [13:55:37] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64414 bytes in 0.006 seconds [13:55:37] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64414 bytes in 0.022 seconds [13:55:37] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64414 bytes in 0.008 seconds [13:55:37] RECOVERY - LVS HTTPS IPv6 on foundation-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64414 bytes in 0.014 seconds [13:55:37] RECOVERY - LVS HTTP IPv6 on mediawiki-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64414 bytes in 0.016 seconds [13:55:46] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64414 bytes in 0.018 seconds [13:55:46] RECOVERY - Host wikidata-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:56:04] RECOVERY - LVS HTTPS IPv4 on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3846 bytes in 0.007 seconds [13:56:04] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 67113 bytes in 0.028 seconds [13:56:04] RECOVERY - Host foundation-lb.pmtpa.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:56:31] RECOVERY - Host wikiversity-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [13:56:34] !log Set rt_cache_rebuild_count to -1 on lvs2 and lvs6 [13:56:44] Logged the message, Master [13:56:49] RECOVERY - Host mediawiki-lb.pmtpa.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [14:02:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43478 [14:03:05] I think your nagios installation could use service / host dependencies :-] [14:03:14] there is no point in warning for all those services when the host is dead I guess [14:03:23] (I should ping Leslie about it) [14:03:25] yes there is [14:03:45] those hosts are redundant [14:04:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:05] so you get service checks to make sure they have switched from an host to another ? [14:06:19] those servers should always be up [14:06:25] if they're not, I want to know about it, unconditionally [14:06:30] services I mean [14:07:25] so Host lvs6 is DOWN is probably enough though receiving pages for all the other services give you an idea about the impact [14:07:48] lvs6 down should not affect anything [14:10:55] PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:12:16] PROBLEM - Varnish HTCP daemon on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:14:26] argh [14:14:37] had my phone in the other room [14:14:37] and didn't hear it [14:14:38] dammit [14:15:32] be happy [14:15:39] I should :) [14:15:46] meh [14:15:50] netapp also tried to call me [14:15:57] now I need to call back to work on nas1001-a [14:15:58] hate that [14:16:20] are you asking? [14:16:26] I can do it [14:16:30] no i'm not asking [14:16:35] i'm complaining ;) [14:16:40] ok [14:16:40] why can't they work via email hehe [14:16:45] still, I'm offering [14:16:52] oh you can have it if you want [14:17:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.825 seconds [14:17:24] Loads of companies seem to do that. Email them and get phoned back. I don't remember saying I wanted you to ring me.. [14:17:37] it might make some sense in this case I guess [14:17:46] actually figuring out what the problem is could be faster [14:18:25] true. Still could ask via email if you wanted to do it.. [14:18:35] Reedy: saw my message above? [14:18:43] somewhere in the backscroll [14:18:51] I've a few pings overnight [14:18:59] Will get round to it in a bit :) [14:19:01] mark: I'm no masochist, but I can do it [14:19:28] RECOVERY - Varnish HTCP daemon on cp1043 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:19:43] do you have a now account? [14:20:13] no [14:20:23] New patchset: Reedy; "Enable GoogleNewsSitemap on all wikinews projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44621 [14:20:26] <^demon> Reedy: I've reviewed + merged that change to master. So all it needs is cherry picking to the wmf branches. [14:20:35] oh, great [14:20:42] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44621 [14:20:55] ^demon: thanks! [14:20:59] <^demon> yw :) [14:21:14] !log reedy synchronized wmf-config/InitialiseSettings.php [14:21:23] Logged the message, Master [14:21:36] mark: registering now [14:25:19] RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:25:53] mark: done but [14:25:56] Note: Complete access to http://support.netapp.com will be available within one (1) business day of this request. [14:26:08] yeah [14:26:44] getting Unauthorized Access on e.g. 'My Installed Systems' [14:27:31] there's a contact preference on my profile though that has "1) Primary Phone, 2) Alternate Phone 3) Email" [14:27:40] maybe you should go there an pick email :-) [14:28:06] set that, tnx [14:29:24] Can someone kick srv191? There's loads of warnings when writing cache stuff to /tmp. I think Tim did something, but seems it didn't fix it. [14:29:25] drwxr-xr-x 4 root root 4096 Jan 18 14:26 . [14:29:36] i'm guessing that the errors are because only root can write there.. [14:30:12] looking [14:30:39] should be better now [14:30:47] but why did that happen... [14:31:13] !log Corrected srv191:/tmp permissions [14:31:23] Logged the message, Master [14:31:42] paravoid: i suspect netapp will upgrade quickly - they also already activated the noc account [14:32:29] thanks [14:35:04] PROBLEM - Host analytics1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:30] Thank you for registering for a NetApp Support Account. Your access to the [14:35:34] NetApp Support Site has been upgraded.  Please allow 2 hours for the site to [14:35:38] reflect the changes. [14:35:47] That's long replication lag [14:37:15] still getting 401 on most pages [14:38:14] RECOVERY - Host analytics1009 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [14:40:19] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [14:49:08] !log reedy synchronized php-1.21wmf7/maintenance/cleanupUploadStash.php [14:49:17] Logged the message, Master [14:49:32] !log reedy synchronized php-1.21wmf8/maintenance/cleanupUploadStash.php [14:49:42] Logged the message, Master [14:49:48] paravoid: ^ done [14:53:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:27] perfect [14:56:57] * mark is running scap [14:57:24] !log root Started syncing Wikimedia installation... : [14:57:34] Logged the message, Master [14:57:36] !log running /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php on hume [14:57:44] arwiki: Skipped non-stash 0/00/120px-111x8qaq58ro.prgs12.538697.jpg [14:57:46] Logged the message, Master [14:57:58] any idea what is that and why is it skipped? [14:58:06] (lots of those) [15:00:13] !log root Finished syncing Wikimedia installation... : [15:00:23] Logged the message, Master [15:03:00] sigh, "Ran out of captcha images" again [15:03:11] there's no NFS anymore [15:03:14] to fallback to [15:03:19] wee [15:03:21] well, technically there is [15:03:33] but you'll to revert yesterday's commit that disabled it [15:04:16] so far it doesn't seem to be constantly failing [15:05:17] at least if I log out I see a captcha in Special:CreateAccount [15:05:36] but the frequency of captcha problems is scaring me;] [15:05:54] !log root Started syncing Wikimedia installation... : [15:06:04] Logged the message, Master [15:07:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.236 seconds [15:11:43] !log root Finished syncing Wikimedia installation... : [15:11:53] Logged the message, Master [15:12:21] 6 minutes, wow ;) [15:14:26] !log Deployed MediaWiki on eqiad image scalers [15:14:36] Logged the message, Master [15:15:57] is it the genuine version committed by Tim or some special simplified sync script? [15:16:14] it's just scap as deployed on the cluster now... including rack awareness [15:16:23] sehr gut [15:16:35] olololo: Exception from line 579 of /usr/local/apache/common-local/php-1.21wmf8/extensions/Wikibase/repo/includes/EditEntity.php: Les ajouts et mises à jour de la base de données sont actuellement bloqués, probablement pour permettre la maintenance de la base, après quoi, tout rentrera dans l'ordre. [15:17:18] and what if it breaks on zh:? [15:18:49] !log recentchanges.rc_params is now a blob on all wikis [15:18:59] Logged the message, Master [15:27:26] sync-common is slow [15:28:19] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [15:41:58] uh [15:42:07] how is the appserver apache config suupposed to end up on app servers these days [15:42:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:14] New review: Faidon; "Looks good to me, although I don't feel qualified enough for the MW parts :)" [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44562 [15:45:04] why doesn't the applicationserver puppet module do that [15:51:16] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [15:54:25] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:56:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [15:59:04] New patchset: Mark Bergsma; "Pull in the WMF apache configuration if it doesn't exist yet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44645 [15:59:49] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:59:58] !log gallium jenkins: updated the tools/fetch-mw-ext script which was not properly fetching the extension. {{gerrit|44644}] updated: cfa82ff..09bbb44 [16:00:11] Logged the message, Master [16:00:16] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [16:09:36] New review: Krinkle; "Prefer not to use port 80." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44164 [16:10:06] New patchset: Mark Bergsma; "Ensure Apache configuration is complete before attempting to start service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44649 [16:11:28] !log gallium jenkins: updated tools/fetch-mw-ext script to wipe the extension destination directory {{gerrit|44648}} updated: 09bbb44..e41843b [16:11:37] Logged the message, Master [16:12:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44649 [16:12:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44645 [16:19:02] New patchset: Mark Bergsma; "Reload apache on initial config pull" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44650 [16:19:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44650 [16:25:01] PROBLEM - Apache HTTP on mw1150 is CRITICAL: Connection refused [16:28:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.544 seconds [16:39:16] PROBLEM - Puppet freshness on db10 is CRITICAL: Puppet has not run in the last 10 hours [16:40:37] PROBLEM - Apache HTTP on mw1023 is CRITICAL: Connection refused [16:40:46] PROBLEM - Apache HTTP on mw1020 is CRITICAL: Connection refused [16:40:55] PROBLEM - Apache HTTP on mw1022 is CRITICAL: Connection refused [16:41:13] PROBLEM - Apache HTTP on mw1033 is CRITICAL: Connection refused [16:41:13] PROBLEM - Apache HTTP on mw1029 is CRITICAL: Connection refused [16:41:22] PROBLEM - Apache HTTP on mw1042 is CRITICAL: Connection refused [16:41:22] PROBLEM - Apache HTTP on mw1021 is CRITICAL: Connection refused [16:41:22] PROBLEM - Apache HTTP on mw1040 is CRITICAL: Connection refused [16:41:22] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [16:41:31] PROBLEM - Apache HTTP on mw1044 is CRITICAL: Connection refused [16:41:40] PROBLEM - Apache HTTP on mw1039 is CRITICAL: Connection refused [16:41:49] PROBLEM - Apache HTTP on mw1026 is CRITICAL: Connection refused [16:41:50] PROBLEM - Apache HTTP on mw1038 is CRITICAL: Connection refused [16:41:50] PROBLEM - Apache HTTP on mw1045 is CRITICAL: Connection refused [16:41:50] PROBLEM - Apache HTTP on mw1041 is CRITICAL: Connection refused [16:41:50] PROBLEM - Apache HTTP on mw1050 is CRITICAL: Connection refused [16:41:51] PROBLEM - Apache HTTP on mw1030 is CRITICAL: Connection refused [16:41:51] PROBLEM - Apache HTTP on mw1051 is CRITICAL: Connection refused [16:41:59] PROBLEM - Apache HTTP on mw1032 is CRITICAL: Connection refused [16:41:59] PROBLEM - Apache HTTP on mw1061 is CRITICAL: Connection refused [16:41:59] PROBLEM - Apache HTTP on mw1043 is CRITICAL: Connection refused [16:41:59] PROBLEM - Apache HTTP on mw1049 is CRITICAL: Connection refused [16:41:59] PROBLEM - Apache HTTP on mw1028 is CRITICAL: Connection refused [16:41:59] PROBLEM - Apache HTTP on mw1058 is CRITICAL: Connection refused [16:42:07] PROBLEM - Apache HTTP on mw1034 is CRITICAL: Connection refused [16:42:07] PROBLEM - Apache HTTP on mw1054 is CRITICAL: Connection refused [16:42:08] PROBLEM - Apache HTTP on mw1036 is CRITICAL: Connection refused [16:42:08] PROBLEM - Apache HTTP on mw1067 is CRITICAL: Connection refused [16:42:16] PROBLEM - Apache HTTP on mw1060 is CRITICAL: Connection refused [16:42:16] PROBLEM - Apache HTTP on mw1079 is CRITICAL: Connection refused [16:42:17] PROBLEM - Apache HTTP on mw1065 is CRITICAL: Connection refused [16:42:25] PROBLEM - Apache HTTP on mw1035 is CRITICAL: Connection refused [16:42:25] PROBLEM - Apache HTTP on mw1025 is CRITICAL: Connection refused [16:42:26] PROBLEM - Apache HTTP on mw1048 is CRITICAL: Connection refused [16:42:34] PROBLEM - Apache HTTP on mw1024 is CRITICAL: Connection refused [16:42:34] PROBLEM - Apache HTTP on mw1062 is CRITICAL: Connection refused [16:42:34] PROBLEM - Apache HTTP on mw1074 is CRITICAL: Connection refused [16:42:35] PROBLEM - Apache HTTP on mw1059 is CRITICAL: Connection refused [16:42:35] PROBLEM - Apache HTTP on mw1056 is CRITICAL: Connection refused [16:42:35] PROBLEM - Apache HTTP on mw1031 is CRITICAL: Connection refused [16:42:35] PROBLEM - Apache HTTP on mw1084 is CRITICAL: Connection refused [16:42:47] PROBLEM - Apache HTTP on mw1070 is CRITICAL: Connection refused [16:42:47] PROBLEM - Apache HTTP on mw1069 is CRITICAL: Connection refused [16:42:47] PROBLEM - Apache HTTP on mw1047 is CRITICAL: Connection refused [16:42:47] PROBLEM - Apache HTTP on mw1081 is CRITICAL: Connection refused [16:42:47] PROBLEM - Apache HTTP on mw1037 is CRITICAL: Connection refused [16:42:47] PROBLEM - Apache HTTP on mw1027 is CRITICAL: Connection refused [16:42:47] WHO TURNED THE INTERNET OFF?! [16:42:52] PROBLEM - Apache HTTP on mw1053 is CRITICAL: Connection refused [16:42:52] PROBLEM - Apache HTTP on mw1052 is CRITICAL: Connection refused [16:42:53] PROBLEM - Apache HTTP on mw1083 is CRITICAL: Connection refused [16:42:53] PROBLEM - Apache HTTP on mw1077 is CRITICAL: Connection refused [16:43:01] PROBLEM - Apache HTTP on mw1055 is CRITICAL: Connection refused [16:43:01] PROBLEM - Apache HTTP on mw1057 is CRITICAL: Connection refused [16:43:01] PROBLEM - Apache HTTP on mw1046 is CRITICAL: Connection refused [16:43:10] PROBLEM - Apache HTTP on mw1082 is CRITICAL: Connection refused [16:43:10] PROBLEM - Apache HTTP on mw1064 is CRITICAL: Connection refused [16:43:19] PROBLEM - Apache HTTP on mw1063 is CRITICAL: Connection refused [16:43:19] PROBLEM - Apache HTTP on mw1073 is CRITICAL: Connection refused [16:43:20] PROBLEM - Apache HTTP on mw1078 is CRITICAL: Connection refused [16:43:28] PROBLEM - Apache HTTP on mw1076 is CRITICAL: Connection refused [16:43:28] PROBLEM - Apache HTTP on mw1068 is CRITICAL: Connection refused [16:43:28] PROBLEM - Apache HTTP on mw1071 is CRITICAL: Connection refused [16:43:37] PROBLEM - Apache HTTP on mw1066 is CRITICAL: Connection refused [16:43:38] PROBLEM - Apache HTTP on mw1075 is CRITICAL: Connection refused [16:43:38] PROBLEM - Apache HTTP on mw1091 is CRITICAL: Connection refused [16:43:46] PROBLEM - Apache HTTP on mw1080 is CRITICAL: Connection refused [16:43:48] * mark loves the smell of dying apaches in the morning [16:43:59] hahaha [16:44:04] PROBLEM - Apache HTTP on mw1088 is CRITICAL: Connection refused [16:44:13] PROBLEM - Apache HTTP on mw1105 is CRITICAL: Connection refused [16:44:13] PROBLEM - Apache HTTP on mw1111 is CRITICAL: Connection refused [16:44:14] PROBLEM - Apache HTTP on mw1086 is CRITICAL: Connection refused [16:44:19] now let's wait until puppet fixes them [16:44:23] PROBLEM - Apache HTTP on mw1101 is CRITICAL: Connection refused [16:44:23] PROBLEM - Apache HTTP on mw1100 is CRITICAL: Connection refused [16:44:23] PROBLEM - Apache HTTP on mw1114 is CRITICAL: Connection refused [16:44:23] PROBLEM - Apache HTTP on mw1089 is CRITICAL: Connection refused [16:44:23] PROBLEM - Apache HTTP on mw1108 is CRITICAL: Connection refused [16:44:23] PROBLEM - Apache HTTP on mw1102 is CRITICAL: Connection refused [16:44:24] PROBLEM - Apache HTTP on mw1097 is CRITICAL: Connection refused [16:44:31] PROBLEM - Apache HTTP on mw1094 is CRITICAL: Connection refused [16:44:31] PROBLEM - Apache HTTP on mw1099 is CRITICAL: Connection refused [16:44:31] PROBLEM - Apache HTTP on mw1096 is CRITICAL: Connection refused [16:44:31] PROBLEM - Apache HTTP on mw1121 is CRITICAL: Connection refused [16:44:31] PROBLEM - Apache HTTP on mw1093 is CRITICAL: Connection refused [16:44:40] PROBLEM - Apache HTTP on mw1124 is CRITICAL: Connection refused [16:44:41] PROBLEM - Apache HTTP on mw1092 is CRITICAL: Connection refused [16:44:41] PROBLEM - Apache HTTP on mw1109 is CRITICAL: Connection refused [16:44:41] PROBLEM - Apache HTTP on mw1106 is CRITICAL: Connection refused [16:44:41] PROBLEM - Apache HTTP on mw1107 is CRITICAL: Connection refused [16:44:41] PROBLEM - Apache HTTP on mw1087 is CRITICAL: Connection refused [16:44:49] PROBLEM - Apache HTTP on mw1104 is CRITICAL: Connection refused [16:44:49] PROBLEM - Apache HTTP on mw1142 is CRITICAL: Connection refused [16:44:49] PROBLEM - Apache HTTP on mw1123 is CRITICAL: Connection refused [16:44:56] too bad sync-common is not using Tim's fancy new rack awareness [16:44:58] PROBLEM - Apache HTTP on mw1120 is CRITICAL: Connection refused [16:45:00] but I can't be bothered to fix that right now [16:45:07] PROBLEM - Apache HTTP on mw1132 is CRITICAL: Connection refused [16:45:07] PROBLEM - Apache HTTP on mw1115 is CRITICAL: Connection refused [16:45:07] PROBLEM - Apache HTTP on mw1133 is CRITICAL: Connection refused [16:45:07] PROBLEM - Apache HTTP on mw1095 is CRITICAL: Connection refused [16:45:08] PROBLEM - Apache HTTP on mw1130 is CRITICAL: Connection refused [16:45:08] PROBLEM - Apache HTTP on mw1140 is CRITICAL: Connection refused [16:45:16] PROBLEM - Apache HTTP on mw1103 is CRITICAL: Connection refused [16:45:16] PROBLEM - Apache HTTP on mw1090 is CRITICAL: Connection refused [16:45:17] PROBLEM - Apache HTTP on mw1152 is CRITICAL: Connection refused [16:45:17] PROBLEM - Apache HTTP on mw1129 is CRITICAL: Connection refused [16:45:17] PROBLEM - Apache HTTP on mw1147 is CRITICAL: Connection refused [16:45:25] PROBLEM - Apache HTTP on mw1110 is CRITICAL: Connection refused [16:45:26] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection refused [16:45:26] PROBLEM - Apache HTTP on mw1134 is CRITICAL: Connection refused [16:45:26] PROBLEM - Apache HTTP on mw1145 is CRITICAL: Connection refused [16:45:26] PROBLEM - Apache HTTP on mw1098 is CRITICAL: Connection refused [16:45:26] PROBLEM - Apache HTTP on mw1144 is CRITICAL: Connection refused [16:45:26] PROBLEM - Apache HTTP on mw1139 is CRITICAL: Connection refused [16:45:34] PROBLEM - Apache HTTP on mw1141 is CRITICAL: Connection refused [16:45:34] PROBLEM - Apache HTTP on mw1118 is CRITICAL: Connection refused [16:45:34] PROBLEM - Apache HTTP on mw1135 is CRITICAL: Connection refused [16:45:43] PROBLEM - Apache HTTP on mw1113 is CRITICAL: Connection refused [16:45:52] PROBLEM - Apache HTTP on mw1127 is CRITICAL: Connection refused [16:45:53] PROBLEM - Apache HTTP on mw1116 is CRITICAL: Connection refused [16:45:53] PROBLEM - Apache HTTP on mw1151 is CRITICAL: Connection refused [16:45:53] PROBLEM - Apache HTTP on mw1125 is CRITICAL: Connection refused [16:45:53] PROBLEM - Apache HTTP on mw1112 is CRITICAL: Connection refused [16:45:53] PROBLEM - Apache HTTP on mw1131 is CRITICAL: Connection refused [16:45:53] PROBLEM - Apache HTTP on mw1117 is CRITICAL: Connection refused [16:46:02] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection refused [16:46:02] PROBLEM - Apache HTTP on mw1137 is CRITICAL: Connection refused [16:46:02] PROBLEM - Apache HTTP on mw1128 is CRITICAL: Connection refused [16:46:02] PROBLEM - Apache HTTP on mw1126 is CRITICAL: Connection refused [16:46:02] PROBLEM - Apache HTTP on mw1119 is CRITICAL: Connection refused [16:46:10] PROBLEM - Apache HTTP on mw1148 is CRITICAL: Connection refused [16:46:10] PROBLEM - Apache HTTP on mw1138 is CRITICAL: Connection refused [16:46:10] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection refused [16:46:10] PROBLEM - Apache HTTP on mw1122 is CRITICAL: Connection refused [16:46:10] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection refused [16:46:19] PROBLEM - Apache HTTP on mw1136 is CRITICAL: Connection refused [16:46:37] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection refused [16:46:46] PROBLEM - Apache HTTP on mw1149 is CRITICAL: Connection refused [16:46:46] PROBLEM - Apache HTTP on mw1146 is CRITICAL: Connection refused [16:46:46] PROBLEM - Apache HTTP on mw1143 is CRITICAL: Connection refused [16:46:46] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection refused [16:46:55] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection refused [16:47:58] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.313 second response time [16:47:59] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.291 second response time [16:57:44] New patchset: Mark Bergsma; "Require mw-sync to finish successfully, if it runs at all, before starting Apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44659 [16:58:19] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 181 seconds [16:58:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44659 [16:59:36] wait for the napalm charge of "Apache HTTP is OK … HTTP 403 / REFUSED" [16:59:40] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 202 seconds [17:03:44] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 8 seconds [17:05:04] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 23 seconds [17:08:13] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.786 second response time [17:12:07] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.342 second response time [17:12:07] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.311 second response time [17:12:52] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.315 second response time [17:12:52] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.311 second response time [17:15:30] poor nfs1 is not having a good day [17:18:52] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.315 second response time [17:18:52] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.315 second response time [17:21:18] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:21:19] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [17:22:57] PROBLEM - Apache HTTP on mw1070 is CRITICAL: Connection refused [17:24:54] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection refused [17:25:39] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection refused [17:25:39] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection refused [17:25:40] PROBLEM - Apache HTTP on mw1150 is CRITICAL: Connection refused [17:25:48] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [17:26:15] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection refused [17:26:19] running scap [17:26:42] PROBLEM - Apache HTTP on mw1149 is CRITICAL: Connection refused [17:26:42] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection refused [17:26:57] !log root Started syncing Wikimedia installation... : [17:27:09] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection refused [17:27:09] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection refused [17:27:12] Logged the message, Master [17:30:27] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 212 seconds [17:30:54] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 219 seconds [17:35:51] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [17:36:18] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [17:43:39] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.310 second response time [17:44:24] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.512 second response time [17:47:19] sigh, my sarcastic email at hetzner online in regards to their "netscan" email didn't get any response [17:47:23] other than another netscan email [17:47:29] yup, tried that before [17:47:33] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.313 second response time [17:48:00] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.296 second response time [17:48:08] but i told them they discovered the insiduous "HTTP" attack we were trying to pull! [17:48:09] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [17:51:00] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.343 second response time [17:51:00] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.499 second response time [17:51:18] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.312 second response time [17:52:12] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.312 second response time [17:52:27] New review: Hashar; "Yes this can be merged. Please restart Apache on gallium when it is merged / applied to make sure th..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44157 [17:54:34] !log root Started syncing Wikimedia installation... : [17:54:44] Logged the message, Master [17:54:54] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.290 second response time [17:55:03] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [18:00:38] !log Removed mw1085 from mediawiki_installation (acting up) [18:00:51] Logged the message, Master [18:02:32] !log root Finished syncing Wikimedia installation... : [18:02:42] Logged the message, Master [18:04:11] !log root Started syncing Wikimedia installation... : [18:04:21] Logged the message, Master [18:09:27] PROBLEM - Apache HTTP on mw1070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:04] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.667 second response time [18:17:17] !log root Finished syncing Wikimedia installation... : [18:17:27] Logged the message, Master [18:21:52] can someone please restart labsconsole's memcache? [18:23:51] PROBLEM - Apache HTTP on mw1070 is CRITICAL: Connection refused [18:24:02] andrewbogott: paravoid ^ [18:25:32] yep, just a second [18:25:57] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.016 second response time on port 11000 [18:27:00] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection refused [18:27:00] giftpflanze: better? It hadn't crashed by I restarted it anyway. [18:27:14] oh, but i had all symptoms of it [18:27:18] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection refused [18:27:27] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection refused [18:27:27] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection refused [18:27:27] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection refused [18:27:28] PROBLEM - Apache HTTP on mw1150 is CRITICAL: Connection refused [18:27:54] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection refused [18:28:03] PROBLEM - Apache HTTP on mw1149 is CRITICAL: Connection refused [18:28:12] who killed eqiad ? ;) [18:28:18] * Reedy blames mark [18:28:23] and nagios had its problems as it seems [18:28:25] Wait, no, scap finished [18:28:50] andrewbogott: thx, works smoothly now :) [18:29:06] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.311 second response time [18:29:07] hrm, a few of those seem happy to me at the moment [18:29:15] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.295 second response time [18:29:15] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.312 second response time [18:29:15] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.312 second response time [18:29:15] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.296 second response time [18:29:15] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.500 second response time [18:30:09] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [18:30:26] New patchset: RobH; "adding colby to dhcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44678 [18:31:12] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [18:32:37] New review: RobH; "end of week, out of witty self review comments" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44678 [18:32:38] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44678 [18:33:09] PROBLEM - Puppet freshness on srv299 is CRITICAL: Puppet has not run in the last 10 hours [18:38:31] i did [18:38:39] on purpose [18:40:21] RECOVERY - Puppet freshness on srv299 is OK: puppet ran at Fri Jan 18 18:40:14 UTC 2013 [18:43:12] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [18:43:12] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [18:43:12] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [18:43:12] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [18:44:24] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Fri Jan 18 18:44:05 UTC 2013 [18:44:24] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Fri Jan 18 18:44:14 UTC 2013 [18:45:00] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.316 second response time [18:47:15] RECOVERY - Puppet freshness on ms-fe1004 is OK: puppet ran at Fri Jan 18 18:46:52 UTC 2013 [18:47:51] RECOVERY - Puppet freshness on ms-fe1003 is OK: puppet ran at Fri Jan 18 18:47:48 UTC 2013 [18:48:01] !log authdns-update [18:48:12] Logged the message, RobH [18:49:39] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.498 second response time [18:51:18] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.288 second response time [18:54:02] !log restarted pdns on linne, it borked on authdns-update [18:54:11] Logged the message, RobH [18:59:59] !log rm -rf /usr/local/apache/common-local on eqiad apaches [19:00:11] Logged the message, Master [19:01:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:35] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44562 [19:08:35] !log root Started syncing Wikimedia installation... : [19:08:45] Logged the message, Master [19:08:46] !log aaron synchronized wmf-config/filebackend.php 'Added ceph file backend configuration' [19:08:55] Logged the message, Master [19:11:00] mw1033: rsync error: errors selecting input/output files, dirs (code 3) at main.c(643) [Receiver=3.0.9] [19:11:06] PROBLEM - Apache HTTP on mw1070 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:11:06] awjr: are all those normal now? [19:11:20] mediawiki doesn't exist on eqiad apaches at the moment [19:11:24] i'm running a scap right now [19:12:01] AaronSchulz: eh? those rsync errors? no idea [19:13:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.549 seconds [19:14:17] mark: is scap including the eqiad apaches? [19:14:21] yes [19:14:24] yay! [19:14:30] not yay [19:14:32] scap sucks [19:14:40] i'm having a hard time getting stuff synced out reliably [19:14:45] you're running it and not me. yay! [19:15:04] well guess who's gonna be the sucker testing mediawiki in eqiad when i'm done ;-) [19:15:22] i'm perversely looking forward to it [19:16:00] alright then ;) [19:17:04] New patchset: Silke Meyer; "Variables for the client config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44690 [19:17:17] New patchset: RobH; "colby partman change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44691 [19:18:06] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44691 [19:21:45] New patchset: Silke Meyer; "Variables for the client config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44690 [19:22:15] New review: Silke Meyer; "Removed some comments." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44690 [19:29:52] RECOVERY - Puppet freshness on mw1001 is OK: puppet ran at Fri Jan 18 19:29:26 UTC 2013 [19:33:18] PROBLEM - Apache HTTP on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:33:27] PROBLEM - Apache HTTP on mw1105 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:33:27] PROBLEM - Apache HTTP on mw1088 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:33:28] PROBLEM - Apache HTTP on mw1111 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:33:28] PROBLEM - Apache HTTP on mw1120 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:33:54] PROBLEM - Apache HTTP on mw1094 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:34:03] PROBLEM - Apache HTTP on mw1134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:34:03] PROBLEM - Apache HTTP on mw1023 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:34:21] PROBLEM - Apache HTTP on mw1045 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:34:30] PROBLEM - Apache HTTP on mw1059 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:34:31] PROBLEM - Apache HTTP on mw1068 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:34:31] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.014 second response time [19:34:39] PROBLEM - Apache HTTP on mw1046 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:34:40] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.307 second response time [19:34:48] PROBLEM - Apache HTTP on mw1078 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:34:48] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.299 second response time [19:34:58] PROBLEM - Apache HTTP on mw1143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:34:58] PROBLEM - Apache HTTP on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:35:06] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.736 second response time [19:35:15] PROBLEM - Apache HTTP on mw1048 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:35:16] PROBLEM - Apache HTTP on mw1147 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:35:16] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.312 second response time [19:35:16] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.944 second response time [19:35:24] PROBLEM - Apache HTTP on mw1051 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:35:24] PROBLEM - Apache HTTP on mw1146 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:35:24] PROBLEM - Apache HTTP on mw1153 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:35:24] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:35:25] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:35:33] PROBLEM - Apache HTTP on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:35:42] PROBLEM - Apache HTTP on mw1099 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:35:42] the joy of half synced mediawiki installations [19:35:43] PROBLEM - Apache HTTP on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:35:43] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:35:43] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.941 second response time [19:35:52] PROBLEM - Apache HTTP on mw1130 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:35:52] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.295 second response time [19:36:00] PROBLEM - Apache HTTP on mw1041 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:00] PROBLEM - Apache HTTP on mw1154 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:09] RECOVERY - Apache HTTP on mw1045 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:36:18] PROBLEM - Apache HTTP on mw1054 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:18] PROBLEM - Apache HTTP on mw1073 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:19] PROBLEM - Apache HTTP on mw1103 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:19] PROBLEM - Apache HTTP on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:36:27] PROBLEM - Apache HTTP on mw1065 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:36:28] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [19:36:28] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.512 second response time [19:36:36] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.552 second response time [19:36:45] PROBLEM - Apache HTTP on mw1104 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:45] PROBLEM - Apache HTTP on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:45] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:46] PROBLEM - Apache HTTP on mw1086 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:46] PROBLEM - Apache HTTP on mw1057 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:36:46] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.291 second response time [19:37:04] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.303 second response time [19:37:04] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.296 second response time [19:37:12] PROBLEM - Apache HTTP on mw1077 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:37:12] PROBLEM - Apache HTTP on mw1058 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:37:21] PROBLEM - Apache HTTP on mw1040 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:37:30] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.305 second response time [19:37:30] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.036 second response time [19:37:39] PROBLEM - Apache HTTP on mw1071 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:37:40] PROBLEM - Apache HTTP on mw1049 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:37:48] PROBLEM - Apache HTTP on mw1101 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:37:48] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.377 second response time [19:37:49] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.307 second response time [19:37:57] PROBLEM - Apache HTTP on mw1037 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:37:58] PROBLEM - Apache HTTP on mw1106 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:38:07] PROBLEM - Apache HTTP on mw1091 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:38:07] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:38:15] PROBLEM - Apache HTTP on mw1056 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:38:15] PROBLEM - Apache HTTP on mw1133 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:38:15] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:38:24] PROBLEM - Apache HTTP on mw1138 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:38:24] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.181 second response time [19:38:33] PROBLEM - Apache HTTP on mw1028 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:38:34] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:38:52] PROBLEM - Apache HTTP on mw1074 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:38:52] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.516 second response time [19:39:09] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.298 second response time [19:39:09] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.295 second response time [19:39:27] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.300 second response time [19:39:28] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.302 second response time [19:39:36] PROBLEM - Apache HTTP on mw1096 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:39:36] PROBLEM - Apache HTTP on mw1142 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:39:37] PROBLEM - Apache HTTP on mw1109 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:39:54] PROBLEM - Apache HTTP on mw1121 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:39:54] PROBLEM - Apache HTTP on mw1118 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:40:03] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.299 second response time [19:40:03] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.993 second response time [19:40:21] PROBLEM - Apache HTTP on mw1129 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:40:21] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.297 second response time [19:40:25] mark: it's about the same as the joy of fully synced mediawiki installations [19:40:30] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.013 second response time [19:40:36] hehe [19:40:49] PROBLEM - Apache HTTP on mw1063 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:40:57] PROBLEM - Apache HTTP on mw1139 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:40:57] PROBLEM - Apache HTTP on mw1055 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:41:06] PROBLEM - Apache HTTP on mw1136 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:41:06] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [19:41:15] PROBLEM - Apache HTTP on mw1131 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:41:24] PROBLEM - Apache HTTP on mw1125 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:41:24] PROBLEM - Apache HTTP on mw1097 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:41:24] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:41:24] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:41:25] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.799 second response time [19:41:34] hey guys, important question [19:41:42] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [19:42:00] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.304 second response time [19:42:09] PROBLEM - Apache HTTP on mw1062 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:42:09] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.297 second response time [19:42:22] New patchset: Pyoungmeister; "temp measure to keep all jobrunner in eqiad stopped" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44695 [19:42:27] PROBLEM - Apache HTTP on mw1027 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:42:27] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.302 second response time [19:42:35] there are a bunch of udp2log-related monitoring scripts in /usr/share/ganglia-logtailer -- where do they live in gerrit? [19:42:45] PROBLEM - Apache HTTP on mw1093 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:42:45] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.298 second response time [19:42:49] no idea [19:42:54] PROBLEM - Apache HTTP on mw1064 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:43:03] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.300 second response time [19:43:12] PROBLEM - Apache HTTP on mw1075 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:43:13] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.299 second response time [19:43:13] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.316 second response time [19:43:43] PROBLEM - Apache HTTP on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error [19:43:43] PROBLEM - Backend Squid HTTP on amssq58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:43] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.295 second response time [19:43:48] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.302 second response time [19:44:06] PROBLEM - Apache HTTP on mw1112 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:44:24] PROBLEM - Apache HTTP on mw1067 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:44:25] PROBLEM - Apache HTTP on mw1043 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:44:25] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.630 second response time [19:44:33] PROBLEM - Apache HTTP on mw1038 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:44:42] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.296 second response time [19:44:51] PROBLEM - Apache HTTP on mw1061 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:45:00] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.303 second response time [19:45:18] PROBLEM - Apache HTTP on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:45:18] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.297 second response time [19:45:37] PROBLEM - Apache HTTP on mw1033 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:45:37] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.009 second response time [19:45:54] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.291 second response time [19:46:03] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:46:05] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.307 second response time [19:46:12] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:46:21] PROBLEM - Apache HTTP on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:46:21] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [19:46:21] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:46:30] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [19:46:39] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.291 second response time [19:46:39] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.307 second response time [19:46:57] PROBLEM - Apache HTTP on mw1156 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:46:57] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.569 second response time [19:47:06] PROBLEM - Apache HTTP on mw1159 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:47:06] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.295 second response time [19:47:08] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44695 [19:47:15] PROBLEM - Apache HTTP on mw1081 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error [19:47:24] PROBLEM - Apache HTTP on mw1082 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:47:24] PROBLEM - Apache HTTP on mw1157 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:47:33] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [19:47:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:09] PROBLEM - Apache HTTP on mw1160 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:48:18] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:48:19] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.139 second response time [19:48:27] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:48:36] PROBLEM - Apache HTTP on mw1155 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:48:37] PROBLEM - Apache HTTP on mw1158 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:48:37] PROBLEM - Apache HTTP on mw1084 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:48:45] PROBLEM - Apache HTTP on mw1079 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:48:54] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.300 second response time [19:48:54] RECOVERY - Backend Squid HTTP on amssq58 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 1.189 seconds [19:49:11] !log authdns-update colby back into sandbox [19:49:12] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:49:18] Logged the message, RobH [19:49:22] PROBLEM - Apache HTTP on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:49:22] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:49:37] hey paravoid notpeter -- quick/important q: there are a bunch of udp2log-related monitoring scripts in /usr/share/ganglia-logtailer -- where do they live in gerrit? [19:49:48] PROBLEM - Apache HTTP on mw1126 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:49:57] PROBLEM - Apache HTTP on mw1108 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:49:57] PROBLEM - Apache HTTP on mw1115 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:49:57] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.298 second response time [19:49:57] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.306 second response time [19:50:24] PROBLEM - Apache HTTP on mw1080 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:50:24] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.299 second response time [19:50:25] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:50:33] PROBLEM - Apache HTTP on mw1149 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:50:33] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [19:50:33] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.295 second response time [19:50:51] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:51:00] PROBLEM - Apache HTTP on mw1135 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:51:09] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [19:51:18] PROBLEM - Apache HTTP on mw1124 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:51:18] PROBLEM - Apache HTTP on mw1128 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:51:34] Ohai! [19:51:36] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [19:51:36] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.296 second response time [19:51:45] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:51:45] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [19:51:54] PROBLEM - Apache HTTP on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:52:01] dschoon: with the power of grep I have divined that it lives in ganglia.pp [19:52:02] class ganglia::logtailer { [19:52:03] # this class pulls in everything necessary to get a ganglia-logtailer instance on a machine [19:52:03] package { "ganglia-logtailer": [19:52:03] ensure => latest; [19:52:03] PROBLEM - Apache HTTP on mw1076 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:52:12] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.285 second response time [19:52:21] in the operations/puppet repo [19:52:22] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.573 second response time [19:52:24] ty, notpeter [19:52:38] mutante replied in mw_sec [19:52:47] ah, ok [19:52:48] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [19:52:48] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.710 second response time [19:52:57] PROBLEM - Apache HTTP on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:52:57] PROBLEM - Apache HTTP on mw1150 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:53:06] PROBLEM - Apache HTTP on mw1039 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [19:53:06] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.296 second response time [19:53:42] PROBLEM - Apache HTTP on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [19:53:42] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.305 second response time [19:53:51] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [19:54:00] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK HTTP/1.1 200 OK - 292 bytes in 0.055 seconds [19:54:09] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.302 second response time [19:54:40] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [19:54:40] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.287 second response time [19:54:45] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.291 second response time [19:54:54] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.305 second response time [19:55:30] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [19:57:54] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [19:58:08] Change merged: Asher; [operations/software] (master) - https://gerrit.wikimedia.org/r/44564 [19:58:08] Change merged: Asher; [operations/software] (master) - https://gerrit.wikimedia.org/r/23099 [19:58:15] New review: Parent5446; "The bug blocking this has been made live." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/21322 [20:02:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.056 seconds [20:03:54] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.302 second response time [20:03:56] paravoid: using curl with ceph gives curl: (7) couldn't connect to host every other time [20:04:02] is there rrdns or something? [20:04:30] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.565 second response time [20:04:57] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.297 second response time [20:05:09] or mark [20:05:15] which hostname? [20:05:15] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.300 second response time [20:05:31] http://ms-fe.eqiad.wmnet [20:05:42] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.300 second response time [20:05:42] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.532 second response time [20:06:18] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.295 second response time [20:06:45] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [20:06:54] AaronSchulz: pybal has only 2 out of 4 hosts pooled [20:06:56] so clearly something is up [20:07:12] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.295 second response time [20:07:21] and that something is, only one out of 4 is working [20:07:39] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.291 second response time [20:08:08] preilly: Dude I read the new scap code and it's really nice, I totally get how it's fast now [20:08:22] * RoanKattouw thanks Tim-away [20:08:35] it's loads better than it was [20:08:38] i'd still not call it "nice" [20:08:54] haha [20:09:00] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [20:09:06] I agree rsyncing is nasty [20:09:10] But the new features are nice [20:09:26] How long does it take now? [20:09:36] over an hour now [20:09:40] to completely empty eqiad apaches [20:09:43] dunno when synced [20:09:45] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.299 second response time [20:09:54] Damn [20:09:59] Yeah completely empty will take a while [20:10:16] At least it's using -F30 now [20:10:21] before I had puppet run sync-common on all the hosts [20:10:26] all hitting nfs1 [20:10:28] poor nfs1 [20:10:30] RoanKattouw: yeah! [20:10:30] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.300 second response time [20:10:55] AaronSchulz: can you use "ms-fe1001.eqiad.wmnet" instead for now? [20:10:57] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.300 second response time [20:10:58] that LVS service is broken [20:11:07] But I suppose it's still nfs1, yeah [20:11:13] only one of the 4 hosts is actually up, and the health check is broken [20:11:25] The initial set of rsync proxies have to be fed from nfs1 [20:11:33] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.708 second response time [20:11:50] RoanKattouw: sync-common on the hosts is not network aware [20:11:59] at least not without options [20:12:18] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.298 second response time [20:12:34] That's right [20:12:39] Who/what is running sync-common? [20:12:47] I thought that was only used for individual hosts that had gotten out of sync [20:13:39] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [20:13:48] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.303 second response time [20:14:06] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [20:14:07] puppet does that [20:15:18] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.301 second response time [20:17:56] mark: maybe I'll just wait till it's fixed [20:18:54] RECOVERY - Puppet freshness on mw1016 is OK: puppet ran at Fri Jan 18 20:18:38 UTC 2013 [20:18:54] RECOVERY - Puppet freshness on mw1006 is OK: puppet ran at Fri Jan 18 20:18:41 UTC 2013 [20:19:13] !log root Finished syncing Wikimedia installation... : [20:19:18] :O [20:19:23] Logged the message, Master [20:19:24] good. [20:19:53] * mark restarts all those apaches [20:20:07] RECOVERY - Apache HTTP on mw1033 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [20:20:07] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.764 second response time [20:21:09] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [20:21:18] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [20:21:36] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.311 second response time [20:21:36] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.554 second response time [20:21:45] mark: might those get fixed soon or will it be a while? [20:22:09] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [20:22:09] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.313 second response time [20:22:09] AaronSchulz: the ceph frontends? I don't know when paravoid has time to look at it [20:22:17] i don't want to mess with it now, as it's the same lvs service as swift in tampa [20:22:21] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.285 second response time [20:22:21] RECOVERY - Puppet freshness on mw1015 is OK: puppet ran at Fri Jan 18 20:22:16 UTC 2013 [20:22:40] if you need it now, just access ms-fe1001 directly, there's no other box yet anyway [20:22:48] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.290 second response time [20:23:06] RECOVERY - Puppet freshness on mw1007 is OK: puppet ran at Fri Jan 18 20:22:50 UTC 2013 [20:23:15] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [20:23:24] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [20:23:24] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.295 second response time [20:23:42] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.293 second response time [20:24:09] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.612 second response time [20:24:18] RECOVERY - Puppet freshness on mw1003 is OK: puppet ran at Fri Jan 18 20:24:05 UTC 2013 [20:24:27] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.294 second response time [20:24:36] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.310 second response time [20:25:03] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.721 second response time [20:25:12] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.297 second response time [20:25:39] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.304 second response time [20:26:15] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.314 second response time [20:26:24] RECOVERY - Puppet freshness on mw1002 is OK: puppet ran at Fri Jan 18 20:25:59 UTC 2013 [20:26:24] RECOVERY - Puppet freshness on mw1009 is OK: puppet ran at Fri Jan 18 20:26:10 UTC 2013 [20:26:24] RECOVERY - Puppet freshness on mw1004 is OK: puppet ran at Fri Jan 18 20:26:16 UTC 2013 [20:26:24] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.315 second response time [20:31:04] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44157 [20:36:09] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Puppet has not run in the last 10 hours [20:36:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds [20:40:12] RECOVERY - Puppet freshness on mw1014 is OK: puppet ran at Fri Jan 18 20:39:57 UTC 2013 [20:40:21] RECOVERY - Puppet freshness on mw1011 is OK: puppet ran at Fri Jan 18 20:40:14 UTC 2013 [20:41:24] RECOVERY - Puppet freshness on mw1012 is OK: puppet ran at Fri Jan 18 20:40:49 UTC 2013 [20:43:48] RECOVERY - Puppet freshness on mw1008 is OK: puppet ran at Fri Jan 18 20:43:20 UTC 2013 [20:43:48] RECOVERY - Puppet freshness on mw1013 is OK: puppet ran at Fri Jan 18 20:43:41 UTC 2013 [20:44:24] RECOVERY - Puppet freshness on mw1010 is OK: puppet ran at Fri Jan 18 20:43:56 UTC 2013 [20:44:51] RECOVERY - Puppet freshness on mw1005 is OK: puppet ran at Fri Jan 18 20:44:26 UTC 2013 [20:58:35] paravoid: Can you check my nginx-proxy puppet code. The file name is /etc/puppet/manifest/nginx.pp. http://justpaste.it/1t5y [20:58:57] New patchset: Ottomata; "Giving Dan Andreescu access on stat1. RT 4312." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44703 [20:59:58] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44703 [21:09:06] mike_wang: you know you can put it in as a changeset to gerrit, and then you get simple things like link checking? [21:09:09] And amend as necessary [21:09:18] s/link/lint/ [21:11:41] New patchset: Ottomata; "Dan already had an account on stat1. Adding him to locke, emery and oxygen. RT 4312." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44704 [21:12:26] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44704 [21:45:36] New patchset: Cmjohnson; "Addin osm-cp1001-4 /osm-db1001-2 to dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44707 [21:47:18] New patchset: Pyoungmeister; "creating precise mariadb apt repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44708 [21:47:40] New review: Cmjohnson; "looks good to me" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44707 [21:47:41] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44707 [21:49:25] binasher: paravoid is that all that's needed to create a new apt repo ^^ [21:49:30] New review: Andrew Bogott; "I apologize in advance if I'm missing the whole point of this. It looks to me like you want to be a..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44690 [21:50:56] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [21:52:09] which me luck [21:53:26] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to parse template varnish/wikimedia.vcl.erb: undefined method `each' for :undef:Symbol at /etc/puppet/manifests/varnish.pp:107 on node deployment-varnish-t.pmtpa.wmflabs [21:53:27] I am doomed [22:01:35] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44708 [22:01:39] hashar, this is a vcl for uploads - how it got loaded for mobile cache? [22:02:00] wikimedia.vcl.erb might be global [22:11:47] New patchset: Mwang; "provide web access to instances that lack a public IP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44712 [22:11:48] New patchset: Mwang; "proxy to provide web access to instances that lack a public IP." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [22:12:36] New review: Hashar; "Some variable is not properly defined for templates/varnish/wikimedia.vcl.erb :" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [22:14:50] hashar, looks like it has something to do with varnish_backends, varnish_directors or director_options [22:18:05] yeah [22:18:17] I misconfigured $varnish_backends [22:18:31] I have put the conf in role::cache::configuration::beta [22:18:43] then though that including that class would export all the variables in the local scope [22:18:44] but [22:18:47] I need to prefix them [22:19:25] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [22:19:57] i mean https://gerrit.wikimedia.org/r/#/c/44709/2/manifests/role/cache.pp,unified [22:20:00] New patchset: Asher; "Bringing eqiad online" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44714 [22:20:24] hashar: can you please take a look at ^^^ [22:21:07] from a 11pm20 30000 feet overview it looks like something that might do what is claimed in the topic :-] [22:21:17] haha [22:21:37] sounds like a +1 to me [22:21:41] honestly, too late to review such a thing :-D [22:21:52] New patchset: Mwang; "proxy to provide web access to instances that lack a public IP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44715 [22:21:57] rfaulkner: [22:21:58] from dulwich.config import StackedConfig [22:22:05] New patchset: Pyoungmeister; "fixing no eof complaint from reprepro" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44716 [22:22:06] sc = StackedConfig(StackedConfig.default_backends()) [22:22:10] sc.get('user','email') [22:22:13] for instance [22:22:49] binasher: if you need that today (I guess not since we are friday afternoon), you might attempt anomie / sam :-D [22:22:55] else I will be glad to review that on monday [22:23:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44716 [22:24:06] binasher: sorry :( [22:24:18] MaxSem: will poke the rest on monday I guess [22:24:31] awesome! [22:27:07] oh [22:27:21] the awesomeness will actually happen on monday [22:27:33] when mark will step in and say: here you are missing "that". [22:27:41] and he will spot it in like 2 seconds [22:27:47] and that will make everything work :-] [22:28:04] oh no, this is going out now [22:28:19] Ryan_Lane: gotcha thx [22:28:43] binasher: I am sorry I can't review/validate anything this late :/ [22:28:49] I am surely going to miss something and that will cause more havoc [22:29:08] hashar: no worries! [22:29:46] but yeah that overall looks fine :] [22:30:14] Reedy: are you still awoke to lookup a mediawiki-config change for asher ? [22:30:23] anomie: same there :-) ^^^ [22:30:25] change is https://gerrit.wikimedia.org/r/#/c/44714/ [22:30:55] hashar- Looking at it already. Unit test fails, but so far it's a problem with the unit test. [22:31:04] ah nice [22:31:15] yeah the unit tests are not really nice [22:31:22] test 2:31 PST [22:31:32] I have hacked up months ago and never looked back at them since [22:31:32] LeslieCarr: :-] [22:31:41] eh, worst case.. the site goes down for a few minutes while i revert :) [22:31:55] i'm trying it again hashar :) [22:31:56] Ryan_Lane: https://github.com/rfaulkner/Sartoris/commits/master [22:32:20] binasher: yeah that is what we used to do :-] Specially when the apaches were reading the settings files out of NFS. Deployment was all about :w in vim :-D [22:32:20] hashar: eh, btw, not today.. but we need to talk about wikibugs some time >p [22:32:20] binasher: be bold. [22:32:46] mutante: oh the poor wikibugs [22:32:56] hashar: there is stuff in git which does not work and there is stuff in svn which does not either [22:32:57] New review: Pyoungmeister; "be bold." [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44714 [22:33:09] and then there is an abandoned change that was supposed to let puppet clone it from git [22:33:14] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44714 [22:33:20] anomie: ^^^^ [22:33:21] :-) [22:33:40] * hashar takes an ice cream site in the sofa, and wait for the cluster to NOT Falldown. [22:33:43] hashar: basically i had to revert to a version in SVN like 4 changes before the latest [22:34:12] !log asher synchronized wmf-config 'eqiad/pmtpa config variances' [22:34:12] mutante: one sure is sure, the SVN version should no more be used. I have migrated that repo to git. [22:34:14] * anomie will still fix the unit test [22:34:18] Logged the message, Master [22:34:32] mutante: but then we / I added some more patches to wikibugs in git. [22:34:34] hashar: what is in git has never been deployed before [22:34:38] and it doesnt work [22:34:40] New patchset: Asher; "Revert "Bringing eqiad online"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44732 [22:34:47] mutante: so my patches are wrong :/ [22:34:49] i am seeing fatals on meta.wikimedia.org [22:34:54] PHP fatal error in /usr/local/apache/common-local/wmf-config/CommonSettings.php line 233: [22:34:54] require() [function.require]: Failed opening required '/usr/local/apache/common-local/php-1.21wmf7/../wmf-config/mc.php' (include_path='/usr/local/apache/common-local/php-1.21wmf7:/usr/local/lib/php:/usr/share/php') [22:35:01] Yeah, just reported the same. [22:35:05] I thought it was limited to wmfwiki, heh. [22:35:06] hashar: something about them , yea.. plus another problem that we cant just clone on mchenry [22:35:10] awjr: Susan yeah that is being dealt with [22:35:14] awjr: Susan bug in PHP [22:35:21] thanks hashar [22:35:23] PHP has bugs? :o [22:35:28] a few yeah [22:35:40] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44732 [22:35:42] the last one I got was segfaulting while trying to deference an array :( [22:36:13] !log asher synchronized wmf-config 'revert' [22:36:23] Logged the message, Master [22:36:41] mutante: honestly, I think we should drop wikibugs entirely. It was more like a hack script and having it consuming mails on mchenry is probably not the best thing ever. [22:37:02] !log asher synchronized wmf-config 'revert' [22:37:07] mutante: I think MOzilla use supybot (a python irc bot) and a custom plugin. Should probably have a look at that. (or whatever mozilla is using nowadays) [22:37:08] sorry logged in users, all better [22:37:12] Logged the message, Master [22:37:26] Susan: awjr binasher fixed the errors! [22:37:38] binasher: was it on test.wikipedia.org or just the whole cluster? [22:37:38] far out [22:37:47] thanks binasher [22:38:01] hashar: do you know why pmtpa hosts were looking for wmf-config/mc.php and not loading mc-pmtpa.php? [22:38:06] hashar: Whole cluster. [22:38:42] binasher: maybe there is an include somewhere ? [22:38:57] or getRealmSpecificFilename() failed to match [22:39:08] ah, yeah.. damn it [22:39:18] RECOVERY - Puppet freshness on lvs1001 is OK: puppet ran at Fri Jan 18 22:39:04 UTC 2013 [22:39:41] hashar: acutally, the failed include looks like it was via require( getRealmSpecificFilename( "$wmfConfigDir/mc.php" ) ); [22:40:11] hashar: i agree, i will puppetize eggdrop some day [22:40:23] then we can try TCl scripting.heh [22:40:36] mutante: there is most probably an egg drop module for bugzilla [22:40:48] yea:) and tons of other scripts [22:40:54] and we can link bots to each other [22:41:47] binasher: yup, and that should resolve fine if and only if the file exist [22:41:58] iff! [22:44:04] blah [22:44:05] blah [22:44:16] fnord [22:44:17] fnord [22:44:24] wth, magic healing [22:44:31] obviously strace fixes it [22:44:32] ;) [22:44:45] heh, yea, you need to strace it and it start working for no reason [22:45:20] binasher: assuming there is NO getRealmSpecificFilename( 'mc.php' ) on the cluster right now. [22:45:35] binasher: you can connect on a pmtpa apache, touch mc-pmtpa.php [22:45:44] then in veal find out what is happening [22:46:03] php /apache/common-local/multiversion/MWScript.php maintenance/eval.php --wiki=enwiki [22:46:11] print getRealmSpecificFilename('mc.php'); [22:46:17] should gives out mc-pmtpa.php [22:46:25] at least there are [22:46:25] s/maintenance\/// [22:46:25] > print $wmfRealm; [22:46:26] hashar: i have mv'd mc.php to mc-pmtpa.php on a tampa apache and its working as expected [22:46:27] production [22:46:28] > print $wmfDatacenter; [22:46:28] pmtpa [22:46:34] ah nice [22:47:03] AaronSchulz: http://www.amazon.com/Team-Geek-Software-Developers-Working/dp/1449302440 [22:47:05] nice but.. not sure what was going on there. going to try the full change set on test [22:47:11] New patchset: Lcarr; "fix typo in docs.pp manifest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44735 [22:47:47] blah [22:47:55] i'm still alive [22:48:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44735 [22:49:26] mutante: regarding wikibugs, I bet marktraceur will be happy to step in too [22:49:52] New patchset: RobH; "added git-core to marmontel/blog server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44737 [22:50:07] * marktraceur is intrigued at hashar's certainty, what have I been volunteered for [22:50:14] now I am sleeping, have a nice afternoon / night /morning [22:50:25] New review: RobH; "what harm could this do...." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44737 [22:50:26] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44737 [22:50:26] marktraceur: mutante was talking about wikibugs [22:50:47] hashar: Meaning I might be able to actually make a change to it soon? :) [22:51:39] RoanKattouw: hey, there? [22:52:00] marktraceur: na mostly wondering what to do with it. [22:52:01] or Reedy or AaronSchulz [22:52:10] there's a stacktrace that I'd like some eyes on :) [22:52:15] ? [22:52:17] marktraceur: I think we should simply drop it and replace wikibugs with something new / better / written in python [22:52:21] * marktraceur would be happy to weigh in on that [22:52:34] notpeter: Yes? [22:52:35] hashar: Or don't write our own [22:52:36] Reedy: all the api boxes in eqiad are throwing mad 500s [22:52:47] just from the health checks [22:52:48] Reedy: wfHttpError() is sometime called with an empty $wgOut entry there is ton of fatals right now. Comes from the API [22:52:56] mutante: A new system that we don't need to write would certainly be nice, especially if it's not Perl [22:52:56] all over the exception.log [22:52:57] :O [22:52:57] marktraceur: hashar: https://wiki.mozilla.org/Bugzilla:Addons#IRC_Bots [22:53:05] notpeter: I can haz pastebin of backtrace? [22:53:05] Reedy: yea that list of bot [22:53:11] yeah, what hashar said :) [22:53:12] sure [22:53:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:36] Reedy: Aw, let hashar sleep :) [22:53:46] oh man, i totally fucked up that wmf-config changeset [22:53:53] ok, that's good [22:53:58] we need to log those fatals in a database to easily query them out, find out the top offenders for the last hour and so on [22:54:08] marktraceur: http://en.wikipedia.org/wiki/Eggdrop [22:54:12] http://pastebin.com/Zv7Wje89 [22:54:15] there be a stacktrace [22:54:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:54:37] Reedy: RoanKattouw [22:54:40] ^^ [22:54:51] yay!!! irc spamming is successful! [22:54:52] win! [22:54:59] :) [22:55:07] wat [22:55:17] LeslieCarr: it's like spamsgiving! [22:55:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.350 second response time [22:55:30] That backtrace is weird [22:55:33] yes [22:55:40] is there anything else I can do to help? [22:55:46] An exception came from ... somewhwere? [22:56:00] And it's failing to print it because $wgOut is not yet initialized? How the hell does that happen? [22:56:08] RECOVERY - IRC spamming on neon is OK: OK [22:56:47] notpeter: Could you try a live hack for me? [22:57:07] RoanKattouw: sure [22:57:09] what you need? [22:57:11] WARN - live hacking detected [22:57:16] hahahaha [22:57:19] lol [22:57:25] I don't know which box you're trying this on, but wherever it is, please edit /usr/local/apache/common-local/php-1.21wmf7/includes/GlobalFunctions.php , find the wfHttpError() function, and wrap all invocations of $wgOut in if ( $wgOut ) { $wgOut->foo(); } [22:57:35] :) [22:57:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.427 seconds [22:58:02] RoanKattouw: you should use your root for this :) [22:58:07] use mw1120 [22:58:15] really any api box in eqiad [22:58:20] not in prod, so no risk [22:59:00] heading bed, see you later all [22:59:27] RoanKattouw: [22:59:30] actually.... [22:59:30] notpeter: I'm really supposed to be working on a bottleneck piece of code in VE ... [22:59:37] it has no wmf-config... [22:59:39] sooo [22:59:48] this is probably going to wrap up quckly :) [22:59:50] binasher- Did you already figure out the problem you had with !g 44732? [23:00:00] anomie: yep [23:00:55] Ryan_Lane: http://blog.ludovf.net/python-str-unicode/ [23:00:56] notpeter: hahahaha [23:00:59] that's what I was thinking of [23:01:37] binasher- Nifty. I have changes to the unit test [23:04:19] PROBLEM - Apache HTTP on mw30 is CRITICAL: Connection refused [23:06:18] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.129 second response time [23:07:08] New patchset: Asher; "pmtpa/eqiad config variances" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44738 [23:09:09] PROBLEM - Host mw1124 is DOWN: PING CRITICAL - Packet loss = 100% [23:10:12] RECOVERY - Host mw1124 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [23:10:14] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44738 [23:10:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:10:32] * Aaron|laptop wonders where paravoid is [23:11:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [23:11:24] !log asher synchronized wmf-config 'eqiad/pmtpa config variances' [23:11:33] Logged the message, Master [23:11:47] rfaulkner: https://github.com/wikimedia/Sartoris/pull/3 [23:12:04] and the website is still up [23:12:50] just need to restart all the jobrunners [23:13:41] all jobrunners restarted [23:14:00] oh great [23:14:15] New patchset: Anomie; "Fix dbconfigTest to test all db.php variants" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44739 [23:14:18] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.291 second response time [23:14:24] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.109 second response time [23:15:24] binasher- gerrit change 44739 fixes the unit test that your change made fail [23:16:41] !log DNS update to add wikimania.asia [23:16:51] Logged the message, Master [23:17:24] !log stopping all jobrunners including tmh boxes per asher's request while troubleshooting [23:17:35] Logged the message, notpeter [23:18:48] New patchset: Asher; "getRealmSpecificFilename poolcounter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44740 [23:19:02] the timemediahandler extension needs patching [23:19:15] anyone available for that? single line change [23:19:30] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44740 [23:20:31] AaronSchulz: hey [23:20:33] need halp [23:20:40] re: tmh [23:20:44] you know about tmh, right? [23:21:23] !log asher synchronized wmf-config/CommonSettings.php 'dc specific poolcounter conf' [23:21:33] Logged the message, Master [23:23:05] notpeter: some [23:23:17] do you have a checkout of it? [23:23:28] New patchset: Lcarr; "splitting snmpd and snmptrapd init files - pulling these in via puppet on icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44741 [23:23:35] yeah, on my big lappy [23:23:42] basically, binasher's change in https://gerrit.wikimedia.org/r/44740 needs to be crammed into tmh [23:23:45] one line [23:24:22] that link is to a config change that was merged [23:24:29] * Aaron|laptop is confused [23:24:34] actually maybe not.. notpeter can you try restarting the tmh jobbers [23:24:47] jobbers, heh [23:26:43] !log blog software updated a few hours ago, forgot to log. just updated theme to git update from gerrit repo. [23:26:53] Logged the message, RobH [23:28:22] RobH: Thanks! :-) [23:28:45] New patchset: Dzahn; "splitting snmpd and snmptrapd init files - pulling these in via puppet on icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44741 [23:29:12] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44741 [23:30:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:23] woot, icinga beats nagios ! [23:33:38] (also much better cpu allowing like 8x simultaneous checks helps) [23:34:46] !log blog is being bitchy about new theme versus old caching, reverting to stock and clearing local cache on marmontel [23:34:57] Logged the message, RobH [23:35:03] Ryan_Lane: https://github.com/wikimedia/Sartoris/pull/5 [23:36:39] .... [23:36:45] my caching issue is a firefox being a bitch [23:36:47] damn you firefox [23:37:37] New patchset: Lcarr; "upping concurrent checks to 3200" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44742 [23:38:17] !log stupid w3 caching on blog won't let me use new wp-victor directory, has to match old .git name or caching errors result [23:38:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44742 [23:38:28] Logged the message, RobH [23:42:17] !log blog is partly offline (unreliable loads) investigating [23:42:27] Logged the message, RobH [23:42:48] New patchset: Pyoungmeister; "adding mariadb apt repo to mariadb boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44743 [23:44:15] PROBLEM - Host marmontel is DOWN: PING CRITICAL - Packet loss = 100% [23:44:17] 2013-01-18 23:38:48 mw1086 metawiki: [ad0ba9f3] /wiki/Special:BannerRandom?userlang=ru&sitename=Wikimedia+Commons&project=commons&anonymous=true&bucket=0&country=BY&slot=9 Exception from line 352 of /us [23:44:17] r/local/apache/common-local/php-1.21wmf7/includes/cache/MessageCache.php: Could not acquire 'metawiki:messages:ru:status' lock. [23:44:42] PROBLEM - Host marmontel is DOWN: CRITICAL - Host Unreachable (208.80.152.150) [23:44:43] Aaron|laptop: RoanKattouw: either of you know what that lock is? [23:45:00] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44743 [23:45:16] binasher: I have no freaking clue [23:45:18] RECOVERY - Host marmontel is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [23:45:36] RECOVERY - Host marmontel is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [23:45:41] New patchset: Lcarr; "removing duplicate snmpd service entry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44744 [23:46:11] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44744 [23:49:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.343 second response time [23:49:46] New patchset: Pyoungmeister; "cp/paste error for mariadb repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44745 [23:50:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.600 seconds [23:50:10] Is chad gone for the day? [23:50:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44745 [23:50:45] PROBLEM - Backend Squid HTTP on sq72 is CRITICAL: Connection refused [23:51:51] Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/40784/1/modules/mediawiki_new/manifests/cgroup.pp,unified [23:52:25] PROBLEM - Frontend Squid HTTP on sq72 is CRITICAL: Connection refused [23:52:48] PROBLEM - Backend Squid HTTP on sq72 is CRITICAL: Connection refused [23:53:39] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [23:54:00] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [23:55:03] PROBLEM - Frontend Squid HTTP on sq72 is CRITICAL: Connection refused [23:56:14] Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/40785/ [23:58:29] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Fri Jan 18 23:58:19 UTC 2013 [23:58:30] RECOVERY - Puppet freshness on virt3 is OK: puppet ran at Fri Jan 18 23:58:19 UTC 2013