[00:16:50] !log flushed the varnish cache for mobile again [00:16:53] Logged the message, Master [00:33:34] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [00:59:57] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:41:21] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 247 seconds [01:43:54] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [01:45:33] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:11:55] does anyone know about the gmane setup ? ;-D [02:12:10] the wikivideo-l has been setup unidirectional so we can't post from gmane nntp gateway [02:13:17] will ask on eng ;-D [03:44:31] PROBLEM - NTP on virt1001 is CRITICAL: NTP CRITICAL: No response from NTP server [05:18:27] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:19:48] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.034 second response time on port 8123 [08:08:49] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [09:44:16] New patchset: Hashar; "database & memcached configuration for 'beta' project" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8448 [10:34:27] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [10:34:52] mark: I don't have the NOW password; what does this link say, is it not a battery issue after all? [10:40:24] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-5/2/1 (FPL/Level3, CV71028) [10Gbps wave]BR [10:41:52] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [10:48:16] paravoid: indeed [10:48:24] it gets in a state where the battery is no longer charging [10:48:32] and you have to do some reboot/maint mode cycle to get it charging again [10:48:45] hah [10:49:08] is that why all 4 of them got discharged? [10:49:18] I would think so [10:49:22] nice is also [10:49:24] bug status: Fixed [10:49:29] 'Fixed-In Version' [10:49:34] "This is not fixed in any ONTAP releases" [10:49:38] :-/ [10:51:24] hahahaha [10:51:48] I think I got nas1001 charging again on both controllers [11:01:22] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:08:21] anyone with +2 access for CommonSettings.php ? [11:08:33] oh mark you were around a few minutes ago [11:09:12] there are some complaints about http://www.mediawiki.org/wiki/MobileFrontendFeedback being in the main space, people would like it to be http://www.mediawiki.org/wiki/Project:MobileFrontendFeedback -- it requires a change on https://noc.wikimedia.org/conf/highlight.php?file=CommonSettings.php $wgMFRemotePostFeedbackArticle = "MobileFrontendFeedback"; to become $wgMFRemotePostFeedbackArticle = "P [11:09:12] roject:MobileFrontendFeedback"; [11:09:45] I don't do mediawiki config changes unless critical because of downtime or something [11:12:50] mark: btw, "priv set advanced" [11:13:05] hidden command that unlocks tools for "advanced" users [11:13:06] yeah [11:13:09] (oh yes) [11:13:33] and there's a "priv set admin" too iirc [11:15:57] New patchset: Thehelpfulone; "per complaints about feedback being in main space, moving to project namespace enter the commit message for your changes. Lines starting" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8586 [11:45:20] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [11:46:19] my my [11:46:29] refusing to do a controller giveback for one failed disk [11:46:46] everything to get some attention ;) [11:49:25] so will we now automatically get replacement disks sent to the dc? ;) [12:06:34] paravoid: fyi. raised that project IP quota. that stuff is on virt1 , using nova-manage https://labsconsole.wikimedia.org/wiki/Nova-manage [12:09:10] mutante: thanks... [12:21:26] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:27:53] RECOVERY - mysqld processes on db1020 is OK: PROCS OK: 1 process with command name mysqld [12:29:59] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 54921 seconds [12:34:20] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:44:32] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 53131 seconds [12:59:32] RECOVERY - mysqld processes on db1001 is OK: PROCS OK: 1 process with command name mysqld [13:02:05] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 40612 seconds [13:06:26] PROBLEM - MySQL Slave Delay on db1001 is CRITICAL: CRIT replication delay 40698 seconds [13:49:01] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [13:50:04] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [13:56:30] chris? [13:56:51] while you're in eqiad this week, could you make a precise diagram of how the netapp is connected there, with every cabling going from port to port [13:56:57] and then make sure it's exactly the same in tampa? [13:57:13] there seems to be something wrong with the one in tampa, it's complaining where the eqiad one isn't [14:01:29] hmm, puppet hiera looks interesting [14:01:33] and apparently was merged into 3.0 [14:01:48] mark: sure thing [14:02:07] thanks! [14:04:42] any idea on how to resurrect gerrit-wm? [14:06:40] restart /etc/init.d/ircecho on the box [14:06:47] manganese iirc [14:08:27] mark: thanks :-) [14:35:14] mutante, when you get a sec, try this to fix lighttpd [14:35:14] https://gerrit.wikimedia.org/r/#/c/8454/ [14:35:54] yuck [14:36:00] why not simply notify => Service[lighttpd]? [14:36:11] didn't that work? [14:41:23] mark: do you have any thoughts on where community analytics stuff (i.e. Faulkner's db and web stuff) should migrate? It's temporarily on aluminium and the fundraising db cluster but needs to migrate off. [14:41:44] to the analytics boxes? [14:42:02] i have no idea what it's all for [14:43:23] me either, really [14:43:58] is that an option? I don't really understand how his group and the analytics group relate [14:44:13] me neither [14:44:24] mark, that's what I did, no? [14:44:28] ha--who should I ask? ct? [14:44:49] ummm, as of right now it should not migrate to the analytics cluster [14:44:58] not sure where it should go though [14:45:05] ottomata: why is that exec there then? [14:45:19] the analytics cluster is going to be for batch/stream processing (via hadoop, datastax, storm, cassandra) [14:45:37] awesome [14:45:45] mysql replication nicely falls in that definition [14:45:47] ;-) [14:45:57] where is the service defined? [14:45:58] the db could move to the misc cluster I suppose [14:46:09] sorry, you are right mark, i am not doing that [14:46:11] ummm [14:46:58] since the lighttpd config is not necessariliy associated with a service { "lighttpd" [14:47:15] how do we know that it has been defined anywhere? [14:47:39] why else would you use that definition? [14:47:55] i dunno, man haha, i kinda wished I hadn't touched this [14:48:01] notpeter was having trouble with the notify stuff [14:48:14] notpeter: ping [14:48:25] and I fixed it in a way that I had gotten apache to reload with vhosts before [14:49:09] i mean, it does kinda suck, having no central place that service { lighttpd is defined [14:49:16] you can't be sure someone has done something like [14:49:20] include lightthpd [14:49:43] hmm, there is webserver::static [14:49:44] that's good [14:49:47] yes [14:49:53] but there is also misc::download-wikimedia [14:49:53] and then puppet breaks in that instance [14:49:57] which is fine with me for now [14:50:07] and misc::install-server::web-server [14:50:31] all of those should one day migrate to a sensible class to manage a generic lighttpd instance [14:50:39] aye [14:51:08] then the lighttpd_config could require that class [14:51:17] and we'd know for sure that the service is defined [14:53:09] mark: ottomata hey, frantically running around prepping to walk out the door to the airport. what's up? [14:53:19] ah, no rush [14:53:26] was wondering what was wrong with notify => Service[lighttpd] [14:56:06] mark, in other things, since you are here [14:56:09] I'm working on that GeoIP update thing [14:56:11] going to to #1 [14:56:23] good [14:56:26] check out what's been done already [14:56:29] so that means that puppetmaster will run geoipupdate, and download files to /usr/share/GeoIP [14:56:30] yeah I have [14:56:37] yeah [14:56:43] you can test it in labs [14:56:49] you can build a puppetmaster in labs easily [14:56:57] (if you want) [14:56:59] then I need puppet to be able to distribute those on other hosts [14:57:13] so I need the GeoIP/ dir to be available in the puppet fileserver [14:57:28] (yeah, I test on my local VM, not in labs, it is too slow and too much pain in labs) [14:57:53] but [14:58:03] in order to get /usr/share/GeoIP/ available to puppet fileserver [14:58:13] should I symlink it on puppetmaster into /etc/puppet/files/ [14:58:14] ? [14:58:27] you mean /var/lib/puppet/volatile/ [14:58:32] sure [14:58:37] yeah you can try that [14:58:37] mark: you mean as opposed to running that exec? [14:58:42] notpeter: indeed [14:58:45] ok [14:58:55] will that trigger a restart or a reload? [14:59:01] the exec was my suggestion because something wasn't work with notify [14:59:05] but i don't remember what wasn't working [14:59:10] notpeter: a restart [14:59:42] if restart is ok, seems much cleaner [14:59:51] yeah [14:59:54] I wish puppet knew about reload :/ [14:59:59] that'd be nice [14:59:59] there's a ticket for it at least! [15:00:10] * mark is melting [15:00:15] it suddenly got hot here [15:00:39] brb [15:01:11] so, what should I do about that then boys? [15:01:16] change back to notify => Service? [15:01:19] yes [15:01:24] isn't the reload better though? [15:01:27] why do a full restart? [15:01:28] fuck that [15:01:37] lighttpd restarts anyway [15:01:40] ok [15:13:35] New patchset: Andrew Bogott; "Add subnet for virt1000-1008." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8608 [15:24:30] mark, so I'm making a class called geoip::data::source [15:24:42] oh mark has a red dot next to his name, hmm, he might be away [15:25:09] anyway, was wondering if that class would be better in my geoip.pp file [15:25:13] or in puppetmaster.pp [15:25:16] puppetmaster is going to include it [15:25:27] but it is relevant to geoip stuff [15:25:37] i'm leaning towards putting it in geoip file [15:25:41] geoip.pp [15:34:03] !log updated dns for analytics mgmt [15:34:09] Logged the message, RobH [15:35:30] !log ns1 died on update, restarting pdns [15:35:35] Logged the message, RobH [16:06:33] Change abandoned: Andrew Bogott; "nope." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8608 [16:08:26] hey guys, puppet Q for the room [16:08:34] maybe Ryan_Lane will have an idea about this [16:08:42] I need to access this variable in a class [16:08:43] $puppetmaster::config::volatiledir [16:08:56] but I can't get at it, unless I include the puppetmaster::config class [16:09:26] and I don't want to do that, since I don't want to set up a puppetmaster [16:09:37] can/should I move those variable definitions into a separate [16:09:41] puppetmaster::variables class [16:09:45] ? [16:09:56] should I make them global varialbes? [16:10:09] $puppetmaster_config_volatiledir [16:10:14] ^ e.g.? [16:10:34] er, what are you trying to do? [16:10:42] New patchset: Thehelpfulone; "per complaints about feedback being in main space, moving to project namespace enter the commit message for your changes. Lines starting" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8612 [16:11:06] you're trying to work with something that's something internal to the puppetmaster [16:11:19] i'm trying to ensure that some files are downloaded into the $volatiledir on puppetmaster [16:11:31] so that puppet can then distribute those files in other class elsewhere [16:11:31] that probably belongs to the puppetmaster class then? [16:11:38] set of classes even [16:11:48] hm [16:12:00] Change abandoned: Thehelpfulone; "Moved to https://gerrit.wikimedia.org/r/#/c/8612/ instead." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8586 [16:12:02] i guess so, but then I have the converse question [16:12:09] hmm, hang on [16:12:15] sure :) [16:12:43] ok, yes, this class is puppetmaster specific, I asked this Q earlier in the room, and maybe it is relevant again now [16:12:52] this class is responsible for downloading the maxmind geoip files [16:13:02] but that is only done on the puppetmaster [16:13:06] I have a geoip.pp file [16:13:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:18] but I guess this is more relevant to puppetmaster.pp [16:13:25] since it only happens on puppetmasters [16:13:26] hmmmmmmmmmm [16:13:32] actually, I think I know what would be best to do [16:13:40] I can make the download generic, and the volatile stuff puppetmaster only [16:13:42] ok ok [16:13:42] yes [16:13:42] ok [16:13:44] maybe untangle the geoip stuff from puppetmaster paths? [16:13:57] i can parameterize the class, yeah [16:14:05] and pass the dl location in from puppetmaster when it is included [16:14:06] I think we said something similar [16:14:14] jajaja, much better [16:14:15] thank you! [16:14:28] yvw [16:15:35] New patchset: Thehelpfulone; "moving feedback to Project: namespace" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8612 [16:17:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.875 seconds [16:31:41] RECOVERY - MySQL Slave Delay on db1001 is OK: OK replication delay 0 seconds [16:32:08] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 0 seconds [16:51:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [17:07:34] maplebed: can you do this one https://gerrit.wikimedia.org/r/#/c/8619/ [17:08:48] preilly: sure; soon as I finish 8612 [17:13:49] maplebed: can you also flush the varnish cache [17:14:04] preilly: after pushing tha change or right now? [17:14:07] maplebed: this is actually very high priority (the varnish cache flush) [17:14:15] maplebed: right now [17:14:38] ok, doing. [17:15:32] preilly: cache is flushed. [17:15:54] maplebed: thanks! [17:16:03] and now on to th egerrit change. [17:16:20] maplebed: http://es.m.wikipedia.org/wiki/Italia gives me a 503 [17:16:32] maplebed: okay it worked now [17:18:08] for me too. [17:18:44] Anybody about who has the Real Ultimate Power to create Gerrit repos? [17:19:31] Chad [17:19:56] or Ryan I guess [17:20:07] I believe Chad is currently enjoying the pleasures of automobile ownership. [17:20:13] And thus trying to get his car untowed. [17:20:38] And I don't see Ryan. [17:20:54] preilly: you've got a mix of spaces and tabs for your indentation in https://gerrit.wikimedia.org/r/#/c/8619/1/templates/varnish/mobile-frontend.inc.vcl.erb [17:21:07] line 102 [17:21:21] oops [17:21:32] dschoon: is it documented? ;) [17:22:01] Reedy: ...buh? [17:22:36] Is there a page on wikitech saying how to create repos? If so you might be able to get any of ops to do it.. If not, it might not be the best idea [17:22:55] the last time I tried to create a repo I fucked it up in some super-obscure way by just doing what seemed sensible. [17:23:26] isn't this a bit ridiculous that we have to move heaven and earth to get a repo? [17:23:37] yes. [17:23:50] Change abandoned: preilly; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8623 [17:24:12] Yeah, I mean, my understanding is that Gerrit privs are also a shitshow [17:24:18] and gerrit does not have a 'create-repo-only' right for admins? [17:24:23] so only 3ish people have the ability to do such things. [17:24:24] no. [17:24:26] it does not. [17:24:33] it has one bit, "admin_p" [17:24:44] arghghhgh :D [17:25:24] which controls absolutely all administration powers, including purging all history, deleting users, etc [17:25:30] things i do not want the power to do when drunk. [17:26:31] New patchset: Asher; "restrict ishmael to specific ldap groups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8624 [17:26:32] Bribe people to document it [17:26:50] Yeah, sending bribery email now. [17:28:35] maplebed: try this https://gerrit.wikimedia.org/r/#/c/8619 now [17:29:14] looking. [17:30:25] preilly: looks good. merging. [17:31:27] binasher: is the change in puppet to templates/apache/sites/graphite.wikimedia.org yours? [17:31:52] preilly: I'll be a minute; there's a pending unmerged change I have to resolve. [17:31:53] it is, and it's merged along with the varnish config [17:32:20] ok. it's good to push live then? [17:33:11] preilly: are you going to switch the x-carrier designations to 4 character codes? [17:33:20] maplebed: sure [17:33:36] binasher: before we go live yes [17:33:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:39] binasher: for testing no [17:33:48] that works [17:33:55] oh. I understand now; I didn't get that you merged the varnish stuff. [17:33:58] thanks! [17:34:58] !log deployed change to varnish configs for preilly; adding more carriers [17:35:02] Logged the message, Master [17:35:45] maplebed: can you also purge varnish [17:35:55] maplebed: sorry binasher buts it's needed [17:35:55] sure. [17:36:37] purged. [17:36:53] puppet's still running so if it isn't live, ask me to purge again in 5m. [17:37:36] maybe we should give preilly and awj sudo privs to a script that does the purge, but also prints out the bugzilla ticket to fix mobile static asset caching [17:37:41] and some ascii art [17:37:55] binasher: ha ha [17:42:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.482 seconds [17:44:33] PROBLEM - MySQL disk space on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:44:42] PROBLEM - MySQL Slave Running on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:44:42] PROBLEM - mysqld processes on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:45:00] PROBLEM - MySQL Recent Restart on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:45:09] PROBLEM - MySQL Slave Delay on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:51] PROBLEM - SSH on gurvin is CRITICAL: Server answer: [18:07:15] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [18:07:54] maplebed, you around in a bit to help me install the new udp-filter package? [18:08:24] yeah. [18:08:35] gimme a time? [18:08:36] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [18:09:48] PROBLEM - SSH on es1003 is CRITICAL: Server answer: [18:10:15] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [18:11:35] maplebed: is the migration for the ring change still going? [18:11:45] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:11:53] Aaron|home: are you seeing strange things happening? [18:11:58] I haven't checked yet today but probably. [18:12:35] I'm just curious is the profiling should look any better yet [18:13:23] the profiling likely won't change. I only shifted ms-be1; I'll run sql queries directly as the test. [18:13:35] maplebed, um, 1 hour? [18:14:03] ottomata: if it'll take long, that means I should get food first... [18:14:40] Aaron|home: also I realized I made a mistake and I put them on sda and sdb, which are shared with the OS. [18:14:47] I shoulda taken sdc and sdd. [18:15:00] I might do another today with that config adn see what happens. [18:15:10] 'doh [18:15:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:55] i'm not sure what's involved, it should be just uploading the deb, apt-get update, and um, apt-get upgrade? [18:15:55] something? [18:16:01] and babysitting to make sure everything is ok [18:16:15] yeah, sure. 12:15. I sent you a calendar entry. [18:16:19] but go ahead and get food first? [18:16:20] ok [18:16:29] thanks [18:21:06] Aaron|home: to answer your question - no, it's a long long long ways away from finishing the rebalance. [18:21:43] * Aaron|home wishes we had more boxen [18:24:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.446 seconds [18:27:39] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [18:35:44] New patchset: Bhartshorne; "updating ring files to move container storage to two dedicated drives on ms-be2 as well to test container read speed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8630 [18:57:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.187 seconds [19:19:35] New patchset: Ottomata; "Rewrote geoip.pp to be more modular and to use the licensed Maxmind GeoIP data files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8677 [19:32:58] RECOVERY - NTP on virt1001 is OK: NTP OK: Offset -0.02825951576 secs [19:40:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.596 seconds [19:56:49] RECOVERY - mysqld processes on db1003 is OK: PROCS OK: 1 process with command name mysqld [19:59:23] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 2688384 seconds [20:00:07] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 2688374 seconds [20:01:47] can someone run in eval.php [20:01:54] count( $wgMemc->get( 'mw-tor-exit-nodes' ) ); [20:01:58] echo count( $wgMemc->get( 'mw-tor-exit-nodes' ) ); [20:01:59] ? [20:12:28] Platonides: on a wmf wiki? [20:13:55] New patchset: Ottomata; "Rewrote geoip.pp to be more modular and to use the licensed Maxmind GeoIP data files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8677 [20:14:16] Platonides: 642 [20:15:52] New patchset: Ottomata; "Rewrote geoip.pp to be more modular and to use the licensed Maxmind GeoIP data files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8677 [20:17:28] thanks Aaron [20:20:30] New patchset: Asher; "re-opening ishmael to anyone with an ldap account" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8712 [20:22:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:44] has the lag on db12 already been reported? [20:30:00] this is the first I've seen of it [20:31:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.844 seconds [20:31:53] there's a long query [20:32:04] not sure whether it's safe to kill [20:34:55] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [20:50:02] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 188 seconds [20:50:20] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 185 seconds [20:51:21] New patchset: Ottomata; "role/analytics.pp - allowing analytics sudoers to sudo -u to any user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8713 [21:02:38] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:05:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.055 seconds [21:15:01] New patchset: Asher; "show dbs with broken replication as black instead of red" [operations/software] (master) - https://gerrit.wikimedia.org/r/8716 [21:28:19] New patchset: Jgreen; "setting up db1025 to copy fundraising db dumps to tridge" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8717 [21:34:37] !log upgraded udp-filter to 0.2.4 on oxygen, emery, and locke (with maplebed's help) [21:34:40] Logged the message, Master [21:34:43] yeahhhhh! [21:34:45] first log [21:34:58] ah maaan, but as ottomata1 [21:34:59] :( [21:37:17] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:38:02] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:39:21] New patchset: Jgreen; "grr, fixed code-omission-fail" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8718 [21:42:05] PROBLEM - swift-object-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:44:45] New patchset: Jgreen; "adding cron job for offhost backups of db1025" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8719 [21:45:59] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay seconds [21:46:08] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay seconds [21:46:35] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [21:46:42] !log shutting down mysql on db12 in able to restart with binlogging disabled [21:46:45] Logged the message, Master [21:48:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:53:38] RECOVERY - swift-object-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:54:14] PROBLEM - mysqld processes on db12 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:56:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.455 seconds [21:57:23] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [21:58:26] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [21:59:56] RECOVERY - mysqld processes on db12 is OK: PROCS OK: 1 process with command name mysqld [22:04:23] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 1326 seconds [22:04:41] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 1332 seconds [22:31:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:39:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [22:43:05] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [22:43:41] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [22:47:17] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [22:49:27] maplebed: can you please revert that change from earlier today? [22:53:31] preilly: the mobile one? [22:53:56] maplebed: this one https://gerrit.wikimedia.org/r/#/c/8619/ [22:54:09] ah that's fine ;) [23:01:51] maplebed: actually just merge this one https://gerrit.wikimedia.org/r/#/c/8723/ [23:02:53] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [23:07:34] Change abandoned: preilly; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8723 [23:08:19] or not [23:12:20] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 0 seconds [23:12:29] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [23:13:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:16:09] maplebed: actually this one https://gerrit.wikimedia.org/r/#/c/8724 [23:19:15] maplebed: never mind Ryan_Lane is doing it [23:19:18] done [23:19:26] gerrit is all fucked up [23:19:48] of course the hooks are broken [23:21:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [23:24:24] New patchset: Ryan Lane; "test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8725 [23:24:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8725 [23:24:45] New review: Ryan Lane; "test" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8725 [23:24:55] seems that fixed that [23:25:07] Change abandoned: Ryan Lane; "was a test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8725 [23:26:32] New patchset: Ryan Lane; "Fixing the commend-added hook" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8726 [23:26:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8726 [23:26:54] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8726 [23:27:17] hm [23:27:25] merge logs are still broken [23:29:30] New patchset: Ryan Lane; "Fixing change-merged" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8727 [23:29:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8727 [23:30:03] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8727 [23:30:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8727 [23:50:56] http://www.downforeveryoneorjustme.com/wikitech.wikimedia.org [23:51:34] woosters: ^ [23:52:46] seems the linode is down [23:53:20] oh noes, our operations information cluster is down! [23:53:27] part of newark is down apparently [23:53:42] I'm still on IRC so it couldn't be all of linode [23:54:36] * Ryan_Lane nods [23:54:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds