[00:01:51] RECOVERY - Disk space on ms-be7 is OK: DISK OK [00:02:54] !log maxsem synchronized php-1.23wmf9/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/105108' [00:03:14] Logged the message, Master [00:03:58] I'm done [00:04:27] woohoo [00:21:23] (03CR) 10Tim Starling: [C: 032] Set $wgULSFontRepositoryBasePath to protocol-relative URL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105115 (owner: 10Ori.livneh) [00:21:42] (03Merged) 10jenkins-bot: Set $wgULSFontRepositoryBasePath to protocol-relative URL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105115 (owner: 10Ori.livneh) [00:23:23] !log tstarling synchronized wmf-config/CommonSettings.php [00:23:53] Logged the message, Master [00:25:05] TimStarling: thanks [00:28:34] ok, well it still aborts [00:31:04] it aborts the connection and then starts a new request for the same URL after 100ms [00:36:12] do you think we could check connection_aborted() during MW request shutdown, and log to a special channel if it is? [00:45:08] TimStarling: maybe, do you have any reason to suspect the problem is prevalent? [00:45:38] I haven't been able to reproduce it, so it could still be a browser bug [00:51:14] ori: what's up? [00:51:57] paravoid: can you be around if I flip the graphite CNAME to tungsten? [00:51:59] also, hi [00:52:13] happy new year [00:52:21] same to you :) [00:52:35] yes, I can be around [00:52:39] is everything ready? [00:53:03] no, not yet. how much longer do you think you'll be around? [00:53:54] half an hour maybe? [00:54:05] but I can stay for more [00:54:12] how can I help? [00:55:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 7.271 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [00:56:51] RECOVERY - MySQL Slave Running on db68 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:58:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [01:02:34] paravoid: argh, I need to edit all the graph definition files in files/graphite/gdash/dashboards to change all FooMetric -> MediaWiki.FooMetric [01:02:44] since they're namespaced under 'MediaWiki' in the new instance [01:03:04] sed -i ? [01:03:06] :) [01:03:16] (i know, error prone :P) [01:03:17] FooMetric can be anything, sadly [01:03:34] e.g.: :data => 'cactiStyle(substr(EditPage.*.tp90,1,2))' [01:03:39] ('EditPage' in this case) [01:04:11] paravoid: sorry, I forgot about that. Probably best that I just ping you tomorrow. [01:14:08] (03PS4) 10Aaron Schulz: Added a separate scap-rebuild-cdbs phase to scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/105110 [01:18:01] AaronSchulz: since you're setting it to 1 you might as well explicitly check for that value [01:18:55] I did that before and then didn't, meh [01:19:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 8.612 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [01:21:04] (03PS5) 10Aaron Schulz: Added a separate scap-rebuild-cdbs phase to scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/105110 [01:21:05] meh [01:22:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [01:27:06] (03CR) 10Ori.livneh: [C: 032] Added a separate scap-rebuild-cdbs phase to scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/105110 (owner: 10Aaron Schulz) [01:28:03] ori: https://gerrit.wikimedia.org/r/#/c/105021/ [01:28:06] eeaseh [01:30:15] !log reedy started scap: Fix Disambiguator hewiki magicwords [01:30:19] Reedy: why two separate messages? [01:30:47] also, why remove the code to print it to stdout? [01:31:02] Logged the message, Master [01:31:42] does it not print and assign? [01:32:30] "The -v option causes the output to be assigned to the variable var rather than being printed to the standard output." [01:32:31] srsly [01:33:10] it's like printf / sprintf [01:33:32] you could just change one line, the current dologmsg [01:34:39] well, no. print it to a var, echo the var, and interpolate it into the log msg format [01:39:12] (03PS2) 10Reedy: Make logmsgbot report scap length to irc channel, but not log it [operations/puppet] - 10https://gerrit.wikimedia.org/r/105021 [01:42:41] mw103: mwdeploy is not in the sudoers file. This incident will be reported. [01:42:41] mw103: mwdeploy is not in the sudoers file. This incident will be reported. [01:42:41] mw103: Done [01:42:48] There's quite a few of these for different servers... [01:42:58] I'm really on santas naughty list now [01:52:18] (03PS1) 10Aaron Schulz: Do not try to run as mwdeploy in scap-2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105132 [01:52:36] yep. [01:52:53] Didn't even last a fortnight :( [01:53:24] ori: ^ stupid c/p error [01:54:21] Is my "localisation update" not going to have worked then? [01:55:19] it would have aborted out [01:55:21] (03PS3) 10Ori.livneh: Make logmsgbot report scap length to irc channel, but not log it [operations/puppet] - 10https://gerrit.wikimedia.org/r/105021 (owner: 10Reedy) [01:55:29] anything after the failed sudo doesn't happen [01:55:48] (03CR) 10Ori.livneh: [C: 032] Do not try to run as mwdeploy in scap-2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105132 (owner: 10Aaron Schulz) [01:55:58] Right [01:56:00] Sigh [01:57:07] (03PS4) 10Ori.livneh: Make logmsgbot report scap length to irc channel, but not log it [operations/puppet] - 10https://gerrit.wikimedia.org/r/105021 (owner: 10Reedy) [01:57:12] (03CR) 10Ori.livneh: [C: 032 V: 032] Make logmsgbot report scap length to irc channel, but not log it [operations/puppet] - 10https://gerrit.wikimedia.org/r/105021 (owner: 10Reedy) [01:57:24] !log reedy finished scap: Fix Disambiguator hewiki magicwords [01:57:39] scap completed in 31m 28s. [01:58:02] Logged the message, Master [02:01:50] Hmmm [02:02:03] Localisation update is due to start nowish.. [02:03:09] i ran puppet on tin [02:03:13] anything else i can do to help? [02:03:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 7.726 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:03:52] I sorta need to run scap again... But it's going to be running at a very similar time to localisation update [02:04:59] so the /upstream dir must have been synced, so I can just rebuild the cdbs [02:05:18] (03CR) 10Faidon Liambotis: "Our nameservers do not provide a recursion service anyway, so yes, these are all noops and this discussion is a orthogonal, indeed." [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [02:05:57] paravoid: you aren't awake are you? [02:06:02] I am [02:06:11] it's absurdly late there ;) [02:06:28] I'm thinking of ways to move away from dns for ldap/puppet [02:06:34] were you afraid of someone hacking my gerrit account and posting code reviews? :) [02:06:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:06:43] AaronSchulz: WFM [02:07:08] I'm not sure I understand [02:07:12] I think it's done already, just hanging on the last few boxes [02:07:14] one sec [02:07:21] need to describe it :) [02:07:27] okay :) [02:07:38] I was thinking of storing puppet data in DNS, and writing an ENC that reads it from there [02:07:40] searchidx1001...always the last to the party [02:07:54] then writing everything into DNS using designate [02:08:51] Reedy: I didn't do wmf9 though, does that matter? [02:09:01] why? [02:09:20] because I'd like the web interface to stop writing puppet and dns info into ldap [02:09:41] ah [02:09:58] this would make it easier to open api access [02:10:21] no generic key/value openstack service yet I'm assuming :) [02:10:44] well, there is, kind of [02:10:51] it wouldn't really be much of a help there, though [02:11:02] AaronSchulz: Shouldn't really. localisation update will fix that one long before it's used on hewiki [02:11:15] since it really just provides the ability to create them in tenants [02:11:37] this needs to be globally accessible and tenant writable [02:11:59] I could write a nova plugin for puppet to write it into LDAP, too [02:12:12] it doesn't necessarily need to be designate [02:13:26] DNS can be extended to be a key/value for things that don't write a lot and don't need to be queried easily. [02:13:36] not sure if it's an insane idea or not, though :D [02:13:38] switching to designate for DNS sounds kind of obvious [02:13:50] yeah, we're doing that either way [02:13:56] using DNS for puppet classes... dunno, I'm not thrilled by the idea :) [02:14:02] * Ryan_Lane nods [02:14:17] how would you do variables? [02:14:24] arbitrary TXT records or something? [02:14:28] yep [02:15:56] extending nova client and server api is likely easy enough too, though [02:16:14] and then what? nova server api writing to ldap? [02:16:17] yep [02:16:44] I want to move away from LDAP for DNS because pdns's implementation sucks [02:17:12] and it's unmaintained [02:17:20] and there's an openstack service for it now [02:17:25] yep [02:17:32] which makes things way easier [02:17:32] that alone sounds a good enough reason to me [02:17:38] sounds like* [02:17:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 7.843 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:17:44] mhoover has a working prototylpe [02:17:55] !log LocalisationUpdate completed (1.23wmf8) at Fri Jan 3 02:17:55 UTC 2014 [02:17:56] do you know what is wrong with ^^^ btw? :) [02:18:05] speaking of labs DNS [02:18:06] has that been flapping? [02:18:09] yes [02:18:10] all day [02:18:14] I haven't investigated [02:18:18] Logged the message, Master [02:18:27] opendj has an issue of some variety [02:18:38] I'm betting something absurd is querying it poorly [02:18:39] and even the recovery is 7.8s, doesn't sound very recovered to me either [02:19:01] indeed [02:19:31] checking opendj on virt1000 [02:19:42] seems fine there [02:20:23] hm. still a problem on virt0 [02:21:23] Reedy / AaronSchulz: I'm about to head out. Is there anything else that I should hang around for? [02:21:23] (03PS1) 10Springle: sanitarium scripts built during Sep13 data leak event, based on .sql files in Asher's old $HOME. [operations/software/redactatron] - 10https://gerrit.wikimedia.org/r/105135 [02:21:29] hm. someone's doing a pretty expensive query really often [02:21:37] not for me, I'm heading out too [02:22:21] I'm hoping to head to bed soon.. [02:22:21] (03CR) 10Springle: [C: 032] sanitarium scripts built during Sep13 data leak event, based on .sql files in Asher's old $HOME. [operations/software/redactatron] - 10https://gerrit.wikimedia.org/r/105135 (owner: 10Springle) [02:22:41] Reedy: I can stick around if there's something I could do to assist [02:23:04] we should really have a dedicated ldap server for wikitech/gerrit/other web interfaces [02:23:05] ori: Nothing to do, just been confirmed as fixed :) [02:23:12] wee. [02:23:19] so that labs instances can't affect it [02:23:23] bar some purging it seems [02:23:27] yeah definitely [02:23:39] but the discussion above sounds very appropriate too :) [02:23:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:24:47] between the two, I think I prefer nova writing to LDAP using the puppet/ldap schema, rather than trying to use domain records to encode puppetClass/puppetVar [02:24:58] agreed [02:25:09] either way I need to write a plugin for some service [02:33:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 9.447 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:36:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:47:35] !log LocalisationUpdate completed (1.23wmf9) at Fri Jan 3 02:47:34 UTC 2014 [02:47:53] Logged the message, Master [02:52:31] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.117 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:53:37] paravoid: ^^ well seems clearing that cron may have fixed the situation [03:03:17] (03PS1) 10Ryan Lane: Specify the command to remove for manage-exports [operations/puppet] - 10https://gerrit.wikimedia.org/r/105136 [03:08:16] (03CR) 10Ryan Lane: [C: 032] Specify the command to remove for manage-exports [operations/puppet] - 10https://gerrit.wikimedia.org/r/105136 (owner: 10Ryan Lane) [03:20:58] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jan 3 03:20:58 UTC 2014 [03:21:20] Logged the message, Master [03:23:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:24:07] !log start schema changes for bug 59236, indexing only, ipblocks ipb_parent_block_id [03:24:54] Logged the message, Master [03:33:40] (03PS1) 10Ryan Lane: Fix misplaced closing brace for manage-exports [operations/puppet] - 10https://gerrit.wikimedia.org/r/105137 [03:33:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 6.466 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [03:35:28] (03CR) 10Ryan Lane: [C: 032] Fix misplaced closing brace for manage-exports [operations/puppet] - 10https://gerrit.wikimedia.org/r/105137 (owner: 10Ryan Lane) [03:37:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:39:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 5.797 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [04:24:14] TimStarling: is there any use keeping the postConnectionBackoff stuff around? [04:27:34] !log added python-keystone-redis to apt repo [04:27:35] actually if getLagTimes is the only thing used, then that could just go in LB and LM can be deleted [04:27:50] Logged the message, Master [04:28:02] I guess postConnectionBackoff can go [04:28:32] it's a whole separate hierarchy that really only supports mysql [04:28:34] I figured it was better to reduce the max_threads on the server, and let the clients get a connection error and die [04:28:47] well, the idea was that more subclasses would be added to support other DBMSes [04:29:07] and, ideally, some other information system outside of MySQL [04:29:22] that doesn't require you to actually connect to the server to check whether it is overloaded [04:30:14] can't other replication systems have a meaningful getLag() implementation? I'd hope that any replica DB has some way to get the lag somehow. [04:30:39] so the only real use would be something that grabs async stats without a direct DB query [04:31:30] we kind of hack around that with $wgMemc, the add() locks, and using stale data while locked [04:31:40] yes, we have memcached already [04:31:58] ideally you don't want to have to connect to a server in order to decide whether or not to connect to it [04:32:04] (03PS1) 10Ryan Lane: Add redis support to keystone [operations/puppet] - 10https://gerrit.wikimedia.org/r/105139 [04:32:05] it is kind of inefficient [04:32:38] I imagined LoadMonitor being a client for some system which held information on all mysql servers [04:32:50] updated by regular polling [04:36:22] it would be nice to have a non-mysql specific one...actually the mysql one might be as long as getLag is implemented [04:37:28] yeah, once postConnectionBackoff() is removed, it is not really MySQL-specific [04:38:55] of course, the DatabasePostgres class for example, would need config flags for the type of replication (http://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling#Introduction) [04:39:10] with mysql, there aren't so many choices [04:39:33] TimStarling: so when was that backoff disabled? [04:39:41] I see it commented out with --TS [04:43:59] after 2009 and before 2013 [04:46:31] PROBLEM - MySQL Slave Running on db1026 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: Error Deadlock found when trying to get lock: try restarting transac [04:46:39] pity the history of this repo wasn't imported into git from subversion [04:47:31] RECOVERY - MySQL Slave Running on db1026 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [04:48:39] looks like it was around 2011-06-23, r1905 in the old repo [04:50:29] it was during incident response on that day [04:50:49] https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_18#June_23 [04:51:01] 16:32 RoanKattouw: Site came back up instantly after Tim disabled max_threads [04:51:01] 16:28 logmsgbot: tstarling synchronized php-1.17/wmf-config/db.php 'disabled max threads' [05:00:58] haproxy nodes in front of slaves, perhaps. it can monitor both rep lag and outages [05:08:11] aaron@fluorine:~/mw-log$ grep -P -o 'Error connecting to .+ ' dberror.log | grep -P -o '\d+\.\d+\.\d+\.\d+' | sort | uniq -c [05:08:13] 4312 10.64.0.6 [05:08:14] 8743 10.64.16.23 [05:08:16] 718 10.64.16.29 [05:08:48] springle: I always wonder if there is some way to avoid that error spam (though it only matters to the user if all the slaves give it) [05:09:43] either the weights are off, the max connections on the server are too low, or there are too many CDN misses [05:10:59] db1002, db1034, and db1040 [05:14:44] max_connections is already high. imo it should be lower [05:15:24] well that or more slaves are needed [05:15:35] so 4 possibilities ;) [05:15:37] db1002 and 34 are both s2 and have same hardware/config. probably a spike [05:15:49] yes definitely. have ordered some [05:15:59] these flood of errors happen every day [05:16:23] springle: move some from tampa too? >:D [05:16:43] yep :) i have 9 coming from tampa ;) [05:16:44] hmm, that emoticon looks wrong in BZ [05:17:12] dastardly smiley face is just regular angry face [05:17:21] *CZ [05:37:40] query: SELECT /* EditPage::getLastDelete EvilFreD */ [05:37:59] silly that such a simple query falls over...must be those missing log indexes [06:57:42] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [06:58:52] PROBLEM - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:21] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:31] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 61%, RTA = 2217.21 ms [06:59:51] RECOVERY - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64254 bytes in 5.294 second response time [06:59:53] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 16%, RTA = 118.64 ms [07:00:21] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64390 bytes in 7.692 second response time [07:00:51] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:51] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 64390 bytes in 5.604 second response time [07:04:21] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:03] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2161.02 ms [07:05:03] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2214.52 ms [07:05:11] PROBLEM - Host amssq48 is DOWN: PING CRITICAL - Packet loss = 73%, RTA = 2058.84 ms [07:05:32] RECOVERY - Host amssq48 is UP: PING WARNING - Packet loss = 37%, RTA = 111.89 ms [07:05:32] RECOVERY - Host amssq56 is UP: PING WARNING - Packet loss = 37%, RTA = 146.32 ms [07:05:41] RECOVERY - Host amssq62 is UP: PING WARNING - Packet loss = 37%, RTA = 118.10 ms [07:06:12] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64390 bytes in 0.655 second response time [07:06:22] !log Disabled OSPF3 on csw2-knams:xe-1/1/0.0 [07:06:31] PROBLEM - Packetloss_Average on erbium is CRITICAL: CRITICAL: packet_loss_average is 14.6136152381 (gt 8.0) [07:06:40] Logged the message, Master [07:11:01] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.32331555556 (gt 8.0) [07:16:31] RECOVERY - Packetloss_Average on erbium is OK: OK: packet_loss_average is 0.48323969697 [07:17:01] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.43920567568 (gt 8.0) [07:21:01] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.629932833333 [07:21:11] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [07:27:01] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is -0.644574864865 [07:37:03] !log streber from mgmt console reports eth0 link down (hence it appears down to icinga and ganglia) [07:37:20] Logged the message, Master [07:37:52] no carrier... [07:38:07] * apergos looks around for a mark [07:39:26] mark, around? any ideas about streber no carrier? [07:45:07] nm I see your email nw [08:09:41] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [09:07:07] goooood morning [09:07:24] heyo [09:10:26] (03CR) 10Hashar: [WIP] Kibana puppet class (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104172 (owner: 10BryanDavis) [09:16:32] RECOVERY - Host streber is UP: PING OK - Packet loss = 0%, RTA = 26.88 ms [09:18:31] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [09:22:31] RECOVERY - Host streber is UP: PING OK - Packet loss = 0%, RTA = 26.90 ms [09:25:11] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [09:27:31] RECOVERY - Host streber is UP: PING OK - Packet loss = 0%, RTA = 26.91 ms [09:28:01] PROBLEM - Disk space on wtp1023 is CRITICAL: DISK CRITICAL - free space: / 264 MB (2% inode=72%): [09:31:01] RECOVERY - Disk space on wtp1023 is OK: DISK OK [09:31:51] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:51] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [09:33:31] RECOVERY - Host streber is UP: PING OK - Packet loss = 0%, RTA = 26.89 ms [09:33:41] PROBLEM - Disk space on wtp1010 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=72%): [09:36:11] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [09:38:41] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:47:41] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.002 second response time [09:49:31] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [09:49:41] RECOVERY - Disk space on wtp1010 is OK: DISK OK [09:50:51] !log restarted parsoid on wtp1010 and 1023, several gigs of logs full of "Error: Can't set headers after they are sent." from ServerResponse.OutgoingMessage.setHeader [09:51:09] Logged the message, Master [09:58:23] (03CR) 10saper: "Oh, very imporant reasons:" [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [11:12:13] (03PS1) 10Ori.livneh: gdash: Prefix all metric names with 'MediaWiki.' [operations/puppet] - 10https://gerrit.wikimedia.org/r/105163 [11:16:10] ori: friendly reminder that it is 3am and you should sleep :-D [11:16:52] allllllllmost [11:17:14] ori: and I resist the envy of bike shedding about prefixing metrics with MediaWiki :D [11:17:45] (03PS1) 10Ori.livneh: point graphite CNAME at tungsten in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/105166 [11:17:55] (03CR) 10jenkins-bot: [V: 04-1] point graphite CNAME at tungsten in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/105166 (owner: 10Ori.livneh) [11:18:36] bah [11:19:00] (03PS2) 10Ori.livneh: point graphite CNAME at tungsten in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/105166 [11:49:05] (03CR) 10Aklapper: [C: 031] "LGTM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103525 (owner: 10Dzahn) [12:14:30] !log Jenkins is showing failures for tests executed on integration-slave01 (remote file system failing) [12:14:46] Logged the message, Master [12:14:56] uhuh, morebots is here :) [12:43:48] !log Jenkins still unable to use integration-slave01 (restarted the node in labs, and disconnected/re-launched slave agent afterwards, too; no effect) [12:44:02] Logged the message, Master [13:30:21] PROBLEM - MySQL Processlist on db1009 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 92 copy to table, 36 statistics [13:34:21] RECOVERY - MySQL Processlist on db1009 is OK: OK 1 unauthenticated, 0 locked, 6 copy to table, 4 statistics [13:35:46] !log killed msnbot spike on s2 [13:36:04] Logged the message, Master [13:44:07] again? [13:45:49] msn o_O [13:56:11] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:01] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [14:08:31] PROBLEM - HTTP on aluminium is CRITICAL: Connection refused [14:11:31] RECOVERY - HTTP on aluminium is OK: HTTP OK: HTTP/1.1 302 Found - 557 bytes in 0.001 second response time [14:14:51] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:51] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [14:18:11] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [14:19:12] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.446 second response time [14:23:14] !log restarting Jenkins , some git plugin are misbehaving [14:23:30] Logged the message, Master [14:30:13] hi akosiaris - per https://wikitech.wikimedia.org/wiki/Server_access_responsibilities#SSH_key_access I need my key revoked. Do you want the fingerprint, the public key, or something else? [14:30:39] (the laptop is with WMF office IT for repair) [14:31:02] (I assume I should contact akosiaris because that's the nick I see for "on RT duty" in the /topic) [14:31:53] I need username and public key (whether the fingerprint or the entire key... does not matter for revocation) [14:32:24] akosiaris -- username: sumanah [14:32:43] after you issue a new one, that one you need to upload somewhere trusted and I will fetch it from there (office wiki for example) [14:32:44] akosiaris: fingerprint: 54:66:e6:49:fd:47:1e:16:19:d8:85:94:cd:61:d3:1c [14:32:47] akosiaris: got it [14:33:05] oh.. and welcome back!! [14:33:08] thank you akosiaris! [14:33:42] I'm sorry for not revoking the key IMMEDIATELY; the laptop was being carried via a trusted person to WMF OIT and so I thought the risk was ok, but then I read the wikitech wiki page and saw that it was not. [14:35:38] (03PS1) 10Mark Bergsma: Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 [14:36:09] (03PS1) 10Alexandros Kosiaris: Revoke sumanah keys after her request. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105186 [14:37:41] (03CR) 10jenkins-bot: [V: 04-1] Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 (owner: 10Mark Bergsma) [14:39:23] (03CR) 10jenkins-bot: [V: 04-1] Revoke sumanah keys after her request. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105186 (owner: 10Alexandros Kosiaris) [14:39:34] (03PS2) 10Mark Bergsma: Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 [14:39:40] LOST ? [14:39:55] akosiaris: huh? [14:40:08] jenkins is voting LOST ... [14:40:13] oh, jenkins. Weird! [14:40:16] on all jobs... weird... [14:40:22] will force the commit [14:40:24] : !log restarting Jenkins [14:40:34] Just wait a few minutes, CR+2 it now and it'll be handled when it is ready. [14:40:42] i should stop Zuul probably :D [14:40:51] 17 minutes ago... does it take that long ? [14:40:54] (unless it needs to be deployed right now) [14:41:00] Jenkins takes a long time to reboot. [14:41:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revoke sumanah keys after her request. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105186 (owner: 10Alexandros Kosiaris) [14:41:14] (03CR) 10jenkins-bot: [V: 04-1] Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 (owner: 10Mark Bergsma) [14:42:46] brainwane: I just merged the chage, key will be revoked from anywhere in the cluster in 30 mins top. Thanks for letting us know. [14:43:53] !log zapped /vol/root export from nas-1001-a [14:44:10] Logged the message, Master [14:44:16] thanks akosiaris - that seems sufficient to me [14:44:16] entirely? [14:45:19] mark: we can change it to another host if someone uses it. [14:45:32] occasionally to do netapp config changes [14:45:38] VERY annoying that you almost have to do that over nfs [14:45:46] but yeah, not on streber anymore ;p [14:45:50] I always use the CLI [14:46:00] does it have an editor? [14:46:20] or do you use that cat equivalent ;p [14:46:22] I think it does in 8.something something [14:46:30] but I don't use it [14:46:32] why can't they just ship vi or smt [14:48:16] anyway we can always set the export to another host when we need it [14:48:24] absolutely [14:53:25] !log jenkins restarted [14:53:43] Logged the message, Master [14:57:15] (03PS3) 10Hashar: Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 (owner: 10Mark Bergsma) [15:00:48] (03CR) 10Mark Bergsma: [C: 032] Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 (owner: 10Mark Bergsma) [15:00:51] PROBLEM - MySQL Slave Running on db1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:41] RECOVERY - MySQL Slave Running on db1017 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [15:20:15] (03PS8) 10Physikerwelt: Add Mathoid module (TeX -> MathML / SVG conversion web service) [operations/puppet] - 10https://gerrit.wikimedia.org/r/90733 [15:23:20] (03PS1) 10Hashar: webproxies service entries in wmnet. [operations/dns] - 10https://gerrit.wikimedia.org/r/105189 [15:23:38] (03PS2) 10Hashar: webproxies service entries in wmnet. [operations/dns] - 10https://gerrit.wikimedia.org/r/105189 [15:24:51] (03CR) 10Mark Bergsma: [C: 032] webproxies service entries in wmnet. [operations/dns] - 10https://gerrit.wikimedia.org/r/105189 (owner: 10Hashar) [15:37:51] (03PS1) 10Mark Bergsma: Setup RANCID on netmon1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105191 [15:39:03] (03CR) 10Mark Bergsma: [C: 032] Setup RANCID on netmon1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105191 (owner: 10Mark Bergsma) [15:39:49] (03PS1) 10Tim Landscheidt: Tools: Install requested package python-pyexiv2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105192 [15:47:18] (03PS1) 10Mark Bergsma: Add systemuser creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105193 [15:48:23] (03CR) 10Mark Bergsma: [C: 032] Add systemuser creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105193 (owner: 10Mark Bergsma) [16:04:47] (03PS1) 10Mark Bergsma: Add /etc/rancid/rancid.conf to config management [operations/puppet] - 10https://gerrit.wikimedia.org/r/105199 [16:14:15] !log reedy synchronized php-1.23wmf9/extensions/ [16:14:32] Logged the message, Master [16:15:22] !log reedy synchronized php-1.23wmf8/extensions/ [16:15:38] Logged the message, Master [16:22:23] (03PS1) 10Mark Bergsma: Add RANCID cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/105203 [16:23:09] (03CR) 10Mark Bergsma: [C: 032] Add /etc/rancid/rancid.conf to config management [operations/puppet] - 10https://gerrit.wikimedia.org/r/105199 (owner: 10Mark Bergsma) [16:23:27] (03CR) 10Mark Bergsma: [C: 032] Add RANCID cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/105203 (owner: 10Mark Bergsma) [16:24:16] (03CR) 10Faidon Liambotis: [C: 04-1] "scap shuffles the list to avoid overloading rsync servers, as far as I can see, so this won't produce the effect you want it to produce. I" [operations/puppet] - 10https://gerrit.wikimedia.org/r/105006 (owner: 10Reedy) [16:27:12] (03CR) 10Reedy: "Outage last night; waiting for redeployment of updated l10n cache files via sync-dir to fix the problem. EQIAD was last, so had to wait fo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/105006 (owner: 10Reedy) [16:28:14] Reedy: outage report? :) [16:28:33]

uWSGI Error