[00:05:42] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.007 second response time on port 11000 [00:08:25] heh [00:08:30] why'd you do that [00:09:17] interesting, it needs mysql [00:09:59] off the top of anyone's heads, know of a good mysql (non server) class to include ? [00:11:13] you probably want generic::mysql::packages::client [00:11:37] ah cool - there's something inside mysql.pp as well but i'm unsure where exactly as there's about 50 levels of {} [00:11:39] thanks [00:12:11] hah [00:12:15] it's in multiple places [00:12:16] in the same file [00:12:57] __site__::generic::mysql::packages::client [00:13:02] eh.. that is from doc.wikimedia.org [00:13:22] LeslieCarr: http://doc.wikimedia.org/puppet/classes/__site__/generic/mysql/packages/client.html [00:13:41] there's also mysql::client [00:13:42] hehe [00:13:46] because we have to be confusing [00:13:50] i'll use the generic one though [00:14:00] maybe you want mariadb client anyways?:) [00:14:40] don't worry, that's all going away soon :) [00:14:47] jeez, didn't you read my email? [00:15:02] heh,the one 6 minutes ago [00:15:09] hehe [00:15:14] the modules one ? [00:15:16] yeah [00:15:34] i still don't 100% grok modules [00:16:35] same shit, different organizational strategy ;) [00:16:43] hehe ok [00:16:44] New patchset: Lcarr; "ganglios check requires mysql !" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43603 [00:16:54] we're moving to them because that's the direction that puppet is going [00:26:20] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43603 [00:27:26] PROBLEM - MySQL Slave Delay on es1001 is CRITICAL: CRIT replication delay 267 seconds [00:28:12] PROBLEM - MySQL Slave Delay on es3 is CRITICAL: CRIT replication delay 312 seconds [00:28:53] New patchset: Dzahn; "puppetize bugzilla_report.php, replace change I19d5da64: Clarify weekly report "top resolvers"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [00:28:56] PROBLEM - MySQL Slave Delay on es2 is CRITICAL: CRIT replication delay 357 seconds [00:29:06] go jenkins, come on now [00:29:43] hopefully the cronspam will stop soon [00:30:24] notpeter: can i PM? [00:30:28] LeslieCarr: cool! thanks [00:30:31] sure [00:30:36] New patchset: Ryan Lane; "Assign the status variable before accessing it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43605 [00:30:54] I knew I should have stopped working last night [00:31:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43605 [00:32:50] AaronSchulz: ceph itself not too bad [00:32:55] AaronSchulz: Dells on the other hand... [00:35:13] New patchset: Lcarr; "increasing concurrent check limit in icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43606 [00:36:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43606 [00:39:28] never again will I work that tired [00:39:36] New patchset: Ryan Lane; "Fix repo location for clones" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43607 [00:40:06] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43607 [00:41:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [00:45:21] csteipp: so, I've fixed it [01:02:05] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:02:06] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:06:54] PROBLEM - MySQL disk space on neon is CRITICAL: DISK CRITICAL - free space: / 354 MB (3% inode=71%): [01:08:14] RECOVERY - MySQL Slave Delay on es1001 is OK: OK replication delay 0 seconds [01:11:24] PROBLEM - MySQL Slave Delay on es3 is CRITICAL: CRIT replication delay 189 seconds [01:11:24] PROBLEM - MySQL Slave Delay on es2 is CRITICAL: CRIT replication delay 189 seconds [01:13:09] hey those alerts are supposed to be disabled [01:13:27] binasher: might not be stickied [01:13:29] PROBLEM - MySQL Slave Delay on es1001 is CRITICAL: CRIT replication delay 315 seconds [01:14:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 280 seconds [01:20:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [01:20:53] !log added mysql grants for eqiad to all shards [01:21:05] Logged the message, Master [02:12:53] New patchset: Dzahn; "sudo for strace/tcpdump for demon in role appservers (RT-4066)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42791 [02:15:58] New review: Dzahn; "Jeff confirmed. Chris will wipe disks and remove controller card." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42908 [02:15:58] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42908 [02:17:55] bye, have a nice weekend [02:23:23] PROBLEM - MySQL disk space on db78 is CRITICAL: DISK CRITICAL - free space: /a 114883 MB (3% inode=99%): [02:24:08] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [02:24:09] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [02:24:09] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [02:24:09] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [02:24:09] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:28:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:22] !log LocalisationUpdate completed (1.21wmf7) at Sat Jan 12 02:28:21 UTC 2013 [02:28:31] Logged the message, Master [02:30:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [02:38:05] PROBLEM - Puppet freshness on sq45 is CRITICAL: Puppet has not run in the last 10 hours [02:47:33] RECOVERY - SSH on ms1002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [02:53:40] !log LocalisationUpdate completed (1.21wmf6) at Sat Jan 12 02:53:39 UTC 2013 [02:53:49] Logged the message, Master [02:58:44] New patchset: Asher; "splitting up the per-shard host list by datacenter to aid generation of mha configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43612 [02:59:29] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43612 [03:01:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:30] New patchset: Asher; "white space" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43613 [03:02:59] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43613 [03:17:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.071 seconds [03:26:41] RECOVERY - MySQL disk space on db78 is OK: DISK OK [03:49:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:05:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [04:37:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.483 seconds [05:25:54] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [05:33:46] New review: Dzahn; "see comment on ps1 and per talk with robla there will likely be requests for more users to be added ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42791 [06:12:33] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [06:26:21] RECOVERY - MySQL disk space on neon is OK: DISK OK [08:43:09] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [08:43:37] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 208 seconds [08:52:28] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 197 seconds [09:02:21] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:03:44] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:12:27] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 188 seconds [09:13:56] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 241 seconds [09:14:40] I'm getting an error at https://en.wikipedia.org/w/index.php?title=Wikipedia:WikiProject_U.S._Supreme_Court_cases/Members&action=submit [09:14:50] > [09:14:51] Request: POST http://en.wikipedia.org/w/index.php?title=Wikipedia:WikiProject_U.S._Supreme_Court_cases/Members&action=submit, from 10.64.0.141 via cp1013.eqiad.wmnet (squid/2.7.STABLE9) to () [09:15:00] Error: ERR_CANNOT_FORWARD, errno [No Error] at Sat, 12 Jan 2013 09:14:28 GMT [09:15:03] > [09:16:20] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:21] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:21] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:21] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:21] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:21] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:29] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:30] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:39] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:46] Hmm. LeslieCarr? [09:16:48] PROBLEM - Apache HTTP on srv261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:48] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:48] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:48] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:48] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:48] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:48] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:49] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:49] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:50] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:50] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:50] I'm not sure anyone is alive right now... [09:16:51] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:57] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:57] PROBLEM - Apache HTTP on srv191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:57] PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:57] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:57] PROBLEM - Apache HTTP on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:57] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:57] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:58] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:58] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:59] PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:59] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:00] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:00] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:01] PROBLEM - Apache HTTP on srv196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:01] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:01] They will be soon with all of these pages ... [09:17:02] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:06] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:14] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:15] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:16] Yeah. [09:17:24] PROBLEM - Apache HTTP on srv262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:24] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:24] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:24] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:24] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:24] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:24] PROBLEM - Apache HTTP on srv208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:25] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:25] PROBLEM - Apache HTTP on srv269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:26] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:26] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:27] PROBLEM - Apache HTTP on srv285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:27] PROBLEM - Apache HTTP on srv278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:32] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:33] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:33] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:33] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:33] PROBLEM - Apache HTTP on srv226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:41] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:51] PROBLEM - Apache HTTP on srv265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:51] PROBLEM - Apache HTTP on srv276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:51] PROBLEM - Apache HTTP on srv198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:51] PROBLEM - Apache HTTP on srv236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:51] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:51] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:51] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:52] PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:52] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:53] PROBLEM - Apache HTTP on srv202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:53] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:54] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:58] Is there a way to send a direct page? [09:18:00] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:00] PROBLEM - Apache HTTP on srv241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:00] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:00] PROBLEM - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:00] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:00] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:00] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:01] PROBLEM - Apache HTTP on srv215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:01] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:02] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:02] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:03] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:09] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:09] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:09] PROBLEM - Apache HTTP on srv252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:09] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:09] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:09] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:09] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:10] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:10] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:11] PROBLEM - Apache HTTP on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:11] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:12] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:15] well, basically that would just be calling CT/them directly [09:18:17] PROBLEM - Frontend Squid HTTP on cp1014 is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [09:18:18] PROBLEM - Frontend Squid HTTP on cp1003 is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [09:18:18] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:18] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:18] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:18] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:18] PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:19] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:19] PROBLEM - Apache HTTP on srv192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:20] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:20] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:21] PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:21] PROBLEM - Apache HTTP on srv267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:22] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:22] PROBLEM - Apache HTTP on srv214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:24] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:24] PROBLEM - Apache HTTP on srv271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:24] PROBLEM - Apache HTTP on srv227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:24] PROBLEM - Apache HTTP on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:25] PROBLEM - Apache HTTP on srv251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:25] PROBLEM - Apache HTTP on srv283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:26] PROBLEM - Apache HTTP on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:26] PROBLEM - Apache HTTP on srv255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:27] PROBLEM - Apache HTTP on srv287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:27] PROBLEM - Apache HTTP on srv291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:28] PROBLEM - Apache HTTP on srv295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:28] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:29] You might be able to do it through Nagios… not sure [09:18:35] Someone may need to. [09:18:36] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:36] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:36] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:36] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:36] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:36] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:36] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:37] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:45] PROBLEM - Apache HTTP on srv254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:45] PROBLEM - Apache HTTP on srv218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:45] PROBLEM - Apache HTTP on srv250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:45] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:45] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:47] yeah … if this lasts too much longer I'll dig something up [09:18:53] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:54] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:56] mark: You don't happen to be around? [09:19:03] PROBLEM - Apache HTTP on srv190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:03] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:03] PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:11] PROBLEM - Apache HTTP on srv216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:12] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:12] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:12] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:12] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:20] PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:21] PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:39] PROBLEM - Apache HTTP on mw62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:48] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:57] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:20:05] RECOVERY - Frontend Squid HTTP on cp1003 is OK: HTTP OK HTTP/1.0 200 OK - 1394 bytes in 0.056 seconds [09:20:15] PROBLEM - Apache HTTP on mw64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:20:33] PROBLEM - Apache HTTP on srv257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:20:33] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:20:41] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:21:54] PROBLEM - Frontend Squid HTTP on cp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:22:11] PROBLEM - Frontend Squid HTTP on cp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:15] PROBLEM - Backend Squid HTTP on cp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:15] PROBLEM - Backend Squid HTTP on cp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:24] PROBLEM - Backend Squid HTTP on cp1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:33] PROBLEM - Backend Squid HTTP on sq36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:41] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:51] PROBLEM - Frontend Squid HTTP on cp1005 is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [09:23:51] PROBLEM - Backend Squid HTTP on cp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:24:00] PROBLEM - Backend Squid HTTP on cp1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:24:35] PROBLEM - Backend Squid HTTP on sq77 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:03] RECOVERY - Backend Squid HTTP on cp1016 is OK: HTTP OK HTTP/1.0 200 OK - 1259 bytes in 0.054 seconds [09:25:38] \o [09:25:48] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [09:25:54] Meta is unavailable. [09:25:56] PROBLEM - Backend Squid HTTP on cp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:03] Pirx_: Someone has just arrived. :-) [09:26:07] is anyone handling it? [09:26:13] yes. [09:26:14] RECOVERY - Frontend Squid HTTP on cp1014 is OK: HTTP OK HTTP/1.0 200 OK - 1392 bytes in 0.054 seconds [09:26:23] PROBLEM - Backend Squid HTTP on sq73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:24] PROBLEM - Backend Squid HTTP on sq74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:45] yes [09:26:47] okay, thank you. [09:26:58] (also pl.wikipedia.org) [09:27:08] PROBLEM - Backend Squid HTTP on cp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:11] All Wikimedia wikis, I believe. [09:27:17] PROBLEM - Frontend Squid HTTP on cp1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:37] PROBLEM - Frontend Squid HTTP on amssq39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:44] RECOVERY - Backend Squid HTTP on cp1018 is OK: HTTP OK HTTP/1.0 200 OK - 1259 bytes in 0.054 seconds [09:27:45] PROBLEM - Backend Squid HTTP on amssq35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:47] New patchset: Asher; "pulling parsercache" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43618 [09:28:02] PROBLEM - LVS HTTPS IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out [09:28:24] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43618 [09:28:29] PROBLEM - Backend Squid HTTP on sq72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:56] PROBLEM - Frontend Squid HTTP on amssq33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:05] RECOVERY - Frontend Squid HTTP on cp1007 is OK: HTTP OK HTTP/1.0 200 OK - 1394 bytes in 0.054 seconds [09:29:15] RECOVERY - Frontend Squid HTTP on amssq39 is OK: HTTP OK HTTP/1.0 200 OK - 1585 bytes in 0.221 seconds [09:29:23] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:33] RECOVERY - Backend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 1260 bytes in 0.054 seconds [09:29:40] !log asher synchronized wmf-config/CommonSettings.php 'pulling parsercache' [09:30:08] PROBLEM - Backend Squid HTTP on sq33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:09] PROBLEM - LVS HTTPS IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:26] PROBLEM - Backend Squid HTTP on sq76 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:36] RECOVERY - Frontend Squid HTTP on cp1006 is OK: HTTP OK HTTP/1.0 200 OK - 1395 bytes in 0.056 seconds [09:30:45] RECOVERY - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2584 bytes in 4.957 seconds [09:30:45] RECOVERY - Apache HTTP on srv295 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.168 second response time [09:30:53] RECOVERY - Apache HTTP on srv216 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.789 second response time [09:31:02] PROBLEM - Backend Squid HTTP on amssq46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:03] PROBLEM - Backend Squid HTTP on amssq43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:11] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [09:31:12] RECOVERY - Apache HTTP on srv251 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [09:31:12] RECOVERY - Apache HTTP on srv255 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [09:31:12] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [09:31:12] PROBLEM - LVS HTTP IPv4 on wikidata-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:20] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [09:31:21] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [09:31:21] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.065 second response time [09:31:21] RECOVERY - Frontend Squid HTTP on cp1013 is OK: HTTP OK HTTP/1.0 200 OK - 1384 bytes in 6.705 seconds [09:31:21] RECOVERY - Backend Squid HTTP on cp1020 is OK: HTTP OK HTTP/1.0 200 OK - 1249 bytes in 7.680 seconds [09:31:21] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.474 second response time [09:31:22] binasher: No morebots in here, BTW. You'll have to log that manually. [09:31:25] Or I can. [09:31:28] I guess you're busy, heh. [09:31:30] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.061 second response time [09:31:30] RECOVERY - Apache HTTP on srv285 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.828 second response time [09:31:30] RECOVERY - LVS HTTP IPv4 on m.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 0.081 second response time [09:31:30] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.876 second response time [09:31:30] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.344 second response time [09:31:30] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.914 second response time [09:31:30] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.235 second response time [09:31:31] RECOVERY - Apache HTTP on srv236 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.164 second response time [09:31:38] RECOVERY - Backend Squid HTTP on cp1004 is OK: HTTP OK HTTP/1.0 200 OK - 1249 bytes in 9.109 seconds [09:31:39] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [09:31:39] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [09:31:39] RECOVERY - Apache HTTP on srv192 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [09:31:39] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [09:31:39] RECOVERY - Apache HTTP on srv263 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [09:31:39] RECOVERY - Apache HTTP on srv283 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [09:31:39] oh, you're right [09:31:40] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.219 second response time [09:31:40] RECOVERY - LVS HTTPS IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [09:31:41] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.199 second response time [09:31:41] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.043 second response time [09:31:42] RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.871 second response time [09:31:48] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.068 second response time [09:31:48] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [09:31:48] RECOVERY - Apache HTTP on srv279 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.068 second response time [09:31:48] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.069 second response time [09:31:48] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [09:31:48] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [09:31:48] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [09:31:49] Susan: did the bots die too? [09:31:53] NFI what my wikitech password is. [09:31:54] hey [09:32:02] o.O I thought nagios was flood protected [09:32:04] I think morebots has its own issue. [09:32:12] Unrelated to this outage. [09:32:40] * Matthew_ doesn't have a wikitech password, so he can laugh a Susan :P [09:33:06] what's going on? [09:33:16] paravoid: Heh, they were recovering so fast the feed died :) [09:33:18] paravoid: All Wikimedia wikis became inaccessible via HTTP and HTTPS. [09:33:46] paravoid: Asher disabled parser cache in https://gerrit.wikimedia.org/r/43618 and synced to the site. [09:33:50] Seems better now. [09:33:58] ganglia's down? [09:34:01] (binasher: ty, btw) [09:34:17] http://nagios.wikimedia.org/nagios/cgi-bin/notifications.cgi?contact=all looks much better now. [09:34:23] paravoid: it is, something about security issues. [09:34:31] Oh, the first half, at least. [09:35:26] now it works [09:35:39] yes, works for me. Thanks! [09:36:04] hrm, so mediawiki is able to handle outright write failures to the pcache with no problem, but it has no timeout if writes hang [09:36:12] Dziękuję :) [09:36:46] i updated the server admin log [09:36:56] I've managed to lock myself out of the wikitech wiki. [09:36:57] ori-l: ty [09:38:14] * Matthew_ is glad OTRS didn't refresh during that time :P [09:38:43] binasher: so, you're completely on top of this? need anything? [09:39:07] https://bugzilla.wikimedia.org/show_bug.cgi?id=43897 "morebots missing from #wikimedia-operations" [09:39:31] wow, innodb bug [09:39:34] --Thread 139904037881600 has waited at btr/btr0cur.c line 482 for 44.00 seconds the semaphore: [09:39:34] S-lock on RW-latch at 0x2fe9178 created in file dict/dict0dict.c line 1637 [09:39:36] a writer (thread id 139919087257344) has reserved it in mode exclusive [09:39:37] number of readers 0, waiters flag 1, lock_word: 0 [09:39:38] Last time read locked in file btr/btr0cur.c line 482 [09:39:39] Last time write locked in file btr/btr0cur.c line 475 [09:39:56] paravoid: everything should be ok now, and it should be ok to leave the db parsercache off until tomorrow [09:40:11] cool, thanks [09:43:22] http://xkcd.com/903/ :-) [09:53:03] hehe [09:53:13] ok, sleep! [18:31:08] if anyone checks in, I didn't even look, just did a mass shooting of converts on the image scalers first thing, recovery was instant [18:31:26] (that's in response to the rendering pmtpa page) [18:31:30] * apergos is gone again [19:33:17] New patchset: Krinkle; "(bug 43908) Enable wgUseRCPatrol on enwikivoyage." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43624 [19:35:16] Change merged: Krinkle; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43624 [19:38:17] !log krinkle synchronized wmf-config/InitialiseSettings.php 'Enable wgUseRCPatrol on enwikivoyage (I36dbda43b2f)' [20:41:02] New patchset: Ori.livneh; "$wgClickTrackingLog => /dev/null" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43669 [20:41:54] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43669