[00:00:15] RECOVERY - check_mysql on payments4 is OK: Uptime: 2948776 Threads: 1 Questions: 34029 Slow queries: 2 Opens: 420 Flush tables: 1 Open tables: 64 Queries per second avg: 0.11 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:00:16] mutante: so, same behavior as before. Does it look like a networking issue to you? [00:01:19] dr0ptp4kt: ok, i'm off,gnight [00:01:46] andrewbogott: eth0/eth1 are link state up, eth/eth3 are down [00:02:01] PROBLEM - LDAPS on virt0 is CRITICAL: Connection timed out [00:02:06] andrewbogott: did it use eth2 and 3? then it's the cable [00:02:15] otherwise, yea, looks like networking issue [00:02:27] mutante, I don't know much, just judging by the packet loss. [00:02:42] LeslieCarr: can you help us out here? [00:03:32] * andrewbogott surprised this isn't paging [00:03:41] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: Connection timed out [00:03:41] andrewbogott: from fenari, 94% packet loss , ack [00:03:46] but not 100% [00:03:50] PROBLEM - HTTP on virt0 is CRITICAL: Connection timed out [00:03:50] PROBLEM - LDAP on virt0 is CRITICAL: Connection timed out [00:04:00] mutante: andrewbogott from the office, 93% [00:04:05] mutante, yep! Very serious. [00:04:39] mutante, is leslie within shouting range? [00:04:49] andrewbogott: i don't know [00:04:54] not within mine [00:05:00] Oh, are you in DE still? [00:05:10] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 1.080 second response time [00:05:19] PROBLEM - SSH on virt0 is CRITICAL: Connection timed out [00:05:29] rfarrand: Can you see Leslie? [00:05:47] andrewbogott: no, but both not at office [00:05:59] RECOVERY - LDAPS on virt0 is OK: TCP OK - 3.044 second response time on port 636 [00:06:19] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 5.133 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [00:06:29] RECOVERY - LDAP on virt0 is OK: TCP OK - 3.043 second response time on port 389 [00:06:39] Reedy: nope [00:06:56] need her? [00:07:21] rfarrand: We do! I just texted her... [00:08:10] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [00:08:27] so virt0 uses bonded interface, two ports on switch [00:08:32] daniel is confirming on server [00:08:59] seems its just two eth0 and eth1 [00:09:00] andrewbogott, Reedy: She is on vacation tomorrow and the following do so it would be best to catch her ASAP. If I see her I will let her know [00:09:10] i'll call her. [00:09:13] **do = day [00:09:19] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:09:49] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: Connection timed out [00:09:50] robla: hey [00:10:19] PROBLEM - HTTP on virt0 is CRITICAL: Connection timed out [00:10:29] PROBLEM - LDAP on virt0 is CRITICAL: Connection timed out [00:10:33] cmjohnson1: sorry, nevermind....I hadn't realized the wikitech thing was already being worked on, but saw you were on RT duty. [00:10:56] cmjohnson1: I believe we are still mulling paging springle, though [00:11:07] the wikitech thing is due to virt0 being down [00:11:09] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.082 second response time [00:11:15] but it may be recovering... [00:11:18] oh...okay..springle should be on soon..what happened to virt0? [00:11:19] PROBLEM - SSH on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:30] dunno, mutante rebooted but dunno before that [00:11:40] (crash i imagine) [00:11:44] RobH: I don't think it's recovering -- very high packet loss which means that ocassionally a test succeeds. [00:11:50] yea, prolly not [00:11:54] It had the exact same symptoms before the reboot. [00:11:59] yea, eww [00:12:02] yea [00:12:08] so i texted leslie, im giving her a few minutes before i call [00:12:19] do i need to go down there? [00:12:19] (i cannot speak to other folks textign her, so I just disregard it happened ;) [00:12:25] cmjohnson1: you are in tampa? [00:12:29] yes [00:12:39] here is the thing [00:12:45] what exactly do i tell you to do when you are there? [00:12:51] wait, now i even get errors from puppetmaster? [00:12:53] its two cables, it shouldnt freak out that badly [00:13:03] robla: paging me? [00:13:06] so i have concerns it isnt physical layer [00:13:16] springle! [00:13:18] springle: there are two ongoing issues, one with wikitech (not paged you for) [00:13:18] springle: You didn't recieve a pager in your welcome package? [00:13:26] just so you realize confusion in backlog ;] [00:13:28] TimStarling: ^^ [00:13:37] robla: See #wikimedia-tech [00:13:49] You're about 6 minutes behind ;) [00:14:19] ah, there he is. I see the single "looking" now [00:14:23] * RobH tries to use phone as phone, watches it crash [00:14:27] technology. [00:14:36] ops can have this channel, they like shouting over bots [00:14:43] RobH: Have you tried turning it off and on again? [00:14:54] I'll talk about bug 56577 in #wikimedia-tech instead [00:15:19] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [00:15:29] RECOVERY - LDAP on virt0 is OK: TCP OK - 3.043 second response time on port 389 [00:16:06] ^^ lies [00:16:09] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.174 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [00:16:19] ok, interesting [00:16:27] the network display for the secondary port on virt0 shows no traffic [00:16:36] but i dunno this bonded itnerface stuff [00:16:41] so i dunno if its ok. [00:17:29] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: CRITICAL - Cannot make SSL connection [00:18:07] And… suddely packet loss is down to 0% [00:18:10] Did someone do something? [00:18:11] Robh? [00:18:13] andrewbogott: back to 0% [00:18:16] was about to day [00:18:19] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.157 second response time [00:18:21] and puppet runs on it now [00:18:38] it's back.. wtf [00:18:43] wikitech also up [00:18:45] hiccup [00:18:57] so yea [00:19:00] eth1 is now showing traffic [00:19:02] i did nothing. [00:19:07] Elsie: wikitech is back [00:19:08] other than watch network side output on switch [00:19:17] so no goddamned clue what happened. [00:19:23] It was a long hiccup! 40 minutes. [00:19:36] its not something we can say 'ok its fixed' [00:19:37] andrewbogott: puppet is now re-adding iptables rules on virt0 [00:19:38] ... [00:19:53] we shall see if it kills it ;] [00:19:57] mutante: Thanks. Can you test morebots? [00:20:05] !log Test [00:20:11] mutante, elsie, I'll kick morebots. [00:20:21] Logged the message, Master [00:20:31] I think it's connected... but perhaps unresp.. there it goes. :-) [00:20:38] So something was wrong with its eth1 interface in that entire outage [00:20:47] eth0 was showing network traffic on switch but eth1 was not [00:20:51] https://wikitech.wikimedia.org/wiki/SAL # Whee. [00:20:52] i have no idea what it means ;] [00:21:11] RobH: Can you add notes to https://bugzilla.wikimedia.org/show_bug.cgi?id=56594 please? :-) [00:21:58] that ticket is so narrow scope [00:22:14] virt0 had issue, so it affected a LOT of things, but i guess its a start [00:22:25] morebots, are you ok? [00:22:25] I am a logbot running on tools-login. [00:22:25] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [00:22:25] To log a message, type !log . [00:22:37] hm [00:22:37] (03CR) 10jenkins-bot: [V: 04-1] Fix up multiversion to not require dba_* functions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [00:22:39] PROBLEM - MySQL Slave Running on db47 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in where clause on query. Default da [00:22:39] PROBLEM - MySQL Slave Running on db54 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in field list on query. Default data [00:23:41] RobH, mutante, I will send an outage email [00:24:40] i updated ticket with my observation [00:24:59] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 311 seconds [00:25:16] Thanks. :-) [00:25:39] RECOVERY - MySQL Slave Running on db47 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:25:43] we all know that the true cause of this problem was Ryan getting onto an airplane. [00:26:00] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [00:26:19] PROBLEM - MySQL Replication Heartbeat on db54 is CRITICAL: CRIT replication delay 340 seconds [00:26:29] <^d> andrewbogott: First ferries. Now airplanes. Basically none of us should be allowed to travel, ever. [00:26:31] (03PS2) 10RobH: new bugzilla.wikimedia.org cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/93621 [00:26:39] RECOVERY - MySQL Slave Running on db54 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:27:59] PROBLEM - MySQL Slave Delay on db54 is CRITICAL: CRIT replication delay 394 seconds [00:28:29] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 386 seconds [00:29:10] PROBLEM - NTP on virt0 is CRITICAL: NTP CRITICAL: Offset unknown [00:29:59] RECOVERY - MySQL Slave Delay on db54 is OK: OK replication delay 0 seconds [00:30:00] *yawn* [00:30:19] RECOVERY - MySQL Replication Heartbeat on db54 is OK: OK replication delay 0 seconds [00:30:29] RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay -0 seconds [00:30:50] you missed all the action, apergos [00:30:59] no I didn't [00:33:34] apergos: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=search-pool :) [00:33:50] are they gone? [00:34:12] hm they are 'ok', magic! [00:34:26] oh those are the eqiads [00:34:26] because they are gone :) [00:34:30] yes [00:34:31] the rest are gone, of course! [00:34:39] excellent! [00:34:57] * YuviPanda screams ALL ULS' FAULT! and then runs away [00:35:09] RECOVERY - NTP on virt0 is OK: NTP OK: Offset 0.001705765724 secs [00:38:30] brownbag talk on govt surveillance and privacy tools starting on 6 floor in 20 minutes [00:39:09] Here's the hangout link: https://plus.google.com/hangouts/_/72cpj4nea87b6deasmknhlqmb8 [00:39:20] kaldari: i was about to ask for that link, thanks: [00:39:25] who's giving it? [00:39:33] Yan Zhu [00:39:43] Blogger/hacktivist [00:40:23] (03CR) 10Chad: "Tested on arsenic, pretty much works :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [00:44:17] (03CR) 10Dzahn: [C: 032] "Issuer: C=US, O=GeoTrust, Inc., CN=RapidSSL CA" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93621 (owner: 10RobH) [00:46:34] !log running master/slave pt-table-sync on archive tables with 1M+ rows [00:46:51] Logged the message, Master [00:47:06] YuviPanda: ok, so… now I can go back to thinking about whatever it was I was... [00:47:12] …thinking about before the… outage. [00:47:21] andrewbogott: until something else dies... [00:47:31] :P [00:47:34] Did you see something obvious going wrong with the unicorn? [00:47:44] no, i see that it's 500ing [00:47:48] so perhaps a path issue? [00:47:51] andrewbogott: where's the code? [00:47:54] like, file path? [00:48:17] It's a service, controlled by /etc/init/dynamicproxy-api.conf [00:48:28] using uwsgi [00:49:15] the actual python code is in /usr/lib/python2.7/dist-packages/invisible_unicorn [00:49:28] andrewbogott: where's the sqlite file? [00:49:50] /etc/dynamicproxy-api/ [00:50:35] andrewbogott: hmm, file permissions seem ok [00:50:55] Yeah, I changed them to a+w for a bit just to see, didn't help. [00:51:52] andrewbogott: okay, I can perform write operations on it manually, via the sqlite3 command [00:52:17] so maybe uwsgi is running with yet another user, somehow? [00:52:22] s/with/as/ [00:52:36] mutante, apergos: Also there's a youtube broadcast link if you just want to watch: [00:52:37] http://youtu.be/KBpau4gGvfQ [00:52:55] nice, clicking:) [00:53:11] * apergos will come on late tomorrow and check out early [00:53:14] yeah why not [00:54:03] andrewbogott: try hitting the API again? [00:54:23] YuviPanda: just did a GET, it looks to have failed. [00:54:36] I see 400s [00:56:21] andrewbogott: try again? [00:57:10] hm 400 again [00:57:18] This was recently working, dammit [00:57:29] andrewbogott: try doing a PUT? [00:57:54] put [00:57:57] 500 it says [00:58:02] FUCKING SHOW ME THE ERROR LOGS DAMMIT [00:58:02] grr [00:58:06] stupid stupid thing [00:58:09] (03PS2) 10Dzahn: use new SSL certs on Bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/93620 [00:58:38] YuviPanda: I thought you might know the secret place where the error logs are hidden [00:58:54] Supposedly stderr and stdout are getting written to /var/log/uwsgi/app/invisible.log [00:58:57] so the uwsgi people tell me [00:59:11] (03CR) 10Dzahn: [C: 032] use new SSL certs on Bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/93620 (owner: 10Dzahn) [00:59:39] apparently not [01:01:18] andrewbogott: err, does this have the proxy puppet class applied to it? [01:01:18] invisible-unicorn requires that [01:01:18] since it requies a redis instance setup locally [01:01:18] that might be it [01:01:18] i see a local redis instance [01:01:34] but the nginx is still really old, and not the db [01:01:34] *deb [01:01:50] andrewbogott: that might be an issue? [01:01:54] hm, ok... [01:02:00] it should be, but maybe puppet is busted [01:02:03] andrewbogott: RobH so at 23:42 UTC i see the port being saturated [01:02:03] and deploying the wrong nginx [01:02:27] !log replaced SSL certs on Bugzilla with new ones that are not star.wm [01:02:43] LeslieCarr: hrmm [01:02:45] Logged the message, Master [01:02:49] the main port [01:02:56] so what triggered saturation [01:02:56] eth0? [01:02:56] andrewbogott: tbh my interest in the entire thing has waned ever since the 'we need to debianize eveything!' conversations, since I still don't understand what the advantage is, especially since we're just getting around it by making our own packages [01:03:02] https://observium.wikimedia.org/device/device=12/tab=port/port=1299/ [01:03:03] ori-l: ^ now it's actually fixed, with 2 new certs . fyi [01:03:11] i dunno, it looks like 99% of the time, there's 0 traffic [01:03:20] yea [01:03:22] then bam! sudden huge traffic spike [01:03:24] i saw packet loss of 94% from fenari [01:03:27] then bam, nada [01:03:31] so.... there's our culprit.... [01:03:36] hrmm. [01:03:37] then back to 0% out of nowhere [01:03:53] andrewbogott: hmm, try again now? [01:04:02] !log virt0 port saturated at 23:42 (recordkeeping for outage investigation) [01:04:19] Logged the message, RobH [01:04:23] YuviPanda: I'm asking puppet to install the latest nginx-extras, did you already upgrade that? [01:05:00] YuviPanda: and, oops, I just clobbered your --catch-exceptions [01:05:09] Still want me to try again? [01:07:31] RobH: resolved the cert stuff. it's using the 2 new certs now and puppet runs and it's happy [01:07:47] so no more star.wm [01:08:18] ok, the talk on youtube started [01:08:48] yay [01:08:53] wait youtube? [01:09:11] RobH: 16:56 < kaldari> mutante, apergos: Also there's a youtube broadcast link if you just want to watch: [01:09:25] http://youtu.be/KBpau4gGvfQ [01:10:29] andrewbogott: yeah [01:10:45] ok. first a get... [01:10:48] so, did anyone happen to do a tcpdump on virt0 ? [01:10:52] just in case ? [01:11:03] now a put [01:11:14] LeslieCarr: afraid not, just traceroute to it from elsewhere [01:11:18] ok [01:11:32] YuviPanda, done [01:11:42] andrewbogott: yeah, 400 and 500. [01:11:47] no idea why :| [01:11:50] let me do something else [01:12:05] oh this is interesting! [01:12:16] we got a 4gig bump of traffic going *into* our network [01:12:18] YuviPanda: possibly uwsgi is mangling the requests [01:12:23] its on my tv now [01:12:24] heh [01:12:25] during this time [01:12:26] andrewbogott: can you hit port 5000? [01:12:39] same machine [01:13:00] YuviPanda: I can do a wget... [01:13:09] Changing the gui to hit a different port is kind of a pain [01:13:14] yeah that should enough [01:13:16] a curl even [01:13:26] mutante: what cant xmbc do? [01:13:40] RobH: stream youtube? [01:13:43] well, it can do that [01:13:44] it is now [01:13:47] it cannot do hangouts... [01:14:01] heh:) makes sense, yea, nice to have the youtube [01:14:04] YuviPanda: ok, just did a get on 5000 [01:14:07] oh, different/wrong url though [01:14:22] You see that I have nginx configured to detect /dynamicproxy-api/whatever? [01:14:32] andrewbogott: i got the GET, and responded with 400 [01:14:33] * YuviPanda is confused [01:14:35] * YuviPanda checks [01:14:46] LeslieCarr: does that mean we were DOSed? [01:14:59] andrewbogott: it's supposed to return a 400 if that project does not exist [01:15:02] andrewbogott: try a PUT first? [01:15:10] YuviPanda: I can't easily do that to port 5000 [01:15:13] unless you know the curl syntax [01:15:24] ugh, I had it in bash history somewhere [01:15:28] burrrr [01:16:02] yeah [01:16:30] andrewbogott: curl -X PUT localhost:5000/v1/test/mapping -d '{"domain": "cvne.wmflabs.org", "backends": ["http://cvn-apache2.pmtpa.wmflabs"]}' [01:16:35] YuviPanda: There's some url rewriting that should happen, too… like, I should be able to wget http://proxy-abogott-8/dynamicproxy-api/v1/project-proxy/mapping and unicorn should just get /v1/project-proxy/mapping [01:16:37] but with IP instead of localhist? [01:16:53] andrewbogott: right, but I'm running a debug version with the debug webserver on port 5000 now [01:16:58] andrewbogott: so if there's an error, we know where it is [01:17:06] andrewbogott: if there isn't, we know that it's the URL mangling happening somewhere [01:17:10] ok, just did the put, got a 400 [01:17:24] Is the debug server not seing the db file? [01:17:42] * apergos raises hand (for talk) [01:18:29] andrewbogott: again? [01:18:46] * andrewbogott PUTs [01:19:01] andrewbogott: meh, 400 again [01:19:10] andrewbogott: there's no code path that starts at PUT and then returns a 400 [01:19:27] i'm super confused now [01:19:34] YuviPanda: probably best to stop nginx and uwsgi [01:19:37] in case they're in your way [01:19:38] since nothing works anymore [01:19:50] not if you're hitting 5000 directly, no? [01:19:57] i have it configured to listen on 0.0.0.0 [01:20:01] dunno [01:20:44] andrewbogott: don't think I can debug more right now, sorry :( [01:21:00] legoktm: think you can help? weird flask issue [01:21:23] i'm not very good with flask but lemme read the scrollback [01:21:35] ok [01:21:45] sorry andrewbogott :( [01:22:09] s'alright, make legoktm will see something obvious [01:22:18] I swear this is working worse than a few hours ago with no changes made [01:22:57] yeah...i have no clue. [01:57:02] !log deployed Parsoid f416b8240e74e286 [01:57:19] Logged the message, Master [02:00:09] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [02:00:55] !log stopped mysql on db1008 to clone another [02:01:16] Logged the message, Master [02:05:10] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [02:08:27] going to make tomorrow a get in late day and much shorter... [02:08:28] night [02:15:10] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [02:15:31] mysql on db1008 coming back up [02:16:42] !log LocalisationUpdate completed (1.23wmf2) at Tue Nov 5 02:16:42 UTC 2013 [02:16:58] Logged the message, Master [02:20:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 298 Threads: 1 Questions: 6249 Slow queries: 0 Opens: 43 Flush tables: 2 Open tables: 52 Queries per second avg: 20.969 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [02:30:26] !log LocalisationUpdate completed (1.23wmf1) at Tue Nov 5 02:30:26 UTC 2013 [02:30:42] Logged the message, Master [02:38:01] PROBLEM - MySQL Slave Running on db31 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in where clause on query. Default da [02:38:10] PROBLEM - MySQL Slave Running on db73 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in where clause on query. Default da [02:40:00] PROBLEM - MySQL Slave Running on db37 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in field list on query. Default data [02:40:01] RECOVERY - MySQL Slave Running on db31 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:41:10] RECOVERY - MySQL Slave Running on db73 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:41:20] PROBLEM - MySQL Replication Heartbeat on db72 is CRITICAL: CRIT replication delay 303 seconds [02:41:40] PROBLEM - MySQL Replication Heartbeat on db65 is CRITICAL: CRIT replication delay 318 seconds [02:41:40] PROBLEM - MySQL Replication Heartbeat on db31 is CRITICAL: CRIT replication delay 319 seconds [02:42:40] PROBLEM - MySQL Slave Delay on db31 is CRITICAL: CRIT replication delay 364 seconds [02:42:40] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 303 seconds [02:43:30] PROBLEM - MySQL Replication Heartbeat on db37 is CRITICAL: CRIT replication delay 349 seconds [02:46:40] PROBLEM - MySQL Slave Running on db1035 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in where clause on query. Default da [02:47:01] wtf [02:48:40] RECOVERY - MySQL Replication Heartbeat on db65 is OK: OK replication delay -0 seconds [02:48:40] RECOVERY - MySQL Slave Delay on db31 is OK: OK replication delay 0 seconds [02:48:40] RECOVERY - MySQL Slave Running on db1035 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:48:40] RECOVERY - MySQL Replication Heartbeat on db31 is OK: OK replication delay 0 seconds [02:49:20] RECOVERY - MySQL Replication Heartbeat on db72 is OK: OK replication delay -0 seconds [02:50:00] RECOVERY - MySQL Slave Running on db37 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:51:40] PROBLEM - MySQL Slave Delay on db37 is CRITICAL: CRIT replication delay 600 seconds [02:52:40] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [02:58:20] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:10] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [03:08:30] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Nov 5 03:08:30 UTC 2013 [03:08:46] Logged the message, Master [03:14:14] !log update Parsoid to 05aa83f52cecda3 to work around Vary: Accept-Encoding issue in production [03:14:30] Logged the message, Master [03:25:37] PROBLEM - Varnish HTTP parsoid-backend on cp1058 is CRITICAL: Connection refused [03:25:57] RECOVERY - Disk space on cp1058 is OK: DISK OK [03:25:57] PROBLEM - Varnish HTTP parsoid-frontend on cp1058 is CRITICAL: Connection refused [03:28:37] RECOVERY - Varnish HTTP parsoid-backend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.001 second response time [03:28:57] RECOVERY - Varnish HTTP parsoid-frontend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 641 bytes in 0.002 second response time [03:31:57] PROBLEM - Varnish HTTP parsoid-backend on cp1045 is CRITICAL: Connection refused [03:32:27] PROBLEM - Varnish HTTP parsoid-frontend on cp1045 is CRITICAL: Connection refused [03:32:50] RoanKattouw_away is clearing the Parsoid Varnishes ^^ [03:33:07] RECOVERY - Disk space on cp1045 is OK: DISK OK [03:34:57] RECOVERY - Varnish HTTP parsoid-backend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.001 second response time [03:35:27] RECOVERY - Varnish HTTP parsoid-frontend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 643 bytes in 0.001 second response time [03:45:35] !log rolled back Parsoid deployment to cdbfdbb28ba2 [03:45:51] Logged the message, Master [03:48:14] !log catrope synchronized php-1.23wmf1/extensions/VisualEditor 'Update VE for cherry-pick' [03:48:30] Logged the message, Master [03:48:31] !log catrope synchronized php-1.23wmf2/extensions/VisualEditor 'Update VE for cherry-pick' [03:48:48] Logged the message, Master [04:10:36] Aaron|home: yt? [04:45:15] !log begin OSC for el_id on wikis with externallinks >1M rows (smaller ones already done) [04:45:34] Logged the message, Master [04:48:48] ori-l: hm? [04:50:30] Aaron|home: the memcached logs are flooded with one particular key [04:50:39] 'enwiki:sites/SiteList#2013-02-07+Site:2013-01-23' [04:50:56] * Aaron|home already compiled the per-key list in desc order [04:51:06] and yeah, that one was on top [04:51:13] http://noc.wikimedia.org/~ori/SiteSQLStore.html [04:53:06] good to see those graphs next to each other though [06:26:10] (03PS1) 10Ori.livneh: Set enwiki's languageLinkSiteGroup to 'wikipedia' [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93648 [06:26:56] PROBLEM - MySQL Processlist on db1049 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 100 copy to table, 0 statistics [06:27:56] RECOVERY - MySQL Processlist on db1049 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [06:30:40] (03CR) 10Ori.livneh: [C: 032] Set enwiki's languageLinkSiteGroup to 'wikipedia' [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93648 (owner: 10Ori.livneh) [06:33:00] !log ori synchronized wmf-config/CommonSettings.php 'Id33ec9366: Set enwiki's languageLinkSiteGroup to 'wikipedia' (Bug: 56602)' [06:33:17] Logged the message, Master [07:03:36] PROBLEM - DPKG on tmh1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:04:06] PROBLEM - DPKG on mw1153 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:04:36] RECOVERY - DPKG on tmh1001 is OK: All packages OK [07:05:06] RECOVERY - DPKG on mw1153 is OK: All packages OK [07:07:39] PROBLEM - DPKG on tmh1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:08:39] RECOVERY - DPKG on tmh1002 is OK: All packages OK [07:09:19] !log video/imagescaler upgrade [07:09:31] Logged the message, Master [07:10:59]