[00:00:15] RECOVERY - check_mysql on payments4 is OK: Uptime: 2948776 Threads: 1 Questions: 34029 Slow queries: 2 Opens: 420 Flush tables: 1 Open tables: 64 Queries per second avg: 0.11 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:00:16] mutante: so, same behavior as before. Does it look like a networking issue to you? [00:01:19] dr0ptp4kt: ok, i'm off,gnight [00:01:46] andrewbogott: eth0/eth1 are link state up, eth/eth3 are down [00:02:01] PROBLEM - LDAPS on virt0 is CRITICAL: Connection timed out [00:02:06] andrewbogott: did it use eth2 and 3? then it's the cable [00:02:15] otherwise, yea, looks like networking issue [00:02:27] mutante, I don't know much, just judging by the packet loss. [00:02:42] LeslieCarr: can you help us out here? [00:03:32] * andrewbogott surprised this isn't paging [00:03:41] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: Connection timed out [00:03:41] andrewbogott: from fenari, 94% packet loss , ack [00:03:46] but not 100% [00:03:50] PROBLEM - HTTP on virt0 is CRITICAL: Connection timed out [00:03:50] PROBLEM - LDAP on virt0 is CRITICAL: Connection timed out [00:04:00] mutante: andrewbogott from the office, 93% [00:04:05] mutante, yep! Very serious. [00:04:39] mutante, is leslie within shouting range? [00:04:49] andrewbogott: i don't know [00:04:54] not within mine [00:05:00] Oh, are you in DE still? [00:05:10] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 1.080 second response time [00:05:19] PROBLEM - SSH on virt0 is CRITICAL: Connection timed out [00:05:29] rfarrand: Can you see Leslie? [00:05:47] andrewbogott: no, but both not at office [00:05:59] RECOVERY - LDAPS on virt0 is OK: TCP OK - 3.044 second response time on port 636 [00:06:19] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 5.133 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [00:06:29] RECOVERY - LDAP on virt0 is OK: TCP OK - 3.043 second response time on port 389 [00:06:39] Reedy: nope [00:06:56] need her? [00:07:21] rfarrand: We do! I just texted her... [00:08:10] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [00:08:27] so virt0 uses bonded interface, two ports on switch [00:08:32] daniel is confirming on server [00:08:59] seems its just two eth0 and eth1 [00:09:00] andrewbogott, Reedy: She is on vacation tomorrow and the following do so it would be best to catch her ASAP. If I see her I will let her know [00:09:10] i'll call her. [00:09:13] **do = day [00:09:19] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:09:49] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: Connection timed out [00:09:50] robla: hey [00:10:19] PROBLEM - HTTP on virt0 is CRITICAL: Connection timed out [00:10:29] PROBLEM - LDAP on virt0 is CRITICAL: Connection timed out [00:10:33] cmjohnson1: sorry, nevermind....I hadn't realized the wikitech thing was already being worked on, but saw you were on RT duty. [00:10:56] cmjohnson1: I believe we are still mulling paging springle, though [00:11:07] the wikitech thing is due to virt0 being down [00:11:09] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.082 second response time [00:11:15] but it may be recovering... [00:11:18] oh...okay..springle should be on soon..what happened to virt0? [00:11:19] PROBLEM - SSH on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:30] dunno, mutante rebooted but dunno before that [00:11:40] (crash i imagine) [00:11:44] RobH: I don't think it's recovering -- very high packet loss which means that ocassionally a test succeeds. [00:11:50] yea, prolly not [00:11:54] It had the exact same symptoms before the reboot. [00:11:59] yea, eww [00:12:02] yea [00:12:08] so i texted leslie, im giving her a few minutes before i call [00:12:19] do i need to go down there? [00:12:19] (i cannot speak to other folks textign her, so I just disregard it happened ;) [00:12:25] cmjohnson1: you are in tampa? [00:12:29] yes [00:12:39] here is the thing [00:12:45] what exactly do i tell you to do when you are there? [00:12:51] wait, now i even get errors from puppetmaster? [00:12:53] its two cables, it shouldnt freak out that badly [00:13:03] robla: paging me? [00:13:06] so i have concerns it isnt physical layer [00:13:16] springle! [00:13:18] springle: there are two ongoing issues, one with wikitech (not paged you for) [00:13:18] springle: You didn't recieve a pager in your welcome package? [00:13:26] just so you realize confusion in backlog ;] [00:13:28] TimStarling: ^^ [00:13:37] robla: See #wikimedia-tech [00:13:49] You're about 6 minutes behind ;) [00:14:19] ah, there he is. I see the single "looking" now [00:14:23] * RobH tries to use phone as phone, watches it crash [00:14:27] technology. [00:14:36] ops can have this channel, they like shouting over bots [00:14:43] RobH: Have you tried turning it off and on again? [00:14:54] I'll talk about bug 56577 in #wikimedia-tech instead [00:15:19] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [00:15:29] RECOVERY - LDAP on virt0 is OK: TCP OK - 3.043 second response time on port 389 [00:16:06] ^^ lies [00:16:09] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.174 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [00:16:19] ok, interesting [00:16:27] the network display for the secondary port on virt0 shows no traffic [00:16:36] but i dunno this bonded itnerface stuff [00:16:41] so i dunno if its ok. [00:17:29] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: CRITICAL - Cannot make SSL connection [00:18:07] And… suddely packet loss is down to 0% [00:18:10] Did someone do something? [00:18:11] Robh? [00:18:13] andrewbogott: back to 0% [00:18:16] was about to day [00:18:19] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.157 second response time [00:18:21] and puppet runs on it now [00:18:38] it's back.. wtf [00:18:43] wikitech also up [00:18:45] hiccup [00:18:57] so yea [00:19:00] eth1 is now showing traffic [00:19:02] i did nothing. [00:19:07] Elsie: wikitech is back [00:19:08] other than watch network side output on switch [00:19:17] so no goddamned clue what happened. [00:19:23] It was a long hiccup! 40 minutes. [00:19:36] its not something we can say 'ok its fixed' [00:19:37] andrewbogott: puppet is now re-adding iptables rules on virt0 [00:19:38] ... [00:19:53] we shall see if it kills it ;] [00:19:57] mutante: Thanks. Can you test morebots? [00:20:05] !log Test [00:20:11] mutante, elsie, I'll kick morebots. [00:20:21] Logged the message, Master [00:20:31] I think it's connected... but perhaps unresp.. there it goes. :-) [00:20:38] So something was wrong with its eth1 interface in that entire outage [00:20:47] eth0 was showing network traffic on switch but eth1 was not [00:20:51] https://wikitech.wikimedia.org/wiki/SAL # Whee. [00:20:52] i have no idea what it means ;] [00:21:11] RobH: Can you add notes to https://bugzilla.wikimedia.org/show_bug.cgi?id=56594 please? :-) [00:21:58] that ticket is so narrow scope [00:22:14] virt0 had issue, so it affected a LOT of things, but i guess its a start [00:22:25] morebots, are you ok? [00:22:25] I am a logbot running on tools-login. [00:22:25] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [00:22:25] To log a message, type !log . [00:22:37] hm [00:22:37] (03CR) 10jenkins-bot: [V: 04-1] Fix up multiversion to not require dba_* functions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [00:22:39] PROBLEM - MySQL Slave Running on db47 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in where clause on query. Default da [00:22:39] PROBLEM - MySQL Slave Running on db54 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in field list on query. Default data [00:23:41] RobH, mutante, I will send an outage email [00:24:40] i updated ticket with my observation [00:24:59] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 311 seconds [00:25:16] Thanks. :-) [00:25:39] RECOVERY - MySQL Slave Running on db47 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:25:43] we all know that the true cause of this problem was Ryan getting onto an airplane. [00:26:00] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [00:26:19] PROBLEM - MySQL Replication Heartbeat on db54 is CRITICAL: CRIT replication delay 340 seconds [00:26:29] <^d> andrewbogott: First ferries. Now airplanes. Basically none of us should be allowed to travel, ever. [00:26:31] (03PS2) 10RobH: new bugzilla.wikimedia.org cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/93621 [00:26:39] RECOVERY - MySQL Slave Running on db54 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:27:59] PROBLEM - MySQL Slave Delay on db54 is CRITICAL: CRIT replication delay 394 seconds [00:28:29] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 386 seconds [00:29:10] PROBLEM - NTP on virt0 is CRITICAL: NTP CRITICAL: Offset unknown [00:29:59] RECOVERY - MySQL Slave Delay on db54 is OK: OK replication delay 0 seconds [00:30:00] *yawn* [00:30:19] RECOVERY - MySQL Replication Heartbeat on db54 is OK: OK replication delay 0 seconds [00:30:29] RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay -0 seconds [00:30:50] you missed all the action, apergos [00:30:59] no I didn't [00:33:34] apergos: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=search-pool :) [00:33:50] are they gone? [00:34:12] hm they are 'ok', magic! [00:34:26] oh those are the eqiads [00:34:26] because they are gone :) [00:34:30] yes [00:34:31] the rest are gone, of course! [00:34:39] excellent! [00:34:57] * YuviPanda screams ALL ULS' FAULT! and then runs away [00:35:09] RECOVERY - NTP on virt0 is OK: NTP OK: Offset 0.001705765724 secs [00:38:30] brownbag talk on govt surveillance and privacy tools starting on 6 floor in 20 minutes [00:39:09] Here's the hangout link: https://plus.google.com/hangouts/_/72cpj4nea87b6deasmknhlqmb8 [00:39:20] kaldari: i was about to ask for that link, thanks: [00:39:25] who's giving it? [00:39:33] Yan Zhu [00:39:43] Blogger/hacktivist [00:40:23] (03CR) 10Chad: "Tested on arsenic, pretty much works :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [00:44:17] (03CR) 10Dzahn: [C: 032] "Issuer: C=US, O=GeoTrust, Inc., CN=RapidSSL CA" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93621 (owner: 10RobH) [00:46:34] !log running master/slave pt-table-sync on archive tables with 1M+ rows [00:46:51] Logged the message, Master [00:47:06] YuviPanda: ok, so… now I can go back to thinking about whatever it was I was... [00:47:12] …thinking about before the… outage. [00:47:21] andrewbogott: until something else dies... [00:47:31] :P [00:47:34] Did you see something obvious going wrong with the unicorn? [00:47:44] no, i see that it's 500ing [00:47:48] so perhaps a path issue? [00:47:51] andrewbogott: where's the code? [00:47:54] like, file path? [00:48:17] It's a service, controlled by /etc/init/dynamicproxy-api.conf [00:48:28] using uwsgi [00:49:15] the actual python code is in /usr/lib/python2.7/dist-packages/invisible_unicorn [00:49:28] andrewbogott: where's the sqlite file? [00:49:50] /etc/dynamicproxy-api/ [00:50:35] andrewbogott: hmm, file permissions seem ok [00:50:55] Yeah, I changed them to a+w for a bit just to see, didn't help. [00:51:52] andrewbogott: okay, I can perform write operations on it manually, via the sqlite3 command [00:52:17] so maybe uwsgi is running with yet another user, somehow? [00:52:22] s/with/as/ [00:52:36] mutante, apergos: Also there's a youtube broadcast link if you just want to watch: [00:52:37] http://youtu.be/KBpau4gGvfQ [00:52:55] nice, clicking:) [00:53:11] * apergos will come on late tomorrow and check out early [00:53:14] yeah why not [00:54:03] andrewbogott: try hitting the API again? [00:54:23] YuviPanda: just did a GET, it looks to have failed. [00:54:36] I see 400s [00:56:21] andrewbogott: try again? [00:57:10] hm 400 again [00:57:18] This was recently working, dammit [00:57:29] andrewbogott: try doing a PUT? [00:57:54] put [00:57:57] 500 it says [00:58:02] FUCKING SHOW ME THE ERROR LOGS DAMMIT [00:58:02] grr [00:58:06] stupid stupid thing [00:58:09] (03PS2) 10Dzahn: use new SSL certs on Bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/93620 [00:58:38] YuviPanda: I thought you might know the secret place where the error logs are hidden [00:58:54] Supposedly stderr and stdout are getting written to /var/log/uwsgi/app/invisible.log [00:58:57] so the uwsgi people tell me [00:59:11] (03CR) 10Dzahn: [C: 032] use new SSL certs on Bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/93620 (owner: 10Dzahn) [00:59:39] apparently not [01:01:18] andrewbogott: err, does this have the proxy puppet class applied to it? [01:01:18] invisible-unicorn requires that [01:01:18] since it requies a redis instance setup locally [01:01:18] that might be it [01:01:18] i see a local redis instance [01:01:34] but the nginx is still really old, and not the db [01:01:34] *deb [01:01:50] andrewbogott: that might be an issue? [01:01:54] hm, ok... [01:02:00] it should be, but maybe puppet is busted [01:02:03] andrewbogott: RobH so at 23:42 UTC i see the port being saturated [01:02:03] and deploying the wrong nginx [01:02:27] !log replaced SSL certs on Bugzilla with new ones that are not star.wm [01:02:43] LeslieCarr: hrmm [01:02:45] Logged the message, Master [01:02:49] the main port [01:02:56] so what triggered saturation [01:02:56] eth0? [01:02:56] andrewbogott: tbh my interest in the entire thing has waned ever since the 'we need to debianize eveything!' conversations, since I still don't understand what the advantage is, especially since we're just getting around it by making our own packages [01:03:02] https://observium.wikimedia.org/device/device=12/tab=port/port=1299/ [01:03:03] ori-l: ^ now it's actually fixed, with 2 new certs . fyi [01:03:11] i dunno, it looks like 99% of the time, there's 0 traffic [01:03:20] yea [01:03:22] then bam! sudden huge traffic spike [01:03:24] i saw packet loss of 94% from fenari [01:03:27] then bam, nada [01:03:31] so.... there's our culprit.... [01:03:36] hrmm. [01:03:37] then back to 0% out of nowhere [01:03:53] andrewbogott: hmm, try again now? [01:04:02] !log virt0 port saturated at 23:42 (recordkeeping for outage investigation) [01:04:19] Logged the message, RobH [01:04:23] YuviPanda: I'm asking puppet to install the latest nginx-extras, did you already upgrade that? [01:05:00] YuviPanda: and, oops, I just clobbered your --catch-exceptions [01:05:09] Still want me to try again? [01:07:31] RobH: resolved the cert stuff. it's using the 2 new certs now and puppet runs and it's happy [01:07:47] so no more star.wm [01:08:18] ok, the talk on youtube started [01:08:48] yay [01:08:53] wait youtube? [01:09:11] RobH: 16:56 < kaldari> mutante, apergos: Also there's a youtube broadcast link if you just want to watch: [01:09:25] http://youtu.be/KBpau4gGvfQ [01:10:29] andrewbogott: yeah [01:10:45] ok. first a get... [01:10:48] so, did anyone happen to do a tcpdump on virt0 ? [01:10:52] just in case ? [01:11:03] now a put [01:11:14] LeslieCarr: afraid not, just traceroute to it from elsewhere [01:11:18] ok [01:11:32] YuviPanda, done [01:11:42] andrewbogott: yeah, 400 and 500. [01:11:47] no idea why :| [01:11:50] let me do something else [01:12:05] oh this is interesting! [01:12:16] we got a 4gig bump of traffic going *into* our network [01:12:18] YuviPanda: possibly uwsgi is mangling the requests [01:12:23] its on my tv now [01:12:24] heh [01:12:25] during this time [01:12:26] andrewbogott: can you hit port 5000? [01:12:39] same machine [01:13:00] YuviPanda: I can do a wget... [01:13:09] Changing the gui to hit a different port is kind of a pain [01:13:14] yeah that should enough [01:13:16] a curl even [01:13:26] mutante: what cant xmbc do? [01:13:40] RobH: stream youtube? [01:13:43] well, it can do that [01:13:44] it is now [01:13:47] it cannot do hangouts... [01:14:01] heh:) makes sense, yea, nice to have the youtube [01:14:04] YuviPanda: ok, just did a get on 5000 [01:14:07] oh, different/wrong url though [01:14:22] You see that I have nginx configured to detect /dynamicproxy-api/whatever? [01:14:32] andrewbogott: i got the GET, and responded with 400 [01:14:33] * YuviPanda is confused [01:14:35] * YuviPanda checks [01:14:46] LeslieCarr: does that mean we were DOSed? [01:14:59] andrewbogott: it's supposed to return a 400 if that project does not exist [01:15:02] andrewbogott: try a PUT first? [01:15:10] YuviPanda: I can't easily do that to port 5000 [01:15:13] unless you know the curl syntax [01:15:24] ugh, I had it in bash history somewhere [01:15:28] burrrr [01:16:02] yeah [01:16:30] andrewbogott: curl -X PUT localhost:5000/v1/test/mapping -d '{"domain": "cvne.wmflabs.org", "backends": ["http://cvn-apache2.pmtpa.wmflabs"]}' [01:16:35] YuviPanda: There's some url rewriting that should happen, too… like, I should be able to wget http://proxy-abogott-8/dynamicproxy-api/v1/project-proxy/mapping and unicorn should just get /v1/project-proxy/mapping [01:16:37] but with IP instead of localhist? [01:16:53] andrewbogott: right, but I'm running a debug version with the debug webserver on port 5000 now [01:16:58] andrewbogott: so if there's an error, we know where it is [01:17:06] andrewbogott: if there isn't, we know that it's the URL mangling happening somewhere [01:17:10] ok, just did the put, got a 400 [01:17:24] Is the debug server not seing the db file? [01:17:42] * apergos raises hand (for talk) [01:18:29] andrewbogott: again? [01:18:46] * andrewbogott PUTs [01:19:01] andrewbogott: meh, 400 again [01:19:10] andrewbogott: there's no code path that starts at PUT and then returns a 400 [01:19:27] i'm super confused now [01:19:34] YuviPanda: probably best to stop nginx and uwsgi [01:19:37] in case they're in your way [01:19:38] since nothing works anymore [01:19:50] not if you're hitting 5000 directly, no? [01:19:57] i have it configured to listen on 0.0.0.0 [01:20:01] dunno [01:20:44] andrewbogott: don't think I can debug more right now, sorry :( [01:21:00] legoktm: think you can help? weird flask issue [01:21:23] i'm not very good with flask but lemme read the scrollback [01:21:35] ok [01:21:45] sorry andrewbogott :( [01:22:09] s'alright, make legoktm will see something obvious [01:22:18] I swear this is working worse than a few hours ago with no changes made [01:22:57] yeah...i have no clue. [01:57:02] !log deployed Parsoid f416b8240e74e286 [01:57:19] Logged the message, Master [02:00:09] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [02:00:55] !log stopped mysql on db1008 to clone another [02:01:16] Logged the message, Master [02:05:10] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [02:08:27] going to make tomorrow a get in late day and much shorter... [02:08:28] night [02:15:10] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [02:15:31] mysql on db1008 coming back up [02:16:42] !log LocalisationUpdate completed (1.23wmf2) at Tue Nov 5 02:16:42 UTC 2013 [02:16:58] Logged the message, Master [02:20:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 298 Threads: 1 Questions: 6249 Slow queries: 0 Opens: 43 Flush tables: 2 Open tables: 52 Queries per second avg: 20.969 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [02:30:26] !log LocalisationUpdate completed (1.23wmf1) at Tue Nov 5 02:30:26 UTC 2013 [02:30:42] Logged the message, Master [02:38:01] PROBLEM - MySQL Slave Running on db31 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in where clause on query. Default da [02:38:10] PROBLEM - MySQL Slave Running on db73 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in where clause on query. Default da [02:40:00] PROBLEM - MySQL Slave Running on db37 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in field list on query. Default data [02:40:01] RECOVERY - MySQL Slave Running on db31 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:41:10] RECOVERY - MySQL Slave Running on db73 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:41:20] PROBLEM - MySQL Replication Heartbeat on db72 is CRITICAL: CRIT replication delay 303 seconds [02:41:40] PROBLEM - MySQL Replication Heartbeat on db65 is CRITICAL: CRIT replication delay 318 seconds [02:41:40] PROBLEM - MySQL Replication Heartbeat on db31 is CRITICAL: CRIT replication delay 319 seconds [02:42:40] PROBLEM - MySQL Slave Delay on db31 is CRITICAL: CRIT replication delay 364 seconds [02:42:40] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 303 seconds [02:43:30] PROBLEM - MySQL Replication Heartbeat on db37 is CRITICAL: CRIT replication delay 349 seconds [02:46:40] PROBLEM - MySQL Slave Running on db1035 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column ar_id in where clause on query. Default da [02:47:01] wtf [02:48:40] RECOVERY - MySQL Replication Heartbeat on db65 is OK: OK replication delay -0 seconds [02:48:40] RECOVERY - MySQL Slave Delay on db31 is OK: OK replication delay 0 seconds [02:48:40] RECOVERY - MySQL Slave Running on db1035 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:48:40] RECOVERY - MySQL Replication Heartbeat on db31 is OK: OK replication delay 0 seconds [02:49:20] RECOVERY - MySQL Replication Heartbeat on db72 is OK: OK replication delay -0 seconds [02:50:00] RECOVERY - MySQL Slave Running on db37 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:51:40] PROBLEM - MySQL Slave Delay on db37 is CRITICAL: CRIT replication delay 600 seconds [02:52:40] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [02:58:20] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:10] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [03:08:30] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Nov 5 03:08:30 UTC 2013 [03:08:46] Logged the message, Master [03:14:14] !log update Parsoid to 05aa83f52cecda3 to work around Vary: Accept-Encoding issue in production [03:14:30] Logged the message, Master [03:25:37] PROBLEM - Varnish HTTP parsoid-backend on cp1058 is CRITICAL: Connection refused [03:25:57] RECOVERY - Disk space on cp1058 is OK: DISK OK [03:25:57] PROBLEM - Varnish HTTP parsoid-frontend on cp1058 is CRITICAL: Connection refused [03:28:37] RECOVERY - Varnish HTTP parsoid-backend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.001 second response time [03:28:57] RECOVERY - Varnish HTTP parsoid-frontend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 641 bytes in 0.002 second response time [03:31:57] PROBLEM - Varnish HTTP parsoid-backend on cp1045 is CRITICAL: Connection refused [03:32:27] PROBLEM - Varnish HTTP parsoid-frontend on cp1045 is CRITICAL: Connection refused [03:32:50] RoanKattouw_away is clearing the Parsoid Varnishes ^^ [03:33:07] RECOVERY - Disk space on cp1045 is OK: DISK OK [03:34:57] RECOVERY - Varnish HTTP parsoid-backend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.001 second response time [03:35:27] RECOVERY - Varnish HTTP parsoid-frontend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 643 bytes in 0.001 second response time [03:45:35] !log rolled back Parsoid deployment to cdbfdbb28ba2 [03:45:51] Logged the message, Master [03:48:14] !log catrope synchronized php-1.23wmf1/extensions/VisualEditor 'Update VE for cherry-pick' [03:48:30] Logged the message, Master [03:48:31] !log catrope synchronized php-1.23wmf2/extensions/VisualEditor 'Update VE for cherry-pick' [03:48:48] Logged the message, Master [04:10:36] Aaron|home: yt? [04:45:15] !log begin OSC for el_id on wikis with externallinks >1M rows (smaller ones already done) [04:45:34] Logged the message, Master [04:48:48] ori-l: hm? [04:50:30] Aaron|home: the memcached logs are flooded with one particular key [04:50:39] 'enwiki:sites/SiteList#2013-02-07+Site:2013-01-23' [04:50:56] * Aaron|home already compiled the per-key list in desc order [04:51:06] and yeah, that one was on top [04:51:13] http://noc.wikimedia.org/~ori/SiteSQLStore.html [04:53:06] good to see those graphs next to each other though [06:26:10] (03PS1) 10Ori.livneh: Set enwiki's languageLinkSiteGroup to 'wikipedia' [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93648 [06:26:56] PROBLEM - MySQL Processlist on db1049 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 100 copy to table, 0 statistics [06:27:56] RECOVERY - MySQL Processlist on db1049 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [06:30:40] (03CR) 10Ori.livneh: [C: 032] Set enwiki's languageLinkSiteGroup to 'wikipedia' [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93648 (owner: 10Ori.livneh) [06:33:00] !log ori synchronized wmf-config/CommonSettings.php 'Id33ec9366: Set enwiki's languageLinkSiteGroup to 'wikipedia' (Bug: 56602)' [06:33:17] Logged the message, Master [07:03:36] PROBLEM - DPKG on tmh1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:04:06] PROBLEM - DPKG on mw1153 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:04:36] RECOVERY - DPKG on tmh1001 is OK: All packages OK [07:05:06] RECOVERY - DPKG on mw1153 is OK: All packages OK [07:07:39] PROBLEM - DPKG on tmh1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:08:39] RECOVERY - DPKG on tmh1002 is OK: All packages OK [07:09:19] !log video/imagescaler upgrade [07:09:31] Logged the message, Master [07:10:59] PROBLEM - DPKG on mw1156 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:11:09] PROBLEM - DPKG on mw1155 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:11:09] PROBLEM - DPKG on mw1157 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:11:59] RECOVERY - DPKG on mw1156 is OK: All packages OK [07:12:09] RECOVERY - DPKG on mw1155 is OK: All packages OK [07:12:09] RECOVERY - DPKG on mw1157 is OK: All packages OK [07:14:39] PROBLEM - DPKG on mw1158 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:15:09] PROBLEM - DPKG on mw1160 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:15:39] RECOVERY - DPKG on mw1158 is OK: All packages OK [07:16:09] RECOVERY - DPKG on mw1160 is OK: All packages OK [07:22:19] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:10] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [07:50:08] paravoid: uwsgi + nginx or apache + mod_wsgi? [08:02:06] ori-l: gunicorn? [08:02:32] node.js + express in a docker on heroku? [08:02:53] lol [08:02:54] i'll get started on the hacker news post [08:02:57] I wasn't kidding :) [08:03:15] nginx + gunicorn is the stack I've used [08:03:16] gunicorn is fine obviously, but i don't know that we use it anywhere [08:03:47] well, sure, why not [08:03:48] I've used apache/mod_wsgi, can't say I was too happy [08:03:50] hmm, andrewbogott_afk is trying to setup a nginx + uwsgi stack on labs for one of the wikitech services and neither of us have experience with uwsgi... [08:04:30] I've heard good things about uwsgi (the maintainer is a friend/ex-mentoree) but I can't say I have an informed opinion myself [08:04:50] well, it looked cool, but then things like this give me pause (from the package description): [08:04:58] "uWSGI can be run in preforking, threaded, asynchronous/evented modes and supports various forms of green threads/coroutines (such as uGreen, Greenlet, Fiber). uWSGI provides several methods of configuration: via command line, via environment variables, via XML, INI, YAML configuration files, via LDAP and more." [08:05:20] first part or second part, ori-l? [08:05:39] all of it [08:05:45] yeah [08:05:53] you can do everything, with any way you'd like [08:05:56] have fun [08:06:25] reminds me of http://www.imdb.com/title/tt0185014/quotes [08:06:31] Hannah Green: Grady, you know how in class you're always telling us that writers make choices? [08:06:35] Grady Tripp: Yeah. [08:06:39] Hannah Green: And even though you're book is really beautiful, I mean, amazingly beautiful, it's... it's at times... it's... very detailed. You know, with the genealogies of everyone's horses, and the dental records, and so on. And... I could be wrong, but it sort of reads in places like you didn't make any choices. At all. And I was just wondering if it might not be different if... if when you wrote you weren't always... unde [08:06:40] r the influence. [08:07:02] heh [08:08:38] ori-l: the gunicorn package has this https://chris-lamb.co.uk/posts/sysadmin-friendly-deployment-gunicorn-debian [08:08:43] (found it!) [08:09:21] gunicornctl! [08:09:31] hahaha [08:09:48] of course we'll need ctlctl to control all of those seperate *ctl utils [08:09:50] Configuration Options [08:09:51] uWSGI and the various plugins it consists of is almost infinitely configurable. [08:10:11] yes, that's panic-inducing [08:10:15] like, WHAT IS IT? [08:10:40] IT'S INFINITE [08:10:41] it's a framework for doing things [08:10:56] or not doing things [08:11:12] ok, gunicorn it is [08:11:24] gunicorn.d is puppet friendly [08:11:24] uwsgi is also a protocol that nginx speaks natively [08:11:28] I'd review uwsgi too, as long as I don't have to do the initial research I'm good :p [08:11:59] YuviPanda: i don't know which drawer in my head to put all that information [08:12:29] the drawer you construct in the two weeks of free time everyone has :D [08:13:40] ori-l: in any case, I'd be careful with enabling gevent, it doesn't always work well if the application wasn't made with it in mind [08:13:47] DNS is one of the first things that fail [08:13:59] psycopg is another, although not an issue here obviously [08:14:24] yes, gevent is evil [08:14:43] version 1.0 is great for generating books [08:14:48] but otherwise evil [08:14:49] also btw, would it make sense to put this behind the misc-web varnish cluster? [08:15:01] (where "this" is graphite, I'm guessing :) [08:15:16] yes, sure [08:15:20] not sure if it makes much sense, it is runtime data [08:15:26] it's currently running behind a varnish instance that was just commented out in puppet and left running [08:15:35] yeah I remember... [08:22:45] oh did you see that the memcached traffic spike revolves around a single key? [08:22:58] no [08:23:05] did you send a mail? [08:23:06] http://noc.wikimedia.org/~ori/SiteSQLStore.html [08:23:11] filed a bug and CC'd you [08:23:22] it's ze germans [08:24:05] https://bugzilla.wikimedia.org/56602 [08:24:15] yup, just saw it [08:24:17] reading [08:24:45] you know Tim found that back in September, right? [08:25:18] no, i didn't [08:25:22] :( [08:25:31] I even pointed you to it when you started investigating [08:25:39] I should had been more explicit, sorry :( [08:25:46] Subject: Re: [Ops] Site issues, past two weeks & in progress [08:25:49] Date: Tue, 24 Sep 2013 14:01:00 +1000 [08:25:50] From: Tim Starling [08:25:58] * ori-l reads [08:27:54] ah, the work wasn't totally redundant, now we have some folk to blame [08:28:09] * ori-l squints at aude [08:28:39] it wasn't redundant in any case, you actually filed a bug and chasing it and all that :) [08:28:49] we had nothing done since september [08:29:03] the depressing thing is that i think the key contains '2013-02-07' because that's when the data was last modified [08:29:03] I'm not trying to downplay your finding, if anything I'm very grateful :) [08:29:15] at least as best as i can determine [08:30:04] it's InitialiseSettings.php -- in the cloud! [08:32:50] so, do you have any evil plans on how to avoid such things happening again? [08:35:07] how much faster is memcached than redis? [08:35:16] no idea, why? [08:35:32] it's much more introspectable [08:35:50] the other thing you could do is um [08:36:20] log a sample of all gets & sets (we're doing this via memcached-serious.log on fluorine but inadvertantly) [08:36:51] canonicalize keys by splitting on ':' and removing any component that contains numbers, any wiki db name, and any language names [08:37:21] and set up ganglia monitoring with a tmax/dmax (whichever one actually matters, i always forget) of 5 minutes or so [08:38:35] that sounds great [08:38:42] then actually have someone look at ganglia maybe [08:38:44] like deployers :) [08:39:24] yeah, even if people aren't proactive it would make debugging issues like this pretty quick [08:40:05] the high-traffic key patterns would never fall out of ganglia so you'd be able to see patterns [08:41:54] ha, good to have that figured out [08:43:00] yup [08:49:32] (03PS1) 10Mark Bergsma: Add new esams upload LVS service IPs to the https cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/93658 [08:51:12] (03PS2) 10Mark Bergsma: Add new esams upload LVS service IPs to the https cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/93658 [08:51:17] ori-l: poke! [08:51:32] re-poke! [08:51:38] fix! :) [08:51:44] asap! [08:52:07] heh [08:52:24] we could probably turn these things into settings [08:52:32] (03CR) 10Mark Bergsma: [C: 032] Add new esams upload LVS service IPs to the https cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/93658 (owner: 10Mark Bergsma) [08:52:47] but would a cbd implementation of the sites store for id look up also help? (like done for interwiki) [08:53:42] how much data are we talking about? [08:55:39] for looking up one's own ids, settings are fine but we have mapping of "site id" (e.g. enwiki) to interwiki id ('en') [08:55:59] having such mapping for all the sites, somewhere efficient for lookup would be useful [08:56:21] i'll have to take a look at the code to remember what it's doing :) [09:08:21] anyway, thanks ori-l for bringing it to my attention [09:08:42] i can come up with a better solution for this stuff :) [09:10:55] aude, cool, thank you [10:10:12] (03PS1) 10ArielGlenn: change refs to is_labs_puppet_master, it's not top scope [operations/puppet] - 10https://gerrit.wikimedia.org/r/93664 [10:11:41] (03CR) 10ArielGlenn: [C: 032] change refs to is_labs_puppet_master, it's not top scope [operations/puppet] - 10https://gerrit.wikimedia.org/r/93664 (owner: 10ArielGlenn) [10:35:42] (03PS1) 10Mark Bergsma: Pull lvs::realserver_ips from lvs::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/93666 [10:36:53] (03CR) 10Mark Bergsma: [C: 032] Pull lvs::realserver_ips from lvs::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/93666 (owner: 10Mark Bergsma) [10:49:08] (03PS1) 10ArielGlenn: remove knsq1-30, decommed in rt #1943 and #5898 [operations/dns] - 10https://gerrit.wikimedia.org/r/93668 [10:54:00] (03PS1) 10Mark Bergsma: Add new esams upload LVS service IPs to the protoproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/93669 [10:55:03] (03CR) 10Mark Bergsma: [C: 032] Add new esams upload LVS service IPs to the protoproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/93669 (owner: 10Mark Bergsma) [11:05:25] (03PS1) 10ArielGlenn: remove knsq* from dhcp, dsh groups, ganglia refs, decommed [operations/puppet] - 10https://gerrit.wikimedia.org/r/93670 [11:12:01] (03CR) 10ArielGlenn: [C: 032] remove knsq* from dhcp, dsh groups, ganglia refs, decommed [operations/puppet] - 10https://gerrit.wikimedia.org/r/93670 (owner: 10ArielGlenn) [11:13:48] (03CR) 10ArielGlenn: [C: 032] remove knsq1-30, decommed in rt #1943 and #5898 [operations/dns] - 10https://gerrit.wikimedia.org/r/93668 (owner: 10ArielGlenn) [11:38:16] (03PS1) 10ArielGlenn: cleanup: remove db79, db80 [operations/dns] - 10https://gerrit.wikimedia.org/r/93672 [11:39:31] (03CR) 10ArielGlenn: [C: 032] cleanup: remove db79, db80 [operations/dns] - 10https://gerrit.wikimedia.org/r/93672 (owner: 10ArielGlenn) [11:41:03] lunch [11:44:08] (03PS1) 10Mark Bergsma: Swap old/new esams upload LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93674 [11:44:18] (03PS1) 10ArielGlenn: cleanup: remove neptunium [operations/dns] - 10https://gerrit.wikimedia.org/r/93675 [11:46:16] jenkins is out to lunch with hashar I think [11:47:27] (03CR) 10Mark Bergsma: [C: 032] Swap old/new esams upload LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93674 (owner: 10Mark Bergsma) [11:48:15] (03CR) 10ArielGlenn: [C: 032] cleanup: remove neptunium [operations/dns] - 10https://gerrit.wikimedia.org/r/93675 (owner: 10ArielGlenn) [11:58:39] PROBLEM - Puppetmaster HTTPS on palladium is CRITICAL: Connection refused [11:59:19] PROBLEM - DPKG on palladium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:36] (03PS1) 10Mark Bergsma: Add reverse DNS for new esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/93677 [12:02:23] (03CR) 10Mark Bergsma: [C: 032] Add reverse DNS for new esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/93677 (owner: 10Mark Bergsma) [12:07:43] (03PS1) 10ArielGlenn: add back professor to dhcp, not dead yet [operations/puppet] - 10https://gerrit.wikimedia.org/r/93678 [12:09:01] (03CR) 10ArielGlenn: [C: 032] add back professor to dhcp, not dead yet [operations/puppet] - 10https://gerrit.wikimedia.org/r/93678 (owner: 10ArielGlenn) [12:31:58] (03PS1) 10ArielGlenn: remove prod dns for spares argon, cobalt, neodymium, promethium [operations/dns] - 10https://gerrit.wikimedia.org/r/93680 [12:37:13] (03PS1) 10ArielGlenn: move knsq23-29 to decommissioned_role in cache [operations/puppet] - 10https://gerrit.wikimedia.org/r/93682 [12:39:54] (03CR) 10ArielGlenn: [C: 032] move knsq23-29 to decommissioned_role in cache [operations/puppet] - 10https://gerrit.wikimedia.org/r/93682 (owner: 10ArielGlenn) [13:00:23] (03PS1) 10Faidon Liambotis: geoip: add a geoip::data::lite class [operations/puppet] - 10https://gerrit.wikimedia.org/r/93683 [13:00:24] (03PS1) 10Faidon Liambotis: puppetmaster: add GeoLite to geoip, fix for Labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93684 [13:09:25] (03CR) 10Faidon Liambotis: [C: 032] geoip: add a geoip::data::lite class [operations/puppet] - 10https://gerrit.wikimedia.org/r/93683 (owner: 10Faidon Liambotis) [13:09:43] (03CR) 10Faidon Liambotis: [C: 032] puppetmaster: add GeoLite to geoip, fix for Labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93684 (owner: 10Faidon Liambotis) [13:13:18] (03PS1) 10Faidon Liambotis: puppetmaster: fixup for geoip changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/93685 [13:13:35] (03CR) 10Faidon Liambotis: [C: 032 V: 032] puppetmaster: fixup for geoip changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/93685 (owner: 10Faidon Liambotis) [13:18:08] (03PS1) 10Mark Bergsma: Change upload-lb.esams IP addresses to fit the new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/93686 [13:18:21] friggin puppet [13:23:04] (03PS1) 10Faidon Liambotis: puppetmaster: another fixup for geoip [operations/puppet] - 10https://gerrit.wikimedia.org/r/93688 [13:28:20] (03CR) 10Mark Bergsma: [C: 032] Change upload-lb.esams IP addresses to fit the new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/93686 (owner: 10Mark Bergsma) [13:29:23] !log Changed upload-lb.esams IP addresses [13:29:42] Logged the message, Master [13:30:19] fun [13:30:50] with the added benefit that ipv6 is now handled directly by varnish [13:31:00] and by amslvs2 instead of amslvs1 [13:31:15] oh it wasn't directly by varnish before? [13:31:17] I hadn't realized [13:31:35] didn't change it yet since we moved from squid [13:31:56] for upload I mean, bits has always been direct [13:36:32] which countries shall we add to ulsfo today [13:37:50] (03CR) 10Faidon Liambotis: [C: 032] puppetmaster: another fixup for geoip [operations/puppet] - 10https://gerrit.wikimedia.org/r/93688 (owner: 10Faidon Liambotis) [13:38:03] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/debs/gerrit] - 10https://gerrit.wikimedia.org/r/90715 (owner: 10Hashar) [13:41:16] (03PS1) 10Faidon Liambotis: geoip: make geoliteupdate executable [operations/puppet] - 10https://gerrit.wikimedia.org/r/93691 [13:42:06] (03CR) 10Faidon Liambotis: [C: 032] geoip: make geoliteupdate executable [operations/puppet] - 10https://gerrit.wikimedia.org/r/93691 (owner: 10Faidon Liambotis) [13:44:55] ok [13:44:59] I think I should take a break [13:45:07] I'm just doing stupid things now [13:52:17] (03PS1) 10Mark Bergsma: Send traffic from some more countries to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/93692 [13:55:11] (03PS1) 10Faidon Liambotis: geoip: remove the broken geoip::data base class [operations/puppet] - 10https://gerrit.wikimedia.org/r/93693 [13:55:38] (03CR) 10Mark Bergsma: [C: 032] Send traffic from some more countries to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/93692 (owner: 10Mark Bergsma) [13:55:54] how's your ganglia foo, mark? [13:56:27] i haven't used it to make custom views myself if that's what you're asking [13:56:31] i've always used torrus and the like for that [13:56:33] not the question [13:56:39] everytime you add countries [13:56:44] gdnsd's stats get reset to zero [13:56:48] and a spike happens [13:56:49] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=DNS [13:56:56] because ganglia is stupid I'm guessing [13:57:03] yes [13:57:08] any ideas on how to fix? [13:57:08] because it doesn't understand NaN [13:57:11] yes [13:57:15] at least the python plugin doesn't get that [13:57:20] very annoying [13:57:22] the python plugin makes it 0 [13:57:24] dunno what gdnsd uses [13:57:26] yes [13:57:26] I've checked the source [13:57:30] we should fix that upstream [13:57:36] varnish has that problem too [13:57:49] (03CR) 10Akosiaris: [C: 032] geoip: remove the broken geoip::data base class [operations/puppet] - 10https://gerrit.wikimedia.org/r/93693 (owner: 10Faidon Liambotis) [13:58:14] thanks [13:59:51] i suppose ulsfo makes sense for any asian country that is currently on eqiad, except when it happens to use routes via europe [13:59:55] but then it should be on esams anyway [14:12:44] slowly eating from eqiad's traffic [14:14:51] root@stafford:~# du -hs /var/lib/puppet/volatile/GeoIP/ [14:14:51] 88M /var/lib/puppet/volatile/GeoIP/ [14:14:53] heh [14:15:01] now, to sync that to all hosts that want to use geoip [14:15:11] not sure if it's a great idea [14:15:55] (03PS1) 10Mark Bergsma: Depool amssq47, was used as test host [operations/puppet] - 10https://gerrit.wikimedia.org/r/93695 [14:16:33] (03PS2) 10Mark Bergsma: Depool amssq47, was used as test host [operations/puppet] - 10https://gerrit.wikimedia.org/r/93695 [14:16:56] try it [14:17:03] nah [14:17:06] worst case you have to disable it for a few days, wouldn't be the end of the world [14:17:08] I think I'll create a definition [14:17:14] so you can pick the databases you care about [14:17:21] oh, sure [14:17:25] I thought that's what you meant [14:17:28] no [14:17:33] right now we just sync the whole dir [14:17:34] but the hosts that do need it can get it from puppet [14:17:37] right [14:17:49] I just doubled its size [14:17:54] (03CR) 10Mark Bergsma: [C: 032] Depool amssq47, was used as test host [operations/puppet] - 10https://gerrit.wikimedia.org/r/93695 (owner: 10Mark Bergsma) [14:28:27] yoyo paravoid, any preference on that thar varnishmill name thing we are thinking about? [14:28:44] i'm asking because we want to dis-ambiguate the config property names [14:28:51] (03PS1) 10Mark Bergsma: Move the text-varnish IPs in pmtpa back to text [operations/puppet] - 10https://gerrit.wikimedia.org/r/93698 [14:28:57] right now a lot of them start with 'log.' [14:31:43] it's fine [14:31:49] I don't care that much tbh :) [14:31:51] (03CR) 10Mark Bergsma: [C: 032] Move the text-varnish IPs in pmtpa back to text [operations/puppet] - 10https://gerrit.wikimedia.org/r/93698 (owner: 10Mark Bergsma) [14:31:58] about naming [14:34:58] (03PS1) 10Mark Bergsma: Correct monitoring IP references [operations/puppet] - 10https://gerrit.wikimedia.org/r/93700 [14:35:05] (03CR) 10jenkins-bot: [V: 04-1] Correct monitoring IP references [operations/puppet] - 10https://gerrit.wikimedia.org/r/93700 (owner: 10Mark Bergsma) [14:36:47] (03CR) 10Mark Bergsma: [C: 032 V: 032] Correct monitoring IP references [operations/puppet] - 10https://gerrit.wikimedia.org/r/93700 (owner: 10Mark Bergsma) [14:40:39] (03PS1) 10Faidon Liambotis: puppetmaster: fixup to $is_labs_puppet_master ref [operations/puppet] - 10https://gerrit.wikimedia.org/r/93707 [14:40:40] (03PS1) 10Faidon Liambotis: geoip: don't filebucket geoip databases [operations/puppet] - 10https://gerrit.wikimedia.org/r/93708 [14:40:41] (03PS1) 10Faidon Liambotis: geoip: simplify the module [operations/puppet] - 10https://gerrit.wikimedia.org/r/93709 [14:40:58] (03CR) 10Faidon Liambotis: [C: 032] puppetmaster: fixup to $is_labs_puppet_master ref [operations/puppet] - 10https://gerrit.wikimedia.org/r/93707 (owner: 10Faidon Liambotis) [14:41:06] (03CR) 10Faidon Liambotis: [C: 032] geoip: don't filebucket geoip databases [operations/puppet] - 10https://gerrit.wikimedia.org/r/93708 (owner: 10Faidon Liambotis) [14:42:03] (03CR) 10Ottomata: [C: 032] geoip: don't filebucket geoip databases [operations/puppet] - 10https://gerrit.wikimedia.org/r/93708 (owner: 10Faidon Liambotis) [14:42:14] thanks :) [14:43:04] (03PS1) 10Mark Bergsma: Move esams text-varnish IPs to text [operations/puppet] - 10https://gerrit.wikimedia.org/r/93710 [14:43:37] jenkins died again? [14:43:55] (03CR) 10Faidon Liambotis: [V: 032] puppetmaster: fixup to $is_labs_puppet_master ref [operations/puppet] - 10https://gerrit.wikimedia.org/r/93707 (owner: 10Faidon Liambotis) [14:44:01] (03PS2) 10Mark Bergsma: Move esams text-varnish IPs to text [operations/puppet] - 10https://gerrit.wikimedia.org/r/93710 [14:44:01] i believe so [14:44:06] (03CR) 10Faidon Liambotis: [V: 032] geoip: don't filebucket geoip databases [operations/puppet] - 10https://gerrit.wikimedia.org/r/93708 (owner: 10Faidon Liambotis) [14:46:47] (03CR) 10Mark Bergsma: [C: 032 V: 032] Move esams text-varnish IPs to text [operations/puppet] - 10https://gerrit.wikimedia.org/r/93710 (owner: 10Mark Bergsma) [14:46:50] paravoid but what about the thousands of people that want to have puppetmasters download geoip files to a custom directory?! [14:46:52] :p [14:47:00] (jk) [14:47:12] paravoid: mark: jenkins is not dead, it is trying to catch up with all your changes :D [14:47:13] (03CR) 10Ottomata: [C: 031] geoip: simplify the module [operations/puppet] - 10https://gerrit.wikimedia.org/r/93709 (owner: 10Faidon Liambotis) [14:47:18] I got to make the job to run in parralel [14:47:23] jobs [14:47:40] technically, I didn't hardcode a directory, just a fileserver path :) [14:47:41] isn't it good that we write changes faster than jenkins can check them [14:47:46] which you can point wherever you want [14:48:16] * paravoid wtfs at virt0's iptable/puppet integration [14:50:55] (03CR) 10Akosiaris: [C: 032] geoip: simplify the module [operations/puppet] - 10https://gerrit.wikimedia.org/r/93709 (owner: 10Faidon Liambotis) [14:53:48] (03PS1) 10Faidon Liambotis: puppetmaster/geoip: make the symlinks relative [operations/puppet] - 10https://gerrit.wikimedia.org/r/93712 [14:53:49] (03PS1) 10Faidon Liambotis: authdns: switch to geoip::data::puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/93713 [14:54:24] (03CR) 10Faidon Liambotis: [C: 032] puppetmaster/geoip: make the symlinks relative [operations/puppet] - 10https://gerrit.wikimedia.org/r/93712 (owner: 10Faidon Liambotis) [14:54:35] (03CR) 10Faidon Liambotis: [C: 032] authdns: switch to geoip::data::puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/93713 (owner: 10Faidon Liambotis) [14:54:52] mark: so, now do you see why I didn't do geoip city from the beginning? [14:54:55] :P [14:55:03] YuviPanda: It's database problems. I fixed the 400s from yesterday by copying over the .db file from proxy-dammit [14:55:07] no, i'm not looking [14:55:14] YuviPanda: Now I can read but not write, current failures always happen in db.session.commit() [14:55:15] i need my eyes today [14:55:18] just the count of commits [15:11:14] (03PS1) 10Faidon Liambotis: geoip: s/misc::geoip/geoip/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/93714 [15:11:27] ottomata: ^ :-) [15:12:40] (03CR) 10Faidon Liambotis: [C: 032] geoip: s/misc::geoip/geoip/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/93714 (owner: 10Faidon Liambotis) [15:16:45] hm, how do we push to labs-private again? [15:17:45] (03PS1) 10Mark Bergsma: Add new esams LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93717 [15:18:35] To ssh://faidon@gerrit.wikimedia.org:29418/labs/private.git ! [remote rejected] HEAD -> refs/for/master (internal server error) [15:18:38] huh [15:18:52] I haven't done this in a while though [15:19:09] andrewbogott: around? [15:19:21] Internal server error suggests it's probably not your fault ;) [15:19:35] * andrewbogott reads backscroll [15:19:41] you'd think [15:19:54] cmjohnson1: are you here ? [15:20:16] qchris: ^ Gerrit is giving an internal server error... [15:20:26] (03CR) 10Mark Bergsma: [C: 032] Add new esams LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93717 (owner: 10Mark Bergsma) [15:20:30] paravoid: 'git review' should work with gerrit, same as anywhere else [15:20:40] with labs private you mean [15:20:48] yep. [15:21:00] labs private is just another ordinary gerrit repo. [15:21:20] * andrewbogott tries it here [15:21:21] [2013-11-05 15:21:17,018] ERROR com.google.gerrit.server.git.ReceiveCommits : Only 0 of 1 new change refs created in labs/private; aborting [15:21:38] * paravoid scratches head [15:21:40] Reedy: Is labs-private gerrit hosted? [15:21:45] it is [15:21:47] Yup [15:21:48] see the error above [15:21:54] https://git.wikimedia.org/summary/?r=labs/private.git [15:22:05] Right. [15:22:06] !log reedy synchronized php-1.23wmf2/includes/media/FormatMetadata.php 'I5f65333cc94b66fd80fcf5abcfad7e10e4669310' [15:22:18] paravoid: "I'm not slacking; gerrit won't let me push my work" [15:22:21] Logged the message, Master [15:22:24] paravoid: Yeps, I get internal server error too. [15:22:39] That was good [15:22:42] and there's ^d :) [15:22:46] ^d, you are here just in time! [15:22:47] ^d: paravoid has broken gerrit [15:22:57] * ^d goes back to bed [15:22:59] <^d> ;-) [15:23:22] ^d: ever seen "com.google.gerrit.server.git.ReceiveCommits : Only 0 of 1 new change refs created in labs/private; aborting" before? [15:23:47] <^d> Not that I remember. [15:24:19] http://comments.gmane.org/gmane.comp.version-control.repo/5939 [15:26:22] akosiaris1..yep [15:27:56] <^d> paravoid: I did the little one liner he suggested to compare change ids, our max(change_id) value is greater or equal to than what's on-disk [15:28:00] <^d> (Which should be fine) [15:28:34] cmjohnson1: I have a problem with strontium. It seems to have 0 connected ethernet cables to the switch. And I can't find a reason for it (unless descriptions are wrong on the switch). Maybe you could have a look next time you visit eqiad ? [15:29:17] ^d: it's not :) [15:29:27] it doesn't work I mean [15:29:28] <^d> Obviously :) [15:29:29] okay..I will be back there Thursday [15:29:36] <^d> qchris: Any thoughts? [15:29:45] cmjohnson1: thanks :-) [15:29:46] Not yet. [15:29:53] I am checking acls ... [15:30:06] <^d> acls on labs/private shouldn't have changed in *agessss* [15:30:14] Yes :-) [15:30:30] http://code.google.com/p/gerrit/issues/detail?id=1593 [15:31:29] akosiaris1: ge-4/0/27 up down strontium [15:32:58] that is why i think it is a cable issue :-( [15:33:10] tried and the other 3 obviously... no change [15:34:02] and then i set them back to being disabled [15:36:33] ^d: acls look sane. So does database. Do the logs show anything related? [15:37:44] <^d> qchris: Just a couple of entries at the end of error_log saying exactly what paravoid and nothing else. [15:38:01] Mhmm :-( [15:38:16] No stack traces or the like? Strange. [15:38:47] <^d> Here's the latest, in all its usefulness: [15:38:48] <^d> [2013-11-05 15:29:09,890] ERROR com.google.gerrit.server.git.ReceiveCommits : Only 0 of 1 new change refs created in labs/private; aborting [15:44:26] (03PS1) 10Ottomata: Adding support for typeNames setting on output writers. [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/93728 [15:45:51] * andrewbogott goes to vote [15:48:38] !log authdns: upgrading gdnsd to 1.10.1 & switching to the proprietary GeoIP database [15:48:53] Logged the message, Master [16:08:50] (03PS1) 10Ottomata: Adding output writer parameters to jmxtrans::metrics::jvm [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/93741 [16:09:46] (03CR) 10Ottomata: [C: 032 V: 032] Adding support for typeNames setting on output writers. [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/93728 (owner: 10Ottomata) [16:10:02] (03CR) 10Ottomata: [C: 032 V: 032] Adding output writer parameters to jmxtrans::metrics::jvm [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/93741 (owner: 10Ottomata) [16:13:59] (03PS1) 10Cmjohnson: Removing all entries for ms1-4 as they're being decommissioned :: [operations/puppet] - 10https://gerrit.wikimedia.org/r/93744 [16:14:13] (03PS1) 10Ottomata: Adding kafka::server::jmxtrans support for jmxtrans Kafka broker metrics [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93746 [16:15:21] (03CR) 10Ottomata: [C: 032 V: 032] Adding kafka::server::jmxtrans support for jmxtrans Kafka broker metrics [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93746 (owner: 10Ottomata) [16:21:01] (03PS1) 10Faidon Liambotis: Switch to the GeoIP City databases & add US states [operations/dns] - 10https://gerrit.wikimedia.org/r/93747 [16:22:10] mark, bblack ^ [16:22:35] indendation is getting a bit ridiculous :) [16:23:30] (03CR) 10Mark Bergsma: [C: 031] Switch to the GeoIP City databases & add US states [operations/dns] - 10https://gerrit.wikimedia.org/r/93747 (owner: 10Faidon Liambotis) [16:24:16] HI, CA, OR, WA for starters? [16:24:51] (03CR) 10Faidon Liambotis: [C: 032] Switch to the GeoIP City databases & add US states [operations/dns] - 10https://gerrit.wikimedia.org/r/93747 (owner: 10Faidon Liambotis) [16:26:58] bblack? any suggestions? [16:29:48] (03PS1) 10Ottomata: Now using jmxtrans instead of KafkaGanglia class. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93750 [16:30:20] (03CR) 10Ottomata: [C: 032 V: 032] Now using jmxtrans instead of KafkaGanglia class. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93750 (owner: 10Ottomata) [16:30:52] paravoid: I see only "=> eqiad" for all the new US stuff, which is the default anyways? [16:31:16] oh, Reedy mind replying to that email from Erik about scap? [16:31:33] is that just to establish a baseline and then later move some to ulsfo? [16:31:40] paravoid: just to make it easier to change some to ulsfo later [16:31:44] greg-g: There's nothing to announce [16:31:49] same holds for esams atm, we should just change that to EU => esams now [16:31:54] oh yes, I should read commit messages [16:32:17] (03PS1) 10Cmjohnson: Removing dns entries for ms1-4 [operations/dns] - 10https://gerrit.wikimedia.org/r/93751 [16:36:13] paravoid or mark can you look at this ...really just the nfs.pp change...should be okay but don't want to break anything https://gerrit.wikimedia.org/r/#/c/93744/ [16:37:02] (03PS1) 10Faidon Liambotis: Point US West Coast states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/93752 [16:37:07] cmjohnson1: should be fine [16:37:13] thx [16:37:25] mark: deploy tomorrow our morning? [16:37:32] I guess they're waking up now, might be too late [16:37:35] (03CR) 10Cmjohnson: [C: 032] Removing all entries for ms1-4 as they're being decommissioned :: [operations/puppet] - 10https://gerrit.wikimedia.org/r/93744 (owner: 10Cmjohnson) [16:37:36] yeah fin [16:37:38] fine [16:37:54] I'll be around for a few more hours, so I could do it today [16:38:22] but maybe it's better to slowly ramp it up [16:38:31] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for ms1-4 [operations/dns] - 10https://gerrit.wikimedia.org/r/93751 (owner: 10Cmjohnson) [16:38:59] !log dns update [16:39:15] Logged the message, Master [16:40:51] !log beta: extensions update has been blocked since Oct 30th due to some bad registration in mediawiki/extensions.git . Fixed! [16:40:56] paravoid: I'd just put CA by itself in the first move. It's so huge compared to other states. [16:41:07] Logged the message, Master [16:42:01] Reedy: could you tell him there's nothing to announce? :) [16:42:14] I know, that's why I bundled it with the other small ones [16:42:23] if it were two large states I wouldn't do it [16:42:33] but the others should be relatively nothing compared to CA [16:42:37] ok [16:53:24] (03PS1) 10RobH: RT: 6161 decom for relocation marmontel [operations/dns] - 10https://gerrit.wikimedia.org/r/93755 [16:55:39] gerrit why you so slllllowwwww [16:56:05] we give you all the hw you want [16:56:05] yet you still slow [16:56:29] jenkins? [16:56:45] ^d: i know gerrit is your baby, but its an ugly baby. [16:56:52] nah, just 'review and go to next file' [16:56:53] takes forever [16:57:03] maybe its a firefox thing. [16:57:21] RobH: jenkins is the slow part [16:57:25] but doing so even slows down my ssytem and invokes the pinwheel [16:57:34] at least it was yesterday, but it's not dead, just patience [16:57:39] ah yeah, stupid google tool only caring about google products [16:57:53] jenkins is at fault for the page loads in gerrit? that seems odd. [16:58:17] hrmm [16:58:18] apergos: reminder re RelEng/QA checkin starting in 5 [16:58:22] well 2 [16:58:22] <^d> RobH: One of the design philosophies with Gerrit is "make it work in Chrome, and fuck everybody else" [16:58:25] must be firefox, i watch daniel do same thing and no lag. [16:58:33] ah i saw the difference now [16:58:39] ^d: so i gotta run my closed source browser? [16:58:40] =p [16:58:45] well, i have Iceweasel/Firefox and on the same connection and its way faster [16:58:47] <^d> Much as I hate to say it, using any webkit/blink-based browser is likely to give a better experience :) [16:58:48] no one give me the 'you can compile the open source chrome' [16:58:51] cuz thats bullshit! [16:58:53] ;] [16:59:03] greg-g: can I read the highlights later? [16:59:18] <^d> RobH: Really, it's GWT, but I digress. [16:59:26] # experimental for modern Iceweasel [16:59:26] deb http://ftp.us.debian.org/debian experimental main [16:59:26] deb http://mozilla.debian.net/ experimental iceweasel-aurora [16:59:32] I don't have anything to report, and I'm slugging away on cleanup just now [16:59:34] <-- apt.sources [16:59:45] (03CR) 10RobH: [C: 032] RT: 6161 decom for relocation marmontel [operations/dns] - 10https://gerrit.wikimedia.org/r/93755 (owner: 10RobH) [17:01:12] <^d> mutante, RobH: So we talked SVN the other day, and finding a crappy box in eqiad to stuff it on. We agree on antimony as being reasonable? [17:01:23] <^d> If so, I can start working on the puppet stuff and get the data moved over. [17:01:24] i'm going to let mutante have his moment of 'my linux laptop performs better at that' [17:01:28] ^d: nope [17:01:35] cuz when you finish temp stuff on that host [17:01:37] hashar: able to make call? [17:01:38] its getting wiped [17:01:41] its a temp scripting host. [17:01:43] <^d> RobH: antimonyyyyy [17:01:46] <^d> Not arsenic. [17:01:51] greg-g: yeah forgot to click sorry [17:01:52] oh, wait [17:01:52] damn it [17:01:58] fuckin elements, who picked that scheme... [17:02:06] ^d: does that include codereview-proxy? [17:02:09] (03PS1) 10Faidon Liambotis: Add Canada and its provinces/territories [operations/dns] - 10https://gerrit.wikimedia.org/r/93756 [17:02:17] <^d> mutante: We don't need codereview-proxy anymore afaik. [17:02:28] <^d> I'm pretty sure Reedy finished dumping all the junk into CR. [17:02:30] ok antimony is gerrit stuff [17:02:31] so yea [17:02:31] ^d: i agree. [17:02:38] <^d> antimony is gitblit. [17:02:38] ^d: well, this is gitblit right? [17:02:40] ^d: somebody fixed the redirect to bugzilla to http://codereview-proxy.wikimedia.org/ [17:02:49] ^d: i have an RT to confirm if it can die or not [17:02:57] isnt it in 'development' [17:02:57] or is it a production service now? [17:03:13] https://rt.wikimedia.org/Ticket/Display.html?id=6146 [17:03:16] <^d> RobH: As production as gitblit is capable of being. [17:03:19] i would really like that because that's on kaulen [17:03:22] ^d: heh [17:03:42] i think it was said we cannot get rid of CRP [17:03:45] cuz of some kinda linking? [17:03:51] (or is that just svn?) [17:04:01] then please take it with you on the svn box [17:04:04] if that works [17:04:12] <^d> RobH: That's svn. [17:04:15] ok [17:04:24] yea antimony is fine to move things onto for that then [17:04:27] <^d> codereview-proxy is a relic from mayflower days and probably could've died sooner. [17:04:33] then lets kill it [17:04:41] <^d> When we needed a way to get svn from europe to us :) [17:06:58] (03PS1) 10RobH: RT: 6161 remove all traces of marmontel [operations/puppet] - 10https://gerrit.wikimedia.org/r/93758 [17:08:07] ^d: Yeah, I force cached the diffs into the database for what would ever generate a diff [17:08:41] delete it [17:08:43] DELETE IT [17:09:09] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=6208 is marmontel wipe and ship to eqiad [17:09:21] usual, dont decom in racktables rack, just unrack and replace name with asset tag, etc... [17:10:24] ^cool...it's already wiping so I will update ticket in a few...I have a pile of servers going to eqiad. Should go out by next week [17:10:29] !log stopping puppet on analytics1021 to test jmxtrans kafka stuff [17:10:45] Logged the message, Master [17:11:54] !log reclaimed marmontel for shipment to eqiad, confirmed decommission steps up to powering down. [17:12:10] Logged the message, RobH [17:12:24] !log disabling icinga checks on search13-20 for decommissioning rt6194 [17:12:37] Logged the message, Master [17:15:20] (03PS1) 10Dzahn: remove codereview-proxy.wm.org from DNS [operations/dns] - 10https://gerrit.wikimedia.org/r/93761 [17:15:46] !log ori synchronized php-1.23wmf1/extensions/WikimediaEvents/modules/ext.wikimediaEvents.ve.js 'I4085ee9cb: Update VisualEditor performance event names' [17:16:02] Logged the message, Master [17:16:20] !log ori synchronized php-1.23wmf2/extensions/WikimediaEvents/modules/ext.wikimediaEvents.ve.js 'I4085ee9cb: Update VisualEditor performance event names' [17:16:36] Logged the message, Master [17:16:37] <^d> Reedy: Easy change https://gerrit.wikimedia.org/r/#/c/92016/ [17:17:06] Wheee [17:17:10] (03PS4) 10Reedy: Remove old tampa search config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92016 (owner: 10Chad) [17:17:14] (03CR) 10Reedy: [C: 032] Remove old tampa search config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92016 (owner: 10Chad) [17:18:01] (03Merged) 10jenkins-bot: Remove old tampa search config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92016 (owner: 10Chad) [17:18:06] ^d: eh.. https://wikitech.wikimedia.org/wiki/Codereview-proxy.wikimedia.org says "from our Subversion repo, which is currently over in eqiad" [17:18:36] <^d> Lies :p [17:18:39] <^d> It's in Tampa [17:18:49] https://wikitech.wikimedia.org/w/index.php?title=Codereview-proxy.wikimedia.org&diff=80260&oldid=70298 [17:18:53] Reedy: ^ [17:19:07] ^d: lol, it was esams before [17:19:13] <^d> Yep. [17:19:16] <^d> mayflower. [17:19:21] ah [17:19:22] Ryan moved it around the DC hackathon [17:19:25] 3 years ago(?) [17:19:52] Reedy: agree on https://gerrit.wikimedia.org/r/#/c/93761/1 ? [17:20:01] Yup [17:20:01] i'd kill that first and then delete it from kaulen [17:20:03] cool [17:20:31] * apergos smiles at ottomata  [17:20:35] (03PS7) 10Andrew Bogott: Various dynamicproxy-api changes: [operations/puppet] - 10https://gerrit.wikimedia.org/r/92664 [17:20:41] :) [17:20:44] (for the log entry!) [17:20:46] (03CR) 10Dzahn: [C: 032] remove codereview-proxy.wm.org from DNS [operations/dns] - 10https://gerrit.wikimedia.org/r/93761 (owner: 10Dzahn) [17:21:04] (03PS1) 10Chad: Apply SVN manifests to antimony in prep for move to eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/93762 [17:21:20] !log DNS update - killing codereview-proxy [17:21:34] Logged the message, Master [17:21:50] !log updating pmtpa pybal searchpool_4 and 5...setting all to false [17:22:07] Logged the message, Master [17:22:10] <^d> mutante: So, for moving svn we should be able to just apply the puppet changes, rsync the data over, then swap dns. [17:23:29] ^d: ok, sounds reasonable, want me to merge ? [17:23:39] svn-server to antimony [17:25:01] do you need us for the rsync or do you have sudo on both [17:25:43] <^d> I've got sudo on both ends. [17:25:53] <^d> So once it's merged on puppetmaster I can handle the rest I think [17:25:54] (03CR) 10Dzahn: [C: 031] Apply SVN manifests to antimony in prep for move to eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/93762 (owner: 10Chad) [17:26:04] ah, ok [17:26:12] (03CR) 10Dzahn: [C: 032] Apply SVN manifests to antimony in prep for move to eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/93762 (owner: 10Chad) [17:26:26] heya paravoid, just so I know for scrum of scrums [17:26:43] ^d: it's merged [17:26:44] the geoip module work you did allows for use of states in gdns? [17:26:55] aude: any idea what the qunit failure is about? [17:27:29] (03PS8) 10Andrew Bogott: Various dynamicproxy-api changes: [operations/puppet] - 10https://gerrit.wikimedia.org/r/92664 [17:27:33] <^d> mark: I wonder if we can put the old readonly svn behind misc-lvs while we're moving it to eqiad :) [17:27:43] sure [17:27:48] ori-l: asking me? [17:28:12] <^d> *misc-varnish, but yeah :) [17:29:12] oh, it always fails and think it has to do with how phantom does focus [17:29:21] <^d> mutante: err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Package[apache2-mpm-prefork] is already defined in file /etc/puppet/manifests/webserver.pp at line 151; cannot redefine at /etc/puppet/manifests/webserver.pp:65 on node antimony.wikimedia.org [17:29:22] <^d> Bah [17:29:30] qunit tests otherwise do pass [17:29:45] no one has figured out how to make this work in jenkins with phantom [17:30:22] (03CR) 10Andrew Bogott: [C: 032] Various dynamicproxy-api changes: [operations/puppet] - 10https://gerrit.wikimedia.org/r/92664 (owner: 10Andrew Bogott) [17:30:42] aude: how safe is https://gerrit.wikimedia.org/r/#/c/93661/ to deploy? [17:31:41] aude: also, does it need to be accompanied by a wmf-config change adding the config vars? [17:31:47] ^d: probably easiest fix to remove webserver setup from svn class.. i guess it should be a role for the entire server that includes webserver once and then gitblit and svn [17:32:00] ori-l: should be okay but might not merge cleanly into the branch [17:32:14] let me wail briefly once more [17:32:15] and then needs site group set [17:32:21] GANGLIAAAAAAAAAAA why u so finicky.? [17:32:47] wikivoyage for all the wikivoyage sites, wikipedia for everything else (commons is treated as in wikipedia group) [17:33:09] ori-l: i can cherry pick into the branch [17:33:28] (03PS1) 10Chad: antimony already has apache installed [operations/puppet] - 10https://gerrit.wikimedia.org/r/93764 [17:33:32] that'll be good, i think [17:33:53] give me 5 mine [17:33:54] min [17:35:31] (03PS1) 10Cmjohnson: Removing search13-21 from puppet manifest [operations/puppet] - 10https://gerrit.wikimedia.org/r/93765 [17:35:57] (03CR) 10Dzahn: [C: 032] "yea, that should fix the duplicate definition. webserver setup should probably not be in the same classes setting up a site anyways becaus" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93764 (owner: 10Chad) [17:37:59] (03PS2) 10Cmjohnson: Removing search13-21 from puppet manifest [operations/puppet] - 10https://gerrit.wikimedia.org/r/93765 [17:38:07] (03CR) 10Cmjohnson: [C: 032] Removing search13-21 from puppet manifest [operations/puppet] - 10https://gerrit.wikimedia.org/r/93765 (owner: 10Cmjohnson) [17:47:06] !log enabled spam filtering on ops list (logging in case it holds a ton of them) [17:47:21] Logged the message, RobH [17:50:25] !log deleting empty /srv/org/mediawiki on kaulen [17:50:44] Logged the message, Master [17:56:52] !log more cleanup on kaulen, move old stuff like "1.16wmf4 pilot" :) ,"codereview proxy", old bugzillas etc to backup [17:57:02] yay [17:57:10] Logged the message, Master [17:57:13] does kaulen need mediawiki? [17:57:35] apergos: no, at least not nowadays [17:57:38] great [17:57:39] everything does! [17:57:42] noooo [17:57:47] thank goodness [17:58:02] so RobH and cmjohnson1, question for you both, https://wikitech.wikimedia.org/wiki/User:ArielGlenn/Server_cleanup#Spares:_dns.2Fdhcp.2Fnetboot_or_not.3F [17:58:12] when you guys have a chance [17:58:37] cmjohnson1: i have those [17:58:47] those are my spares, i'll finish their cleanup since chris is in tampa [17:58:55] (the element ones at bottom of that page) [17:59:02] apergos: ok for me to yank them off yer page when i finish? [17:59:11] we don't need to remove mgmt dns or dhcpd when their spares [17:59:14] well the q is whether we in fact do remove from netboot, dhcp, etc [17:59:18] ^^ [17:59:25] exactly that, do we or don't we [17:59:32] we don't [17:59:36] correction! [17:59:40] we kinda do now [17:59:45] cmjohnson1: we changed lifecycle doc last week [17:59:50] So if its spare, we dont remove its mgmt [18:00:00] we do however completely pull out dhcp lease entry [18:00:02] I don't have a horse in this race but if you guys agree on whatever, I'll do that whatever. [18:00:04] as those are FQDN [18:00:45] so the idea is a spare server may or may not use the same fqdn, as they go internal or external vlans [18:00:45] if its spare, its name is more than likely stable, and its mgmt is stable [18:01:00] but its production vlan is not [18:01:00] cmjohnson1: so in discussion last week about this, i modified the decom part of lifecycle [18:01:04] and apergos did as well [18:01:17] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission [18:01:19] okay..i wasn't part of that discussion..so yeah if that is what you came up with that works [18:01:25] Remove from puppet stored configuration files. [18:01:26] remove from site.pp (puppet:///manifests/site.pp) [18:01:26] remove from netboot.cfg (puppet:///files/autoinstall/netboot.cfg) [18:01:26] remove from DHCPD lease file (puppet:///files/dhcpd/linux-host-entries.ttyS... filename changes based on serial console settings) [18:01:30] well it's up for discussion (in my book) [18:01:30] cmjohnson1: Yea, didnt mean to exclude [18:01:44] in fact i was on the side of 'dont delete' [18:01:47] i.e. I just kinda nodded, it doesn't impact my issues (monitoring, puppet, etc) [18:01:48] mutante was also in dicussion and pointed out what i did above [18:01:48] and i changed my mind [18:02:07] nbd really probably good practice to remove [18:02:08] since i could decom say rubidium, on public vlan [18:02:16] later then use internal vlan [18:02:19] so yea, safer to remove all but mgmt [18:02:27] (which doesnt change based on system use, merely on site) [18:02:48] what do we think abut netboot.cfg? keep? toss and let new use determine what it gets? [18:02:52] so yea, when a system is set to spare, we remove all trace of it except for its mgmt in dns, and its entry in racktables [18:02:56] toss [18:03:04] cuz it could change its partitioning needs [18:03:10] based on system role [18:03:29] it seems like 25% of the time we end up juggling hard disks in them as well between installs [18:03:34] adding or removing ssd, etc. [18:03:36] I didn't know that [18:04:20] ok. so... [18:04:57] RobH: you wanted to claim the last cleanup on the elements at the end? this is cool [18:05:05] https://gerrit.wikimedia.org/r/#/c/93680/ that's my dns change if you want it [18:05:06] yep, you have patchset in dns for it? [18:05:13] beat me to it [18:05:47] and yes you can toss from the page when done (as well as mark in lists earlier in page) [18:06:04] <^d> bblack: Since I'm trying to decom formey, did you ever end up grabbing those final svn backups and stashing them $somewhereSafe like we discussed? [18:06:06] my hope is in the end we have a bunch of spares marked up and everything else is gone [18:06:15] apergos: fyi, kaulen had /srv/org/mw because it once was download.mediawiki. and then there was a ticket about getting a new download server for mw. and meanwhile it's a cluster redirect to https://www.mediawiki.org/wiki/MediaWiki because that's where you find the downloads [18:06:27] right [18:06:29] ^d: me? [18:06:40] <^d> Whoops, not you, sorry. [18:06:47] there still is that ticket and there still should be one... sometime [18:06:53] <^d> akosiaris1: Around? :) [18:06:56] apergos: yes [18:07:16] hrmm [18:07:20] where the fuck is my ticket on assigning strontium [18:07:28] added that redirect some months ago so it didn't point to empty dir on kaulen [18:07:33] those two are the new puppetmasters aren't they? [18:07:43] * apergos goes to look at site.pp [18:07:59] yay [18:08:04] 5985 [18:08:10] i didnt pull off my spares page =P [18:08:25] yeah [18:08:49] do the cp10* hosts count as new spares? I guess they are almost done [18:09:44] if they are fully decom'd and aren't going to be cp servers then yea [18:09:53] but they'll need the server names reset to asset tag [18:09:55] and such [18:10:15] cp1021-36, cp1041-42 [18:10:24] 5981 [18:11:40] well [18:11:43] almost gone, there's just cp1036 which still has a puppet cert [18:11:44] this isnt resolved [18:11:47] you guys cannot just decom stuff and NOT EMAIL ME [18:11:50] =p [18:12:07] you get all the rt mail :-P [18:12:21] ^d: yes [18:12:37] apergos: thats not telling me [18:12:38] <^d> akosiaris1: Hi :) Since I'm trying to decom formey, did you ever end up grabbing those final svn backups and stashing them $somewhereSafe like we discussed? [18:12:42] assignign a ticket to me is telling me [18:12:48] expecting me to read all of RT isnt cool =[ [18:13:15] that's how some of us get our news :-P [18:14:31] if you guys think that having the person who assigns spares have to find out about them by reading of all RT [18:14:37] then we are going to have the issue we have now unknown spares [18:14:46] you're already on that ticket weighing in [18:14:57] we should assing the decom tickets from one person to the next person then [18:15:01] and it wasn't assigned to you cause chris took it to do it [18:15:18] but if there is a good method to make sure you are notified of all decoms then let's do it [18:15:34] (03PS1) 10Aude: Add siteGroup setting for Wikibase [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93773 [18:16:11] but hey, im not gonna argue it anymore, too busy. [18:16:19] one sec before approving settings [18:16:20] ori-l: ^ [18:16:30] i am just double checking i have commons correct [18:16:44] sigh [18:16:44] im not saying i need to decom [18:16:47] but when a server is decom'd [18:16:48] whoever does it should not resolve ticket [18:16:52] but assign it to me so i can update my spares pool [18:17:04] cuz if it doesnt then it sits there in rack and no one konws [18:17:19] ok, if I wind up doing a decom (hey you never know) then I will make sure to do this [18:18:09] eh yea, it was simply resolved too early then [18:18:13] if the procedure gets to be well documented and regular we can open it up to lots of folks so [18:18:30] having them know to assign to you before wipe is good [18:18:49] or after? [18:19:01] (03CR) 10Aude: [C: 04-1] "looking at this again, I think we want commons to have a different setting" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93773 (owner: 10Aude) [18:19:25] if you wanted to you could even rename ticket. before "remove from puppet", then "wipe the disk" etc.. instead of calling all of it decom.. but shrug [18:19:35] to represent the steps [18:19:42] often there is a separate ticket for wipe (for tampa) [18:20:17] ^d: seems like everything was setup to do that but never attributed to formey itself. Wanna wait about an hour ? [18:20:33] <^d> akosiaris1: Well it's just a one-time backup. [18:21:36] <^d> akosiaris1: 4 gzipped svn dumps, totalling 14GB :) [18:21:50] (03PS2) 10Aude: Add siteGroup setting for Wikibase [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93773 [18:22:05] ori-l: ready ^ [18:22:07] lifecycle updated to say 'if spare, assign ticket to robh' [18:22:07] they dont have to do more than that, so issue now resolved for future. [18:23:09] ^d: ok. gimme an hour and it will be backed up. Btw... the svn::server class is going away after that or are we keeping it ? [18:23:09] so it's assign to you at the end after the wipe, yes? [18:23:32] <^d> akosiaris1: Keeping, it's still a read-only service we're just moving out of Tampa. [18:23:50] <^d> Just don't want to waste time moving backups around that aren't going to change anymore. [18:24:03] who is responsible for tmh1 and tmh2? (video encoders) [18:24:58] ok then. I will place the definition there so that all svn::servers from now on get backed up and scheduling a run to backup /svnroot [18:26:20] cmjohnson1: not sure anyone is [18:26:21] <^d> akosiaris1: Not all of /svnroot. Just /svnroot/final-backup/ [18:26:30] used to be notpeter I guess on our side [18:26:42] <^d> There's no point in having continuing backups. [18:26:57] cmjohnson1: i think they are going to be spare [18:27:03] lemme check into it and let you know in a few [18:27:18] okay..thx [18:29:45] (03PS1) 10Akosiaris: Backup svn::server svnroot [operations/puppet] - 10https://gerrit.wikimedia.org/r/93775 [18:30:49] ah so RobH yeah strontium and palladium are marked up as the new puppetmasters in site.pp [18:31:04] (03CR) 10Akosiaris: [C: 032] Backup svn::server svnroot [operations/puppet] - 10https://gerrit.wikimedia.org/r/93775 (owner: 10Akosiaris) [18:31:27] yep, i found ticket for it [18:31:29] and pulled off my spares [18:31:35] cool [18:44:31] cmjohnson1: so tmh is interesting case [18:44:37] it seems the eqiad ones are online and such [18:44:44] but since tampa is not doing most things, those are set to false there [18:44:48] class { role::applicationserver::videoscaler: run_jobs_enabled => false } [18:46:22] hrmm [18:46:22] so i dont really see a resolution [18:46:33] rt3298 [18:46:33] ie: who handled them after os [18:48:12] RobH: paravoid : that's Jan Gerber, like in the recent mails about libav [18:48:53] <^d> mutante: rsync'ing data over now :) [18:49:03] ^d: cool:) [18:49:24] cmjohnson1: So the tmh1/2 are not active on the cluster right now it seems [18:49:24] Does anyone know which of the deployment-prep hosts on beta labs are varnish systems? Is it all 4 of the deployment-cache servers? Any others? [18:49:27] im still investigating [18:53:29] kaldari: i would expect just the -cache ones, but to really check you can look at the roles applied to the instances in labsconsole/wikitech [18:54:15] like if you go to configure an instance there should be checkboxes [18:54:53] LeslieCarr: hey, question for you [18:54:54] mutante: thanks! I'll try that [18:55:15] LeslieCarr: https://gerrit.wikimedia.org/r/#/c/93752/ & https://gerrit.wikimedia.org/r/#/c/93756/ [18:55:41] LeslieCarr: these are fine for starters, but I'd like to know which ones we should/can add next and where should we draw the line [18:56:02] LeslieCarr: it depends a lot on network connectivity between states and I'm not that familiar with it [18:56:10] I thought you might be :) [18:57:50] yes, take it easy [18:58:00] ulsfo has only a combined 10G link for transit & transport [18:58:09] and an as yet almost unused peering port [18:59:14] sure, but it's useful to know topology/latency nevertheless [18:59:25] sure [18:59:31] all i'm saying, don't change many states at once ;) [18:59:39] CA might be a big one? [19:00:07] I did california, oregon, washington, hawaii and alaska [19:00:27] with the guess that california would account for most of that, so the rest don't matter as much [19:00:32] yup [19:00:38] 38M, should be fine [19:01:16] I also did (prepared) british columbia and yukon, which is also peanuts :) [19:02:49] (03CR) 10Ori.livneh: [C: 032] Add siteGroup setting for Wikibase [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93773 (owner: 10Aude) [19:03:04] paravoid: just checked the test results from the weekend, cassandra ran out of heap space again with the default config [19:03:18] upped it back to 7G now, will restart the test [19:03:22] mark: 54 million in total [19:03:50] gwicke: cool [19:04:08] http://www.wolframalpha.com/input/?i=what+is+the+population+of+california%2C+washington%2C+oregon%2C+hawaii%2C+alaska [19:06:44] !log ori synchronized wmf-config/CommonSettings.php 'I4d6862c3a: add wmgWikibaseSiteGroup' [19:06:58] Logged the message, Master [19:07:37] !log ori synchronized wmf-config/InitialiseSettings.php 'I4d6862c3a: add wmgWikibaseSiteGroup' [19:07:48] ori-l: fixed? [19:07:55] Logged the message, Master [19:08:18] s/fixed/is that the fix for the memcached increased traffic bug?/ [19:08:56] paravoid: it should, two more syncs needed tho [19:09:04] !log ori synchronized php-1.23wmf2/extensions/Wikibase 'I9d52070a4: Re-introduce siteGroup setting for performance reasons' [19:09:05] cool [19:09:22] Logged the message, Master [19:09:37] !log ori synchronized php-1.23wmf1/extensions/Wikibase 'I9d52070a4: Re-introduce siteGroup setting for performance reasons' [19:09:57] Logged the message, Master [19:10:11] thanks ori-l [19:10:51] greg-g: what was the issue with the deploy of wmf1.23wmf2? [19:10:56] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Memcached%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1383675029&g=network_report&z=large [19:10:59] is this documented somewhere? [19:11:10] AaronSchulz: \o/ [19:11:27] aude: ^ [19:11:29] good nosedive [19:11:34] thanks again for the quick fix [19:12:55] (03PS1) 10Jforrester: Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 [19:13:06] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 (owner: 10Jforrester) [19:13:55] (03PS2) 10Jforrester: Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 [19:13:59] (03PS4) 10Nemo bis: Simplify misc::maintenance::update_special_pages a bit [operations/puppet] - 10https://gerrit.wikimedia.org/r/90117 [19:14:10] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 (owner: 10Jforrester) [19:14:41] (03PS3) 10Jforrester: Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 [19:16:30] (03PS1) 10Chad: Puppetize enabling Apache/SVN module [operations/puppet] - 10https://gerrit.wikimedia.org/r/93781 [19:16:37] matanya: I need to copy/paste the email to a pastebin, it was sent to an engineering only list, basically: bad deploy causing fatal exceptions due to a botched deploy of a new extension. [19:17:53] thanks greg-g, details would be welcome, if possible to share of course [19:19:16] ori-l: still lots of memcached-serious spam though [19:19:29] memcached is serious business [19:19:42] (03CR) 10GWicke: [C: 04-2] "This depends on Parsoid master being deployed. Please hold back until that is the case." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 (owner: 10Jforrester) [19:19:55] and those redis auth errors are annoying to [19:20:04] that must be some kind of client or server bug [19:20:30] I know some clients forget to use the password on reconnect...not sure if it's phpredis [19:20:46] (03PS2) 10Chad: puppetize enabling apache svn authz module [operations/puppet] - 10https://gerrit.wikimedia.org/r/93781 [19:24:57] paravoid: write tests are restarted [19:28:15] !log aaron synchronized php-1.23wmf2/includes/clientpool/RedisConnectionPool.php '9e0a5d83fc5350218336a65130e0bb7fd1a5b4a1' [19:28:30] Logged the message, Master [19:30:07] (03CR) 10Dzahn: [C: 031] "all good, mini style improvement, quote it to avoid: WARNING: unquoted resource title on line 45" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93781 (owner: 10Chad) [19:41:19] (03CR) 10RobH: [C: 032] RT: 6161 remove all traces of marmontel [operations/puppet] - 10https://gerrit.wikimedia.org/r/93758 (owner: 10RobH) [19:41:43] (03CR) 10Dzahn: [C: 032] puppetize enabling apache svn authz module [operations/puppet] - 10https://gerrit.wikimedia.org/r/93781 (owner: 10Chad) [19:42:07] (03CR) 10RobH: [C: 032] remove prod dns for spares argon, cobalt, neodymium, promethium [operations/dns] - 10https://gerrit.wikimedia.org/r/93680 (owner: 10ArielGlenn) [19:43:10] (03PS1) 10RobH: removing spares from dhcpd argon neodyium promethium [operations/puppet] - 10https://gerrit.wikimedia.org/r/93789 [19:44:25] gwicke: have a sec? [19:44:34] (03PS1) 10Dzahn: quoting fixes in svn.pp (puppet-lint) [operations/puppet] - 10https://gerrit.wikimedia.org/r/93791 [19:44:58] matanya: sure [19:45:05] (03PS2) 10RobH: removing spares from dhcpd argon neodyium promethium [operations/puppet] - 10https://gerrit.wikimedia.org/r/93789 [19:45:53] gwicke: in modules/mediawiki/templates/jobrunner/jobs-loop.sh.erb you have added a todo a month ago-ish, to remove Legacy Parsoid jobs [19:45:58] damn you rebase [19:46:04] (03CR) 10Dzahn: [C: 032] quoting fixes in svn.pp (puppet-lint) [operations/puppet] - 10https://gerrit.wikimedia.org/r/93791 (owner: 10Dzahn) [19:46:24] are they drained already, or still in use? [19:47:09] (03CR) 10RobH: [C: 032] removing spares from dhcpd argon neodyium promethium [operations/puppet] - 10https://gerrit.wikimedia.org/r/93789 (owner: 10RobH) [19:47:45] matanya: yes, those are now drained and can be removed [19:47:50] (03PS1) 10Chad: Switch SVN to its new home in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/93792 [19:47:51] those job runners are idle [19:48:03] (03PS3) 10Dzahn: ensure /srv/org/wikimedia exists on bugzilla server ..and some minor formatting [operations/puppet] - 10https://gerrit.wikimedia.org/r/93503 [19:48:44] !log traces of neodyium, promethium, & argon removed and reclaimed to eqiad spare [19:48:54] !log marmontel decommissioned for shipment to eqiad [19:49:01] Logged the message, RobH [19:49:08] ^d: tell me when it's ready to switch:) [19:49:13] Logged the message, RobH [19:49:14] <^d> One more test. [19:49:23] gwicke: i'll push a patch [19:49:28] ^d: might run puppet one more time after my lint fix [19:50:30] matanya: thanks! [19:50:39] (03CR) 10Dzahn: [C: 032] ensure /srv/org/wikimedia exists on bugzilla server ..and some minor formatting [operations/puppet] - 10https://gerrit.wikimedia.org/r/93503 (owner: 10Dzahn) [19:51:11] matanya: apart from the job runners there is also a rule to exclude those jobs from the default queue; that can be removed too [19:51:34] gwicke: where is that rule? [19:51:46] matanya: looking.. [19:52:54] <^d> mutante: Can still checkout stuff anonymously over http(s) as designed, svn+ssh access is no more \o/ [19:52:57] <^d> Viewvc working. [19:53:01] <^d> I think we can swap now [19:53:36] matanya: in mediawiki-config/wmf-config/jobqueue-equiad.php, the first of the parsoid jobs [19:54:27] ok, i'll clone that and patch there too gwicke [19:54:47] (03CR) 10Dzahn: [C: 032] "< ^d> mutante: Can still checkout stuff anonymously over http(s) as designed, svn+ssh access is no more \o/" [operations/dns] - 10https://gerrit.wikimedia.org/r/93792 (owner: 10Chad) [19:54:49] matanya: awesome, thanks [19:55:23] !log DNS update - switching SVN over to antimony (eqiad) [19:55:28] (03PS1) 10Cmjohnson: Removing search13-20 from dsh and dhcpd files [operations/puppet] - 10https://gerrit.wikimedia.org/r/93796 [19:55:32] (03PS1) 10RobH: removing spare servers from dsh node groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/93797 [19:55:38] Logged the message, Master [19:56:15] ;; ANSWER SECTION: [19:56:15] svn.wikimedia.org. 3600 IN CNAME antimony.wikimedia.org. [19:56:20] ^d: :) [19:56:31] <^d> :) [19:57:08] yay, thanks:) i'm looking for RT to close [19:57:10] <^d> We should wait a few days to make sure nobody spots anything broken [19:57:20] <^d> But should be ready to decom after that [19:57:36] ^d: is there more stuff on formey or was that all? [19:57:40] (03CR) 10RobH: [C: 032] removing spare servers from dsh node groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/93797 (owner: 10RobH) [19:58:11] <^d> mutante: Ryan said something about using it for a git-deploy test host. [19:58:16] <^d> But dunno what the status of that is. [19:58:19] apergos: so on your list of spares on the misc server page of yours [19:58:27] all the spares at bottom i put strike through, they are handled now. [19:58:36] <^d> mutante: That's the last thing I know of though. [19:58:42] ok. I'll clean up the lists tomorrow, thanks! [19:58:43] ^d: gotcha, ok, and i don't see any other services in DNS pointing to formey either [19:58:45] they now only exist on racktables and mgmt dns entries, as spares should [19:58:47] nice [19:58:50] sweet! [19:58:52] (03PS1) 10Matanya: jobrunner: remove legacy parsoid jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93798 [20:00:02] gwicke: ^ [20:00:41] (03PS2) 10Cmjohnson: Removing search13-20 from dsh and dhcpd files [operations/puppet] - 10https://gerrit.wikimedia.org/r/93796 [20:01:57] (03CR) 10Cmjohnson: [C: 032] Removing search13-20 from dsh and dhcpd files [operations/puppet] - 10https://gerrit.wikimedia.org/r/93796 (owner: 10Cmjohnson) [20:06:36] (03PS7) 10Dzahn: Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 (owner: 10Reedy) [20:08:21] (03PS1) 10Matanya: jobrunner: remove legacy parsoid jobs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93800 [20:08:24] ^d: /root on formey might be worth backing up for historical reasons.. mysqlfb.tar.gz , phase3.tar.gz, userstofix.txt, svnusers-delete.ldif.. shrug [20:08:52] bunch of .ldif [20:09:10] i thought you guys backed it all up [20:09:17] so now its not backed up >_< [20:09:22] ? [20:09:30] gwicke: both pushed. please review in your spare time and merge :) [20:09:49] So who is taking care of formey data copy? [20:09:55] one person needs to be the team lead on this server. [20:10:01] so i can make the ticket their problem. [20:10:49] <^d> RobH: I was taking care of migrating all the SVN stuff over, that's it. I can wash my hands of anything else. [20:10:54] k [20:11:08] i guess mutante is copying it to tridge [20:11:15] so when its done we'll add to ticket and email ops list [20:11:24] saying this is where yer shits at, formey is deeeeeaaad [20:11:46] then someone can chime in that i should have told them beofre emailing ops list and ops list isnt appropriate way to check on server status [20:11:49] like last month [20:14:46] <^d> List archives -> totally legit way to keep track of things. [20:16:08] !log backing up /root on formey to tridge [20:16:24] Logged the message, Master [20:17:10] !log shutting down search13-20, decommissoned [20:17:25] Logged the message, Master [20:20:11] Reedy: /home/reedy/build/* on formey fwiw [20:21:00] mutante: i would like to help with rt tickets, who should i talk to and find out if it is needed and if so, how do i sign an NDA? [20:22:01] matanya: unfortunately i don't know the answer to how to sign an NDA, but if you have one just create a ticket to get an account and happy to create it [20:22:24] (03PS1) 10Cmjohnson: Removing dns entries for search13-20 [operations/dns] - 10https://gerrit.wikimedia.org/r/93803 [20:22:53] probably legal dep. [20:23:13] ksnider: do you know? [20:24:49] you have to get an nda signed like the other volunteers yea [20:24:52] ct handled with legal before [20:24:57] so it would be ksnider with legal now [20:25:11] we have two other volunteers in there that i am aware of [20:25:25] RobH: petan and who else? [20:25:42] THO [20:26:17] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for search13-20 [operations/dns] - 10https://gerrit.wikimedia.org/r/93803 (owner: 10Cmjohnson) [20:26:19] matanya: Legal is still figuring out how best to track these, but I don't think this should hold up your signing. I'll get this sorted. Thanks for your patience. :) [20:26:45] if petan is in there [20:26:46] then we have three. [20:26:48] !log dns update [20:26:51] thanks for this ksnider. I come to serve :) [20:27:02] Logged the message, Master [20:27:09] Don't we all. ;) [20:27:16] matanya: So just a heads up when you do get access to RT. Please never, ever respond directly to vendors =] [20:27:27] otherwise yea, its all good [20:27:43] RobH: you sound like my boss :) [20:28:14] i can fix that i bet. 'here is a 25% raise, you should take an additional two months of paid vacation yearly.' [20:28:42] if that still sounds like your boss, you have a better job than i do ;] [20:28:53] actully he said the first part a month ago RobH :P [20:29:25] congrats then [20:29:27] heh [20:29:42] (i guess, it really matters on scope doesnt it, but lets go with congrats! ;) [20:30:10] with raise comes more job, but well, meh [20:31:33] RobH: mind merging my recent patches? [20:32:44] the parsoid jobrunner stuff? [20:32:52] ottomata: how much of this deploy do you want to do? want to make the mediawiki-config changes, for example? [20:33:20] yes! [20:33:28] sorry, i will be with you in ummm 7 mins? [20:33:29] 5 mins. [20:34:00] yes RobH [20:34:23] this is independent of the mobile frontend one that i dont wanna even touch right? ;] [20:34:37] i can handle those yea [20:34:56] ^d: Job: formey.wikimedia.org-Monthly-1st-Sat-production-svnroot.2013-11-05_18.40.08_03, Termination: Backup OK [20:35:20] and with that... day's end [20:35:32] ^d: RobH: /root and /home and /etc backed up (formey). [20:36:35] ok, ill take ticket [20:36:35] and do email to ops that its gonn adie [20:36:48] and folks have a week to get their shit [20:38:07] yep, agreed, good [20:38:26] it's very unlikely but good to give it a few days [20:38:40] in the meantime. I am going to pull it from monitoring and other related decom groups [20:38:51] (03PS1) 10Ottomata: Fixing up Kafka object queries. Many MBeans had to include escaped double quotes. [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93806 [20:38:51] going to leve its dns alone and network alone [20:38:59] but it doesnt need to do puppet stuffs [20:39:05] !log formey puppet agent disabled [20:39:12] (03CR) 10Ottomata: [C: 032 V: 032] Fixing up Kafka object queries. Many MBeans had to include escaped double quotes. [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93806 (owner: 10Ottomata) [20:39:20] Logged the message, RobH [20:39:40] there's still a bunch of sudo_privs on formey in site.pp [20:39:43] for LDAP stuff [20:39:47] (03PS1) 10Ottomata: Updating kafka module to use better jmxtrans queries [operations/puppet] - 10https://gerrit.wikimedia.org/r/93807 [20:39:57] (03PS2) 10Ottomata: Updating kafka module to use better jmxtrans queries [operations/puppet] - 10https://gerrit.wikimedia.org/r/93807 [20:39:57] mutante: eh? [20:40:06] (03CR) 10Ottomata: [C: 032 V: 032] Updating kafka module to use better jmxtrans queries [operations/puppet] - 10https://gerrit.wikimedia.org/r/93807 (owner: 10Ottomata) [20:40:08] mutante: meaning what though, that i shouldnt have just killed its puppet stuff? [20:40:14] RobH: i mean what is in site.pp for formey currently [20:40:33] !log re-enabling puppet on analytics1021 [20:40:39] cuz i just killed puppetstoredconfig, puppetkey and salt key [20:40:39] oh, ok [20:40:47] i'm going to yank that via patchset and link in my email [20:40:47] and not merge it., [20:40:48] Logged the message, Master [20:41:17] RobH: so far just pointing it out to make sure it's also been migrated [20:41:20] yea [20:41:25] ok manybubbles [20:41:31] f.e. sudo_user { [ "robla", "sumanah", "reedy" ]: privileges => $sudo_privs } [20:41:44] oh, meh [20:41:45] that is to let them modify LDAP users [20:41:51] where is that now? [20:42:00] ok wait, if it's going to be up and accessible, it should run puppet [20:42:04] i mean, where do they go now, or was it still formey? [20:42:28] yeah manybubbles, i want to do everything! [20:42:39] apergos: its only accessible to pull data in the week before its powered off [20:42:44] its otherwise decommissioned [20:42:55] so let puppet be running on it during that week [20:42:57] i can make it work again [20:42:58] ottomata: ok! so the first thing is to regiser the new wikis as cirrus secondaries in mediawiki-config. [20:43:00] oh. no certs eh? [20:43:00] but now i have to sign keys [20:43:03] RobH: let's have Ryan review .. hmm [20:43:19] oook [20:43:21] never done this before! [20:43:32] meh [20:43:35] i dunno why im invovled then [20:43:40] cuz i thoguht it was coy data and decom [20:43:42] now its not decom [20:43:46] so i'll resign all the certs and such now [20:43:57] that's what we're trying to find out with this ticket [20:44:12] no one should involve me in the decom until its decom ;p [20:44:20] n theory when we disable puppet we're committing to power off... yep, agree [20:44:24] *in theory [20:44:40] dang it tricked... into talking about work again [20:44:44] * apergos goes away [20:45:13] yea, the usual problem. finding out if it can be decom'ed [20:45:56] no way without team work [20:46:40] i'm gonna comment on ticket [20:46:56] matanya: sorry, im working on fixing this mistake beofre i go back to your patchset [20:47:06] puppet key signed for formey [20:47:15] salt isnt pending yet, when it shows in queue it can be signed [20:47:21] i unassigned myself from the ticket [20:47:27] no problem. if you ask me, leave formey for a week [20:47:44] that was the plan [20:47:55] with puppet, i mean [20:47:56] but i think it should sit in mostly decom state to ease of use later [20:48:02] but its now back to full service for the week. [20:48:10] cuz i dont care enough to work on it any longer than i must [20:48:18] =P [20:48:19] moving nagios and the like is half decom'ed :) [20:49:09] (03PS1) 10Hashar: beta: log exceptions in exception-json bucket [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93811 [20:50:32] (03PS1) 10Ottomata: Setting up CirrusSearch for 5 wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93812 [20:52:11] (03CR) 10RobH: "chatted with matanya and gabriel about this, merging" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93798 (owner: 10Matanya) [20:52:20] hrmm [20:52:26] needs verified, ohhh [20:52:29] trusted user thing [20:52:30] hrmm [20:52:39] RobH: Let Jenkins do its thing [20:52:42] i am not sure on the appropriate way to proceed [20:52:43] RoanKattouw: it did [20:52:52] it only did +1 though as matanya isnt trust whatever [20:52:53] i guess [20:52:59] RobH: do you have a link to the other patch? [20:53:00] Yeah no I mean gate-and-submit [20:53:04] Or is that not set up for that repo? [20:53:08] didn't get a mail for some reason [20:53:09] https://gerrit.wikimedia.org/r/#/c/93798/ [20:53:23] RobH: Give it CR+2 but no V vote [20:53:31] Then in theory I believe Jenkins should come along and run the full tests [20:53:32] RobH: is that the only one? [20:53:35] RobH: i'm trusted only on my secondry mail [20:53:42] ping hashar to fix :) [20:54:08] gwicke: so one for puppet main repo, one for mediawiki [20:54:08] https://gerrit.wikimedia.org/r/#/c/93798/ [20:54:08] yer a reviewer on both [20:54:09] So I don't do a lot of reviews for others, so I don't normally have this issue. [20:54:20] But what is the policy to get this approved properly? We cannot add every user to trust tested [20:54:24] trusted tests even. [20:54:27] bleh [20:54:30] (03CR) 10Catrope: [C: 032] "Voting +2 per Rob" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93798 (owner: 10Matanya) [20:54:44] is there some flag that can get set to fire off the tests that only trusted submissions get? [20:54:51] (03CR) 10GWicke: [C: 032] jobrunner: remove legacy parsoid jobs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93800 (owner: 10Matanya) [20:54:54] RobH: Now in theory ---^^ should make it rerun the tests [20:54:54] seems like we would want those tests run before its in production no? [20:54:58] here's how to whitelist a user: https://gerrit.wikimedia.org/r/#/c/92773/ [20:55:01] (03Merged) 10jenkins-bot: jobrunner: remove legacy parsoid jobs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93800 (owner: 10Matanya) [20:55:02] RobH: thanks [20:55:07] oh, so my +2 and NO merge should run tests? [20:55:08] That's how MediaWiki is set up at least, I don't know if puppet is different [20:55:11] RoanKattouw: ? [20:55:14] I think so [20:55:19] At least that's how it works for MW and VE [20:55:23] cool, its way simpler than i thought. [20:55:27] But ... not in puppet? [20:55:30] gotta love the regex [20:55:33] https://gerrit.wikimedia.org/r/#/c/92773/1/layout.yaml [20:55:34] ahh, mails now arrived [20:55:37] Cause it's not running it [20:55:47] stupid gmail imap slowness I guess [20:55:51] matanya: you can whitelist yourself by submitting a change in integration/zuul-config.git look at the history for past examples . Hint: you have to put your regex quoted email at two places. [20:56:02] hashar: Is gate-and-submit not set up for the puppet repo? (See https://gerrit.wikimedia.org/r/93798 ) [20:56:07] matanya: see my link above when i did that for odder [20:56:19] hrmm [20:56:26] i have no clue whats up with it now, it hates it [20:56:27] heh [20:56:43] RoanKattouw: it is not. Only operations members (+ bunch of blessed mortals) are allowed to submit patches [20:56:50] RoanKattouw: or that would give me root access :D [20:56:52] hrmm [20:56:53] https://gerrit.wikimedia.org/r/#/c/93800/ shows some build failure after the merge [20:56:58] mutante: hashar it is just a simple patch i can submit, but it looks bad to whitelist yourself, isn't it? :) [20:57:02] So how do I get it to run full tests? [20:57:09] (03CR) 10Catrope: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93798 (owner: 10Matanya) [20:57:16] indeed, i need full testing on a non ops member patchset in puppet/operations [20:57:25] so how does one do that? [20:57:34] RoanKattouw: one has needs its email address to be added to a manually maintained whitelist which is in Zuul. [20:57:41] eww [20:57:45] hashar: Well sure that's how you get to be trusted [20:57:50] i dont wanna make him trusted [20:57:55] RobH: by making them whitelisted users, see above [20:57:55] ahh [20:57:58] i just wanna run trusted tests manually against his patchset [20:58:01] hashar: But if a non-trusted user puts a change in we still need to be able to run all tests after review [20:58:07] on a per patchset basis [20:58:13] so to run tests, you would have to resend the patch since you are trusted that will trigger tests [20:58:17] mutante: thats overkill solution [20:58:20] ... [20:58:26] that workflow sucks! ;] [20:58:33] the simplest way is to edit the commit summary and insert a newline above the Change-Id: line [20:58:34] Can we change it? [20:58:38] having to edit that regex kind of does, yea [20:58:52] Gerrit would strip the new line and create a new patchset for you with you as a committer [20:58:59] and that would trigger the trusted tests. [20:59:00] so matanya, sorry for using you as an example but [20:59:08] feel free [20:59:18] I dont wanna make matanya a trusted user for our repo, just run tests against patchsets [20:59:18] i'm a walking qa :) [20:59:31] RobH: i'm already trusted [20:59:34] that workflow will be fixed whenever I get time to integrates a farm of isolated VM in a labs project :D [20:59:41] just not with this mail address [20:59:51] hashar: so i can cherrypick his commit and change and get the tests? [20:59:57] or i have to make a new patchset from it? [21:00:02] (03CR) 10Manybubbles: [C: 04-1] "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93812 (owner: 10Ottomata) [21:00:02] RobH: yeah or edit the commit isummarty directly in gerrit [21:00:14] (03CR) 10Anomie: [C: 031] "Seems right to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93811 (owner: 10Hashar) [21:00:19] RobH: adding a new line above the Change-Id would be fine. [21:00:24] anomie: oh common +2 :D [21:01:18] k [21:01:18] thats not horrible [21:01:32] lets see if that works [21:01:36] hashar: I don't usually +2 operations/mediawiki-config changes, because I'm not usually prepared to deploy them [21:01:36] remote: (W) No changes between prior commit d965faf and new commit 370b18d [21:01:48] newline doesnt seem to be enough for git review [21:01:56] (03PS2) 10Ottomata: Setting up CirrusSearch for 5 wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93812 [21:01:57] anomie: ah that is true. Forgot about rebasing the repo on tin. I will do it [21:02:15] RobH: You can edit the commit msg in Gerrit itself [21:02:17] anomie: and thanks for all the time spend on reviewing the json serialization of exceptions. [21:02:28] (03CR) 10Hashar: [C: 032] "deploying / updating tin" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93811 (owner: 10Hashar) [21:02:33] RoanKattouw: madness? [21:02:35] how? [21:02:39] (03Merged) 10jenkins-bot: beta: log exceptions in exception-json bucket [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93811 (owner: 10Hashar) [21:02:43] (03CR) 10Manybubbles: [C: 031] Setting up CirrusSearch for 5 wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93812 (owner: 10Ottomata) [21:03:08] RobH: Notepad icon at the top right of the commit msg box [21:03:29] oh cool [21:03:36] (03PS2) 10RobH: jobrunner: remove legacy parsoid jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93798 (owner: 10Matanya) [21:03:46] dance gerrit puppet dance! [21:03:56] so that should trigger a trusted user test of the patchset? [21:04:04] cuz if so, that is an easy workaround. [21:04:32] hashar / RoanKattouw thx for help! [21:04:35] RobH: bonus, whitelist me on this email :) [21:04:52] i dont whitelist folks on puppetoperations [21:04:58] thats a good way to get fellow opsen angry [21:05:00] gwicke: matanya: so are you deploying the removal of the 'ParsoidCacheUpdateJob' job ? [21:05:09] that was the plan [21:05:21] but if gwicke is about now [21:05:23] I have sneaked another change in the repository, that impacts beta only [21:05:23] im gonna abaondon this to him and go make lunch. [21:05:31] not going to be an issue [21:05:31] can do [21:05:45] (but i wasnt leaving until i figured out how to do full tesitng on non trusted user patchsets) [21:05:46] (03CR) 10Chad: [C: 032] Setting up CirrusSearch for 5 wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93812 (owner: 10Ottomata) [21:05:47] since that is going to happen more and more [21:05:52] gwicke: I have fetched on tin, not rebased yet. Would let you do :-] [21:05:52] RobH: as said, i'm already whitelisted [21:05:59] my change is d560c3a - beta: log exceptions in exception-json bucket [21:06:01] hashar: no, go ahead [21:06:03] matanya: ues, but not on that email! [21:06:07] gwicke: ok :-) [21:06:08] (03Merged) 10jenkins-bot: Setting up CirrusSearch for 5 wikis: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93812 (owner: 10Ottomata) [21:06:15] so yea, i could have whitelisted you, but i wanted to learn the non whitelist work around [21:06:24] oh, you mean in regards to my 'i dont like adding whitelisted folks' [21:06:27] that makes more sense. [21:06:37] yes, in that regard [21:06:55] * RobH goes through backlog to find the whitelist location [21:06:58] matanya: are you sure the ParsoidCacheUpdateJob queue has been drained ? [21:07:07] hashar: I am [21:07:10] \O/ [21:07:11] deploying [21:07:11] yes, gwicke confimed [21:07:25] it drained roughly five minutes after we deployed the new PHP extension ~2 weeks ago [21:08:04] if the cluster dies [21:08:07] !log hashar synchronized wmf-config/jobqueue-eqiad.php 'removing ParsoidCacheUpdateJob from default queue. Drained a few weeks ago (gwicke)' [21:08:07] I look for a new job [21:08:20] !log otto synchronized wmf-config/InitialiseSettings.php 'Configuring CirrusSearch for 5 new wikis' [21:08:23] hehe [21:08:25] Logged the message, Master [21:08:38] Logged the message, Master [21:09:01] !log hashar synchronized wmf-config/logging-labs.php [21:09:17] Logged the message, Master [21:09:20] I am giving a conference in a couple weeks to expose how we handle code deployment in production [21:09:24] I should take a screenshot [21:09:53] is enwiki down? [21:09:59] (just kidding) [21:10:11] * hashar opens jobvite [21:10:19] (03CR) 10Dzahn: "still no reply from requestor or legal, stalled" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91075 (owner: 10Dzahn) [21:10:20] RobH: layout.yaml in zuul [21:10:34] RobH: e.g. https://gerrit.wikimedia.org/r/#/c/92773/1/layout.yaml [21:10:35] yea [21:10:38] i am looking at it now [21:10:45] i just said aloud 'oh god, i recall this file, this thing sucks!' [21:11:02] cmjohnson1: are you really on duty 2 weeks in a row? [21:11:12] reminds you of admins.pp RobH ? [21:11:20] no..i think it's ottomata [21:11:43] you might want to get out of the topic then [21:12:45] oh i am [21:12:48] crapcrap [21:12:51] sorry bout that [21:12:59] and i just somehow really fubar'd my repo. [21:12:59] sigh [21:14:43] ori-l: https://bugzilla.wikimedia.org/show_bug.cgi?id=53497 is this solved? [21:15:19] !log aaron synchronized php-1.23wmf1/includes/clientpool/RedisConnectionPool.php '9e0a5d83fc5350218336a65130e0bb7fd1a5b4a1' [21:15:35] Logged the message, Master [21:17:17] hashar: I made you reviewer on the adding whitelisted user [21:17:24] below your pay grade but still ;] [21:17:33] (03PS1) 10Cmjohnson: Removing dhcpd and netboot entries for tola,kuo,celsus and lardner [operations/puppet] - 10https://gerrit.wikimedia.org/r/93857 [21:17:49] RobH: hehe [21:17:50] matanya: so its submitted, but i want someone else to check me since i had not done that [21:17:58] RobH: you forgot an entry, each email need to be added twice :D [21:18:02] oh. [21:18:06] can you comment and i fix? [21:18:09] i wanna make it right [21:18:27] took my words out of my keyboard hashar [21:18:41] hashar: ohh, i see where [21:18:44] no need to comment [21:19:23] RobH: replied :] [21:20:33] hashar: fixed, i appreciate the help/pointers/cluing in [21:20:33] =] [21:20:52] I am going to make you the Zuul master for SF timezone :D [21:21:03] Everyone prepare for diappointment. [21:21:11] damn, so disappointing i cannot spell. [21:21:31] to be honest, that config need integration tests [21:21:36] hehe [21:21:55] i had noticed a lack of them. [21:22:31] apergos: did you fix the image tarballs dumps? [21:22:52] * hashar waits for Zuul/Jenkins [21:22:54] no. they are low priority on my list right now [21:23:27] apergos: but planned, yeah? [21:23:33] yes, absolutely [21:23:45] thanks [21:23:51] yw [21:25:32] (03Abandoned) 10Dzahn: add account for Gerrit Padgham and add to stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91075 (owner: 10Dzahn) [21:31:56] ori-l: 10.64.0.201 seems to have more AUTH errors than 10.64.32.76, and in any case they are all from job runners [21:32:10] RobH: while you are around I unleashed lanthanum, it is running phpunit tests nowadays :D [21:33:58] (03PS1) 10Edenhill: Configuration property namespace fixup [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/93858 [21:34:14] gerrit/zuul has moar speedz [21:34:36] (well, i assume with more testing servers it does) [21:34:52] too bad more users are using it and making it slow, damned acceptable adoption rate [21:34:54] heh [21:36:02] (03CR) 10Ottomata: [C: 032 V: 032] Configuration property namespace fixup [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/93858 (owner: 10Edenhill) [21:36:22] hashar: https://bugzilla.wikimedia.org/show_bug.cgi?id=46771 <-- i think you can comment here ... [21:45:51] matanya: posting a summary [21:46:15] matanya: err no. nothing more to add. That requires the Doxygen packages to be updated on gallium. [21:53:04] ok, thanks [22:06:13] (03CR) 10Ottomata: "Alex, you sure we want sudo_group here? We'd have to either create a new parsoidadmins posix group, or put gwicke and catrope in the exis" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91043 (owner: 10Dzahn) [22:09:21] (03PS1) 10Hashar: udp2log: demux now filter groups stricly [operations/puppet] - 10https://gerrit.wikimedia.org/r/93864 [22:16:43] greg-g: ottomata and I are done with the active portin of our deploy. right now we expect the search indexes to finish building sometime after we're asleep. but we're totally done syncing files and stuff. We yield the remainder of our time to the floor. [22:17:31] jajaj, cool, be back in 10 mins [22:18:45] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:05] PROBLEM - RAID on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:06] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:20:06] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:20:45] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:20:55] PROBLEM - mysqld processes on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:05] RECOVERY - RAID on db1047 is OK: OK: State is Optimal, checked 3 logical drive(s), 6 physical drive(s) [22:21:15] PROBLEM - DPKG on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:45] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [22:21:45] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [22:22:05] RECOVERY - DPKG on db1047 is OK: All packages OK [22:22:45] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:35] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:24:06] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:05] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:25:06] PROBLEM - RAID on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:25] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:26:01] manybubbles: :) [22:26:43] ^d: I'm actually happy to report that we have real statistics now: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=testsearch1003.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1383687167&v=8623519&m=es_indexes&vl=indexes%2Fsec&ti=es_indexes&z=large [22:27:05] RECOVERY - RAID on db1047 is OK: OK: State is Optimal, checked 3 logical drive(s), 6 physical drive(s) [22:29:42] <^d> manybubbles: Fantastic! [22:32:45] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:45] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:37:35] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [22:37:45] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:35] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:39:35] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:39:45] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:40:55] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [22:41:35] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:41:45] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:41:58] !log sleeping. [22:42:00] aerh [22:42:01] no [22:42:02] cancel [22:42:04] wrong chan [22:42:09] * hashar escapes [22:42:20] Logged the message, Master [22:42:35] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2359140 seconds since restart [22:42:55] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:45:55] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [22:46:12] (03CR) 10Cmjohnson: [C: 032] Removing dhcpd and netboot entries for tola,kuo,celsus and lardner [operations/puppet] - 10https://gerrit.wikimedia.org/r/93857 (owner: 10Cmjohnson) [22:46:55] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:48:35] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [22:49:55] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [22:51:35] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:51:55] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:03:45] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:42] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:13:42] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:42] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:22] PROBLEM - DPKG on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:02] PROBLEM - MySQL disk space on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:12] RECOVERY - DPKG on db1047 is OK: All packages OK [23:23:42] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:52] RECOVERY - MySQL disk space on db1047 is OK: DISK OK [23:24:42] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:25] !log mflaschen synchronized php-1.23wmf1/extensions/GettingStarted/ 'Deploy GettingStarted to 1.23wmf1' [23:25:46] Logged the message, Master [23:25:59] !log mflaschen synchronized php-1.23wmf1/extensions/GuidedTour/ 'Deploy GuidedTour to 1.23wmf1' [23:26:16] Logged the message, Master [23:26:31] !log mflaschen synchronized php-1.23wmf2/extensions/GettingStarted/ 'Deploy GettingStarted to 1.23wmf2' [23:26:32] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [23:26:42] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:48] Logged the message, Master [23:26:57] !log mflaschen synchronized php-1.23wmf2/extensions/GuidedTour/ 'Deploy GuidedTour to 1.23wmf2' [23:27:12] Logged the message, Master [23:27:12] PROBLEM - RAID on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:16] !log Growth team deployment complete [23:27:22] PROBLEM - DPKG on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:33] Logged the message, Master [23:28:42] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:42] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:52] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:52] PROBLEM - mysqld processes on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:42] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [23:29:52] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2361970 seconds since restart [23:30:32] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [23:30:42] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:31:12] RECOVERY - DPKG on db1047 is OK: All packages OK [23:33:12] RECOVERY - RAID on db1047 is OK: OK: State is Optimal, checked 3 logical drive(s), 6 physical drive(s) [23:33:42] PROBLEM - MySQL Processlist on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:36:27] (03PS1) 10Ottomata: Cleaning up kafka::server::jmxtrans and adding some more metrics [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93888 [23:36:49] (03CR) 10jenkins-bot: [V: 04-1] Cleaning up kafka::server::jmxtrans and adding some more metrics [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93888 (owner: 10Ottomata) [23:40:47] (03Abandoned) 10Ottomata: Cleaning up kafka::server::jmxtrans and adding some more metrics [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93888 (owner: 10Ottomata) [23:41:06] (03PS1) 10Ottomata: Cleaning up kafka::server::jmxtrans and adding some more metrics [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93890 [23:41:19] (03CR) 10Ottomata: [C: 032 V: 032] Cleaning up kafka::server::jmxtrans and adding some more metrics [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93890 (owner: 10Ottomata) [23:42:26] (03PS1) 10Ottomata: Updating kafka module with more jmxtrans metrics to ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/93891 [23:42:34] (03PS2) 10Ottomata: Updating kafka module with more jmxtrans metrics to ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/93891 [23:42:39] (03CR) 10Ottomata: [C: 032 V: 032] Updating kafka module with more jmxtrans metrics to ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/93891 (owner: 10Ottomata) [23:48:43] (03PS1) 10Ottomata: Being consistent with ISRExpands capitalization in jmxtrans resultAlias [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93894 [23:49:00] (03CR) 10Ottomata: [C: 032 V: 032] Being consistent with ISRExpands capitalization in jmxtrans resultAlias [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93894 (owner: 10Ottomata) [23:50:34] (03PS1) 10Ottomata: Updating kafka module for jmxtrans resultAlias fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/93895 [23:50:45] (03PS2) 10Ottomata: Updating kafka module for jmxtrans resultAlias fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/93895 [23:50:50] (03CR) 10Ottomata: [C: 032 V: 032] Updating kafka module for jmxtrans resultAlias fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/93895 (owner: 10Ottomata) [23:50:51] (03CR) 10jenkins-bot: [V: 04-1] Updating kafka module for jmxtrans resultAlias fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/93895 (owner: 10Ottomata)