[00:02:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [00:56:29] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [00:58:21] hm. no clue why ldap died there [00:58:38] nothing in the ldap server logs, nothing in syslog and nothing in dmesg [00:58:55] that ldap server needs to be upgraded to 2.5 [01:22:30] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [01:23:39] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.78 ms [01:31:30] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.000356554985 secs [01:32:30] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.004959225655 secs [01:40:55] New patchset: coren; "LabsNFS: switch to 8k transmit size" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69024 [01:41:01] Ryan_Lane: ^^ [01:41:28] +2's [01:41:30] +'d [01:41:41] seems jenkins didn't run yet [01:41:51] PROBLEM - Disk space on mc15 is CRITICAL: Timeout while attempting connection [01:42:40] RECOVERY - Disk space on mc15 is OK: DISK OK [01:46:10] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:10] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:10] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:10] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:10] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:10] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:11] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [01:46:12] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:12] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:13] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [01:46:13] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [01:56:12] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69024 [02:03:45] !log LocalisationUpdate completed (1.22wmf6) at Mon Jun 17 02:03:45 UTC 2013 [02:03:58] Logged the message, Master [02:06:18] !log LocalisationUpdate completed (1.22wmf7) at Mon Jun 17 02:06:18 UTC 2013 [02:06:28] Logged the message, Master [02:12:33] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jun 17 02:12:33 UTC 2013 [02:12:41] Logged the message, Master [02:45:10] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [03:01:10] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [03:28:25] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [03:30:25] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [04:08:35] PROBLEM - DPKG on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:09:28] RECOVERY - DPKG on mc15 is OK: All packages OK [04:44:52] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:02:02] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:35:26] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [05:45:30] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [06:05:08] PROBLEM - Disk space on mc15 is CRITICAL: Timeout while attempting connection [06:05:58] RECOVERY - Disk space on mc15 is OK: DISK OK [06:11:26] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [06:11:26] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [06:16:27] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [06:21:47] PROBLEM - DPKG on mc15 is CRITICAL: Timeout while attempting connection [06:23:36] RECOVERY - DPKG on mc15 is OK: All packages OK [06:27:26] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [06:31:26] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [06:35:06] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [06:56:05] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [07:19:35] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [07:40:32] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [07:40:38] New patchset: ArielGlenn; "make media storage in beta labs more like production" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68172 [07:59:53] New review: ArielGlenn; "note that squid conf changes already had errors (fixed)" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/68172 [07:59:53] Change merged: ArielGlenn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68172 [08:01:42] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.00147998333 secs [08:03:43] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.003772616386 secs [08:50:08] New patchset: Akosiaris; "Initial debian build" [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/67999 [08:51:34] New review: Akosiaris; "Fixed an error lintian was producing about zero debian revision. Other than that we are good to go." [operations/debs/buck] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/67999 [09:15:26] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [09:17:56] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [09:18:36] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [09:26:59] New review: Hashar; "This is unfortunately not going to work. The hookhelper needs to be slightly enhanced." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/69010 [09:39:23] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [10:04:33] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [10:23:28] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [10:47:24] New patchset: Hashar; "conf: publish 'all-labs.dblist'" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69051 [10:48:16] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69051 [10:49:52] !log manually rebased mediawiki-config on fenari 7d10d3f --> 1d2093e [10:50:01] Logged the message, Master [11:19:13] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [11:19:31] New patchset: Jeroen De Dauw; "Update Wikidata and SMW IRC notification repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69010 [11:20:24] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [11:32:43] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:45] New patchset: Hashar; "beta: cleanup math configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69061 [11:33:16] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69061 [11:33:33] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [11:47:04] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [11:47:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:47:04] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [11:47:04] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:47:04] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [11:47:05] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [11:47:05] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [11:47:06] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [11:47:06] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [11:47:07] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:47:07] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [11:51:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:52:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [12:14:15] PROBLEM - Parsoid on wtp1010 is CRITICAL: Connection refused [12:22:16] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [12:24:45] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [12:31:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [12:39:11] !log restarting opendj on sanger [12:39:20] Logged the message, Master [12:40:14] RECOVERY - LDAP on sanger is OK: TCP OK - 0.027 second response time on port 389 [12:43:14] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [12:48:34] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [12:49:38] ugh so opendj dies shortly after it's restarted [12:49:46] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [12:49:57] which stops all wm mail delivery [13:01:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:01:56] ok well [13:02:08] [17/Jun/2013:13:02:00 +0000] category=SYNC severity=SEVERE_ERROR msgID=14942259 msg=The hostname sfo-intranet.corp.wikimedia.org could not be resolved as an IP address [13:02:13] lots of these in the opendj error log [13:02:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [13:02:35] The LDAP connection handler defined in configuration entry cn=LDAP Connection Handler,cn=Connection Handlers,cn=config has experienced consecutive failures while trying to accept client connections [13:02:43] and then it stops listening. so grrr [13:10:37] !log need to find out why sfo-intranet.corp.wikimedia.org can't be resolved in order to have opendj back and restore mail service [13:10:46] Logged the message, Master [13:11:33] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [13:17:01] apergos: ah you are already working on it, great [13:17:27] well I'm not getting very far, seems like ns1.corp and ns2.corp which shooould be authoritative don't resolve it [13:17:38] get nxdomain [13:17:43] PROBLEM - Host barium is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:17:50] at this point I want to ask oit to check things [13:18:20] apergos: yes, you should send email to them ;) [13:18:50] ha ha ha [13:20:12] ok need to find someone who can sms them (local us number would be good I guess) [13:20:57] apergos: email down? you better email them… [13:20:58] *runs* [13:21:11] Nikerabbit already made the joke [13:21:28] tried turning it off and back on? [13:21:30] this means automatic bloddletting for the rest of the joketellers who failed to be first [13:21:39] cna't, I'm nowhere near the office [13:21:47] thousands of miles away in fact... [13:22:26] but anyway, no email is a good thing >.> [13:23:11] apergos: edit the main page of the officewiki? with a large font-size? [13:23:17] save on the text costs [13:23:50] but they will sleep through that [13:25:21] hmm mw wiki feels a bit slow and unresponsive to me... although thats probably my isp being crappy again [13:25:28] chris is texting them, we'll see [13:25:37] (for which, thanks) [13:31:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [13:35:24] RECOVERY - Host barium is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [14:00:12] RECOVERY - LDAP on sanger is OK: TCP OK - 0.028 second response time on port 389 [14:02:02] New review: Hoo man; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69010 [14:02:13] PROBLEM - Parsoid on wtp1002 is CRITICAL: Connection refused [14:02:33] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [14:03:12] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [14:03:55] New review: Hoo man; "(1 comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/69010 [14:05:32] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.028 second response time [14:08:02] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:38] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:09:35] New patchset: Ottomata; "Initial commit of Kafka Puppet module for Apache Kafka 0.8" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [14:11:20] PROBLEM - LDAPS on sanger is CRITICAL: Connection refused [14:11:46] New patchset: Jeroen De Dauw; "Update Wikidata and SMW IRC notification repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69010 [14:11:48] RECOVERY - LDAP on sanger is OK: TCP OK - 0.027 second response time on port 389 [14:12:18] RECOVERY - LDAPS on sanger is OK: TCP OK - 0.027 second response time on port 636 [14:17:34] New patchset: Ottomata; "Initial commit of Kafka Puppet module for Apache Kafka 0.8" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [14:18:18] PROBLEM - Puppet freshness on mw1152 is CRITICAL: No successful Puppet run in the last 10 hours [14:18:19] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [14:19:52] New review: Ottomata; "Replied to most things inline." [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [14:27:18] PROBLEM - Puppet freshness on mw1129 is CRITICAL: No successful Puppet run in the last 10 hours [14:28:54] it says replication is up [14:29:18] PROBLEM - Puppet freshness on mw1074 is CRITICAL: No successful Puppet run in the last 10 hours [14:45:40] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [14:56:39] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.021 second response time [15:03:31] PROBLEM - Exim SMTP on sodium is CRITICAL: Connection refused [15:07:53] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [15:10:24] I read that as LAPD on sanger and was confused :p [15:11:51] RECOVERY - LDAP on sanger is OK: TCP OK - 0.034 second response time on port 389 [15:16:32] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.032 sec. response time [15:25:31] PROBLEM - Exim SMTP on sodium is CRITICAL: Connection refused [15:28:01] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [15:41:30] PROBLEM - Parsoid on wtp1024 is CRITICAL: Connection refused [15:42:00] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.010 second response time [15:57:22] mark, looks like after your refactoring varnish is broken on labs [15:57:32] http://en.m.wikipedia.beta.wmflabs.org/ [15:57:34] quite likely [15:58:01] hashar, ^^ [15:59:11] PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: STALE [15:59:22] PSSSHHh [15:59:27] PROBLEM - Packetloss_Average on analytics1005 is CRITICAL: STALE [15:59:32] WHAA [15:59:36] what's up with this stale stuf [15:59:37] PROBLEM - Packetloss_Average on analytics1009 is CRITICAL: STALE [15:59:41] this has been happening for a while now [15:59:48] grr [16:10:50] MaxSem: bug fill it please. I can't look at it right now (cooking + daughter + dinner etc...) [16:11:08] hashar, BZ or RT? [16:11:13] bugzilla [16:11:20] Wikimedia Labs -> deployment-prep (beta) [16:11:25] MaxSem: thanks :) [16:12:05] heya LeslieCarr, you there? [16:12:13] i think something is still weird with neon disk space or something [16:12:41] i'm getting STALE icinga alerts again for ganglios values [16:12:55] i just happened to run this on neon (was checking something) [16:12:55] # aptitude show python [16:12:55] E: Unable to open /root/.aptitude/config for writing - apt_init (28: No space left on device) [16:12:55] Package: python [16:12:59] shit [16:13:03] inodes again [16:13:05] yeah [16:13:15] what's happening over there? [16:13:29] /var/spool/snmptt/ [16:13:30] ah yup [16:13:31] /dev/md0 610800 610800 0 100% / [16:13:38] I'll try to clear some out, it's going to take a little [16:19:17] thanks apergos [16:19:33] yw, but be a bit patient [16:19:41] it's a very full directory [16:21:02] why's it get so full? [16:21:12] i'm fine, it just keeps scaring me when I get all these alerts! [16:22:52] /var/spool/snmptt gets filled full of crap, typically these files would be removed daily [16:24:49] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [16:25:59] MaxSem: mobile should be back on beta ( http://en.m.wikipedia.beta.wmflabs.org/wiki/Main_Page ) [16:26:03] MaxSem: an apache was down [16:26:11] oh...:) [16:26:40] not sure why it did not try on the other apache though [16:26:40] :( [16:26:48] hashar, thanks a lot! [16:26:52] don't blame it on wikidata :) [16:28:15] :-D [16:32:08] !blame | aude [16:32:27] meh, have to blame Domas manually [16:32:36] Domas, I blame you!:P [16:34:32] Hey all, I'm in the office now and people are starting to ping me about the email issues. Would one of you like to send out a staff email about it, or should I? If the latter, would someone DM me the details of the status? [16:35:08] nm, cmjohnson1 just updated me [16:37:03] cndiv: hey chip [16:37:12] cndiv: I gave an update on #-staff about 45' ago [16:37:19] cndiv: let me know if you need more details [16:37:31] thanks paravoid, if you already sent one that's great [16:37:33] I'll send an outage report to ops@ soon [16:37:39] no i didn't send a mail to staff [16:37:42] I just informed the irc channel [16:37:51] oh, would you like me to inform staff? or just wait it out? [16:37:55] I can send an email to wmfall I guess? [16:38:09] that would be best, I don't want to give them incorrect or incomplete information [16:38:12] but they should know something [16:40:27] !log neon out of inodes again, clearing out /var/spool/snmpttp, will return to service shortly [16:40:36] Logged the message, Master [16:48:35] not highly relevant to much, but i was curious: How much memory does WMF memcached have available? Is it one cluster shared between all WMF wikis? Memcached pushes things out of memory based or LRU, about how old are the things getting evicted? [16:48:57] is there anywhere i could find these answers, or is it more just ask the right person? [16:52:15] ebernhardson: grep for the memcached config in operations/puppet [16:52:21] 96G http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=mc1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1371487864&g=mem_report&z=large&c=Memcached%20eqiad [16:52:45] basically, look site.pp for hostnames, then see ganglia [16:53:08] ok good stuff i'll take a look through, thanks! [16:53:16] 2) yes [16:53:32] one cluster for all wiki's? ok good to know [16:53:37] well, probably per data center though :) [16:53:43] got em [16:56:01] PROBLEM - Puppet freshness on mw1055 is CRITICAL: No successful Puppet run in the last 10 hours [16:56:01] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [16:57:11] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [17:01:12] RECOVERY - Puppet freshness on mw1152 is OK: puppet ran at Mon Jun 17 17:01:08 UTC 2013 [17:01:12] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [17:01:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:01] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.087 sec. response time [17:02:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:02:57] ok these are loooking logical [17:03:45] now that inodes are freeing up? [17:03:58] indodes are freed up, finished thta [17:04:02] restarted some services [17:04:14] just ws double checking I didn't leave any loose ends [17:04:56] !log neon should be back in business [17:05:06] Logged the message, Master [17:06:32] will that happen again? apergos? [17:06:43] it seems to ahve happened a few times over the last couple of weeks [17:09:13] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.013 second response time [17:10:13] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [17:10:38] yes unless we put a cron job in to clear them out; what I dn't know is how many of them it creates in a short interval [17:10:56] hm [17:10:57] i.e. does it go crazy for a few days? or does it lose it in the space of an hour? [17:11:11] by the time we have trouble it's hard to look at em sorted (basically you can't) [17:11:26] and the few times I've hopped on to look it' sbeen well behaved (of course) [17:13:59] !log rebooting es7 for lvm snapshot cleanup (again) [17:14:08] Logged the message, notpeter [17:15:15] yeah, i'm not sure apergos, I don't always catch it [17:15:30] usually when I do, I see STALE values from ganglia [17:15:35] but ganglia actually has good data [17:15:47] so, I get the alerts, make sure my stuff isnt' actually dying [17:15:52] and then ping in here and ignore :/ [17:16:15] worksforme [17:16:44] PROBLEM - Host es7 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:03] RECOVERY - Host es7 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [17:29:15] awjr: ping [17:29:23] pong preilly [17:29:32] awjr: May I send you a PM [17:29:35] of course! [17:31:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [17:33:14] jdlrobson: http://app.inthis.com/experience/140/ [17:44:40] Change abandoned: Matthias Mullie; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68167 [17:56:02] binasher, are you in charge of databases? [17:56:08] for labs [17:56:38] no, but i might be able to help with problems. what's up? [17:56:59] WHat's the status of S7? [17:57:20] And has legal made any headway for the archive table? [17:57:28] Cyberpower678: I believe we're still waiting on legal, but Coren can likely give the best update on that. [17:58:55] who wants to review some C code? [17:58:56] s7 should go online later this week. i don't know about the archive table issues, Coren is indeed the person to talk to. [17:59:21] petan said he killed Coren. :p [17:59:59] he is still working as a zombie [18:00:26] it is good for him because nobody needs to pay taxes for him now [18:00:31] petan, don't let him feast on your grey matter [18:00:31] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: everything non 'pedia to 1.22wmf7 [18:00:45] Logged the message, Master [18:01:47] New patchset: Reedy; "Everything non 'pedia to 1.22wmf7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69127 [18:03:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69127 [18:10:50] New patchset: awjrichards; "Disable EventLogging and NavigationTiming on betalabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69130 [18:13:43] New patchset: awjrichards; "Disable EventLogging and NavigationTiming on betalabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69130 [18:14:12] Reedy, will I step on your toes if I merge a labs-only config change? [18:14:48] MaxSem: Nope. I was done within 31s of the window starting [18:15:06] :-) [18:15:10] New patchset: Petrb; "updating motd to make it clear which server is which" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69133 [18:15:31] !log aaron synchronized php-1.22wmf7/includes/db/DatabaseMysql.php 'ac676a2d83af283701dee111a28a52ea1730d657' [18:15:40] Logged the message, Master [18:17:00] New review: Reedy; "Needs more cowbellsay" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69133 [18:18:33] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69130 [18:31:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [18:33:16] New patchset: Pyoungmeister; "setting db66 as pmtpa snapshot host for s3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69135 [18:33:21] New review: Dzahn; "support Reedy, cowsay / figlet / toilet :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69133 [18:33:41] New patchset: Pyoungmeister; "setting db66 as pmtpa snapshot host for s3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69136 [18:37:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69135 [18:39:00] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69136 [18:40:48] !log py synchronized wmf-config/db-pmtpa.php 'bookkeeping changes for pmtpa dbs' [18:40:57] Logged the message, Master [18:45:35] RobH: hello mister RT duty man. Mind processing 5312 today, plz? [18:49:20] !log upgrading packages on gallium [18:49:32] Logged the message, Master [18:50:10] greg-g: done, try now [19:12:44] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [19:15:44] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [19:19:05] RECOVERY - Solr on solr1001 is OK: All OK [19:21:40] New patchset: Ottomata; "Making cdh4::hadoop::directory { '/user/hdfs': require Cdh4::Hadoop::Directory['/user']," [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69146 [19:22:28] New patchset: Ottomata; "Making cdh4::hadoop::directory { '/user/hdfs': require Cdh4::Hadoop::Directory['/user']," [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69146 [19:23:05] Change merged: Ottomata; [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69146 [19:28:45] !log updated Parsoid to d1ae92720c4e [19:28:54] Logged the message, Master [19:33:29] New patchset: Ori.livneh; "Compatibility hack to ease roll-out of EL API module" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69149 [19:37:19] !log dist-upgrade and reboot indium [19:37:28] Logged the message, Master [19:42:58] New patchset: Ottomata; "Using 4 spaces instead of 2 for indent (WMF puppet code guideline)" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69150 [19:43:54] Change merged: Ottomata; [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69150 [20:08:36] RECOVERY - Puppet freshness on mw1129 is OK: puppet ran at Mon Jun 17 20:08:26 UTC 2013 [20:08:46] RECOVERY - Puppet freshness on mw1074 is OK: puppet ran at Mon Jun 17 20:08:36 UTC 2013 [20:09:26] RECOVERY - Puppet freshness on mw1055 is OK: puppet ran at Mon Jun 17 20:09:23 UTC 2013 [20:39:18] ^demon: something up with gerrit? [20:41:37] wow 335 ms wait [20:42:16] <^demon> The heck... [20:43:03] <^demon> Ryan_Lane: Is something up with db1048? I'm seeing a bunch of (seemingly) unrelated mysql problems. [20:46:04] <^demon> Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. [20:46:05] <^demon> at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3017) [20:46:05] <^demon> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3467) [20:46:06] <^demon> ... 37 more [20:46:08] <^demon> eg: [20:47:09] db1048 looks idle and isn't logging errors [20:48:46] <^demon> qchris_away: What mysql-connector are we supposed to have with 2.6-rc0-154-gfcdb34b? [20:50:14] no slow queries logged on db1048 either [20:54:38] binasher: any no queries logged [20:54:45] * preilly ha ha ha [21:01:27] gerrit is soooo sloooow [21:01:44] * odder goes make some tea while doing a 'git pull' [21:02:52] AaronSchulz: heya, we're over in 37, our room was moved [21:03:03] ok [21:06:07] New patchset: Odder; "(bug 49335) Modify wgNamespacesToBeSearchedDefault for ukwikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69160 [21:07:16] Change abandoned: Yurik; "will do this later after we stop varying on X-CS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66034 [21:19:09] New patchset: Asher; "Add AFTv5 archive feedback cron job" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62602 [21:20:46] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62602 [21:22:37] ^demon: mysql-connector-java-5.1.21.jar [21:23:23] <^demon> Hmm, that's what I thought. [21:23:35] ^demon: We had the "Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost." problem before. Back then gerrit was restarted and the problem gone IIRC [21:23:55] <^demon> I think it stopped freaking out [21:24:38] Ok. [21:24:55] Are we seeing those kind of problems more often recently? [21:27:39] New patchset: Ori.livneh; "Don't manage eventlogging.conf during upgrade" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69205 [21:28:44] ^ can someone merge that for me? it just comments out the file { '/etc/supervisor/conf.d/eventlogging.conf' } resource stanza so i can twiddle with it while i roll out new back-end code; I'll reintroduce it in a couple of days. [21:30:13] ori-l: that is not going to make it disappear [21:30:24] oh [21:30:26] hashar: I know [21:30:28] you want to edit it sorry [21:30:54] New review: Hashar; "merely to prevent puppet from overriding the configuration :)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/69205 [21:34:47] New patchset: Asher; "fix puppet error on terbium / when misc::maintenance::refreshlinks is disabled" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69247 [21:36:35] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69247 [21:47:16] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:16] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:16] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [21:47:16] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:16] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:17] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:17] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:18] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:18] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:19] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:19] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [21:53:06] New review: Tim Landscheidt; "Hmmm. Can't we instead define a global Puppet parameter "$testing", or something like that, that, i..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69133 [21:56:05] gerrit, oh gerrit, why do you do that thing you do... sloooowly [21:58:48] 23:01 * odder goes make some tea while doing a 'git pull' [22:17:15] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [22:19:12] !log built package and upgrading to php-luasandbox 1.7-1 [22:19:22] Logged the message, Master [22:21:15] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [22:21:31] !log updated Parsoid to b2a7b101e [22:21:40] Logged the message, Master [22:40:37] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:43:45] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [23:01:25] New review: Ori.livneh; "Don't shoot me. Shoot Reedy." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/68690 [23:02:50] nice [23:02:50] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68690 [23:03:24] greg-g: he's bulletproof anyway [23:03:31] Reedy is wonderful. [23:03:46] yes! [23:04:11] !log Gracefully reloading Zuul to deploy {{gerrit|I773e84e6f75512}} [23:04:19] Logged the message, Master [23:04:45] Ryan_Lane: ^ https://gerrit.wikimedia.org/r/#/c/69260/ [23:05:05] Krinkle: awesome. thanks [23:05:58] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Change I95a8785c7 / Bug 49312: Add 'Programs' NS to MetaWiki' [23:06:06] Logged the message, Master [23:06:07] RoanKattouw: done [23:06:24] Thanks [23:06:26] I'll go now [23:07:13] (thanks greg-g) [23:09:17] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [23:09:18] * greg-g waves [23:10:53] !log install package upgrades on streber [23:11:01] Logged the message, Master [23:12:40] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [23:13:50] andrewbogott: could you do https://gerrit.wikimedia.org/r/69205 too by any chance? it's very straightforward [23:14:17] ori-l: I'll look shortly; in the middle of deploying ^^ [23:14:28] kk [23:14:29] thanks [23:17:05] ori-l, do you want that merged right now? Or wait until you're ready to upgrade? [23:17:54] ori-l: soooo. it's possible to have git-deploy actually call out the thing you do manually, if you like [23:18:15] andrewbogott: now; I am ready to upgrade. [23:18:15] not sure if you wanted to continue with that, or do it another way [23:18:36] Ryan_Lane: this is not a standard deployment; I am migrating the back-end from supervisord to upstart [23:18:43] ah. great [23:18:57] that's much better :) [23:19:05] ^demon: Gerrit is extremely slow right now [23:19:15] New review: Andrew Bogott; "lgtm" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/69205 [23:19:36] Minor submodule updates take noticeable time, and it just took three minutes for me to +2-submit-and-merge a commit (see timestamps @ https://gerrit.wikimedia.org/r/#/c/69262/ ) [23:19:50] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69205 [23:20:06] Ryan_Lane: I do need to run 'python setup.py install' after each git-deploy, but the solution to that is not to get git-deploy to run it, but rather to make the whole setup runnable from its deployment target rather than $PATH. I'm going to do that, too, as part of this migration. [23:20:16] the whole EL setup I mean [23:20:26] yeah [23:20:38] I'm also a fan of that :) [23:21:27] good. the less hacks we need to put into deployment, the better [23:21:33] service restart is an easy one [23:22:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:53] ori-l: merged on sockpuppet. Don't forget to revert when you're done :) [23:23:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [23:25:02] andrewbogott: I won't forget -- thanks very much. [23:30:38] !log Running scap to update VisualEditor [23:30:46] Logged the message, Mr. Obvious [23:34:59] <^demon> !log restarting gerrit [23:35:09] Logged the message, Master [23:39:38] !log catrope Started syncing Wikimedia installation... : Updating VisualEditor to master [23:39:46] Logged the message, Master [23:44:15] New patchset: Andrew Bogott; "Revert "Refactor exim::rt to use the new exim template."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69263 [23:47:04] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69263 [23:47:28] PROBLEM - Exim SMTP on streber is CRITICAL: Connection refused [23:51:08] !log catrope Finished syncing Wikimedia installation... : Updating VisualEditor to master [23:51:16] Logged the message, Master [23:51:21] !log restarting exim on streber [23:51:29] Logged the message, Master [23:52:56] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69264 [23:54:49] New review: Andrew Bogott; "Let's try that again!" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/69264 [23:54:49] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69264 [23:56:29] RECOVERY - Exim SMTP on streber is OK: SMTP OK - 0.109 sec. response time