[00:20:20] PROBLEM - MySQL Slave Running on es1004 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Incorrect key file for table ./enwiki/blobs_cluster22.MYI: t [00:20:52] and es1004 needs to be reslaved [00:27:05] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:29:47] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:27:35] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [01:42:08] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 204 seconds [01:47:50] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 3 seconds [03:33:13] PROBLEM - Host mw4 is DOWN: PING CRITICAL - Packet loss = 100% [05:17:22] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [05:32:16] !log flushing mobile varnish cache [05:32:18] Logged the message, Mistress of the network gear. [07:25:36] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:29:48] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:57:39] hello [08:02:10] !log running l10nupdate for {{bug|34938}} [08:02:13] Logged the message, Master [08:13:35] !log rerunning l10nupdate for {{bug|34938}} [08:13:38] Logged the message, Master [08:26:51] cp: cannot create regular file `/home/wikipedia/common/php-1.19/cache/l10n/l10n_cache-zh-my.cdb': Permission denied [08:26:55] WONDERFUL [08:26:56] :D [08:39:54] !log Gave up running l10nupdate script it has some file permissions issues. Opened {{bug|36119}} and {{bug|36120}} [08:39:57] Logged the message, Master [09:35:40] PROBLEM - LVS HTTPS on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:35:49] PROBLEM - LVS HTTP on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:36:52] PROBLEM - LVS HTTP on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:01] PROBLEM - LVS HTTPS on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:04] that's not good [09:37:28] no [09:40:20] RECOVERY - LVS HTTP on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 79844 bytes in 0.193 seconds [09:40:37] PROBLEM - SSH on lvs1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:41] and at the same time my internet connection dropped [09:40:43] did you look at it? [09:41:58] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [09:42:34] RECOVERY - LVS HTTP on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 79844 bytes in 0.162 seconds [09:42:43] RECOVERY - LVS HTTPS on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79459 bytes in 0.193 seconds [09:43:01] RECOVERY - LVS HTTPS on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79459 bytes in 0.275 seconds [09:43:36] no, not yet, I didn't realize you had been dropped off [09:43:39] started to connect to lvs1001.mgmt after i just saw the SSH crit, but it was back in that second [09:43:48] useless [09:55:34] so why did it happen? [10:27:31] !log Sending Brazil upload traffic to eqiad [10:27:34] Logged the message, Master [10:54:00] New patchset: Mark Bergsma; "Set PATH" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5417 [10:54:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5417 [10:54:29] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5417 [10:54:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5417 [11:19:18] !log Sending US upload traffic to eqiad as well [11:19:21] Logged the message, Master [11:28:49] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [11:42:10] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:52] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 7.443 seconds [11:45:19] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:46] PROBLEM - SSH on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:49] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:58] RECOVERY - SSH on ms-fe1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:48:01] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 2.223 seconds [11:48:01] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 3.976 seconds [11:50:25] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:52:40] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:55:36] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:56:48] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.059 seconds [11:59:48] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 3.720 seconds [12:00:29] New patchset: Mark Bergsma; "Halve the number of proxy workers to reduce memory usage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5419 [12:00:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5419 [12:00:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5419 [12:00:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5419 [12:02:39] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:07:09] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:08:15] switf is down [12:08:18] :o [12:08:24] no thumbs [12:08:39] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:09:51] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:12:16] petan|wk: overloaded, m ark is on it, trying to tune [12:12:30] ok [12:12:39] stop the stupid m ark thing [12:12:51] you are pinged on that? [12:12:52] :D [12:13:11] sure, somehow thought you like less highlights when i saw others doing it [12:13:49] that's like when Ryan Lane told me that !Ryan pings him as well [12:20:03] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:20:52] mutante: can you restart ns2? [12:20:56] yep [12:22:45] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.128 seconds response time. www.wikipedia.org returns 208.80.154.225 [12:22:56] !log restarted pdns on ns2 [12:22:59] Logged the message, Master [12:25:27] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 4.144 seconds [12:25:41] :) [12:25:45] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.608 seconds [12:30:27] New patchset: Mark Bergsma; "Move Swift iptables inclusion to the role classes Disable iptables firewall for all-internal pmtpa prod cluster - it's causing issues with full conntrack tables and dropped packets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5420 [12:30:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5420 [12:31:56] New patchset: Mark Bergsma; "Move Swift iptables inclusion to the role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5420 [12:32:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5420 [12:32:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5420 [12:32:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5420 [12:40:35] !log Disabled iptables firewalls on internal prod swift cluster servers as it's dropping packets [12:40:38] Logged the message, Master [12:41:03] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 1.168 seconds [13:07:14] New patchset: Dzahn; "write logs for refreshLinks cron jobs to mwdeploy home" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5425 [13:07:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5425 [13:08:59] New patchset: Dzahn; "write logs for refreshLinks cron jobs to mwdeploy home, ugh, capitalization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5425 [13:09:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5425 [13:10:29] !log Sending India upload traffic to upload-lb.eqiad [13:10:31] Logged the message, Master [13:11:20] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5425 [13:11:23] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5425 [13:16:37] New patchset: Dzahn; "refreshlinks - move log dir creation out of the definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5426 [13:16:53] mutante: can I ask you to look at some rewrite rules for shorturl? [13:16:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5426 [13:17:22] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5426 [13:17:25] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5426 [13:22:11] hexmode: in a little while,ok? or can you assign something in bz? other stuff i waiting to get done before weekend. urgent? [13:22:40] mutante: it has sat for a while, if you have someone else I should ask... [13:23:01] mutante: also, I could just submit these in gerrit if you point me to the place [13:24:20] if it's not urgent before the weekend feel free to assign to me and if i don't have answers i can find someone to forward it to [13:24:58] are these edits to redirects.conf ? [13:27:10] hexmode: just submitting to gerrit sounds very good, about the right place i'll have to look at the request [13:28:30] mutante: redirect.conf, yes... which git repo? [13:28:47] redirect.conf is not in a git repo [13:28:55] yea, thats still fenari stuff [13:28:57] it is just in local to fenari svn repo :D [13:29:07] but it would be nice if it could go through gerrit [13:29:53] mutante: hashar: Could you put it in gerrit? [13:30:24] even if it isn't deployed from there right now, at least then I could submit via gerrit [13:30:26] hrm [13:30:34] maybe I can get it from noc [13:31:07] yes,its on noc [13:32:43] hexmode: even if it's not served from there yet, like you said, but i guess the place should be operations/puppet production and ./puppet/files/apache/ [13:33:06] or _maybe_ ./templates/ instead of ./files/ [13:33:25] mutante: k, I'll figure out the right one and do that. tyvm [13:33:26] if having it as a template is an advantage here [13:35:58] <^demon> Trying to move that stuff to puppet is more than a quick copy+paste from noc to files/ or templates/ [13:46:25] can someone review and approve this commit please ok thanks bye [13:46:26] https://gerrit.wikimedia.org/r/#change,5350 [13:47:18] anal interns ftw [13:47:58] haha, wait me? [13:48:21] i am not an intern! and I am not toooooo anal, but I am analytics [13:48:38] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:55:39] New review: Ottomata; "Hi guys, I'm waiting on this one to start looking at and working on udp2log stuff. Robla and Dieder..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5350 [13:59:23] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:35] ottomata: there's a group, 'anal interns' [14:01:44] I guess someone tagged you right, anal people [14:01:47] !!!11 [14:01:52] heheh [14:02:23] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:00] New patchset: Mark Bergsma; "Add the pmtpa Squids as a thumbnail backend to help Swift out of its misery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5429 [14:03:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5429 [14:03:20] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5429 [14:03:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5429 [14:06:44] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 9.103 seconds [14:06:44] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:07:56] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 2.628 seconds [14:08:05] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.997 seconds [14:11:50] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:18:53] New patchset: MarkAHershberger; ".conf files from noc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5431 [14:19:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5431 [14:22:57] New patchset: MarkAHershberger; "--amend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5432 [14:23:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5432 [14:23:51] New review: Demon; "I believe you meant to --amend the parent, not make a new commit with the summary of "--amend"" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/5432 [14:24:16] Change abandoned: MarkAHershberger; "learning git moar" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5432 [14:25:37] New patchset: MarkAHershberger; ".conf files from noc, with w/s removed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5431 [14:25:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5431 [14:29:06] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3815 [14:29:24] New review: Ottomata; "(no comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2393 [14:30:45] !log Disabled down-pref of Tampa AS2828 routes [14:30:48] Logged the message, Master [14:34:52] New patchset: MarkAHershberger; ".conf files from noc, with w/s removed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5431 [14:35:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5431 [14:41:35] New patchset: MarkAHershberger; "Bug #1450 ? rewrite rules for ShortURL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5433 [14:41:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5433 [14:46:47] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Timeout reading from 10.0.11.27:11000 [14:48:28] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [14:48:51] ok. recovers before you can hit enter on restart [14:50:25] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:51:41] !log starting swift-container-auditor on ms-be1 [14:51:44] Logged the message, Master [14:51:46] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:01:30] !log Converted OSPF directly connected redistributed routes from type 2 to type 1 [15:01:31] New patchset: Dzahn; "add $nagios_group memcached to memcached monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5435 [15:01:32] Logged the message, Master [15:01:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5435 [15:03:40] New review: Dzahn; "just for monitoring group URLs" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5435 [15:03:43] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5435 [15:07:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.644 seconds [15:10:02] !log apache error log on stafford has ruby exceptions re: phusion_passenger [15:10:04] Logged the message, Master [15:10:46] that's probably because you just changed monitoring and thus nagios config changed and thus blocks puppet for a long time [15:11:38] the socket timeout / spence? true [15:18:37] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [15:31:12] New patchset: Mark Bergsma; "Implement Nagios time periods to restrict pages to awake hours (timezone dependent)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5436 [15:31:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5436 [15:33:33] someone can explain how the localisation cache is being rebuild? [15:33:35] on prod [15:33:48] does it run for every wiki? [15:34:08] nvm mutante just replied :o [15:38:00] ok, mutante doesn't know :P so anyone else? [15:38:15] I mean the rebuildLocalizationCache script [15:40:41] I assume we run it per project [15:40:43] lemme poke around [15:41:55] scap does it (so to all servers with mw installed) [15:42:25] apergos: but that would mean next project override the previous one [15:42:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:32] because all projects share same folder [15:42:33] and then l10update runs it [15:42:37] no [15:42:47] howcome [15:42:48] nothing gets overidden [15:44:14] one by one doesn't work [15:44:26] it removed the cache of previous wiki [15:44:39] so in fact only last wiki has cache :o [15:44:53] is that script you use in prod available for download? [15:45:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.725 seconds [15:45:43] yes [15:46:05] where [15:46:07] there have been some issues with l10update recently [15:46:12] a permissions thing, I noticed recently [15:46:22] it's in the server admin log but I don't know more than that [15:47:34] each set of messages goes into a file with the wiki name in it [15:47:41] like this: [15:47:43] messages-zhwiki-zh-cn [15:48:04] that's why there is no overwrite [15:49:44] you can see the global variables that affec it at noc (CommonSettings, InitialiseSettings), and thenI would just read the script itself. but the l10update script wrapper is in the public puppet repo [15:50:00] apergos: all configuration files on prod and labs are same [15:50:09] we use CommonSettings from prod [15:50:23] so I am pretty sure if there is any variable, we have it [15:50:27] but it doesn't work though [15:50:38] because I don't have that script you use [15:53:30] I'm not sure the one in puppet uses rebuildLocalizationCache, it looks like it's another piece od the puzzle [15:53:58] I guess it's using something else [15:55:10] uh huh, update.php from the localization extension [15:57:18] ah here's what those look lik [15:57:20] l10nupdate-xh.cache [15:57:41] and l10n_cache-zh.cdb [15:58:18] but like I say there was some sort of issue with a recent run [15:58:27] where they couldn't complete it I guess [16:02:08] apergos: not sure, but I think you're talking about upgrading 1.18->1.19 but I believe that issue (whatever it was) got resolved right away? [16:02:34] don't know [16:02:39] it was in the last day or two [16:03:44] apergos: then I'm mistaken, or else the issue was not fixed permanently [16:03:54] ok [16:04:05] I dunno, I just saw something scroll by about it in one of the channels [16:04:48] 08:39 hashar: Gave up running l10nupdate script it has some file permissions issues. Opened bug 36119 and bug 36120 [16:04:50] here it is [16:05:04] yup [16:05:15] oops, sorry for gratuitous ping [16:05:24] I tried running l10nupdate this morning to fix some sawiki bug :/ [16:05:44] I have opened two bugs and sent an email to Roan Kattouw [16:05:54] cool [16:06:12] that is all really messy [16:06:31] I got that impression yeah [16:11:30] !log add missing memcached servicegroup to nagios, restarted [16:11:32] Logged the message, Master [16:18:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.795 seconds [16:34:52] New review: Lcarr; "Thank god!" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/5436 [16:35:07] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Timeout reading from 10.0.8.23:11000 [16:36:37] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [16:49:50] New patchset: Mark Bergsma; "Create Swift filesystems with inode size 1024, as suggested in the docs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5439 [16:50:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5439 [16:59:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.620 seconds [17:07:59] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Timeout reading from 10.0.8.39:11000 [17:09:20] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [17:11:55] New patchset: Pyoungmeister; "part 1 of getting multicast relay going" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5335 [17:12:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5335 [17:15:12] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5335 [17:15:31] !log stopping puppet on locke and emery. just to be safe... [17:15:34] Logged the message, notpeter [17:16:58] New patchset: Catrope; "Per bug 36120 , use cp --force in l10nupdate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5443 [17:17:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5443 [17:18:08] diederik: just fyi, I'm going to be mucking about with udp2log data slightly [17:18:58] New patchset: Catrope; "(bug 36119) Fix location of clearMessageBlobs.php , it moved in 1.19" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5444 [17:19:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5444 [17:29:04] New patchset: Pyoungmeister; "part 1 of getting multicast relay going" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5335 [17:29:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5335 [17:29:45] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5335 [17:29:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5335 [17:31:12] LeslieCarr: ok... I've got a debian package for bz4... now to start on the puppet config [17:40:58] mark: the full conntrack tables is easily solvable without removing the rules [17:40:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:12] we can just increase the amount of memory the kernel assigns for conntrack entries [17:43:31] :-) [17:43:53] why does it need to track conns?!!? [17:45:23] New patchset: Pyoungmeister; "cleanin'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5446 [17:45:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5446 [17:47:27] New patchset: Pyoungmeister; "cleanin'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5446 [17:47:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5446 [17:47:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.279 seconds [17:48:16] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5446 [17:48:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5446 [17:51:11] PROBLEM - LVS HTTPS on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:51:47] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:51:47] PROBLEM - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:51:57] notpeter: Did you just break HTTPS? --- ^^ [17:52:05] PROBLEM - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:05] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:12] RoanKattouw: it did that earlier.. Or yesteday [17:52:14] PROBLEM - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [17:52:14] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:14] PROBLEM - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:20] No, was today [17:52:23] PROBLEM - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:41] very well might have [17:53:44] RECOVERY - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 50125 bytes in 0.775 seconds [17:53:53] PROBLEM - LVS HTTPS on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:54:01] oh, awesome [17:54:10] I guess nginx is subscribed to the conf [17:54:10] Give it a few minutes [17:54:21] unless you did break it :p [17:54:21] what conf? [17:54:24] thus "bring down https whenever you push out a new conf" is enabled [17:54:31] Oooh, it's restarting all nginx procs at the same time? [17:55:03] yep! [17:55:14] ah yes, the ol' "dos the site" setup... [17:55:23] RECOVERY - LVS HTTPS on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 59467 bytes in 0.804 seconds [17:55:42] New patchset: preilly; "Add ACL for new carriers and redirect support for carriers landing page on m. domain" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5447 [17:55:50] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:55:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5447 [17:56:00] notpeter: do you know what's going on? [17:56:10] yes [17:56:11] file { [17:56:11] "/etc/nginx/nginx.conf": [17:56:11] content => template('nginx/nginx.conf.erb'), [17:56:13] notify => Service['nginx'], [17:56:14] did you change nginx's config? [17:56:17] I pushed out a new conf [17:56:18] yes [17:56:23] and it notified nginx [17:56:26] and now puppet is restarting everything, argh. [17:56:26] all at the same time [17:56:40] did you run puppet by hand? [17:56:45] this seems like... a problematic design to me [17:56:51] no. [17:56:53] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [17:57:10] then shouldn't puppet run at different times on all of these hosts? [17:57:14] yes [17:57:19] but hey, here we are ;) [17:57:26] so yeah, that's something to change [17:57:47] PROBLEM - LVS HTTPS on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [17:58:12] notpeter: Will this fix itself automatically if we just wait a while? [17:58:18] yes [17:58:19] If it's a restart it sounds like it should [17:58:21] they're just restarting [17:59:09] what's up? [17:59:26] PROBLEM - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [17:59:32] are you sure? something seems off [17:59:40] mark: Mass nginx restart due to updated config file, HTTPS down everywhere [17:59:41] mark: I pushed a new nginx conf [17:59:44] PROBLEM - LVS HTTPS on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:59:48] it notified nginx [17:59:53] they all restarted at once [17:59:53] like, I can connect with s_client, getting the certificate chain then it immediately closes the connection [18:00:11] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3981 bytes in 0.442 seconds [18:00:20] RECOVERY - LVS HTTPS on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 638 bytes in 0.442 seconds [18:00:25] you mean you ran puppet all at once on all boxes? [18:00:28] of course it restarted nginx [18:00:29] paravoid: You can see this in Nagios, some say cannot make SSL conn, others say no data received [18:00:35] what did you expect? [18:00:38] RECOVERY - LVS HTTPS on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3930 bytes in 0.111 seconds [18:00:56] RoanKattouw: I am not talking about nagios [18:01:08] don't we have puppet to run every 30' or so? [18:01:37] yes [18:01:59] RECOVERY - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60170 bytes in 0.887 seconds [18:02:21] paravoid: Weren't you saying that you could connect to one of these HTTPS proxies, it would give you certs, then close the conn? [18:02:26] RECOVERY - LVS HTTPS on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39068 bytes in 0.707 seconds [18:02:26] RECOVERY - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 44133 bytes in 0.777 seconds [18:03:30] so for future reference what should have been done differently here? [18:03:47] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43449 bytes in 0.779 seconds [18:04:08] config went through gerrit, and puppet went mad applying it simultaneously, right? [18:04:14] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 55816 bytes in 0.770 seconds [18:04:14] yes [18:04:23] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:04:28] so would you have slain puppet everywhere first? [18:04:32] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:04:37] RoanKattouw: yes and it still happens to me. [18:04:39] I guess you need to stop puppet first everywhere [18:04:42] I don't think the problem is going to fix itself [18:05:18] notpeter: sigh. finest quality. [18:05:26] hurray for landmines [18:05:32] *** Fatal error: The TLS connection was non-properly terminated. [18:05:32] *** Handshake has failed [18:05:32] GnuTLS error: The TLS connection was non-properly terminated. [18:05:41] (same with openssl s_client) [18:05:44] paravoid: That's exactly what I said I was seeing in Nagios: some LVS IPs say "CRITICAL - Cannot make SSL connection" and others say "HTTP CRITICAL - No data received from host". I'm assuming that what you're seeing is the latter, and that other LVSes won't even give you certs [18:05:47] paravoid: to what host? [18:05:52] Oh, OK, so what you're seeing is the /former/ , sorry [18:06:12] try en.wikipedia.org, for example [18:07:27] PROBLEM - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:07:27] PROBLEM - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:07:27] PROBLEM - LVS HTTPS on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:07:32] !log restarting nginx on ssl1002 and ssl1004 as they are not back up [18:07:34] Logged the message, notpeter [18:08:46] [17754967.392078] nginx[20541]: segfault at 0 ip 00000000004764fd sp 00007fff8f75fb80 error 4 in nginx[400000+92000] [18:08:48] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:08:50] [17754967.451980] nginx[20524]: segfault at 0 ip 00000000004764fd sp 00007fff8f75fbb0 error 4 in nginx[400000+92000] [18:08:53] [17754967.951921] nginx[20546]: segfault at 0 ip 00000000004764fd sp 00007fff8f75fb30 error 4 in nginx[400000+92000] [18:08:57] oh how nice [18:08:58] nginxs are segfaulting [18:09:02] there are more of these too [18:09:11]