[00:20:20] PROBLEM - MySQL Slave Running on es1004 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Incorrect key file for table ./enwiki/blobs_cluster22.MYI: t [00:20:52] and es1004 needs to be reslaved [00:27:05] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:29:47] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:27:35] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [01:42:08] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 204 seconds [01:47:50] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 3 seconds [03:33:13] PROBLEM - Host mw4 is DOWN: PING CRITICAL - Packet loss = 100% [05:17:22] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [05:32:16] !log flushing mobile varnish cache [05:32:18] Logged the message, Mistress of the network gear. [07:25:36] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:29:48] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:57:39] hello [08:02:10] !log running l10nupdate for {{bug|34938}} [08:02:13] Logged the message, Master [08:13:35] !log rerunning l10nupdate for {{bug|34938}} [08:13:38] Logged the message, Master [08:26:51] cp: cannot create regular file `/home/wikipedia/common/php-1.19/cache/l10n/l10n_cache-zh-my.cdb': Permission denied [08:26:55] WONDERFUL [08:26:56] :D [08:39:54] !log Gave up running l10nupdate script it has some file permissions issues. Opened {{bug|36119}} and {{bug|36120}} [08:39:57] Logged the message, Master [09:35:40] PROBLEM - LVS HTTPS on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:35:49] PROBLEM - LVS HTTP on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:36:52] PROBLEM - LVS HTTP on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:01] PROBLEM - LVS HTTPS on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:04] that's not good [09:37:28] no [09:40:20] RECOVERY - LVS HTTP on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 79844 bytes in 0.193 seconds [09:40:37] PROBLEM - SSH on lvs1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:41] and at the same time my internet connection dropped [09:40:43] did you look at it? [09:41:58] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [09:42:34] RECOVERY - LVS HTTP on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 79844 bytes in 0.162 seconds [09:42:43] RECOVERY - LVS HTTPS on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79459 bytes in 0.193 seconds [09:43:01] RECOVERY - LVS HTTPS on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79459 bytes in 0.275 seconds [09:43:36] no, not yet, I didn't realize you had been dropped off [09:43:39] started to connect to lvs1001.mgmt after i just saw the SSH crit, but it was back in that second [09:43:48] useless [09:55:34] so why did it happen? [10:27:31] !log Sending Brazil upload traffic to eqiad [10:27:34] Logged the message, Master [10:54:00] New patchset: Mark Bergsma; "Set PATH" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5417 [10:54:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5417 [10:54:29] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5417 [10:54:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5417 [11:19:18] !log Sending US upload traffic to eqiad as well [11:19:21] Logged the message, Master [11:28:49] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [11:42:10] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:52] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 7.443 seconds [11:45:19] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:46] PROBLEM - SSH on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:49] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:58] RECOVERY - SSH on ms-fe1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:48:01] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 2.223 seconds [11:48:01] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 3.976 seconds [11:50:25] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:52:40] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:55:36] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:56:48] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.059 seconds [11:59:48] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 3.720 seconds [12:00:29] New patchset: Mark Bergsma; "Halve the number of proxy workers to reduce memory usage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5419 [12:00:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5419 [12:00:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5419 [12:00:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5419 [12:02:39] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:07:09] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:08:15] switf is down [12:08:18] :o [12:08:24] no thumbs [12:08:39] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:09:51] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:12:16] petan|wk: overloaded, m ark is on it, trying to tune [12:12:30] ok [12:12:39] stop the stupid m ark thing [12:12:51] you are pinged on that? [12:12:52] :D [12:13:11] sure, somehow thought you like less highlights when i saw others doing it [12:13:49] that's like when Ryan Lane told me that !Ryan pings him as well [12:20:03] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:20:52] mutante: can you restart ns2? [12:20:56] yep [12:22:45] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.128 seconds response time. www.wikipedia.org returns 208.80.154.225 [12:22:56] !log restarted pdns on ns2 [12:22:59] Logged the message, Master [12:25:27] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 4.144 seconds [12:25:41] :) [12:25:45] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.608 seconds [12:30:27] New patchset: Mark Bergsma; "Move Swift iptables inclusion to the role classes Disable iptables firewall for all-internal pmtpa prod cluster - it's causing issues with full conntrack tables and dropped packets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5420 [12:30:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5420 [12:31:56] New patchset: Mark Bergsma; "Move Swift iptables inclusion to the role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5420 [12:32:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5420 [12:32:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5420 [12:32:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5420 [12:40:35] !log Disabled iptables firewalls on internal prod swift cluster servers as it's dropping packets [12:40:38] Logged the message, Master [12:41:03] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 1.168 seconds [13:07:14] New patchset: Dzahn; "write logs for refreshLinks cron jobs to mwdeploy home" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5425 [13:07:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5425 [13:08:59] New patchset: Dzahn; "write logs for refreshLinks cron jobs to mwdeploy home, ugh, capitalization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5425 [13:09:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5425 [13:10:29] !log Sending India upload traffic to upload-lb.eqiad [13:10:31] Logged the message, Master [13:11:20] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5425 [13:11:23] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5425 [13:16:37] New patchset: Dzahn; "refreshlinks - move log dir creation out of the definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5426 [13:16:53] mutante: can I ask you to look at some rewrite rules for shorturl? [13:16:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5426 [13:17:22] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5426 [13:17:25] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5426 [13:22:11] hexmode: in a little while,ok? or can you assign something in bz? other stuff i waiting to get done before weekend. urgent? [13:22:40] mutante: it has sat for a while, if you have someone else I should ask... [13:23:01] mutante: also, I could just submit these in gerrit if you point me to the place [13:24:20] if it's not urgent before the weekend feel free to assign to me and if i don't have answers i can find someone to forward it to [13:24:58] are these edits to redirects.conf ? [13:27:10] hexmode: just submitting to gerrit sounds very good, about the right place i'll have to look at the request [13:28:30] mutante: redirect.conf, yes... which git repo? [13:28:47] redirect.conf is not in a git repo [13:28:55] yea, thats still fenari stuff [13:28:57] it is just in local to fenari svn repo :D [13:29:07] but it would be nice if it could go through gerrit [13:29:53] mutante: hashar: Could you put it in gerrit? [13:30:24] even if it isn't deployed from there right now, at least then I could submit via gerrit [13:30:26] hrm [13:30:34] maybe I can get it from noc [13:31:07] yes,its on noc [13:32:43] hexmode: even if it's not served from there yet, like you said, but i guess the place should be operations/puppet production and ./puppet/files/apache/ [13:33:06] or _maybe_ ./templates/ instead of ./files/ [13:33:25] mutante: k, I'll figure out the right one and do that. tyvm [13:33:26] if having it as a template is an advantage here [13:35:58] <^demon> Trying to move that stuff to puppet is more than a quick copy+paste from noc to files/ or templates/ [13:46:25] can someone review and approve this commit please ok thanks bye [13:46:26] https://gerrit.wikimedia.org/r/#change,5350 [13:47:18] anal interns ftw [13:47:58] haha, wait me? [13:48:21] i am not an intern! and I am not toooooo anal, but I am analytics [13:48:38] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:55:39] New review: Ottomata; "Hi guys, I'm waiting on this one to start looking at and working on udp2log stuff. Robla and Dieder..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5350 [13:59:23] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:35] ottomata: there's a group, 'anal interns' [14:01:44] I guess someone tagged you right, anal people [14:01:47] !!!11 [14:01:52] heheh [14:02:23] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:00] New patchset: Mark Bergsma; "Add the pmtpa Squids as a thumbnail backend to help Swift out of its misery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5429 [14:03:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5429 [14:03:20] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5429 [14:03:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5429 [14:06:44] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 9.103 seconds [14:06:44] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:07:56] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 2.628 seconds [14:08:05] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.997 seconds [14:11:50] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:18:53] New patchset: MarkAHershberger; ".conf files from noc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5431 [14:19:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5431 [14:22:57] New patchset: MarkAHershberger; "--amend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5432 [14:23:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5432 [14:23:51] New review: Demon; "I believe you meant to --amend the parent, not make a new commit with the summary of "--amend"" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/5432 [14:24:16] Change abandoned: MarkAHershberger; "learning git moar" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5432 [14:25:37] New patchset: MarkAHershberger; ".conf files from noc, with w/s removed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5431 [14:25:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5431 [14:29:06] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3815 [14:29:24] New review: Ottomata; "(no comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2393 [14:30:45] !log Disabled down-pref of Tampa AS2828 routes [14:30:48] Logged the message, Master [14:34:52] New patchset: MarkAHershberger; ".conf files from noc, with w/s removed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5431 [14:35:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5431 [14:41:35] New patchset: MarkAHershberger; "Bug #1450 ? rewrite rules for ShortURL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5433 [14:41:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5433 [14:46:47] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Timeout reading from 10.0.11.27:11000 [14:48:28] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [14:48:51] ok. recovers before you can hit enter on restart [14:50:25] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:51:41] !log starting swift-container-auditor on ms-be1 [14:51:44] Logged the message, Master [14:51:46] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:01:30] !log Converted OSPF directly connected redistributed routes from type 2 to type 1 [15:01:31] New patchset: Dzahn; "add $nagios_group memcached to memcached monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5435 [15:01:32] Logged the message, Master [15:01:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5435 [15:03:40] New review: Dzahn; "just for monitoring group URLs" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5435 [15:03:43] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5435 [15:07:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.644 seconds [15:10:02] !log apache error log on stafford has ruby exceptions re: phusion_passenger [15:10:04] Logged the message, Master [15:10:46] that's probably because you just changed monitoring and thus nagios config changed and thus blocks puppet for a long time [15:11:38] the socket timeout / spence? true [15:18:37] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [15:31:12] New patchset: Mark Bergsma; "Implement Nagios time periods to restrict pages to awake hours (timezone dependent)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5436 [15:31:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5436 [15:33:33] someone can explain how the localisation cache is being rebuild? [15:33:35] on prod [15:33:48] does it run for every wiki? [15:34:08] nvm mutante just replied :o [15:38:00] ok, mutante doesn't know :P so anyone else? [15:38:15] I mean the rebuildLocalizationCache script [15:40:41] I assume we run it per project [15:40:43] lemme poke around [15:41:55] scap does it (so to all servers with mw installed) [15:42:25] apergos: but that would mean next project override the previous one [15:42:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:32] because all projects share same folder [15:42:33] and then l10update runs it [15:42:37] no [15:42:47] howcome [15:42:48] nothing gets overidden [15:44:14] one by one doesn't work [15:44:26] it removed the cache of previous wiki [15:44:39] so in fact only last wiki has cache :o [15:44:53] is that script you use in prod available for download? [15:45:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.725 seconds [15:45:43] yes [15:46:05] where [15:46:07] there have been some issues with l10update recently [15:46:12] a permissions thing, I noticed recently [15:46:22] it's in the server admin log but I don't know more than that [15:47:34] each set of messages goes into a file with the wiki name in it [15:47:41] like this: [15:47:43] messages-zhwiki-zh-cn [15:48:04] that's why there is no overwrite [15:49:44] you can see the global variables that affec it at noc (CommonSettings, InitialiseSettings), and thenI would just read the script itself. but the l10update script wrapper is in the public puppet repo [15:50:00] apergos: all configuration files on prod and labs are same [15:50:09] we use CommonSettings from prod [15:50:23] so I am pretty sure if there is any variable, we have it [15:50:27] but it doesn't work though [15:50:38] because I don't have that script you use [15:53:30] I'm not sure the one in puppet uses rebuildLocalizationCache, it looks like it's another piece od the puzzle [15:53:58] I guess it's using something else [15:55:10] uh huh, update.php from the localization extension [15:57:18] ah here's what those look lik [15:57:20] l10nupdate-xh.cache [15:57:41] and l10n_cache-zh.cdb [15:58:18] but like I say there was some sort of issue with a recent run [15:58:27] where they couldn't complete it I guess [16:02:08] apergos: not sure, but I think you're talking about upgrading 1.18->1.19 but I believe that issue (whatever it was) got resolved right away? [16:02:34] don't know [16:02:39] it was in the last day or two [16:03:44] apergos: then I'm mistaken, or else the issue was not fixed permanently [16:03:54] ok [16:04:05] I dunno, I just saw something scroll by about it in one of the channels [16:04:48] 08:39 hashar: Gave up running l10nupdate script it has some file permissions issues. Opened bugĀ 36119 and bugĀ 36120 [16:04:50] here it is [16:05:04] yup [16:05:15] oops, sorry for gratuitous ping [16:05:24] I tried running l10nupdate this morning to fix some sawiki bug :/ [16:05:44] I have opened two bugs and sent an email to Roan Kattouw [16:05:54] cool [16:06:12] that is all really messy [16:06:31] I got that impression yeah [16:11:30] !log add missing memcached servicegroup to nagios, restarted [16:11:32] Logged the message, Master [16:18:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.795 seconds [16:34:52] New review: Lcarr; "Thank god!" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/5436 [16:35:07] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Timeout reading from 10.0.8.23:11000 [16:36:37] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [16:49:50] New patchset: Mark Bergsma; "Create Swift filesystems with inode size 1024, as suggested in the docs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5439 [16:50:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5439 [16:59:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.620 seconds [17:07:59] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Timeout reading from 10.0.8.39:11000 [17:09:20] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [17:11:55] New patchset: Pyoungmeister; "part 1 of getting multicast relay going" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5335 [17:12:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5335 [17:15:12] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5335 [17:15:31] !log stopping puppet on locke and emery. just to be safe... [17:15:34] Logged the message, notpeter [17:16:58] New patchset: Catrope; "Per bug 36120 , use cp --force in l10nupdate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5443 [17:17:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5443 [17:18:08] diederik: just fyi, I'm going to be mucking about with udp2log data slightly [17:18:58] New patchset: Catrope; "(bug 36119) Fix location of clearMessageBlobs.php , it moved in 1.19" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5444 [17:19:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5444 [17:29:04] New patchset: Pyoungmeister; "part 1 of getting multicast relay going" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5335 [17:29:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5335 [17:29:45] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5335 [17:29:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5335 [17:31:12] LeslieCarr: ok... I've got a debian package for bz4... now to start on the puppet config [17:40:58] mark: the full conntrack tables is easily solvable without removing the rules [17:40:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:12] we can just increase the amount of memory the kernel assigns for conntrack entries [17:43:31] :-) [17:43:53] why does it need to track conns?!!? [17:45:23] New patchset: Pyoungmeister; "cleanin'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5446 [17:45:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5446 [17:47:27] New patchset: Pyoungmeister; "cleanin'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5446 [17:47:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5446 [17:47:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.279 seconds [17:48:16] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5446 [17:48:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5446 [17:51:11] PROBLEM - LVS HTTPS on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:51:47] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:51:47] PROBLEM - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:51:57] notpeter: Did you just break HTTPS? --- ^^ [17:52:05] PROBLEM - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:05] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:12] RoanKattouw: it did that earlier.. Or yesteday [17:52:14] PROBLEM - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [17:52:14] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:14] PROBLEM - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:20] No, was today [17:52:23] PROBLEM - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:52:41] very well might have [17:53:44] RECOVERY - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 50125 bytes in 0.775 seconds [17:53:53] PROBLEM - LVS HTTPS on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:54:01] oh, awesome [17:54:10] I guess nginx is subscribed to the conf [17:54:10] Give it a few minutes [17:54:21] unless you did break it :p [17:54:21] what conf? [17:54:24] thus "bring down https whenever you push out a new conf" is enabled [17:54:31] Oooh, it's restarting all nginx procs at the same time? [17:55:03] yep! [17:55:14] ah yes, the ol' "dos the site" setup... [17:55:23] RECOVERY - LVS HTTPS on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 59467 bytes in 0.804 seconds [17:55:42] New patchset: preilly; "Add ACL for new carriers and redirect support for carriers landing page on m. domain" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5447 [17:55:50] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:55:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5447 [17:56:00] notpeter: do you know what's going on? [17:56:10] yes [17:56:11] file { [17:56:11] "/etc/nginx/nginx.conf": [17:56:11] content => template('nginx/nginx.conf.erb'), [17:56:13] notify => Service['nginx'], [17:56:14] did you change nginx's config? [17:56:17] I pushed out a new conf [17:56:18] yes [17:56:23] and it notified nginx [17:56:26] and now puppet is restarting everything, argh. [17:56:26] all at the same time [17:56:40] did you run puppet by hand? [17:56:45] this seems like... a problematic design to me [17:56:51] no. [17:56:53] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [17:57:10] then shouldn't puppet run at different times on all of these hosts? [17:57:14] yes [17:57:19] but hey, here we are ;) [17:57:26] so yeah, that's something to change [17:57:47] PROBLEM - LVS HTTPS on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [17:58:12] notpeter: Will this fix itself automatically if we just wait a while? [17:58:18] yes [17:58:19] If it's a restart it sounds like it should [17:58:21] they're just restarting [17:59:09] what's up? [17:59:26] PROBLEM - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [17:59:32] are you sure? something seems off [17:59:40] mark: Mass nginx restart due to updated config file, HTTPS down everywhere [17:59:41] mark: I pushed a new nginx conf [17:59:44] PROBLEM - LVS HTTPS on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [17:59:48] it notified nginx [17:59:53] they all restarted at once [17:59:53] like, I can connect with s_client, getting the certificate chain then it immediately closes the connection [18:00:11] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3981 bytes in 0.442 seconds [18:00:20] RECOVERY - LVS HTTPS on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 638 bytes in 0.442 seconds [18:00:25] you mean you ran puppet all at once on all boxes? [18:00:28] of course it restarted nginx [18:00:29] paravoid: You can see this in Nagios, some say cannot make SSL conn, others say no data received [18:00:35] what did you expect? [18:00:38] RECOVERY - LVS HTTPS on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3930 bytes in 0.111 seconds [18:00:56] RoanKattouw: I am not talking about nagios [18:01:08] don't we have puppet to run every 30' or so? [18:01:37] yes [18:01:59] RECOVERY - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60170 bytes in 0.887 seconds [18:02:21] paravoid: Weren't you saying that you could connect to one of these HTTPS proxies, it would give you certs, then close the conn? [18:02:26] RECOVERY - LVS HTTPS on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39068 bytes in 0.707 seconds [18:02:26] RECOVERY - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 44133 bytes in 0.777 seconds [18:03:30] so for future reference what should have been done differently here? [18:03:47] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43449 bytes in 0.779 seconds [18:04:08] config went through gerrit, and puppet went mad applying it simultaneously, right? [18:04:14] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 55816 bytes in 0.770 seconds [18:04:14] yes [18:04:23] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:04:28] so would you have slain puppet everywhere first? [18:04:32] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:04:37] RoanKattouw: yes and it still happens to me. [18:04:39] I guess you need to stop puppet first everywhere [18:04:42] I don't think the problem is going to fix itself [18:05:18] notpeter: sigh. finest quality. [18:05:26] hurray for landmines [18:05:32] *** Fatal error: The TLS connection was non-properly terminated. [18:05:32] *** Handshake has failed [18:05:32] GnuTLS error: The TLS connection was non-properly terminated. [18:05:41] (same with openssl s_client) [18:05:44] paravoid: That's exactly what I said I was seeing in Nagios: some LVS IPs say "CRITICAL - Cannot make SSL connection" and others say "HTTP CRITICAL - No data received from host". I'm assuming that what you're seeing is the latter, and that other LVSes won't even give you certs [18:05:47] paravoid: to what host? [18:05:52] Oh, OK, so what you're seeing is the /former/ , sorry [18:06:12] try en.wikipedia.org, for example [18:07:27] PROBLEM - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:07:27] PROBLEM - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:07:27] PROBLEM - LVS HTTPS on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:07:32] !log restarting nginx on ssl1002 and ssl1004 as they are not back up [18:07:34] Logged the message, notpeter [18:08:46] [17754967.392078] nginx[20541]: segfault at 0 ip 00000000004764fd sp 00007fff8f75fb80 error 4 in nginx[400000+92000] [18:08:48] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:08:50] [17754967.451980] nginx[20524]: segfault at 0 ip 00000000004764fd sp 00007fff8f75fbb0 error 4 in nginx[400000+92000] [18:08:53] [17754967.951921] nginx[20546]: segfault at 0 ip 00000000004764fd sp 00007fff8f75fb30 error 4 in nginx[400000+92000] [18:08:57] oh how nice [18:08:58] nginxs are segfaulting [18:09:02] there are more of these too [18:09:11] paravoid: what box? [18:09:16] ssl1 [18:09:24] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:09:24] ssl2 [18:09:35] ssl3... [18:09:35] of course they are [18:09:39] there's an exploit in the wild [18:09:51] # dmesg | grep segfault |grep nginx | wc -l [18:09:51] 1453 [18:09:55] yay. [18:10:18] awesome [18:10:46] paravoid: suggestions? [18:10:53] hah, I was about to ask you [18:11:03] so, restarting nginx on ssl1? [18:11:07] see if it still segfaults [18:11:15] kk [18:11:16] what's the configuration change? [18:11:34] + access_udplog 208.80.154.15:8419 squid_combined; [18:11:36] that's it [18:11:39] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43448 bytes in 0.777 seconds [18:11:44] really nothing interesting [18:11:57] PROBLEM - LVS HTTPS on upload-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:12:07] paravoid: I'm going to restart nginx on ssl1 [18:12:12] I did [18:12:15] nothing changed [18:12:19] I did a full stop/start cycle [18:12:19] kk [18:12:29] can we rollback the config change to see if it'll make any difference? [18:12:37] sure [18:13:00] RECOVERY - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 50128 bytes in 0.774 seconds [18:13:18] RECOVERY - LVS HTTPS on upload-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 598 bytes in 0.112 seconds [18:13:36] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 57556 bytes in 0.771 seconds [18:14:21] PROBLEM - LVS HTTPS on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:15:01] notpeter: I'm doing manually on ssl1? [18:15:29] notpeter - reverted? [18:15:38] I just reverted it manually on ssl1 and restarted nginx [18:15:51] RECOVERY - LVS HTTPS on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 56725 bytes in 0.166 seconds [18:15:52] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:15:56] I don't see anothera segfault so far [18:17:02] my git repo is fucked [18:17:08] paravoid: can you check it in [18:17:08] did the same on ssl2 and it seems to be okay there as well [18:17:14] I'm going to revert it in puppet now [18:17:18] thanks [18:17:21] PROBLEM - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:17:30] PROBLEM - LVS HTTPS on upload-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:17:39] RECOVERY - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 53321 bytes in 0.775 seconds [18:17:41] although this will cause them all to restart again.... [18:17:48] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:17:52] it wouldn't do that on the same time [18:18:09] all of the hosts end up running puppet at the same time :/ [18:18:15] RECOVERY - LVS HTTPS on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 638 bytes in 0.442 seconds [18:18:19] no that's not wha it happened [18:18:31] ah, ok [18:18:33] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3980 bytes in 0.447 seconds [18:18:35] fair enough [18:18:55] sec, fixing it and will explain my theory [18:18:59] ja [18:20:05] i cannot get to https://www.mediawiki.org/wiki/MediaWiki [18:20:09] gerrit comes up [18:20:14] is the site dead? [18:20:20] jpostlethwaite: HTTPS is broked [18:20:23] jpostlethwaite: known issue, being worked on [18:20:27] thanjs [18:20:30] thanks [18:21:06] RECOVERY - LVS HTTPS on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 67322 bytes in 0.815 seconds [18:21:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:51] RECOVERY - LVS HTTPS on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39066 bytes in 0.701 seconds [18:21:52] RECOVERY - LVS HTTPS on upload-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 599 bytes in 0.120 seconds [18:21:58] New patchset: Faidon; "Revert a access_udplog addition to nginx's config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5448 [18:22:00] PROBLEM - LVS HTTPS on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:22:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5448 [18:22:27] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:22:45] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:22:47] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5448 [18:22:50] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5448 [18:22:54] ok, so now we need to run puppet on all ssl hosts [18:23:06] alright, I'll go after esams [18:23:21] RECOVERY - LVS HTTPS on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3931 bytes in 0.112 seconds [18:23:30] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 59499 bytes in 0.771 seconds [18:24:28] they're all already running, and hammering the puppetmaster... [18:24:48] yay [18:24:51] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:24:51] PROBLEM - LVS HTTPS on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:25:18] PROBLEM - LVS HTTPS on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:26:03] PROBLEM - LVS HTTPS on foundation-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:26:48] oh fuck [18:26:51] huh? [18:26:58] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3979 bytes in 0.441 seconds [18:26:59] does that change applies to puppetmaster's nginx too? [18:27:11] :S [18:27:12] ah, we have apache/passenger for that, nvm [18:27:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.628 seconds [18:27:24] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43451 bytes in 0.779 seconds [18:28:32] so.... I'm still shocked that that change caused nginx to segault all over the place [18:28:46] running puppet in esams [18:29:02] New review: Asher; "There was a bug in the rewrite rule (the first / breaks it) but also, we don't deploy any of this st..." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/5433 [18:29:03] RECOVERY - LVS HTTPS on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79457 bytes in 0.445 seconds [18:29:03] 2 at a time [18:29:12] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host [18:29:19] puppet doesn't want to cooperate, I've modified it manually on ssl1-4 and ssl1001-1004 [18:30:34] doing same in esams [18:31:27] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:31:45] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [18:31:45] RECOVERY - LVS HTTPS on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39066 bytes in 0.712 seconds [18:32:03] RECOVERY - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 53321 bytes in 0.775 seconds [18:32:03] RECOVERY - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60171 bytes in 0.882 seconds [18:32:12] RECOVERY - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 59983 bytes in 0.778 seconds [18:32:30] RECOVERY - LVS HTTPS on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 638 bytes in 0.442 seconds [18:32:30] RECOVERY - LVS HTTPS on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 67322 bytes in 0.824 seconds [18:32:32] done in esams [18:32:33] wow [18:32:39] RECOVERY - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.447 second response time [18:32:48] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3960 bytes in 0.551 seconds [18:33:06] RECOVERY - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 44134 bytes in 0.773 seconds [18:33:06] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43451 bytes in 0.776 seconds [18:33:06] RECOVERY - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 50128 bytes in 0.774 seconds [18:33:24] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 80040 bytes in 0.831 seconds [18:34:55] so, things are getting back to normal, hopefully [18:35:04] Looks like HTTPS in esams is back up [18:35:10] yeah [18:38:26] New review: MarkAHershberger; "As ^demon said in IRC:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/5433 [18:39:11] paravoid: also, I just ran puppet on ssl3001 to make sure that it didn't revert the by-hand changes and it didn't [18:39:44] great [18:41:44] so, the theory is [18:45:07] puppet run at different times within the 30 minute window [18:45:15] yes [18:45:25] so, not all at the same time [18:45:35] made the change, and restared nginx [18:45:41] ja [18:45:46] which should have been okay, because it wouldn't do it at the same time [18:45:51] yep [18:45:59] so, nginxs were restarted at different points [18:46:11] but, nginxs started going haywire with the segfault [18:46:13] but they were all segfaulting constantly after the restart [18:46:15] yeah [18:46:18] right. [18:46:34] which made the problem persist and aggravate itself [18:46:40] yep [18:47:45] *sigh* [18:49:40] Speaking of SSL [18:49:43] Have you guys seen https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Issues_with_the_security_certificate.3F ? [18:49:58] It looks like some people going to *.wikipedia.org on HTTPS got the wikimedia cert instead [18:58:18] New patchset: Pyoungmeister; "moving out a file so it's only deployed once." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5451 [18:58:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5451 [18:59:17] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5451 [18:59:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5451 [19:01:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:08:32] diederik: so.... no oxygen today. apparently sending it logs from our ssl terminators causes them to segfault constantly and take down https for all of our site. [19:08:35] as I just learned [19:09:16] for wikipedia? [19:09:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.067 seconds [19:09:40] notpeter: and if we would exclude ssl traffic? [19:10:25] ottomata: actually, for all of our projects :) [19:10:33] diederik: yeah, I'm working on still getting the rest of it up [19:10:33] geez [19:10:54] ok, if we can just get it set up ready to run log filters [19:10:58] even if it is not running them [19:11:01] that would be helpful [19:11:04] notpeter: well i don't think we did not see that one coming :( [19:11:15] diederik: noooooo, that was... very unexpected. [19:11:37] just a big plain bummer [19:11:57] ottomata: that part is done. it can run filters. now it's down to getting all of the right data sources pointed at in [19:12:00] diederik: yep [19:15:09] New patchset: Pyoungmeister; "everything gets its own name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5487 [19:15:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5487 [19:16:11] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5487 [19:16:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5487 [19:17:17] ok cool [19:17:21] now if only i could get access :) [19:17:30] hello hello helloooooo [19:17:39] (echoes) [19:18:04] https://gerrit.wikimedia.org/r/#change,5350 [19:22:21] ottomata: hhhmmm, yes. I'm not 100% on who has to approve cluster access [19:22:28] I will ask CT when he's back from lunch [19:22:52] i have forwarded CT the RT ticket [19:22:56] and mablebed [19:23:00] oh maplebed isonline now! [19:23:02] and mark [19:24:33] Ping LeslieCarr when you restart the deletion script, could you set the date to feb 15 instead of feb 5? [19:35:32] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 216 seconds [19:35:32] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 216 seconds [19:39:53] RECOVERY - Host mw4 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [19:41:05] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [19:41:05] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [19:42:26] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [19:42:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:38] PROBLEM - Apache HTTP on mw4 is CRITICAL: Connection refused [19:44:46] hmm [19:45:05] !Log powercycled mw4, it was unresponsive to pings and via mgmt console [19:45:18] hmm I wonder if it wil take Log instead of log [19:45:24] what is the volatile puppet file source? [19:45:33] puppet:///volatile/squid/squid.conf [19:45:36] !log powercycled mw4, it was unresponsive to pings and via mgmt [19:45:39] Logged the message, Master [19:45:47] silly case-sensitive bot [19:45:57] ottomata: not sure. but I need to add someting in there as well :) [19:45:57] or, really the same thing as you [19:46:11] i'm looking at this RT ticket http://rt.wikimedia.org/Ticket/Display.html?id=2745 [19:46:16] and another one [19:46:20] but startting with that one [19:47:03] ah, gotcha [19:47:25] I need to figure out where that's kept so as to get squids shooting traffic at o2 [19:48:17] ah hmm, i need to understand how udp2log works, but i thought thought o2 would latch on to a multicast addy? [19:48:19] maybe not [19:48:53] squid-logging-multicast-relay.conf [19:48:57] exec /usr/bin/socat UDP-RECV:8420,su=nobody UDP4-DATAGRAM:233.58.59.1:8420,ip-multicast-ttl=10 [19:49:26] the udp2log instance is listening to the relay [19:49:29] *relay [19:49:42] on o2? [19:49:46] but squid and nginx are not yet sending to the relay [19:49:46] yes [19:49:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.174 seconds [19:51:34] so, in abot 5 minutes, o2's udp2log instance will be getting logs from all traffic that goes through varnish, via multicast [19:51:46] (up until now, the multicast relay was not yet in use) [19:51:59] emery+locke don't use it? [19:52:21] no/not yet [19:52:47] k [19:55:00] New patchset: Pyoungmeister; "need to comment out. they keep spawning..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5489 [19:55:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5489 [19:55:36] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5489 [19:55:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5489 [19:57:35] RECOVERY - Apache HTTP on mw4 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.324 second response time [19:57:53] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [20:02:29] these are interesting [20:02:32] Apr 18 20:43:26 mw4 kernel: [18446744003.567280] BUG: soft lockup - CPU#18 stuck for 17163091968s! [php:12060] [20:02:38] Apr 18 20:43:26 mw4 kernel: [18446744003.574654] BUG: soft lockup - CPU#19 stuck for 17163091968s! [apache2:10091] [20:04:14] not any kind of usefull call trace though, output must have been truncated [20:07:28] ottomata: so, one possibility is to have nginx send to just the relay, and then have it send traffic off to locke/emery, which might get us around the "three logging hosts causes nginx to die repeatedly" problem [20:08:34] would that just turn into "one logging host causes nginx to die"? [20:08:48] hahahaha, could be ;) [20:13:12] this is just to get the logs to o2? [20:13:17] i really don't know how this system works at all [20:13:36] i emailed Tim Starling to see if I could set up a call with him monday eve to get an overview [20:13:45] but man, i am so in the dark about so many things i'd like to help with [20:13:53] and i don't know how to find out [20:13:59] how, or where to look [20:19:47] PROBLEM - NTP on mw4 is CRITICAL: NTP CRITICAL: Offset unknown [20:22:38] RECOVERY - NTP on mw4 is OK: NTP OK: Offset 0.1306298971 secs [20:23:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.587 seconds [21:01:15] notpeter: I was talking to Ryan in person just now, he said the segfault came from his udp2log plugin for nginx [21:01:23] (C code that he wrote and that wasn't reviewed by Tim) [21:01:47] Apparently there's some weird-ass bug in his code that makes it so you can have 0, 1 or 2 access_udplog rules, but not 3 [21:03:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:05:01] ah, ok [21:05:21] RoanKattouw: then I'll attempt to make use of the relay [21:05:45] good to know ;) [21:06:11] ah, weird [21:11:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.554 seconds [21:12:57] !log starting swift delete script on ms-be2 [21:12:59] Logged the message, Mistress of the network gear. [21:18:14] New patchset: Pyoungmeister; "this should do" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5491 [21:18:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5491 [21:19:28] root@ssl1:~# grep pem /etc/nginx/sites-enabled/* [21:19:57] /etc/nginx/sites-enabled/wikinews: ssl_certificate /etc/ssl/certs/star.wikinews.org.chained.pem; [21:20:01] /etc/nginx/sites-enabled/wikipedia: ssl_certificate /etc/ssl/certs/test-star.wikipedia.org.chained.pem; [21:20:06] /etc/nginx/sites-enabled/wikiquote: ssl_certificate /etc/ssl/certs/star.wikiquote.org.chained.pem; [21:20:12] why is it test-* for wikipedia? [21:20:36] lol [21:20:46] Can you check whether that .pem file contains the correct cert? [21:21:02] I have seen reports that the SSL terminators for wikipedia.org will occasionally serve the wikiMedia.org cert [21:21:45] ah, test-star has *.m.wikipedia.org too [21:30:05] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [21:44:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.656 seconds [22:01:32] New patchset: Pyoungmeister; "this should do" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5491 [22:01:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5491 [22:14:48] New patchset: Jeremyb; "make spacing consistent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5492 [22:15:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5492 [22:19:24] New patchset: Pyoungmeister; "this should do" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5491 [22:19:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5491 [22:24:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:05] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5491 [22:26:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5491 [22:32:45] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [22:33:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.220 seconds [22:37:16] New patchset: Pyoungmeister; "perms" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5494 [22:37:29] take that, mobile traffic stats! [22:37:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5494 [22:38:01] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5494 [22:38:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5494 [22:41:49] New patchset: Pyoungmeister; "naming" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5495 [22:42:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5495 [22:43:01] New patchset: Pyoungmeister; "naming" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5495 [22:43:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5495 [22:43:47] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5495 [22:51:08] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5495 [22:51:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5495 [22:54:33] New patchset: Pyoungmeister; "something ain't right. leaving ina stable state until monday" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5497 [22:54:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5497 [22:55:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5497 [22:55:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5497 [23:07:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:10:46] binasher: should we worry about db1001, db1020, or db1047 ? [23:13:03] ACKNOWLEDGEMENT - MySQL disk space on db1047 is CRITICAL: DISK CRITICAL - free space: /a 55780 MB (3% inode=99%): asher this will stay full until we purchase a disk shelf [23:13:04] i ack'd the space alert on db1047 [23:13:33] db1001 had a hardware failure and was repaired, but i haven't put it back in service yet [23:14:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.841 seconds [23:14:33] db1047 is analinterns i think. idk abou thte other 2 [23:14:34] hrm, i thought there was an rt ticket about 1020/22 [23:16:40] oh, db1020 ticket was resolved 3/23 - it had to have its raid card replaced [23:17:32] db1022 ticket was resolved too [23:18:46] so those are both dbs that were configured as slaves, then needed hardware repair (so they were in nagios) and in the meantime other hosts took over their former place. so they're to-be-allocated dbs that are in nagios from before they failed [23:24:33] ACKNOWLEDGEMENT - mysqld processes on db1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld asher need to re-slave [23:25:03] ACKNOWLEDGEMENT - mysqld processes on db1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld asher need to re-slave [23:32:26] db1022 also has checks disabled. they're enabled on db1020 [23:47:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.463 seconds