[00:03:10] !log maxsem synchronized php-1.21wmf6/extensions/MobileFrontend/ [00:03:20] Logged the message, Master [00:10:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [00:10:51] New patchset: Lwelling; "Experimentally disable Captcha on enwiki so we can monitor effect" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42892 [00:12:53] New review: Tim Starling; "sync-apache is an rsync-based sync script independent of scap." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42871 [00:12:57] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [00:13:32] wait, what [00:13:33] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 205 seconds [00:13:44] when are we disabling captcha on enwiki? [00:13:59] and how is this being communicated? [00:14:26] LeslieCarr: /usr/sbin/snmptrapd -On -Lsd [00:14:27] -p /var/run/snmptrapd.pid [00:14:38] /usr/sbin/snmptrapd -On -Lsd-p /var/run/snmptrapd.pid [00:15:20] RECOVERY - MySQL disk space on neon is OK: DISK OK [00:15:50] New review: Asher; "effect: spam." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [00:18:56] New patchset: Ryan Lane; "Checkout and fetch don't require tag" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42895 [00:19:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42895 [00:22:37] notpeter: if you're around … it's aliiiiiiiiiiive! [00:22:44] New patchset: Ryan Lane; "Don't give warning if repo depends aren't used" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42896 [00:22:45] :) [00:23:10] reason: like Labs Nagios [00:23:15] snmtrapd [00:27:48] New review: Mattflaschen; "This is an experiment. Obviously, we will not leave this off for long for the initial test." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42892 [00:28:44] marktraceur: https://gerrit.wikimedia.org/r/#/c/42890/1 <- trivial change :) [00:29:04] New patchset: Ryan Lane; "Make dependencies optional" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42896 [00:29:28] AaronSchulz: What's with trivial changes today? Nobody makes complex changes to UW anymore? Jeez. [00:29:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42896 [00:29:52] today is a good day to kill srv266 [00:30:38] !log powercycling frozen magnesium [00:30:48] Logged the message, Master [00:31:29] Oh it wasn't UW. [00:33:02] RECOVERY - Host magnesium is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [00:34:23] RECOVERY - Host srv238 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [00:34:33] !log maxsem synchronized php-1.21wmf6/includes/resourceloader 'https://gerrit.wikimedia.org/r/#/c/42894/' [00:34:42] Logged the message, Master [00:35:10] New review: Lwelling; "The specific result hoped for from the experiment is a significant increase in users registering, an..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42892 [00:36:00] !log maxsem synchronized php-1.21wmf6/resources/Resources.php 'https://gerrit.wikimedia.org/r/#/c/42894/' [00:36:09] Logged the message, Master [00:36:19] New review: MaxSem; "Are you ready to clean up the resulting mess yourselves?" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [00:38:17] PROBLEM - Apache HTTP on srv238 is CRITICAL: Connection refused [00:41:41] ACKNOWLEDGEMENT - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn hardware failure, Chris recommend decom already RT-4208 [00:42:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:23] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [00:47:12] New review: Ori.livneh; "MaxSem," [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [00:49:24] New patchset: Dzahn; "puppetize bugzilla_report.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [00:51:42] New patchset: Ryan Lane; "Ask for user input before continuing deploy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42900 [00:52:09] New review: Swalling; ""Do we track number of people who click "create account" and number of people who successfully submi..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42892 [00:52:48] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42900 [00:55:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.876 seconds [00:56:15] New patchset: Dzahn; "puppetize bugzilla_report.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [00:59:46] New patchset: Dzahn; "puppetize bugzilla_report.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [01:00:32] RoanKattouw: can you help us debug a mobile RL issue? CSS on http://en.m.wikipedia.org/ is blank; loads OK with debug=1; has been like this for ~15 mins [01:00:55] This is the URI: http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=mobile.head%2Cstyles%7Cext.eventLogging%7Cschema.MobileBetaWatchlist&only=styles&skin=mobile&version=1357684914&* [01:01:01] New patchset: Dzahn; "puppetize bugzilla_report.php, replace change I19d5da64: Clarify weekly report "top resolvers"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [01:02:07] oddly if you change debug=false to debug=0 it loads [01:03:11] preilly: http://wikitech.wikimedia.org/view/MobileFrontend#Flushing_the_cache [01:03:19] LeslieCarr: thanks [01:03:36] thanks LeslieCarr, preilly [01:03:36] !search flush [01:03:36] http://bots.wmflabs.org/~wm-bot/searchlog/index.php?action=search&channel=%23wikimedia-labs [01:04:16] ori-l: That loads for me [01:04:36] New patchset: Lwelling; "Experimentally disable Captcha for new accounts on enwiki so we can monitor effect" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42892 [01:05:43] New patchset: Ryan Lane; "Return dependency status from checkout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42901 [01:08:09] James_F: https://gerrit.wikimedia.org/r/#/c/42892/ first comment made me lol [01:08:34] New patchset: Ryan Lane; "Return dependency status from checkout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42901 [01:08:52] New patchset: Dzahn; "puppetize bugzilla_report.php, replace change I19d5da64: Clarify weekly report "top resolvers"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [01:08:53] !log preilly synchronized php-1.21wmf7/extensions/MobileFrontend [01:09:03] Logged the message, Master [01:09:08] AaronSchulz: Well, yeah. Not a hugely on-point response, frankly. [01:09:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42901 [01:10:33] RECOVERY - Puppet freshness on magnesium is OK: puppet ran at Wed Jan 9 01:10:23 UTC 2013 [01:10:49] !log preilly synchronized php-1.21wmf6/extensions/MobileFrontend [01:11:11] preilly: that fixed it [01:12:21] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 280 seconds [01:12:36] Logged the message, Master [01:14:10] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 7 seconds [01:14:45] TimStarling: can look at https://gerrit.wikimedia.org/r/#/c/36697/ ? It's being lingering around a bit, and the first part was merged a while ago. [01:16:07] ok [01:16:18] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37165 [01:17:42] New patchset: Ori.livneh; "Disable wgMFForceSecureLogin on testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42906 [01:18:19] New patchset: Dzahn; "decom knsq25 - disk fail - RT-2918" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42907 [01:18:46] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42906 [01:19:15] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42907 [01:19:56] have you tested that set_time_limit()? [01:19:56] !log olivneh synchronized wmf-config/InitialiseSettings.php [01:20:10] Logged the message, Master [01:20:30] I think it will suffer the same problem as ulimit4.sh [01:21:37] both measure CPU time, not wall clock time [01:21:46] New patchset: Dzahn; "decom storage3 - hardware issues per Chris and out of warranty (RT-4208)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42908 [01:21:58] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:23:38] TimStarling: well then it's not any worse ;) [01:23:46] using ulimit is broken for that though [01:24:16] TimStarling: this is the flaw that hit us with the scalars lately with tiffs right? [01:24:24] yes [01:24:40] I think wfShellExec() should limit wall clock time via a separate configuration variable [01:25:02] can it set both to the same value? [01:25:44] possibly [01:25:54] hrm [01:26:01] I see that $wgMaxShellTime already specifically mentions cpu [01:26:10] then I guess a separate var makes sense then [01:26:20] the implementation most likely needs a monitor process [01:26:36] and kill -9 powers ;) [01:27:06] it could be done in shell script, but I think it's getting to the stage where a real programming language is needed [01:27:18] bash is not a real language? [01:27:21] * AaronSchulz ducks [01:27:26] mp! [01:27:27] no! [01:29:36] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [01:29:46] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [01:30:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:31:04] I sometimes like the challenge of trying to do things in bash [01:42:01] New patchset: Lwelling; "Disable Captcha for new accounts on enwiki so we can monitor effect for 1-3 hours" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42892 [01:46:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [02:11:20] sorry, had to look after my daughter for half an hour [02:20:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:08] New review: Mattflaschen; ""and little enough extra spam that the cost is worthwhile, right?"" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/42892 [02:26:09] !log LocalisationUpdate completed (1.21wmf7) at Wed Jan 9 02:26:08 UTC 2013 [02:26:20] Logged the message, Master [02:26:22] !log wiping archive of 'ee' mailing list (RT-4294) [02:26:33] Logged the message, Master [02:33:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.024 seconds [02:52:57] !log LocalisationUpdate completed (1.21wmf6) at Wed Jan 9 02:52:56 UTC 2013 [02:53:07] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:53:08] Logged the message, Master [02:53:08] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:56:48] New patchset: Ryan Lane; "Properly update submodules in a generic way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42910 [03:21:24] LeslieCarr: hey [03:21:26] you there? [03:21:36] looks like eqiad is fucked across the board [03:23:29] nvm, recovered [03:24:02] New patchset: Ryan Lane; "Path and missing string fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42912 [03:25:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42910 [03:25:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42912 [03:32:46] PROBLEM - LVS HTTP IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:47] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:32:53] there it goes [03:32:55] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:32:56] PROBLEM - LVS HTTP IPv4 on wikinews-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:56] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:56] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:33:04] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:05] PROBLEM - LVS HTTP IPv4 on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:05] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:05] PROBLEM - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:13] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:14] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:22] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:24] PROBLEM - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:24] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:33:31] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv4 on mediawiki-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv4 on wikidata-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:33] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:40] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:49] PROBLEM - LVS HTTP IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:33:58] PROBLEM - LVS HTTP IPv4 on wikisource-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:59] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:59] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:07] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:07] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:17] PROBLEM - LVS HTTP IPv4 on wikibooks-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:17] PROBLEM - LVS HTTP IPv4 on foundation-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:43] PROBLEM - LVS HTTP IPv4 on wikiquote-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:55] weirdly my ping of 8.8.8.8 is dying at the same time …… [03:35:04] eeep [03:35:05] what's happening [03:35:08] just got the pages [03:35:13] all of eqiad [03:35:17] looks like it's not getting traffic [03:35:18] notpeter: what did you do ?!?! [03:35:19] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3834 bytes in 9.088 seconds [03:35:27] can you take a look at networkign equip [03:35:32] ok [03:35:33] looking now [03:35:38] LeslieCarr: nothing, I just had friends tell me that wikipedia was down ;) [03:36:04] RECOVERY - LVS HTTP IPv4 on foundation-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63848 bytes in 9.368 seconds [03:36:13] RECOVERY - LVS HTTP IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 3.503 seconds [03:36:22] rawr. while I'm eating :( [03:36:23] RECOVERY - LVS HTTP IPv4 on wikiquote-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.157 seconds [03:36:23] RECOVERY - LVS HTTP IPv4 on wikinews-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.184 seconds [03:36:23] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.738 seconds [03:36:23] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 1.738 seconds [03:36:31] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.334 seconds [03:36:32] RECOVERY - LVS HTTP IPv4 on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.383 seconds [03:36:32] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.460 seconds [03:36:32] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 1.139 seconds [03:36:32] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 3.148 seconds [03:36:32] RECOVERY - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 3.151 seconds [03:36:40] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.189 seconds [03:36:41] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.183 seconds [03:36:44] well, looks like it recovered.... [03:36:49] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 20580 bytes in 0.181 seconds [03:36:50] RECOVERY - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.354 seconds [03:36:57] Ryan_Lane: it flapped once before [03:36:58] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.136 seconds [03:36:59] RECOVERY - LVS HTTP IPv4 on wikidata-lb.eqiad.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 0.055 second response time [03:36:59] RECOVERY - LVS HTTP IPv4 on mediawiki-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63848 bytes in 0.135 seconds [03:36:59] RECOVERY - LVS HTTP IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.138 seconds [03:36:59] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.150 seconds [03:36:59] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.169 seconds [03:36:59] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.197 seconds [03:37:00] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.194 seconds [03:37:02] ugh [03:37:04] very briefly [03:37:07] RECOVERY - LVS HTTP IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.137 seconds [03:37:09] if oyu look at ganglia [03:37:12] I do wonder if something larger is happening… it may be coincidence but I was tailing a ping of the google DNS server and it died (then recovered then died again and now recovered) at almost the same time as similar notifications were being sent here [03:37:25] RECOVERY - LVS HTTP IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.139 seconds [03:37:25] RECOVERY - LVS HTTP IPv4 on wikisource-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.136 seconds [03:37:25] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.168 seconds [03:37:26] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.221 seconds [03:37:34] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 20630 bytes in 0.082 seconds [03:37:35] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.169 seconds [03:37:36] yeah, Ithink that this is a transit thing [03:37:41] maybe unrelated but we're logging a lot of [03:37:41] Exception from line 94 of /usr/local/apache/common-local/php-1.21wmf6/extensions/ConfirmEdit/FancyCaptcha.class.php: Ran out of captcha images [03:37:42] now [03:37:43] RECOVERY - LVS HTTP IPv4 on wikibooks-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.135 seconds [03:37:46] this is weird [03:37:52] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.007 seconds [03:38:02] well xo is off but i don't see any other conenctions down [03:38:06] and xo has been down for a little while [03:40:07] grr, time is way out of sync on lvs servers [03:40:43] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:40:49] yep. has been for a while [03:41:03] there's an rt open for that [03:43:04] so i see an lvs server losing its bgp state at 7:30 [03:43:10] pybal was getting monitoring timeouts from the squids [03:43:11] 2013-01-09 03:29:10.679263 [wikibookslb ProxyFetch] cp1006.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 30.001 s [03:43:11] 2013-01-09 03:29:10.722461 [mediawikilb ProxyFetch] cp1005.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 30.000 s [03:43:12] 2013-01-09 03:29:10.783407 [foundationlb ProxyFetch] cp1015.eqiad.wmnet (enabled/partially up/pooled): Fetch failed, 30.001 s [03:43:13] 2013-01-09 03:29:10.816945 [mediawikilb ProxyFetch] cp1012.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 30.001 s [03:43:14] 2013-01-09 03:29:10.927765 [wikivoyagelb ProxyFetch] cp1004.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 30.001 s [03:43:29] thats lvs1001, time is off by 7min [03:44:50] no such timeouts at all on lvs1002 [03:45:26] ah yep that was lvs1001 that had the bgp flap as well [03:47:40] LeslieCarr: any other signs of network weirdness that would effect lvs1001? [03:48:05] not that i can tell … lemme check hardware log just in case something there happened (and was logged) [03:49:17] nope, no hardware events logged [03:49:39] damn [03:49:58] hrm, [04:00:23] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [04:06:59] New patchset: Ryan Lane; "Fix reference to l10n dependency script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42913 [04:07:27] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42913 [04:13:00] heh, it's ironic this happens just as i was watching the jimmy wales interview [04:13:20] New patchset: Ryan Lane; "Capture stderr when pulling pillar data" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42914 [04:14:55] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [04:14:56] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [04:14:56] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [04:14:56] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [04:14:56] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [04:16:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42914 [04:17:18] LeslieCarr: I managed to watch that interview this morning without incident ;) [04:59:31] New patchset: Dereckson; "(bug 43760) Enable WikiLove on is.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42915 [04:59:57] New patchset: Andrew Bogott; "Objectify adminlogbot." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/42916 [07:14:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:14:56] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [07:16:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [07:16:53] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [07:31:00] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [07:48:08] PROBLEM - Puppet freshness on db1036 is CRITICAL: Puppet has not run in the last 10 hours [07:50:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:52:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.158 seconds [08:02:05] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:23:42] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 197 seconds [08:23:59] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 201 seconds [08:27:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:30:44] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [08:31:02] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [08:42:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.895 seconds [08:48:08] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [09:17:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:27] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 184 seconds [09:28:21] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 217 seconds [09:28:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.150 seconds [09:49:12] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 211 seconds [09:51:00] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:51:27] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [10:02:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [10:52:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:05:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [11:12:48] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42389 [11:16:30] !log reedy synchronized wmf-config/ [11:16:39] Logged the message, Master [11:17:43] ori-l: thanks for resolving those bugs [11:18:01] ori-l: what's your interest in CORS if I may ask? [11:23:12] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:29:33] New review: Ori.livneh; "Asher: if you have a chance, could you weigh in? It'd be useful to know what you think." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [11:38:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:53:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [12:24:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:08] MaxSem: any idea wth is going on: http://www.wikidata.org/w/index.php?title=Translations:Wikidata:Glossary/25/fi&action=edit&loadgroup=page-Wikidata%3AGlossary&loadtask=view ? [12:35:05] Nikerabbit, I observed this when my server was in process of meltdown, however vanadium looks responsive and search returns results [12:35:12] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [12:35:30] Nikerabbit, I say look in the logs [12:35:55] MaxSem: where are the logs? [12:36:07] https://wikitech.wikimedia.org/view/Solr#Logs [12:37:11] MaxSem: am I supposed to have access to vanadium? [12:37:29] let's see in puppet;) [12:38:20] which one? [12:38:21] no you're not:) [12:38:45] and it'll likely require root to view them anyway:) [12:39:06] hmph [12:39:20] see `node "vanadium.eqiad.wmnet"` in site.pp [12:39:56] it includes admins::restricted and various single accounts, but not you or other mortals [12:40:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [12:41:11] well, as a fallback I can extend my try catch block to more lines... but I'm very curious wtf is going on there [12:41:52] but it is very hard if not impossible to get the actual result and query for inspecting [12:42:12] it's easy if you have logs acess [12:42:30] * MaxSem looks around for euro pos [12:42:35] MaxSem: only the query is logged if even that [12:42:36] *euro ops [12:42:50] but of course then I can do the query myself [12:43:51] apergos or paravoid, can you help us? ^^^ [12:44:02] yes? [12:45:14] apergos, can you look if there's anything bad in jetty logs on vanadium? [12:47:17] doesn't look like it [12:54:44] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:54:44] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:56:06] apergos: is it logging the queries? [12:57:31] I see stuff like GET /solr/select?wt=json&q= and lots of stuff [12:57:32] current [12:57:47] does grep "language-specific name" find anything? [13:00:04] no [13:00:16] ah [13:00:18] yes [13:00:20] space -> + [13:00:43] returns 200 [13:00:54] the most recent of those anyways [13:01:41] POST /solr/update?wt=json HTTP/1.0" 200 43 is the result [13:02:12] what else did your query have in it? [13:03:05] apergos: what do you mean? Could you paste the full query somewhere for me? [13:03:15] ok [13:04:49] POST /solr/update is an update query while wehat fails is search, someting like GET /solr/search?.... [13:14:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.044 seconds [13:50:48] New patchset: Hashar; "(bug 43729) create /mnt/srv and /srv on beta mw installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42743 [13:51:12] New review: Hashar; "PS4: use mount{} instead of symlink (per Ryan)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42743 [13:53:37] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27175 [13:53:40] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [13:55:40] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34748 [14:02:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:19] New patchset: Hashar; "(bug 43141) jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [14:16:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.636 seconds [14:16:35] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [14:16:35] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [14:16:35] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [14:16:35] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [14:16:35] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [14:28:12] New patchset: Hashar; "beta: makes two wikis to use WMF branches" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42951 [14:29:37] New review: Hashar; "Merging that to deploy it on beta." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42951 [14:29:47] New review: Hashar; "Merging that to deploy it on beta." [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42951 [14:29:48] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42951 [14:38:52] yo opsen, does anybody know anything about the vumi setup (an sms service that you can ask for a wikipedia article and get it delivered as a couple of sms messages). is this already in production? [14:40:38] no idea [14:41:36] :D [14:43:06] New patchset: Nikerabbit; "Workaround for exception preventing translation" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42954 [14:43:19] MaxSem: could you have a look ^^ [14:43:52] after hour(s) of debugging I found out "Operation timed out after 5000 milliseconds with 0 bytes received" [14:44:24] ehm [14:44:45] what did it use by default? [14:44:58] MaxSem: 5s [14:45:11] i mean which http client [14:45:18] MaxSem: file_get_contents [14:45:44] so you think increasing the timeout will help? [14:46:12] shouldn't you investigate why it takes so long? [14:46:17] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42954 [14:46:33] strdist [14:47:43] I'll add some length limits and extend the try-catch block, but this should help now [14:48:01] ehm, on texts sometimes kilobytes long? [14:48:09] it has quadratic complexity [14:48:27] MaxSem: indeed, hence length limits [14:48:56] do you sort by strdist? [14:49:19] MaxSem: it's sorted by score which is the return value of strdist [14:49:21] mebbe it's worth trying to get the relevance score do the trick? [14:50:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:31] this is the best compromise so far [14:52:14] Nikerabbit, are you going to deploy it? [14:52:35] MaxSem: doing i [14:53:14] * MaxSem doesn't like undeployed changes [14:53:17] afk [14:53:37] !log nikerabbit synchronized wmf-config/CommonSettings.php 'Translation memory tweak' [14:53:47] Logged the message, Master [14:59:39] New patchset: Mark Bergsma; "Move misc::url-downloader out of a completely unrelated file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42958 [15:00:37] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [15:01:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42958 [15:03:04] mark: can you possibly restart memcached on virt0 please ? That renders labsconsole useless :/ http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=virt0&service=Memcached [15:04:02] ok [15:04:19] danke! [15:04:31] seems to be running again [15:04:35] !log Started memcached on virt0 [15:04:46] Logged the message, Master [15:05:52] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [15:06:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [15:22:25] New patchset: Demon; "Configure ExtensionDistributor in preparation for new version" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42966 [15:26:16] New patchset: Mark Bergsma; "Allow HTTPS (CONNECT) requests on the copy-by-url proxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42967 [15:27:29] New patchset: Mark Bergsma; "Allow HTTPS (CONNECT) requests on the copy-by-url proxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42967 [15:29:11]