[00:03:10] !log maxsem synchronized php-1.21wmf6/extensions/MobileFrontend/ [00:03:20] Logged the message, Master [00:10:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [00:10:51] New patchset: Lwelling; "Experimentally disable Captcha on enwiki so we can monitor effect" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42892 [00:12:53] New review: Tim Starling; "sync-apache is an rsync-based sync script independent of scap." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42871 [00:12:57] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [00:13:32] wait, what [00:13:33] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 205 seconds [00:13:44] when are we disabling captcha on enwiki? [00:13:59] and how is this being communicated? [00:14:26] LeslieCarr: /usr/sbin/snmptrapd -On -Lsd [00:14:27] -p /var/run/snmptrapd.pid [00:14:38] /usr/sbin/snmptrapd -On -Lsd-p /var/run/snmptrapd.pid [00:15:20] RECOVERY - MySQL disk space on neon is OK: DISK OK [00:15:50] New review: Asher; "effect: spam." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [00:18:56] New patchset: Ryan Lane; "Checkout and fetch don't require tag" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42895 [00:19:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42895 [00:22:37] notpeter: if you're around … it's aliiiiiiiiiiive! [00:22:44] New patchset: Ryan Lane; "Don't give warning if repo depends aren't used" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42896 [00:22:45] :) [00:23:10] reason: like Labs Nagios [00:23:15] snmtrapd [00:27:48] New review: Mattflaschen; "This is an experiment. Obviously, we will not leave this off for long for the initial test." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42892 [00:28:44] marktraceur: https://gerrit.wikimedia.org/r/#/c/42890/1 <- trivial change :) [00:29:04] New patchset: Ryan Lane; "Make dependencies optional" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42896 [00:29:28] AaronSchulz: What's with trivial changes today? Nobody makes complex changes to UW anymore? Jeez. [00:29:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42896 [00:29:52] today is a good day to kill srv266 [00:30:38] !log powercycling frozen magnesium [00:30:48] Logged the message, Master [00:31:29] Oh it wasn't UW. [00:33:02] RECOVERY - Host magnesium is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [00:34:23] RECOVERY - Host srv238 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [00:34:33] !log maxsem synchronized php-1.21wmf6/includes/resourceloader 'https://gerrit.wikimedia.org/r/#/c/42894/' [00:34:42] Logged the message, Master [00:35:10] New review: Lwelling; "The specific result hoped for from the experiment is a significant increase in users registering, an..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42892 [00:36:00] !log maxsem synchronized php-1.21wmf6/resources/Resources.php 'https://gerrit.wikimedia.org/r/#/c/42894/' [00:36:09] Logged the message, Master [00:36:19] New review: MaxSem; "Are you ready to clean up the resulting mess yourselves?" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [00:38:17] PROBLEM - Apache HTTP on srv238 is CRITICAL: Connection refused [00:41:41] ACKNOWLEDGEMENT - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn hardware failure, Chris recommend decom already RT-4208 [00:42:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:23] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [00:47:12] New review: Ori.livneh; "MaxSem," [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [00:49:24] New patchset: Dzahn; "puppetize bugzilla_report.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [00:51:42] New patchset: Ryan Lane; "Ask for user input before continuing deploy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42900 [00:52:09] New review: Swalling; ""Do we track number of people who click "create account" and number of people who successfully submi..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42892 [00:52:48] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42900 [00:55:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.876 seconds [00:56:15] New patchset: Dzahn; "puppetize bugzilla_report.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [00:59:46] New patchset: Dzahn; "puppetize bugzilla_report.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [01:00:32] RoanKattouw: can you help us debug a mobile RL issue? CSS on http://en.m.wikipedia.org/ is blank; loads OK with debug=1; has been like this for ~15 mins [01:00:55] This is the URI: http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=mobile.head%2Cstyles%7Cext.eventLogging%7Cschema.MobileBetaWatchlist&only=styles&skin=mobile&version=1357684914&* [01:01:01] New patchset: Dzahn; "puppetize bugzilla_report.php, replace change I19d5da64: Clarify weekly report "top resolvers"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [01:02:07] oddly if you change debug=false to debug=0 it loads [01:03:11] preilly: http://wikitech.wikimedia.org/view/MobileFrontend#Flushing_the_cache [01:03:19] LeslieCarr: thanks [01:03:36] thanks LeslieCarr, preilly [01:03:36] !search flush [01:03:36] http://bots.wmflabs.org/~wm-bot/searchlog/index.php?action=search&channel=%23wikimedia-labs [01:04:16] ori-l: That loads for me [01:04:36] New patchset: Lwelling; "Experimentally disable Captcha for new accounts on enwiki so we can monitor effect" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42892 [01:05:43] New patchset: Ryan Lane; "Return dependency status from checkout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42901 [01:08:09] James_F: https://gerrit.wikimedia.org/r/#/c/42892/ first comment made me lol [01:08:34] New patchset: Ryan Lane; "Return dependency status from checkout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42901 [01:08:52] New patchset: Dzahn; "puppetize bugzilla_report.php, replace change I19d5da64: Clarify weekly report "top resolvers"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42899 [01:08:53] !log preilly synchronized php-1.21wmf7/extensions/MobileFrontend [01:09:03] Logged the message, Master [01:09:08] AaronSchulz: Well, yeah. Not a hugely on-point response, frankly. [01:09:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42901 [01:10:33] RECOVERY - Puppet freshness on magnesium is OK: puppet ran at Wed Jan 9 01:10:23 UTC 2013 [01:10:49] !log preilly synchronized php-1.21wmf6/extensions/MobileFrontend [01:11:11] preilly: that fixed it [01:12:21] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 280 seconds [01:12:36] Logged the message, Master [01:14:10] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 7 seconds [01:14:45] TimStarling: can look at https://gerrit.wikimedia.org/r/#/c/36697/ ? It's being lingering around a bit, and the first part was merged a while ago. [01:16:07] ok [01:16:18] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37165 [01:17:42] New patchset: Ori.livneh; "Disable wgMFForceSecureLogin on testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42906 [01:18:19] New patchset: Dzahn; "decom knsq25 - disk fail - RT-2918" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42907 [01:18:46] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42906 [01:19:15] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42907 [01:19:56] have you tested that set_time_limit()? [01:19:56] !log olivneh synchronized wmf-config/InitialiseSettings.php [01:20:10] Logged the message, Master [01:20:30] I think it will suffer the same problem as ulimit4.sh [01:21:37] both measure CPU time, not wall clock time [01:21:46] New patchset: Dzahn; "decom storage3 - hardware issues per Chris and out of warranty (RT-4208)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42908 [01:21:58] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:23:38] TimStarling: well then it's not any worse ;) [01:23:46] using ulimit is broken for that though [01:24:16] TimStarling: this is the flaw that hit us with the scalars lately with tiffs right? [01:24:24] yes [01:24:40] I think wfShellExec() should limit wall clock time via a separate configuration variable [01:25:02] can it set both to the same value? [01:25:44] possibly [01:25:54] hrm [01:26:01] I see that $wgMaxShellTime already specifically mentions cpu [01:26:10] then I guess a separate var makes sense then [01:26:20] the implementation most likely needs a monitor process [01:26:36] and kill -9 powers ;) [01:27:06] it could be done in shell script, but I think it's getting to the stage where a real programming language is needed [01:27:18] bash is not a real language? [01:27:21] * AaronSchulz ducks [01:27:26] mp! [01:27:27] no! [01:29:36] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [01:29:46] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [01:30:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:31:04] I sometimes like the challenge of trying to do things in bash [01:42:01] New patchset: Lwelling; "Disable Captcha for new accounts on enwiki so we can monitor effect for 1-3 hours" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42892 [01:46:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [02:11:20] sorry, had to look after my daughter for half an hour [02:20:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:08] New review: Mattflaschen; ""and little enough extra spam that the cost is worthwhile, right?"" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/42892 [02:26:09] !log LocalisationUpdate completed (1.21wmf7) at Wed Jan 9 02:26:08 UTC 2013 [02:26:20] Logged the message, Master [02:26:22] !log wiping archive of 'ee' mailing list (RT-4294) [02:26:33] Logged the message, Master [02:33:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.024 seconds [02:52:57] !log LocalisationUpdate completed (1.21wmf6) at Wed Jan 9 02:52:56 UTC 2013 [02:53:07] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:53:08] Logged the message, Master [02:53:08] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:56:48] New patchset: Ryan Lane; "Properly update submodules in a generic way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42910 [03:21:24] LeslieCarr: hey [03:21:26] you there? [03:21:36] looks like eqiad is fucked across the board [03:23:29] nvm, recovered [03:24:02] New patchset: Ryan Lane; "Path and missing string fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42912 [03:25:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42910 [03:25:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42912 [03:32:46] PROBLEM - LVS HTTP IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:47] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:32:53] there it goes [03:32:55] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:32:56] PROBLEM - LVS HTTP IPv4 on wikinews-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:56] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:56] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:33:04] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:05] PROBLEM - LVS HTTP IPv4 on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:05] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:05] PROBLEM - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:13] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:14] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:22] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:24] PROBLEM - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:24] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:33:31] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv4 on mediawiki-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv4 on wikidata-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:32] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:33] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:40] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:49] PROBLEM - LVS HTTP IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:33:58] PROBLEM - LVS HTTP IPv4 on wikisource-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:59] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:59] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:07] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:07] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:17] PROBLEM - LVS HTTP IPv4 on wikibooks-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:17] PROBLEM - LVS HTTP IPv4 on foundation-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:43] PROBLEM - LVS HTTP IPv4 on wikiquote-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:55] weirdly my ping of 8.8.8.8 is dying at the same time …… [03:35:04] eeep [03:35:05] what's happening [03:35:08] just got the pages [03:35:13] all of eqiad [03:35:17] looks like it's not getting traffic [03:35:18] notpeter: what did you do ?!?! [03:35:19] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3834 bytes in 9.088 seconds [03:35:27] can you take a look at networkign equip [03:35:32] ok [03:35:33] looking now [03:35:38] LeslieCarr: nothing, I just had friends tell me that wikipedia was down ;) [03:36:04] RECOVERY - LVS HTTP IPv4 on foundation-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63848 bytes in 9.368 seconds [03:36:13] RECOVERY - LVS HTTP IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 3.503 seconds [03:36:22] rawr. while I'm eating :( [03:36:23] RECOVERY - LVS HTTP IPv4 on wikiquote-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.157 seconds [03:36:23] RECOVERY - LVS HTTP IPv4 on wikinews-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.184 seconds [03:36:23] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.738 seconds [03:36:23] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 1.738 seconds [03:36:31] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.334 seconds [03:36:32] RECOVERY - LVS HTTP IPv4 on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.383 seconds [03:36:32] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.460 seconds [03:36:32] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 1.139 seconds [03:36:32] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 3.148 seconds [03:36:32] RECOVERY - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 3.151 seconds [03:36:40] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.189 seconds [03:36:41] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.183 seconds [03:36:44] well, looks like it recovered.... [03:36:49] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 20580 bytes in 0.181 seconds [03:36:50] RECOVERY - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.354 seconds [03:36:57] Ryan_Lane: it flapped once before [03:36:58] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.136 seconds [03:36:59] RECOVERY - LVS HTTP IPv4 on wikidata-lb.eqiad.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 0.055 second response time [03:36:59] RECOVERY - LVS HTTP IPv4 on mediawiki-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63848 bytes in 0.135 seconds [03:36:59] RECOVERY - LVS HTTP IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.138 seconds [03:36:59] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.150 seconds [03:36:59] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.169 seconds [03:36:59] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63469 bytes in 0.197 seconds [03:37:00] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.194 seconds [03:37:02] ugh [03:37:04] very briefly [03:37:07] RECOVERY - LVS HTTP IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.137 seconds [03:37:09] if oyu look at ganglia [03:37:12] I do wonder if something larger is happening… it may be coincidence but I was tailing a ping of the google DNS server and it died (then recovered then died again and now recovered) at almost the same time as similar notifications were being sent here [03:37:25] RECOVERY - LVS HTTP IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.139 seconds [03:37:25] RECOVERY - LVS HTTP IPv4 on wikisource-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.136 seconds [03:37:25] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.168 seconds [03:37:26] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.221 seconds [03:37:34] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 20630 bytes in 0.082 seconds [03:37:35] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 63467 bytes in 0.169 seconds [03:37:36] yeah, Ithink that this is a transit thing [03:37:41] maybe unrelated but we're logging a lot of [03:37:41] Exception from line 94 of /usr/local/apache/common-local/php-1.21wmf6/extensions/ConfirmEdit/FancyCaptcha.class.php: Ran out of captcha images [03:37:42] now [03:37:43] RECOVERY - LVS HTTP IPv4 on wikibooks-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 63846 bytes in 0.135 seconds [03:37:46] this is weird [03:37:52] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.007 seconds [03:38:02] well xo is off but i don't see any other conenctions down [03:38:06] and xo has been down for a little while [03:40:07] grr, time is way out of sync on lvs servers [03:40:43] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:40:49] yep. has been for a while [03:41:03] there's an rt open for that [03:43:04] so i see an lvs server losing its bgp state at 7:30 [03:43:10] pybal was getting monitoring timeouts from the squids [03:43:11] 2013-01-09 03:29:10.679263 [wikibookslb ProxyFetch] cp1006.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 30.001 s [03:43:11] 2013-01-09 03:29:10.722461 [mediawikilb ProxyFetch] cp1005.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 30.000 s [03:43:12] 2013-01-09 03:29:10.783407 [foundationlb ProxyFetch] cp1015.eqiad.wmnet (enabled/partially up/pooled): Fetch failed, 30.001 s [03:43:13] 2013-01-09 03:29:10.816945 [mediawikilb ProxyFetch] cp1012.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 30.001 s [03:43:14] 2013-01-09 03:29:10.927765 [wikivoyagelb ProxyFetch] cp1004.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 30.001 s [03:43:29] thats lvs1001, time is off by 7min [03:44:50] no such timeouts at all on lvs1002 [03:45:26] ah yep that was lvs1001 that had the bgp flap as well [03:47:40] LeslieCarr: any other signs of network weirdness that would effect lvs1001? [03:48:05] not that i can tell … lemme check hardware log just in case something there happened (and was logged) [03:49:17] nope, no hardware events logged [03:49:39] damn [03:49:58] hrm, [04:00:23] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [04:06:59] New patchset: Ryan Lane; "Fix reference to l10n dependency script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42913 [04:07:27] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42913 [04:13:00] heh, it's ironic this happens just as i was watching the jimmy wales interview [04:13:20] New patchset: Ryan Lane; "Capture stderr when pulling pillar data" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42914 [04:14:55] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [04:14:56] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [04:14:56] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [04:14:56] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [04:14:56] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [04:16:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42914 [04:17:18] LeslieCarr: I managed to watch that interview this morning without incident ;) [04:59:31] New patchset: Dereckson; "(bug 43760) Enable WikiLove on is.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42915 [04:59:57] New patchset: Andrew Bogott; "Objectify adminlogbot." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/42916 [07:14:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:14:56] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [07:16:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [07:16:53] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [07:31:00] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [07:48:08] PROBLEM - Puppet freshness on db1036 is CRITICAL: Puppet has not run in the last 10 hours [07:50:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:52:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.158 seconds [08:02:05] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:23:42] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 197 seconds [08:23:59] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 201 seconds [08:27:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:30:44] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [08:31:02] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [08:42:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.895 seconds [08:48:08] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [09:17:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:27] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 184 seconds [09:28:21] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 217 seconds [09:28:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.150 seconds [09:49:12] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 211 seconds [09:51:00] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:51:27] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [10:02:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [10:52:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:05:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [11:12:48] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42389 [11:16:30] !log reedy synchronized wmf-config/ [11:16:39] Logged the message, Master [11:17:43] ori-l: thanks for resolving those bugs [11:18:01] ori-l: what's your interest in CORS if I may ask? [11:23:12] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:29:33] New review: Ori.livneh; "Asher: if you have a chance, could you weigh in? It'd be useful to know what you think." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [11:38:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:53:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [12:24:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:08] MaxSem: any idea wth is going on: http://www.wikidata.org/w/index.php?title=Translations:Wikidata:Glossary/25/fi&action=edit&loadgroup=page-Wikidata%3AGlossary&loadtask=view ? [12:35:05] Nikerabbit, I observed this when my server was in process of meltdown, however vanadium looks responsive and search returns results [12:35:12] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [12:35:30] Nikerabbit, I say look in the logs [12:35:55] MaxSem: where are the logs? [12:36:07] https://wikitech.wikimedia.org/view/Solr#Logs [12:37:11] MaxSem: am I supposed to have access to vanadium? [12:37:29] let's see in puppet;) [12:38:20] which one? [12:38:21] no you're not:) [12:38:45] and it'll likely require root to view them anyway:) [12:39:06] hmph [12:39:20] see `node "vanadium.eqiad.wmnet"` in site.pp [12:39:56] it includes admins::restricted and various single accounts, but not you or other mortals [12:40:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [12:41:11] well, as a fallback I can extend my try catch block to more lines... but I'm very curious wtf is going on there [12:41:52] but it is very hard if not impossible to get the actual result and query for inspecting [12:42:12] it's easy if you have logs acess [12:42:30] * MaxSem looks around for euro pos [12:42:35] MaxSem: only the query is logged if even that [12:42:36] *euro ops [12:42:50] but of course then I can do the query myself [12:43:51] apergos or paravoid, can you help us? ^^^ [12:44:02] yes? [12:45:14] apergos, can you look if there's anything bad in jetty logs on vanadium? [12:47:17] doesn't look like it [12:54:44] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:54:44] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:56:06] apergos: is it logging the queries? [12:57:31] I see stuff like GET /solr/select?wt=json&q= and lots of stuff [12:57:32] current [12:57:47] does grep "language-specific name" find anything? [13:00:04] no [13:00:16] ah [13:00:18] yes [13:00:20] space -> + [13:00:43] returns 200 [13:00:54] the most recent of those anyways [13:01:41] POST /solr/update?wt=json HTTP/1.0" 200 43 is the result [13:02:12] what else did your query have in it? [13:03:05] apergos: what do you mean? Could you paste the full query somewhere for me? [13:03:15] ok [13:04:49] POST /solr/update is an update query while wehat fails is search, someting like GET /solr/search?.... [13:14:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.044 seconds [13:50:48] New patchset: Hashar; "(bug 43729) create /mnt/srv and /srv on beta mw installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42743 [13:51:12] New review: Hashar; "PS4: use mount{} instead of symlink (per Ryan)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42743 [13:53:37] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27175 [13:53:40] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [13:55:40] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34748 [14:02:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:19] New patchset: Hashar; "(bug 43141) jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [14:16:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.636 seconds [14:16:35] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [14:16:35] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [14:16:35] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [14:16:35] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [14:16:35] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [14:28:12] New patchset: Hashar; "beta: makes two wikis to use WMF branches" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42951 [14:29:37] New review: Hashar; "Merging that to deploy it on beta." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42951 [14:29:47] New review: Hashar; "Merging that to deploy it on beta." [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42951 [14:29:48] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42951 [14:38:52] yo opsen, does anybody know anything about the vumi setup (an sms service that you can ask for a wikipedia article and get it delivered as a couple of sms messages). is this already in production? [14:40:38] no idea [14:41:36] :D [14:43:06] New patchset: Nikerabbit; "Workaround for exception preventing translation" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42954 [14:43:19] MaxSem: could you have a look ^^ [14:43:52] after hour(s) of debugging I found out "Operation timed out after 5000 milliseconds with 0 bytes received" [14:44:24] ehm [14:44:45] what did it use by default? [14:44:58] MaxSem: 5s [14:45:11] i mean which http client [14:45:18] MaxSem: file_get_contents [14:45:44] so you think increasing the timeout will help? [14:46:12] shouldn't you investigate why it takes so long? [14:46:17] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42954 [14:46:33] strdist [14:47:43] I'll add some length limits and extend the try-catch block, but this should help now [14:48:01] ehm, on texts sometimes kilobytes long? [14:48:09] it has quadratic complexity [14:48:27] MaxSem: indeed, hence length limits [14:48:56] do you sort by strdist? [14:49:19] MaxSem: it's sorted by score which is the return value of strdist [14:49:21] mebbe it's worth trying to get the relevance score do the trick? [14:50:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:31] this is the best compromise so far [14:52:14] Nikerabbit, are you going to deploy it? [14:52:35] MaxSem: doing i [14:53:14] * MaxSem doesn't like undeployed changes [14:53:17] afk [14:53:37] !log nikerabbit synchronized wmf-config/CommonSettings.php 'Translation memory tweak' [14:53:47] Logged the message, Master [14:59:39] New patchset: Mark Bergsma; "Move misc::url-downloader out of a completely unrelated file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42958 [15:00:37] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [15:01:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42958 [15:03:04] mark: can you possibly restart memcached on virt0 please ? That renders labsconsole useless :/ http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=virt0&service=Memcached [15:04:02] ok [15:04:19] danke! [15:04:31] seems to be running again [15:04:35] !log Started memcached on virt0 [15:04:46] Logged the message, Master [15:05:52] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [15:06:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [15:22:25] New patchset: Demon; "Configure ExtensionDistributor in preparation for new version" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42966 [15:26:16] New patchset: Mark Bergsma; "Allow HTTPS (CONNECT) requests on the copy-by-url proxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42967 [15:27:29] New patchset: Mark Bergsma; "Allow HTTPS (CONNECT) requests on the copy-by-url proxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42967 [15:29:11] New review: Demon; "I don't know much about squid config, but this *looks* sane :)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/42967 [15:32:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42967 [15:35:15] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42966 [15:35:54] !log demon synchronized wmf-config/CommonSettings.php 'Deploying Id2755235' [15:36:04] Logged the message, Master [15:37:17] New patchset: Hashar; "mw-update-l10n l10n cache rebuild is now verbose" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42970 [15:37:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:51] New review: Hashar; "Please cast your vote :-]? A typical use case is refreshing the cache on the 'beta' cluster." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42970 [15:52:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.086 seconds [16:03:55] RECOVERY - Host srv266 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:05:21] New patchset: Dereckson; "(bug 43760) Enable Collection on is.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42976 [16:08:24] PROBLEM - Apache HTTP on srv266 is CRITICAL: Connection refused [16:13:05] New review: Anomie; "Note that mw-update-l10n is called from scap, so that will also be more verbose." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/42970 [16:13:40] RECOVERY - Apache HTTP on srv266 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [16:16:06] New patchset: Dereckson; "(bug 43769) Close ik.wiktionary and zh-min-nan.wikiquote" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42978 [16:25:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:36] !log authdns update adding osm-cp1003/4 production to zone files [16:37:46] Logged the message, Master [16:42:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [17:01:40] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [17:02:33] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 230 seconds [17:13:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:45] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [17:17:51] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [17:25:26] New review: MF-Warburg; "Maybe it's better to keep the configuration (i.e. localized logo for ik.wikt; language code correcti..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42978 [17:28:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds [17:33:15] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [17:33:45] !log aaron synchronized php-1.21wmf7/includes/upload/AssembleUploadChunks.php 'deployed 4bb28e000bb6609e23046b611be335382aa74618' [17:33:55] Logged the message, Master [17:43:18] !log aaron synchronized php-1.21wmf7/includes/ 'deployed 5185259101b27e4780618b3cd7718b9a0c51e1c4' [17:43:19] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [17:43:29] Logged the message, Master [17:45:07] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:45:34] hrmmmmm, lvs6? [17:49:28] PROBLEM - Puppet freshness on db1036 is CRITICAL: Puppet has not run in the last 10 hours [18:01:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:34] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:08:14] !log aaron synchronized php-1.21wmf7/extensions/UploadWizard 'deployed bd33047d1cb938f3b4923a51862d887e6c831b65' [18:08:24] Logged the message, Master [18:17:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [18:18:06] New patchset: Hashar; "(bug 43729) beta mw installs use /dev/vdb mounted on /srv" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42743 [18:18:23] New review: Hashar; "PS5: rephrased summary" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42743 [18:18:33] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [18:19:01] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [18:22:50] hashar- How's the deploying of those config changes going? [18:24:01] New review: Hashar; "I guess we can either:" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/38307 [18:24:19] anomie: so I think all changes got deployed [18:24:24] !log aaron Started syncing Wikimedia installation... : [18:24:33] Logged the message, Master [18:24:47] anomie: I found at least one instance of all.dblist that did not use the getRealmSpecificFilename() wrapper [18:25:04] but that is only going to cause issues on beta [18:25:14] anomie: Tim merged the last remaining change during my vacations [18:25:35] anomie: I also have to rename wmfRealm to wmgRealm per a Tim comment . wmf = wikimedia function :-D [18:25:40] does not make sense for a global [18:26:02] $wmfAllOfTheLols(); [18:26:24] Ok then, good. There's still 33388 to look at, too. [18:26:59] We have a few other $wmf* variables, though. When I looked, it seemed that $wmg was used for configuration toggles and $wmf for a few miscellaneous things. [18:27:02] hashar, have you heard about PHP Notice: Undefined variable: wmfRealm in /home/wikipedia/common/wmf-config/InitialiseSettings.php on line 12244 ? [18:27:07] reedy: or \Wikimedia\Globals::singleton()->forRealm( 'labs' )->getFilename( 'all.dblist' ); [18:27:25] MaxSem: nop, that is nasty [18:28:01] hashar, when running maint scripts [18:28:28] hmm [18:28:41] it is set in CommonSettings.php [18:28:48] maybe need to make it global [18:28:57] minimum repro: [18:29:02] maxsem@fenari:/home/wikipedia/common/php-1.21wmf7/extensions/WikimediaMaintenance$ mwscript eval.php testwiki [18:29:03] > $wgConf->loadFullData(); [18:29:03] sounds like a simple enough fix [18:29:34] Add it to the global list in CommonSettings.php line 150? [18:29:55] Shouldn't $wmfUdp2logDest be moved there too? [18:30:05] just for consistency [18:30:09] you know, cause we like that [18:30:53] fixing it [18:31:31] New patchset: Hashar; "fix wmfDatacenter / wmfRealm scope in IS.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42994 [18:31:40] anomie: Reedy : ^^^^ [18:32:19] why not global $wmfUdp2logDest, $wmfDatacenter, $wmfRealm; ? [18:32:33] for git blame! ;-D [18:32:47] Though, per anomie... [18:32:50] function wmfLoadInitialiseSettings( $conf ) { [18:32:50] global $wmfConfigDir, $wgConf, $wmfUdp2logDest; [18:32:50] # $wgConf =& $conf; # b/c alias [18:32:50] require( "$wmfConfigDir/InitialiseSettings.php" ); [18:33:15] hashar, who else do we need to get https://gerrit.wikimedia.org/r/#/c/39711/ approved - it doesn't seem to be too much of a big deal IMO? [18:33:54] Does InitialiseSettings get included anywhere else? [18:34:02] Thehelpfulone: there is no real process yet :D [18:34:15] Thehelpfulone: will merge in and deploy [18:34:47] Reedy- Elsewhere in CommonSettings.php (but not inside a function). A grep doesn't turn up anything else, at least not in operations/mediawiki-config [18:35:29] Maybe we should move them all into InitialiseSettings.php so they won't cause a problem in either place [18:36:02] One or the other place, anyway. I have no opinion on which place is better. [18:36:17] indeed [18:36:24] Doing it once seems better than doing it twice [18:36:45] Thehelpfulone: I have deployed the change. Thanks for the ping :-] [18:36:50] hashar, heh thanks :) [18:37:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42743 [18:37:50] * Reedy amends hashars commit [18:39:35] New patchset: Reedy; "fix wmfDatacenter / wmfRealm scope in IS.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42994 [18:44:10] # Protocol settings for urls [18:44:10] $urlprotocol = ""; [18:44:20] ^ Why are we using that all over the place if it's set to ""? [18:45:07] Relic from before we changed to protocol-relative links everywhere? [18:45:08] because it used to be 'http' [18:45:29] DIEDIEDIE [18:46:05] Reedy, http://youtu.be/RbIGuLXCziU [18:47:10] $wgNoticeCounterSource = . '//wikimediafoundation.org/wiki/Special:ContributionTotal' . [18:47:16] That passes php -l.. [18:48:11] New patchset: Reedy; "Remove $urlprotocol as it's set to """ [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42995 [18:49:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:49:08] hey Reedy, I got a question for you [18:49:27] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [18:50:14] notpeter: ya? [18:50:50] Reedy: nvm. see other channel for context [18:50:56] sorry to interupt! [18:51:14] why is puppet disabled on lvs1004? [18:52:39] New patchset: Hashar; "$::realm is 'labs' not 'wmflabs'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42996 [18:52:45] New review: Hashar; "The realm check used 'wmflabs' instead of 'labs'. Fixed with: https://gerrit.wikimedia.org/r/42996" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42743 [18:54:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42996 [18:56:08] New review: John Erling Blad; "Not sure, but this seems like a typo..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40561 [18:59:10] New patchset: Reedy; "Remove wgUseTagFilter. Same as default and no longer needed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43004 [19:00:17] New patchset: Reedy; "sewikipedia -> sewiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43005 [19:00:40] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43005 [19:01:35] !log reedy synchronized wmf-config/InitialiseSettings.php 'Fix sewiki typo' [19:01:44] Logged the message, Master [19:05:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [19:05:39] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:37] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 194 seconds [19:07:45] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 196 seconds [19:09:29] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf7 [19:09:37] Logged the message, Master [19:12:18] Ryan_Lane - Do we have git-deploy deploying anywhere to the point that we can run maintenance scripts on it? [19:13:01] !log aaron Finished syncing Wikimedia installation... : [19:13:10] Logged the message, Master [19:13:14] anomie: you mean mediawiki configured in such a way that they'll work? [19:13:27] maybe beta? [19:13:27] PROBLEM - Puppet freshness on sq86 is CRITICAL: Puppet has not run in the last 10 hours [19:13:28] PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours [19:13:42] Ryan_Lane: are you the most logical person to review https://gerrit.wikimedia.org/r/#/c/42887/ ? [19:14:13] maybe as a secondary reviewer [19:14:30] PROBLEM - Puppet freshness on hooper is CRITICAL: Puppet has not run in the last 10 hours [19:14:32] for a sanity check for git-deploy, but otherwise I'm not really a mediawiki expert [19:15:04] I suppose we might need to add AaronSchulz in the mix, in addition to Reedy (who's on the list already) [19:15:34] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [19:15:34] PROBLEM - Puppet freshness on kaulen is CRITICAL: Puppet has not run in the last 10 hours [19:15:34] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [19:15:35] Ryan_Lane- I see /src/deployment exists in beta, but nothing under that [19:15:42] err, /srv/deployment [19:15:50] on the bastion? [19:16:00] yeah, I was going to start setting up beta today [19:16:10] On deployment-bastion [19:16:22] * Ryan_Lane nods [19:16:48] well, I think a number of reviews are pending or merged in for the common repo [19:16:52] that's what needed to make this work [19:17:30] PROBLEM - Puppet freshness on mc5 is CRITICAL: Puppet has not run in the last 10 hours [19:17:31] PROBLEM - Puppet freshness on kuo is CRITICAL: Puppet has not run in the last 10 hours [19:17:31] PROBLEM - Puppet freshness on search35 is CRITICAL: Puppet has not run in the last 10 hours [19:17:31] PROBLEM - Puppet freshness on search32 is CRITICAL: Puppet has not run in the last 10 hours [19:17:36] Aren't those in another branch? [19:17:41] probably, yes [19:17:58] might be only on tin though [19:18:05] I think that's where Tim committed your stuff [19:18:08] ah [19:18:25] PROBLEM - Puppet freshness on mc16 is CRITICAL: Puppet has not run in the last 10 hours [19:18:25] PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours [19:18:25] PROBLEM - Puppet freshness on mc13 is CRITICAL: Puppet has not run in the last 10 hours [19:18:33] actually, it's possible that scripts will work on that node [19:19:28] PROBLEM - Puppet freshness on search1024 is CRITICAL: Puppet has not run in the last 10 hours [19:19:28] I do remember that mwversionsinuse worked [19:19:33] though I was using its target [19:19:36] not the script [19:21:17] mwversionsinuse works on tin (if you change the path in the script). mwscript doesn't, though, because /srv/deployment/mediawiki/common/wikiversions.cdb doesn't exist [19:21:25] PROBLEM - Puppet freshness on sq72 is CRITICAL: Puppet has not run in the last 10 hours [19:21:40] /srv/deployment/mediawiki/common/multiversion/activeMWVersions --extended --withdb [19:22:05] And, of course, there's no way to go from the "1.21wmf4" returned by mwversionsinuse to whichever slot it's in [19:22:35] yes there is [19:22:43] Oh, what is it? [19:22:49] in /srv/deployment/mediawiki/common we can have symlinks to the slots [19:23:23] BTW, the equivalent command for mwscript is php /srv/deployment/mediawiki/common/multiversion/MWScript.php; pass args like --wiki=enwiki eval.php for a simple test [19:23:50] Oh, ok. But we don't have those symlinks on tin, yet. [19:23:58] just add them [19:24:01] New patchset: Andrew Bogott; "Objectify adminlogbot." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/42916 [19:26:13] New patchset: preilly; "add WikipediaMobileFirefoxOS to bits docroot as Submodule" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43009 [19:26:31] PROBLEM - Puppet freshness on sq79 is CRITICAL: Puppet has not run in the last 10 hours [19:27:33] PROBLEM - Puppet freshness on mw1121 is CRITICAL: Puppet has not run in the last 10 hours [19:27:34] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Puppet has not run in the last 10 hours [19:27:34] PROBLEM - Puppet freshness on mw1129 is CRITICAL: Puppet has not run in the last 10 hours [19:27:34] PROBLEM - Puppet freshness on sq56 is CRITICAL: Puppet has not run in the last 10 hours [19:27:34] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [19:27:34] PROBLEM - Puppet freshness on sq65 is CRITICAL: Puppet has not run in the last 10 hours [19:27:34] PROBLEM - Puppet freshness on wtp1 is CRITICAL: Puppet has not run in the last 10 hours [19:30:03] New review: Reedy; "Looks alright to me, not quite sure what's up with Jenkins. That submodule clones fine for me:" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/43009 [19:30:43] andrewbogott: quick review? https://gerrit.wikimedia.org/r/#/c/43007/ [19:30:54] New patchset: preilly; "add WikipediaMobileFirefoxOS to bits docroot as Submodule" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43009 [19:31:21] * andrewbogott looks [19:35:31] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [19:35:46] Change merged: preilly; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43009 [19:36:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:06] !log authdns update adding mc1016/mc1017 to zone file [19:47:15] Logged the message, Master [19:47:42] New patchset: Reedy; "enwiki to 1.21wmf7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43011 [19:47:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43011 [19:48:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.388 seconds [19:49:00] RECOVERY - Puppet freshness on sq56 is OK: puppet ran at Wed Jan 9 19:48:50 UTC 2013 [19:50:05] Reedy, can you also deploy https://gerrit.wikimedia.org/r/#/c/42994/ - I've reviewed it [19:50:12] RECOVERY - Puppet freshness on lvs3 is OK: puppet ran at Wed Jan 9 19:50:03 UTC 2013 [19:50:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42994 [19:53:33] New patchset: Hashar; "beta: fix /dev/vdb mounting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43013 [19:53:40] !log reedy synchronized wmf-config/ [19:53:49] Logged the message, Master [19:56:49] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [19:57:07] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [19:58:06] Reedy, cheers [19:59:39] PROBLEM - SSH on ms1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:12] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43013 [20:05:27] New patchset: Reedy; "Remove $urlprotocol as it's set to """ [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42995 [20:18:21] New review: preilly; "I'd really like to know what we hope to gain from this experiment?" [operations/mediawiki-config] (master); V: -1 C: -1; - https://gerrit.wikimedia.org/r/42892 [20:23:14] New review: Swalling; "Hey Patrick:" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42892 [20:24:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:27:07] New patchset: Lcarr; "expanding the eqiad ip range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43015 [20:27:54] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43015 [20:28:44] figured out why analytics machines all had icinga failures :) [20:28:56] i mean puppet timestamp failures according to icinga [20:28:59] New review: Nemo bis; "Patrick, "what we gain" seems quite clear, data for further research; now also linked." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42892 [20:29:02] firewall? [20:29:27] New patchset: Hashar; "phase out imagescaler::labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43016 [20:31:51] New patchset: Hashar; "phase out imagescaler::labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43016 [20:35:43] New patchset: Hashar; "phase out imagescaler::labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43016 [20:36:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43016 [20:40:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [20:41:03] New review: Ryan Lane; "I'm not sure if you've ever needed to clean up a spammed wiki before. If you have a wiki without acc..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [20:43:20] New review: Nemo bis; "Ryan, you're assuming what this test is meant to verify..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42892 [20:47:19] New review: Mattflaschen; "As far as I'm concerned, this test is meant to study the behavior of good-faith humans trying to cre..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/42892 [20:52:07] New review: Asher; "> Asher: if you have a chance, could you weigh in? It'd be useful to know what you think." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [20:56:25] wooo 418 PHP Fatal error: require() [function.require]: Failed opening required '/InitialiseSettings.php' [20:57:31] who borked wikipedia?:P [20:58:12] Reedy: ^^^ [20:58:44] l [20:58:53] * robla looks [20:59:04] I didn't touch that line.. [20:59:19] looks like a sync fail - the site is functional but some apache(s) are out of date [20:59:27] oh, shit [20:59:32] I removed the global [20:59:35] synching [20:59:45] * Damianz looks at Reedy's scope [21:00:02] binasher, Ryan_Lane - thanks. I'll write up an email later today to ops-l to explain the rationale and to ask what the right way is to deploy this (if at all). [21:00:04] !log reedy synchronized wmf-config/CommonSettings.php [21:00:08] I win [21:00:16] Logged the message, Master [21:00:43] fixed [21:01:07] New patchset: Reedy; "Restore $wmfConfigDir global in wmfLoadInitialiseSettings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43020 [21:01:18] ori-l: I don't think ops@ are qualified to answer that, we have too little to do with higher levels such as these [21:01:53] * Nemo_bis thinks people are greatly overestimating the effects of fancycaptcha – among those trying different captcha configs nobody found the ones used by Wikimedia projects to have any effect [21:02:09] paravoid: well, there are a number of stakeholders, which makes this complicated, but ops is one because of potential site stability implications, flagged by asher [21:02:12] I think asher and ryan is who you can expect at most to reply to that thread :) [21:02:43] Nemo_bis, at least it stopped the guy who bruteforced sysop accounts on enwiki years ago [21:02:58] paravoid: that's fine; gerrit is just not an ideal forum for discussions [21:03:00] MaxSem: did it? [21:03:11] and anyway it was years ago and only a single person [21:03:20] New patchset: Reedy; "Restore $wmfConfigDir global in wmfLoadInitialiseSettings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43020 [21:03:29] it did. with an immediate effect [21:03:31] Nemo_bis: the answer to "the captchas are broken" is not "turn off the captchas", it's "fix the captchas" [21:03:39] ori-l: sure, but may I suggest picking a different list, e.g. engineering? [21:03:40] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43020 [21:03:42] or even wikitech [21:03:52] paravoid: OK, i thought about that too. wikitech it is. [21:03:55] <^demon> wikitech is good. [21:04:06] yeah [21:04:12] ooh yeah, announcing no captchas on a public list! [21:04:28] <^demon> No worse than committing the change to a public git repo ;-) [21:04:43] binasher: there was never any intent to push this through without discussion [21:04:52] binasher: if anything, it'll work in favor of your argument, so... :-) [21:05:08] i find it useful to point people to gerrit changes to discuss proposals because it's easier to be specific in code [21:05:42] paravoid: i asked e3 to notify ops before a deployment that may effect us so we can be aware of the cause of potential system impact, not to email ops so we could have a fire side chat on our thoughts behind the behavior of anonymous people using wikipedia [21:07:05] binasher: fair enough! all I'm saying is that if there's a need for a discussion, ops@ is probably a bad place for that [21:07:07] i'm sorry to spark a fire and then run, but i'm late for an appointment :( the bottom line: this won't go out without sign-off from ops and an OK from the community. [21:08:20] So no need to rage, etc. but feedback about _how_ to get some good data about the efficacy of captchas would be much appreciated. (Coordinating with ops to preempt site stability issues is a good point.) [21:09:00] New patchset: Lcarr; "fixing iptables::purges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43021 [21:09:23] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43021 [21:10:12] Ryan_Lane/hashar - ok to merge your imagescaler::labs change ? [21:10:16] on sockpuppet [21:10:19] yep [21:10:32] cool [21:10:34] merging now [21:11:26] *boom* [21:11:36] LeslieCarr: yup [21:11:44] New patchset: Hashar; "role::applicationserver::appserver::beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43023 [21:11:46] LeslieCarr: sorry forgot to merge it :/ [21:12:06] oh no I can't pull on sock puppet .. :D [21:12:43] yep, :) [21:14:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:55] Ryan_Lane: it's just a test [21:15:12] Ryan_Lane: I'm only saying that all this fuss "omg the world will end in those 3 hours" doesn't make any sense [21:15:22] but it's an ineffectual test [21:15:36] 1. it's not running long enough to actually gather proper data [21:15:59] 2. it's going to tell us that disabling captchas make it easier for people to create accounts, which we already know [21:16:09] 2. Is false. [21:16:27] We don't already know that it stops legitimate users, we only suspect it. [21:16:49] 3. if the rate of spam accounts does increase by a lot, then admins will start blocking all new user accounts to stop the flood [21:16:56] It makes for a horrid ux, for every attempt failed you loose % users [21:17:20] 4. if we're going to properly test this, it should run for a reasonable amount of time in an A/B test [21:17:40] it may be an effectual test - have the wikipedia admins been informed ? [21:17:43] 4. is the same as 1. [21:17:44] the non techie side, that is [21:18:02] and admins should have some way of telling that a user account was created with or without a captcha [21:18:02] that would probably help a lot - if we were working in conjunction [21:18:05] New patchset: Hashar; "role::applicationserver::appserver::beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43023 [21:18:05] as for 1, 3 – I'm not qualified to answer, I thought the WMF had a team to do this? :) [21:18:06] to avoid #3 [21:18:14] 4 is not the same as 1 [21:18:22] it is [21:18:22] we're not doing A/B testing [21:18:51] Well, enough of [[MeatBall:DefendEachOther]] for today. [21:19:08] New review: Hashar; "So PS1 did not really solve the problem :/ PS2 include the role::applicationserver::webserver and t..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/43023 [21:23:52] LeslieCarr: https://meta.wikimedia.org/wiki/Research:Account_creation_UX/CAPTCHA#Metrics . we also have two enwiki admins on E3 that are committed to monitoring/cleanup, and the rest of us intend to watch Special:RecentChanges as well. [21:23:57] cool [21:24:06] well that solves #3 :) [21:24:46] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:27:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.443 seconds [21:27:37] Ryan_Lane: we'll be tagging captcha / non-captcha accounts. We thought about that too. [21:32:55] New patchset: Lcarr; "fixing icinga purges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43024 [21:34:49] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43024 [21:39:31] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43023 [21:43:15] New patchset: Pyoungmeister; "lucene.php: simple loadbalancing of requests across datacenters" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [21:56:05] New patchset: Pyoungmeister; "lucene.php: simple loadbalancing of requests across datacenters" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [21:58:20] chmod -R g+w /srv/deployment/mediawiki/common/.git/objects/ [21:58:29] ^ Can someone please run that on tin for me? [21:58:48] Reedy: sure [21:59:02] done [21:59:10] thanks [21:59:15] no prob [21:59:20] I'm going to run that on the whole repo [21:59:46] I did fix my umask just now too [22:00:02] Seems Tim had some with no group write [22:00:29] with git-deploy it requires you to start with umask 0002 [22:00:54] but, that doesn't stop someone from running git pull with screwed up permissions [22:01:10] or root from doing it [22:02:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:50] Totally could make an alias to set umask on git... but then crazy people esape stuff to stop people making you do bad stuff so aliases would be ignored [22:02:53] As that checkout of mediawiki-config is just over a month old, i was going to rebase your patch [22:03:33] * Ryan_Lane nods [22:07:44] reedy@tin:/srv/deployment/mediawiki/common$ git rebase origin [22:07:44] It seems that there is already a rebase-apply directory, and [22:07:44] hah [22:14:01] New review: preilly; "@Swalling ? Thanks, so much for posting that link. It really helped me." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42892 [22:15:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.857 seconds [22:16:26] !log aaron synchronized php-1.21wmf7/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'deployed 7482243e3249ae38d6aedbaf06db7107dc3516f5' [22:16:36] Logged the message, Master [22:22:00] That was relatively painless [22:22:35] * ^demon whacks Reedy with a steel pipe [22:22:37] <^demon> No pain, no gain! [22:27:01] Reedy: where are we up to? [22:28:23] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 189 seconds [22:36:15] TimStarling: was just bringing things up to date, finding out there was bad permissions on the git objects [22:36:47] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [22:36:56] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [22:38:32] You've probably seen in your email that is missing from 42887 that misc/scripts/mw-deployment-vars.erb is missing [22:40:11] yes [22:40:28] New patchset: Tim Starling; "Script updates for the new deployment system" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42887 [22:40:32] there's the fix for it [22:40:51] great [22:40:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [22:41:09] Logged the message, Master [22:41:17] New review: Tim Starling; "PS3: added missing file and fixed its location." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42887 [22:44:51] Change merged: Andrew Bogott; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/42916 [22:45:43] In your email were you meaning a test apache instance? [22:49:12] Looks like extract2 still needs updating (using /apache) [22:49:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:50:14] then the symlinks too [22:54:17] New review: Dereckson; "Indeed, we still need the logo on the closed wiki." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/42978 [22:54:34] so, Ryan was telling me last night that we need symlinks from /srv/deployment/mediawiki/common/php-1.21wmf7 to /srv/deployment/mediawiki/slot0 etc. [22:55:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:55:41] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [22:59:36] Ryan_Lane: I guess .deploy is from git-deploy? Can we just add it to .gitignore? [23:00:10] I feel like git deploy deserves it's own channel [23:02:49] its [23:03:19] I don't really understand what "Git deploy" is. [23:03:44] http://wikitech.wikimedia.org/view/Git Heh. [23:04:03] TimStarling: So do those manually and track them in mediawiki-config? [23:04:31] yes [23:04:42] and then we can have $IP be /srv/deployment/mediawiki/common/php-1.21wmf7 [23:04:53] mutante: so for some reason it looks like snmp isn't getting read again [23:04:58] mutante: on neon, that is [23:04:59] which means that the common directory will be $IP/.. like with scap [23:06:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [23:06:22] heh, tin can't access the internet so can't load the docroot/bits/WikipediaMobileFirefoxOS git submodule [23:14:31] New patchset: Andrew Bogott; "Added a few inline puppet docstrings." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43104 [23:15:13] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43104 [23:21:04] Reedy: I'll merge in these puppet changes and test them [23:25:44] Great [23:26:05] New patchset: Tim Starling; "Split scap scripts from other scripts useful on deployment hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42871 [23:26:18] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42871 [23:26:25] New patchset: Tim Starling; "Script updates for the new deployment system" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42887 [23:26:32] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42887 [23:26:32] Do we care much about live-1.5? [23:27:19] well, we do need somewhere to put our multiversion wrappers [23:27:27] what in particular do you want to do with it? [23:27:56] I was wondering if the files/symlinks etc in need updating [23:28:09] Not much work to do them [23:32:26] can those symlinks just be relative links instead of absolute? [23:33:08] I guess it doesn't matter [23:33:18] but yes, they need updating one way or another [23:38:16] puppetd -tv has been running on fenari for 5 minutes now [23:39:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:34] Doesn't look to be doing a great deal [23:44:11] waiting for stafford probably [23:44:14] root@stafford:~# uptime [23:44:15] 23:44:04 up 191 days, 22:02, 3 users, load average: 64.94, 55.49, 33.56 [23:45:08] stafford is often cpu-bound [23:45:23] :( [23:46:15] stafford gets pegged from time to time [23:46:28] and stuck [23:46:28] we need to investigate why at some point [23:47:02] New patchset: Dereckson; "(bug 43769) Close ik.wiktionary and zh-min-nan.wikiquote" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42978 [23:47:30] for some reason the "splay" feature doesn't work [23:47:50] New review: Dereckson; "PS2: Keep the lang/logo config, the wiki is locked, not deleted." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42978 [23:48:41] New patchset: Dereckson; "(bug 43769) Close ik.wiktionary and zh-min-nan.wikiquote" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42978 [23:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.044 seconds [23:54:19] root@fenari:/# puppet config print splay [23:54:20] false [23:54:32] root@mw2:~# puppet config print splay [23:54:32] false [23:54:43] RECOVERY - Puppet freshness on db1036 is OK: puppet ran at Wed Jan 9 23:54:38 UTC 2013 [23:55:31] I'm continuing on here from last time puppet stopped my work for 20 minutes and I started working out why while I waited [23:56:39] RECOVERY - MySQL disk space on db1036 is OK: DISK OK [23:56:40] RECOVERY - MySQL Idle Transactions on db1036 is OK: OK longest blocking idle transaction sleeps for seconds [23:56:57] RECOVERY - MySQL Recent Restart on db1036 is OK: OK seconds since restart [23:56:58] RECOVERY - MySQL Slave Running on db1036 is OK: OK replication [23:57:15] RECOVERY - MySQL Replication Heartbeat on db1036 is OK: OK replication delay seconds [23:57:25] RECOVERY - MySQL Slave Delay on db1036 is OK: OK replication delay seconds [23:57:25] RECOVERY - Full LVS Snapshot on db1036 is OK: OK no full LVM snapshot volumes