[00:27:27] !log olivneh synchronized php-1.21wmf3/extensions/EventLogging 'Updating EventLogging on test2' [00:27:41] Logged the message, Master [00:28:46] !log aaron synchronized php-1.21wmf3/extensions/TimedMediaHandler 'deployed c1ac05640377f4f99cbe2a094e80d3d25d63b93d' [00:29:00] Logged the message, Master [00:29:54] !log aaron synchronized php-1.21wmf2/extensions/TimedMediaHandler 'deployed 18e51f3b06b84d1d5fbf47272d9ebfc5008dc879' [00:30:08] Logged the message, Master [00:43:08] !log olivneh synchronized php-1.21wmf2/extensions/EventLogging [00:43:23] Logged the message, Master [00:55:10] New review: Aaron Schulz; "I don't like how this has to enumerate the standard ones...what if those change? Is there a way arou..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/29768 [01:01:59] !log olivneh synchronized php-1.21wmf3/extensions/EventLogging 'Updating EventLogging on test2' [01:02:13] Logged the message, Master [01:04:35] New patchset: Ori.livneh; "Re-enable EventLogging on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30747 [01:05:45] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30747 [01:09:12] !log olivneh synchronized php-1.21wmf2/extensions/EventLogging [01:09:26] Logged the message, Master [01:09:41] !log olivneh synchronized wmf-config/InitialiseSettings.php [01:09:55] Logged the message, Master [01:24:40] seems to be an outage for at least some people [01:24:40] including myself :) [01:29:32] Prodego: what kind? [01:33:28] jeremyb: probably something in the middle, I just get a could not connect error [01:33:31] from chrome [01:33:49] huh [01:34:11] ok, there's other kinds of outages you could have now. like power! [01:34:56] true, true [01:39:17] Hmm [01:39:27] I think theres possibly an apache or 2 out of sync.. [01:40:05] Every so often for wikidata, I'm getting the apache error page: Not Found - The requested URL /wiki/Wikidata:Main_Page was not found on this server. - Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request. [01:41:39] I'm currently failing to connect to irc.wikimedia.org [01:41:50] * Connecting to ekrem.wikimedia.org (208.80.152.178) port 6667... [01:41:50] * Connection failed. Error: Connection timed out [01:42:48] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 235 seconds [01:43:10] hrmmm [01:43:17] mobile site is red in watchmouse [01:43:29] IRC is green! [01:43:44] irc WFM [01:43:58] ditto [01:45:10] idk about mobile. WFM [01:46:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:53:54] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:57] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 310 seconds [02:06:03] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:24:05] !log LocalisationUpdate completed (1.21wmf2) at Tue Oct 30 02:24:04 UTC 2012 [02:24:24] Logged the message, Master [02:49:48] !log LocalisationUpdate completed (1.21wmf3) at Tue Oct 30 02:49:48 UTC 2012 [02:50:03] Logged the message, Master [02:57:21] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Tue Oct 30 02:57:12 UTC 2012 [03:35:59] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [03:35:59] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:35:59] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [03:49:39] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [04:49:30] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:19:17] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [06:37:28] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [07:26:50] hello [07:29:40] υο [07:29:44] er [07:29:45] anyways [07:33:15] !log Replaced 2 bits @ esams servers with 4 new servers cp3019-cp3022 [07:33:29] Logged the message, Master [07:36:37] PROBLEM - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [07:37:03] huh [07:37:26] there are a lot of ipv6 someloss esams emails this morning [07:38:13] no, it's because i'm an idiot [07:38:30] * apergos raises an eyebrow [07:39:14] New patchset: Mark Bergsma; "Add IPv6 addresses to new bits servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30757 [07:39:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30757 [07:39:55] RECOVERY - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 3913 bytes in 3.246 seconds [07:41:20] I see [07:42:58] hm? [07:43:10] you're up early [07:43:10] the jobrunners have some really odd graphs (suspiciously dropping off at midnight utc) and yet when I go look at the counts for a few large projects they are low, and some jobs seem to be processed. [07:43:16] well couldn't sleep much [07:43:25] sorry for that [07:43:26] that was to mark :P [07:43:29] ah [07:43:47] it's not particuarly early for you apergos [07:43:55] mm guess not [07:44:07] !log Fixed IPv6 addresses on new esams bits servers [07:44:22] Logged the message, Master [07:44:54] I saw "bond0" in the add_ip6 puppet stanza, and immediately discarded it as part of the link aggregation setup [07:45:07] hey mark [07:45:11] hi [07:45:25] exciting death of NY, eh ? [07:45:33] is it dead? [07:45:36] hmm? [07:45:47] not exactly, but a lot of network issues [07:45:48] oh didn't notice yet [07:45:51] 111 8th is having generator issues on some floors [07:45:56] AC2 cable's down [07:46:01] oh boy [07:46:04] whoops [07:46:16] lots of (our) providers are having outages [07:46:30] oh wow [07:46:47] ah water in the metro tunnels. nice [07:47:26] huh this really did have a huge impact, how about that [07:47:35] yeah, i thought that the news was overreacting, but nope [07:47:40] it's serious [07:50:04] :( [07:52:26] New patchset: Mark Bergsma; "Swap out ganglia aggregators for Bits caches esams group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30758 [07:52:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30758 [07:53:09] PROBLEM - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [07:53:23] wtf [07:53:27] bad mark [07:53:37] * apergos raises the other eyebrow [07:54:43] oh [07:54:48] ARGH. [07:56:28] RECOVERY - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 3902 bytes in 0.243 seconds [07:59:07] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [08:00:46] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:05:16] I think we've established that I need more coffee. [08:07:47] three time's a charm :p [08:08:40] heh [08:08:46] * apergos goes to get some tea [08:38:27] New patchset: Mark Bergsma; "Repurpose cp3001/cp3002" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30759 [08:39:32] New patchset: Mark Bergsma; "Repurpose cp3001/cp3002" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30759 [08:40:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30759 [08:42:03] did you replace bits already? [08:42:07] wow [08:45:56] sure [08:46:05] why not [08:46:18] no more session bug? [08:46:32] session bug? [08:46:46] it wasn't "why", it was "wow you're fast" :) [08:46:54] a bit too fast, before my coffee ;) [08:47:07] i wanted to do it yesterday, right before the ops meet [08:47:15] but then the stupid geoip linking issue delayed me a bit [08:47:20] what was that problem with the concurrent sessions that we were having? [08:47:31] i don't know [08:47:35] i'll look at it again next time it happens [08:47:47] I don't know the details, I just knew that I'd have to restart both servers at the same time or depool/pool esams [08:47:58] might be threads queuing up for one very popular object or something [08:48:54] i'm inclined to install upload there now as well [08:49:08] and perhaps try my varying thumbs idea [08:55:18] hehe [08:58:14] New patchset: Mark Bergsma; "Puppetise domain-maplist, and add wikidata/wikivoyage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30760 [09:00:19] why is gerrit down so much [09:00:45] is it a requirement for every java program to suck massively or something? [09:00:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30760 [09:07:03] !log Fixed wikidata.org language cnames issue [09:07:16] Logged the message, Master [09:19:50] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'Narayam and Webfonts: bug 41460, bug 39200, bug 41359' [09:20:06] Logged the message, Master [09:30:38] [ 9.980557] bnx2x 0000:01:00.0: eth0: Warning: Unqualified SFP+ module detected, Port 0 from LEONI part number L45593-C100-D10 [09:37:48] aaaaargh [09:41:37] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'Temporarily enable beta mappings for am wikis' [09:41:51] Logged the message, Master [09:43:59] it does work ;) [09:44:09] it's in the bits servers now serving production traffic [09:49:09] just a warning [09:49:10] that's good [09:56:38] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [09:56:38] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [09:56:38] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:04:19] paravoid: will you be available for some review later this afternoon or tomorrow ? That is for the Zuul puppet classes I have been working on on labs [10:06:35] yeah [10:07:02] should not cause too many trouble, expect the git::clone stuff :-] [10:08:16] !log nikerabbit synchronized php-1.21wmf2/extensions/NewUserMessage/NewUserMessage.class.php 'I0f93ee53' [10:08:29] Logged the message, Master [11:27:17] PROBLEM - Puppet freshness on srv229 is CRITICAL: Puppet has not run in the last 10 hours [12:07:29] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [12:26:50] PROBLEM - SSH on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:04] RECOVERY - SSH on srv229 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:47:01] New patchset: J; "Bug 41528 - need more memory for video thumbs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30773 [12:58:37] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:05:58] PROBLEM - Memcached on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:01] PROBLEM - SSH on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:04] RECOVERY - Memcached on srv229 is OK: TCP OK - 0.002 second response time on port 11000 [13:14:49] RECOVERY - SSH on srv229 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:23:13] PROBLEM - SSH on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:07] PROBLEM - Memcached on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:52] RECOVERY - Memcached on srv229 is OK: TCP OK - 0.002 second response time on port 11000 [13:34:51] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:57] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.506 second response time [13:35:57] RECOVERY - SSH on srv229 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:37:24] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [13:37:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [13:37:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:41:00] PROBLEM - SSH on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:03] PROBLEM - Memcached on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:15] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:45] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.378 second response time [14:00:30] PROBLEM - NTP on srv229 is CRITICAL: NTP CRITICAL: No response from NTP server [14:01:47] New patchset: Mark Bergsma; "Add bits@eqiad as esams bits backend as well, in round-robin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30777 [14:03:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30777 [14:08:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:09:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.526 seconds [14:17:03] New patchset: Mark Bergsma; "Random director is easier" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30779 [14:17:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30779 [14:26:36] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:26:56] New patchset: Mark Bergsma; "Double bits cache memory to 2 GB" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30781 [14:27:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30781 [14:28:15] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.076 second response time [14:31:42] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:12] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.775 second response time [14:38:32] !log powercycling srv229, dead for unknown reason so far [14:38:45] Logged the message, Master [14:38:58] I tried stopping and restarting memcached on it but it never got done with the stop part [14:39:12] (took ages to get on the host, etc) [14:39:42] use the server admin log? [14:40:08] I would have logged it if it ever completed [14:40:22] but it wasn't going to, a powercycle was next anyways [14:40:52] it was certainly in swapdeath but dunno why [14:41:54] RECOVERY - SSH on srv229 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:43:00] RECOVERY - Memcached on srv229 is OK: TCP OK - 0.006 second response time on port 11000 [14:43:00] New patchset: Hashar; "/etc/wikimedia-cluster containing $::cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30784 [14:44:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:44:16] New review: Mark Bergsma; "In that case it should be called "site" for consistency. cluster is very ambiguous." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/30784 [14:44:47] mark: indeed :) [14:44:55] mark: I thought about $::datacenter hehe [14:45:03] RECOVERY - NTP on srv229 is OK: NTP OK: Offset -0.05753302574 secs [14:45:23] New review: Mark Bergsma; "A better idea btw would be to make one file containing some variables (realm, site, etc.), which can..." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/30784 [14:45:52] i merged your -lGeoIP change btw [14:46:09] seen that, thanks :-] [14:46:20] thanks for fixing my issue too ;) [14:46:26] got that while setting up varnish on a Precise labs instance [14:46:36] got that while setting up varnish on precise in production yesterday [14:46:43] so i had to fix it, hehe [14:48:01] might find other bugs though [14:48:01] New patchset: Hashar; "/etc/wikimedia-site containing $::site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30784 [14:48:06] I haven't reloaded my varnish instance yet [14:48:29] it's working fine [14:48:31] it's serving european traffic since this morning [14:48:43] anyway, check my other comment on that gerrit change [14:48:57] New review: Hashar; "renamed the file to /etc/wikimedia-site which now provides $::site. Will add a new file containing ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30784 [14:49:04] templates for the win! [14:49:15] something like /etc/wikimedia-conf.php ? [14:49:18] not even needed [14:49:43] you can just do file { "/etc/wikimedia-vars": content => "WIKIMEDIA_REALM=$::realm\nWIKIMEDIA_SITE=$::site\n" } ;) [14:49:53] .php ?! [14:50:10] that is going to be loaded from the MediaWiki configuration files (commonsettings.php / initialisettings.php ...) [14:50:22] we could provide both a shell and a php version [14:50:27] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [14:50:36] meh [14:50:42] can't mediawiki just parse the very simple shell version? [14:51:00] then we have to write a parsing function [14:51:10] whereas we could just include("/etc/wikimedia.php"); [14:51:27] now I want a python version as well [14:51:34] let s do it! [14:51:35] and a C object [14:51:52] perhaps a java servlet as well [14:51:52] yeah so hmm [14:52:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.563 seconds [14:53:42] <^demon> mark: How about putting that in /etc/defaults/? [14:53:55] New patchset: J; "Dont overwrite $wgMaxShellMemory in labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30785 [14:54:21] it's not really for a specific program though [14:54:30] lsb_release isn't in defaults either [14:54:48] RECOVERY - Puppet freshness on srv229 is OK: puppet ran at Tue Oct 30 14:54:34 UTC 2012 [14:55:21] <^demon> Hmm, ok. /etc makes enough sense. [14:55:31] so shell format [14:55:34] and we parse it in PHP ? [14:55:42] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [14:55:50] I am a bit afraid of our custom function choking whenever the file is not the expected format [14:55:50] i do agree that if it needs to be used from php, separate files would be easier [14:56:09] just read contents and strip odd chars [14:56:29] (we could use a neutral format such as json, yaml or xml) but I guess it does not play nice with bash :/ [14:56:39] <^demon> Why not just plain text? [14:56:56] <^demon> I see no harm in something like /etc/wikimedia-realm being just a text file. [14:57:07] like key\tvalue [14:57:07] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [14:57:28] just use your original idea of separate files if it becomes complicated [14:57:37] <^demon> Separate files is best imho. [14:57:40] <^demon> Easiest to maintain. [14:57:40] my idea was to facilitate bash scripts, but that's not the only use [14:57:42] <^demon> Less to parse. [14:57:43] agreed [14:57:51] fine [14:57:56] should we put them in /etc/wikimedia/ ? [14:58:05] named conf or something? [14:58:21] that's a directory to maintain, more complicated ;) [14:58:26] /etc/wikimedia-site is fine for now [15:00:14] mark: means you merge in https://gerrit.wikimedia.org/r/#/c/30784/ ? ; ) [15:00:38] yes [15:00:49] New review: Mark Bergsma; "u" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/30784 [15:01:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30784 [15:02:17] thanks! [15:09:41] RECOVERY - SSH on storage3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:18:31] New patchset: Mark Bergsma; "Add missing servers strontium/palladium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30787 [15:18:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30787 [15:27:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:41] mark: mediawiki-config already use $site to describe the project name ;-] [15:30:59] whereas puppet / ops assume site is the datacenter name :-] [15:31:01] lovely semantic issue [15:31:52] hehe [15:32:03] I guess we will load wikimedia-site in our variable [15:32:11] yes [15:32:12] named something like $datacenter or $wikimedia-dc [15:32:48] if we were a big company, we would create a dictionary of the data and hire a team of consultant to write us an ETL [15:32:58] that would take care of the transformation between groups :-] [15:33:04] of course that would cost millions of dollars [15:33:11] and takes a few years [15:33:16] Heh, Wikimedia DC has a whole different meaning... [15:33:28] It's a chapter for the District of Columbia. [15:41:45] also a site can span multiple datacenters [15:42:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [16:00:05] notpeter, can you help me with the Solr package? [16:00:21] New patchset: Anomie; "Add switching for eqiad-specific configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30792 [16:03:27] New review: J; "from reading he source code there is currently now way to just add headers:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/29768 [16:07:08] paravoid: you around? [16:07:09] yes [16:07:36] paravoid: I might need your help in the next 30 minutes [16:07:44] for? [16:07:50] paravoid: on VUMI USSD stuff on Silver [16:07:53] paravoid: will you be available? [16:07:55] ah, hi Patrick [16:07:58] I will [16:08:25] and you just reminded me to reply to Dan, which I forgot to do earlier. [16:08:28] paravoid: sorry, I didn't realize I lost my nick [16:09:30] paravoid: can you run sudo supervisorctl status [16:09:52] paravoid: and send me the output on pastebin.mozilla.org [16:10:14] all of them RUNNING [16:10:58] paravoid: can I have the output please [16:11:27] http://pastebin.com/Rjd0sQLm [16:11:48] paravoid: not the pastebin I requested [16:12:10] no, does it matter?! [16:12:24] paravoid: you forced me to see ads [16:12:28] paravoid: not nice ;-) [16:12:43] dude, have you heard of adblock? :-) [16:12:55] paravoid: I shouldn't have to use adblock [16:15:04] what are you looking for? [16:15:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:15:37] relaying commands that you tell me to run doesn't make much sense [16:16:01] paravoid: that is why I need root [16:16:08] either you should get access and do it, or we (ops) should understand what's going on there and try to fix things [16:16:23] you know, "operate" :-) [16:16:43] paravoid: also for the record I don't like "AdBlock" because it can access your data on all websites and access your tabs and browsing activity [16:17:03] paravoid: we are on a conference call with TATA in India right now [16:17:14] paravoid: would you like to join the call? [16:18:12] heh, I guess that's what I get for asking, isn't it? :-) [16:18:36] paravoid: ha ha ha [16:18:56] paravoid: International Callers [16:18:56] 0091-22-67934444 / +91-22-67914444/55 [16:19:09] Participant Pin Code [16:19:09] 2225886 [16:20:08] do you need me there? [16:20:16] sbernardin: hi & welcome! :) [16:20:58] paravoid: I only need root access [16:21:05] hi paravoid...Thanks [16:21:12] paravoid: But if you want to hear the call you can too [16:21:23] hi sbernardin ! [16:21:31] Everyone! sbernardin is our new data center contractor in Tampa. [16:21:53] paravoid: otherwise I'm forced to relay commands that I tell you to run but that doesn't make much sense [16:21:59] Please welcome him [16:22:00] sbernardin: welcome [16:22:01] welcome, sbernardin! [16:22:58] welcome, sbernardin [16:23:20] Thanks everyone...happy to be on board! [16:27:16] New patchset: Demon; "Set isGithubRepo = true for github replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30796 [16:29:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [16:30:18] new people? [16:30:29] welcome sbernardin [16:30:33] New patchset: Demon; "Resource references should be capitalized" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30797 [16:31:29] Change abandoned: Demon; "This didn't work like I'd hoped." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30797 [16:33:54] paravoid: ping [16:34:12] woosters: is anybody from operations available right now? [16:34:24] woosters: I'm on a call with TATA in India [16:34:37] woosters: nevermind paravoid responded [16:34:37] let me try to get paravoid [16:34:41] ok [16:34:51] woosters: he already responded thanks [16:38:14] !log stpped stray jobrunner on srv278 [16:38:29] Logged the message, Master [16:38:30] slow bots get beaten [16:44:51] New patchset: Mark Bergsma; "Prepare for Varnish upload @ esams: parameterize storage sizes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30799 [16:45:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30799 [16:54:02] !log reedy synchronized wmf-config/ExtensionMessages-1.21wmf3.php [16:54:15] Logged the message, Master [17:01:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:08:27] Jeff_Green: nice owrk on apache-fast-test [17:11:20] New patchset: Mark Bergsma; "Define backends for upload-backend in esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30802 [17:14:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.879 seconds [17:14:15] Reedy: ah you're using it? great! [17:14:34] Yup, mutante mentioned it [17:14:58] ya. that was written out of pure fear . . . of production apache conf changes. [17:15:06] Had a suspicion 2 apaches were out of sync, that confirmed which it was [17:15:14] Which makes finding out which to kick MUCH easier [17:15:19] yup [17:15:28] turns out we're frequently out of sync [17:15:31] New patchset: Mark Bergsma; "Define backends for upload-backend in esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30802 [17:15:54] !log reedy synchronized wmf-config/omgtestfile [17:16:09] Logged the message, Master [17:17:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30802 [17:19:49] !log reedy synchronized wmf-config/CommonSettings.php 'wgMaxAnimatedGifArea = 2.5e7' [17:20:02] Logged the message, Master [17:24:11] kaldari: hi there! [17:24:11] (this is faidon) [17:25:34] New review: Demon; "This won't change anything on merge, it's just pre-configuring something before I deploy the hack." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/30796 [17:26:39] garg stupid git tricks [17:26:40] paravoid: Oh hey :) [17:27:05] what does one do about the dreaded ahead-of-origin/production by 2 commits [17:27:16] Heh, I never put the IRC handle together with the face [17:28:09] paravoid: Anyway, like I said, we don't need very accurate IPv6 look-up, just something that will return a country value that is hopefully close to their actual country. [17:28:56] We can even add the lookup support ourselves, if Ops is willing to review the code and sign-off on it. [17:33:01] at this point it would be less risky for us to do that than changing CentralNotice back to the old banner-loading scheme [17:35:23] Jeff_Green: so, how's the ganglia [17:35:53] the part you fixed is great, the part I need to fix via puppet is enmired in git stupidity [17:36:23] fun times [17:38:05] so, mind if i lock down the payments <-> admin zone again ? [17:38:16] kaldari: hey, sorry, was in a phonecall [17:38:17] also, i'll open up port 80 to maxmind [17:38:51] so, I don't recommend using this for the reasons I stated, but in the end, the decision is yours/fr's [17:39:13] have in mind that it's not about being inaccurate, the database is also incomplete [17:39:30] the database's README: http://geolite.maxmind.com/download/geoip/database/GeoLiteCityv6-beta/README [17:39:38] BETA GeoLiteCityv6 BETA [17:39:41] Here is the first GeoLiteCityv6 database to resolve IPv6 addresses. The current [17:39:45] IPv6 support is rather poor and is currently only a GeoLiteCity database with [17:39:53] teredo and 6to4 support on a city level. [17:39:57] <^demon> paravoid: Any chance you could poke https://gerrit.wikimedia.org/r/#/c/30796/ for me? [17:39:59] I wouldn't use something that it's stated as "rather poor" [17:40:30] We don't use city resolution at all, only country. How functional do you estimate the country-level resolution? [17:40:45] LeslieCarr: sure, go for it [17:41:03] ottomata: i did a restart on analytics1003 ganglia --- i want to not touch it whatsoever for a few hours [17:41:05] kaldari: I think we shouldn't risk it [17:42:15] paravoid: kaldari: if we don't use that data, then we return country code XX for all IPv6 and default to "Unknown country" [17:42:31] what about what asher sent? [17:42:59] pgehres: paravoid is suggesting we move back to geoiplookup.wikimedia.org and rely on the dual-stack behaviour [17:43:33] K4-713: my only fear with that is getting back to the insane lookup times [17:43:46] oops, sorry K4-713, meant for kaldari [17:43:48] ottomata: well that failed :-/ [17:44:03] * K4-713 goes back to her stack trace [17:44:04] paravoid: that's even more risky for us, but we may trial it at some point during the fundraiser [17:44:21] asher's thing you mean? [17:44:29] yeah [17:44:36] ottomata: it looks like it stopped right when "Oct 30 17:43:01 analytics1003 CRON[12767]: (root) CMD ([ -f /var/lib/puppet/state/puppetdlock ] && find /var/lib/puppet/state/puppetdlock -ctime +1 -delete)" happened [17:44:37] well, what he mentioned [17:44:46] we just came up with that idea last week [17:45:26] kaldari: we can certainly try it /me shurgs [17:46:28] K4-713: did you guys ever figure out the issue with meta/upload.wikimedia.org. I wonder if it's a similar issue to the geoiplookup.wikimedia.org load problems. [17:46:50] kaldari: part of that issue seemed to be wikiminiatlas [17:46:57] I think we got as far as seeing that there was an issue... [17:47:10] how? [17:47:10] what data did you use? [17:47:25] Watchmouse browser load timing data. [17:47:54] have you seen the recent thread about watchmouse data? [17:48:14] I have probably seen them all, yes. Which is why the process has come to something of a halt. [17:48:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:49:36] kaldari: so, an example that I used the other day, was that a friend got a new IPv4 and a new IPv6 space the same day [17:49:50] that was about a month ago [17:50:33] MaxMind now correctly geolocates the IPv4, but can't find the IPv6 at all [17:50:51] I expect more of such cases too [17:51:06] if MaxMind doesn't find the IPv6, what does it return? [17:51:23] I'm trying on their web interface now [17:51:33] since we don't have code to test against that, although I can easily make that too [17:52:00] the web lookup says "The ip address you passed is not in our database" [17:52:46] New patchset: Pyoungmeister; "temp removing self from nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30807 [17:53:19] OK, so in those cases we would probably have it return 'XX' or something to know it failed... [17:53:21] !log Installed Precise on new upload esams servers cp3003-cp3010 [17:53:34] Logged the message, Master [17:53:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30807 [17:53:58] paravoid: I agree that using dual stack is far preferable, but here's the downside for doing that... [17:54:13] kaldari: bits.mw.org/geoiplookup returns "Geo = {}" when I try it from my IPv6-enabled desktop [17:55:18] ...but that lookup is not ipv6 enabled [17:56:12] mark: I know, I'm just saying that the current code returns when it fails [17:56:27] fails to find a match [17:56:33] I think {} is better than returning "XX" :) [17:57:49] New patchset: Jgreen; "move fundraising db config to role/fundraising.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30808 [17:58:59] Adding the extra hostname lookup is apparently a significant variable on load time. Right now, the banner loading process starts as soon as the SiteNotice div is loaded in the page. If the GeoIPLookup hasn't completed by that time, they are not assigned to a country and they don't receive any banners. [17:59:00] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30808 [18:00:14] In some browsers it will also block loading of further content [18:01:48] Even if we were to not support IPv6 at all, we might still want to keep the lookups on bits in order to have reliably fast lookups [18:02:11] so that banners load quickly for everyone [18:02:54] it's a trade-off for us either way [18:03:30] it would be nice if maxmind had a "confidence" parameter [18:03:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [18:04:44] I think it does [18:05:27] but we can't/want to check v6 and if confidence is low retry on v4 [18:05:40] let me discuss with other fundraising folks and see what other people think about this [18:05:41] /* confidence factor for Country/Region/City/Postal */ [18:05:41] unsigned char country_conf, region_conf, city_conf, postal_conf; [18:05:44] int accuracy_radius; [18:06:51] would also love to get further input from Jeff_Green if he has any thoughts on it [18:07:36] kaldari: sorry--i've been heads down on a puppet/git/ganglia issue. reading backscroll [18:08:33] Jeff_Green: basically the suggestion is moving geoiplookup back to it's own hostname so that we can use dual-stack fallback for IPv6 users [18:08:57] my concern is how this will affect banner loading for everyone else [18:11:28] paravoid: And I assume that removing IPv6 DNS entries for bits isn't an option [18:12:15] kaldari: could you use javascript to choose where to do the lookup? [18:12:42] probably, that's a sort of brilliant idea [18:12:45] :) [18:12:50] ha [18:13:27] hahaha [18:13:34] i'll trade. can you find puppet and terminate (in the murder sense) it for me? [18:13:55] lemme confer with the fundraising devs and see if we can do that. It seems like it would work though [18:14:17] New review: Hashar; "Overall we have a real mess in our configuration file which Brad is cleaning up." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30792 [18:14:20] fwiw, I fixed the Varnish code to read the IPv6 database [18:14:40] Jeff_Green: I am friends with the inventor of puppet, so I'll pass along the suggestion to him ;) [18:14:55] kaldari: thank you [18:14:55] geoipupdate doesn't update the free GeoLite databases though, so some simple puppet machinery is needed to update that too [18:14:58] he's also from Nashville [18:15:12] which might explain some things [18:15:25] i was just running that in my head... [18:15:28] it may explain all the twang anyway [18:17:12] paravoid: thanks, I'm sure we'll need to migrate it to fully support IPv6 at some point, even if it isn't this year [18:24:15] !log giving sbernardin racktables permissions [18:24:23] Logged the message, Master [18:24:51] cmjohnson1: i think it should work now for Steve. He was lacking permissions [18:25:17] ah..okay [18:25:26] Wikimedia Foundation : Main page : Configuration : Permissions [18:25:26] binasher: is the CentralNotice schema change in place for today's deployment? [18:25:36] cmjohnson1: allow {$username_sbernardin} [18:25:49] kaldari: i'm doing it right now [18:25:49] cool [18:25:54] binasher: cool, thanks [18:26:22] kaldari: yeah, and the API is the same for GeoIP and GeoLite (free) databases [18:26:43] although the current scheme of AJAX requests sounds a bit wasteful to me, but if you like it... :) [18:28:54] paravoid: Oh no, I hate the current loading scheme :) [18:30:40] paravoid: I'm looking forward to implementing the all-in-one solution on varnish (which may also require a bit of assistance from Ops) [18:31:40] !log completed centralnotice patch-bucketing.sql migration [18:31:52] we just don't want to make such a dramatic change immediately before the fundraiser starts [18:31:53] Logged the message, Master [18:32:08] nod [18:37:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:15] New patchset: jan; "Add puppet config for PHP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29975 [18:41:26] New patchset: jan; "Add puppet config for PHP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29975 [18:45:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26571 [18:45:21] New patchset: preilly; "update ACLs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30814 [18:45:30] binasher: ^^ [18:46:04] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30814 [18:46:10] New review: Hashar; "wmf-config/transcoding-wmflabs.php still has a $wgMaxShellMemory set to 3000000 (3GB?). Maybe you c..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/30785 [18:48:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.991 seconds [18:58:55] New review: J; "wmf-config/transcoding-wmflabs.php is not in git and not loaded anywhere, so yes, that should be rem..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/30785 [19:06:00] New patchset: Jgreen; "moving fundraising db config from role/db.pp to role/fundraising.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30818 [19:08:05] New review: jan; "I have reduced the number of classes and have created the two classes php::most_used and php::nearly..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29975 [19:08:30] New patchset: jan; "Refactor the webserver classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30147 [19:14:17] binasher: I forgot to request the CentralNotice schema change as test as well. Is there any chance of getting that in place? [19:14:39] ...schema change to testwiki as well... [19:17:05] is Asher at lunch now? [19:17:20] New patchset: Jgreen; "moving fundraising db config from role/db.pp to role/fundraising.pp (typo)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30818 [19:17:36] kaldari: is it an expensive change? [19:18:20] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30818 [19:18:22] Reedy: https://gerrit.wikimedia.org/r/#/c/29614/10/patches/patch-bucketing.sql [19:18:50] kaldari: so... [19:18:50] Just on meta? [19:18:56] kaldari: how do we proceed? [19:19:05] it's already in place on meta, just need it on test wiki as well [19:19:42] paravoid: We're going to move ahead with Jeff's solution... [19:20:11] paravoid: which means that we don't need IPv6 data in the lookups [19:20:40] paravoid: although we should plan to add it once MaxMind's data is more reliable [19:21:37] Reedy: I'll buy you a wikibeer for a schema update to test.wiki :) [19:22:04] oh, if it's on testwiki [19:22:14] or if you think it's fine to do it myself, I can do it [19:23:05] Yeah, just do it [19:23:08] The first 2 tables have 2 rows in them [19:23:20] the second has 172 [19:23:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:39] and it's just adding columns, so it'll take only a few seconds per item [19:24:57] done [19:25:26] :) [19:27:10] kaldari: crap! we have a IndexPager bug *grumble* [19:30:30] New patchset: Matthias Mullie; "GeoCrumbs maintenance" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30822 [19:31:28] puppet is exploding badly [19:36:04] can I get a second set of eyes on stafford/puppet, it's exploding and I'm concerned it's somehow related to the change I jsut merged [19:36:23] I don't *think* so but it worries me [19:38:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [19:39:07] New patchset: Anomie; "Add ability for switching for eqiad-specific configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30792 [19:40:30] New patchset: Kaldari; "Update settings for CentralNotice banner loading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30823 [19:40:54] New review: Anomie; "So this version goes to using $realm for all the switching, which is logically equivalent to the old..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/30792 [19:43:35] New patchset: Kaldari; "Update settings for CentralNotice banner loading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30823 [19:44:22] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30823 [19:44:40] New patchset: Anomie; "Add ability for switching for eqiad-specific configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30792 [19:45:32] New review: Anomie; "Rebase" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/30792 [19:46:46] !log kaldari synchronized wmf-config/CommonSettings.php 'updating settings for CentralNotice banner loading' [19:46:59] Logged the message, Master [19:48:28] New patchset: Jgreen; "fundraising db config cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30825 [19:49:50] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30825 [19:54:53] LeslieCarr: I finally got the rest of the hosts onto the fundraising ganglia multicast IP and it's working [19:56:36] RECOVERY - Puppet freshness on storage3 is OK: puppet ran at Tue Oct 30 19:56:12 UTC 2012 [19:57:42] New patchset: Jgreen; "more fundraising db config cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30826 [19:58:14] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [19:58:14] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [19:58:14] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [19:58:48] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30826 [19:59:36] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay seconds [19:59:41] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [20:05:43] Jeff_Green: huzzah [20:05:44] :) [20:06:14] puppet gave me a heart attackackackackack though [20:06:32] New patchset: Jgreen; "last bit of fundraising db conf cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30828 [20:06:47] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:42] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30828 [20:08:28] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [20:10:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:19:21] New patchset: Kaldari; "Update settings for CentralNotice bannerloading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30829 [20:19:57] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30829 [20:21:44] !log kaldari synchronized wmf-config/CommonSettings.php 'updating wgCentralBannerDispatcher for CentralNotice' [20:21:57] Logged the message, Master [20:22:20] awight, mwalker: config updates complete [20:22:34] about to run scap if you guys are ready [20:22:41] yep yep [20:22:43] k [20:23:27] hold on to your seats [20:23:39] scap launched! [20:25:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [20:28:15] bigdelete on the way warning in #-tech [20:28:28] can someone say they're around to watch, etc.? [20:29:42] LeslieCarr: ^ ? [20:29:53] New patchset: Faidon; "Add IPv6 GeoIP support to Varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30836 [20:30:11] or paravoid is around? ;) [20:31:01] paravoid: what about the "most IPv6 people are tunneled so geo will be wrong" concern? [20:31:16] ok [20:31:44] LeslieCarr: so he/she should pull the trigger? [20:31:46] jeremyb: it still applies. I don't recommend to use this. [20:31:58] jeremyb: what I implemented this nevertheless, as to give the fr-tech guys more options [20:32:13] heh, okey [20:36:08] honestly, I'm afraid to say "pull the trigger" jeremyb -- i'd rather wait until another ops person who's better at DB's is around [20:36:22] unless he's got a rate limiting script or something [20:37:30] !log kaldari Started syncing Wikimedia installation... : [20:37:30] LeslieCarr: erm? it's just clicking delete once on a page that is too big for a normal sysop to be permitted to delete. (there's a separate permission "bigdelete") [20:37:33] ah ok [20:37:36] LeslieCarr: so no rate limiting is possible [20:37:39] Logged the message, Master [20:37:51] i don't see a binasher or domas [20:38:06] idk who else knows DBs how well [20:38:08] hrm [20:38:14] Ryan? [20:38:52] idk, you tell me ;) [20:38:58] hehe [20:54:07] !log stopping mysql on db1013 for cloning [20:54:19] Logged the message, Master [20:54:57] Jeff_Green: do you know who I need to talk to in order to get something on bits purged? [20:55:12] not really [20:55:17] probably mark or asher? [20:55:40] mark: do you happen to know anything about purging resource loader entities on bits? [20:56:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:57:11] Jeff_Green: how comfortable are you with DBs? [20:57:22] 6! [20:57:46] errr, hrmmm [20:58:01] that's an index to some hashtable i guess ;) [20:58:15] on a scale of "1 to thrilling" [20:58:38] PROBLEM - mysqld processes on db1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:59:32] mark: actually; sorry ignore me, it looks like I'm still waiting for a scap run to complete for the resource loader things to invalidate [20:59:42] Jeff_Green: a steward's been waiting to run a page delete which may break the DB. looking for someone to be around in case the DB falls over [21:00:11] i can't do it today--it's 5P here and i've got about 30 mins before I have parenting duties [21:00:25] oh, right you're in my TZ [21:00:31] ya--sorry [21:00:39] you're in NYC? [21:00:43] ya [21:01:01] you have power and you're online--that's good! [21:01:17] my microwave clock has not needed a reset [21:01:25] nice [21:01:36] but i saw some flickers [21:04:57] New patchset: Dzahn; "update star.wikipedia.org SSL cert, key stays the same (RT-3639)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30889 [21:06:59] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30889 [21:08:59] !log kaldari synchronized php-1.21wmf2/extensions/CentralNotice/CentralNotice.php 'updating centralNotice to autoload ApiCentralNoticeAllocations class' [21:09:13] Logged the message, Master [21:09:55] mutante: are you interested in DB babysitting? (see above) otherwise i guess i wait for someone else to come online [21:12:03] * jeremyb wonders why logmsgbot and gerrit-wm are both duplicated [21:13:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [21:14:50] !log kaldari synchronized php-1.21wmf2/extensions/CentralNotice 'updating centralNotice to autoload ApiCentralNoticeAllocations class' [21:15:06] Logged the message, Master [21:20:16] New patchset: Jgreen; "fixed typo ptmpa to pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30894 [21:22:37] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30894 [21:23:41] !log kaldari Finished syncing Wikimedia installation... : [21:23:57] Logged the message, Master [21:24:32] !log kaldari synchronized php-1.21wmf2/extensions/CentralNotice 'updating centralNotice to autoload ApiCentralNoticeAllocations class' [21:24:46] Logged the message, Master [21:30:30] !log kaldari synchronized php-1.21wmf3/extensions/CentralNotice 'updating centralNotice to autoload ApiCentralNoticeAllocations class' [21:30:43] Logged the message, Master [21:31:14] Reedy: wmf3 is updated now as well. Thanks for the assistance! [21:31:47] New patchset: Jgreen; "changing substring match for fr-tech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30895 [21:32:37] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30895 [21:35:27] New patchset: Reedy; "Cleaning InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28656 [21:36:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28656 [21:37:22] New patchset: Reedy; "Cleaning InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28625 [21:38:58] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28625 [21:40:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30436 [21:43:39] !log reedy synchronized wmf-config/InitialiseSettings.php [21:43:49] Logged the message, Master [21:45:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:47:02] !log restarting nginx on ssl hosts for cert update [21:47:15] Logged the message, Master [21:54:22] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29824 [21:57:29] !log asher synchronized wmf-config/mc.php 'testing MemcachedPeclBagOStuff on test2wiki' [21:57:42] Logged the message, Master [21:58:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.192 seconds [21:58:35] testwiki: Memcached error for key "testwiki:messages:en:lock" on server "10.0.12.11:11000": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [21:58:58] poor test2wiki [21:59:12] that commit had the wrong ports heh [21:59:46] no one notices that? :D [21:59:53] New patchset: Asher; "fix memcached ports" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30903 [22:00:08] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30903 [22:00:44] !log asher synchronized wmf-config/mc.php 'testing MemcachedPeclBagOStuff on test2wiki (serverlist fix)' [22:00:58] Logged the message, Master [22:01:00] better [22:02:02] okay, my deployment window [22:02:32] !log asher synchronized wmf-config/CommonSettings.php 'memcached logging group' [22:02:40] ok, all yours MaxSem [22:02:52] Logged the message, Master [22:02:56] :) [22:03:40] AaronSchulz: was it intended for both memcached clients to log to memcached-serious.log? [22:03:54] just the pecl one uses -serious [22:04:17] AaronSchulz: that doesn't seem to be the case [22:04:33] the php one is logging there too. on the other hand, at least its logging somewhere now [22:06:20] yeah, I see there are some debug calls in the old client to use -serious that I didn't notice [22:07:13] though we see the ip/ports so it doesn't really matter [22:07:58] sure are lots of timeouts [22:08:55] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [22:10:17] Aaron|home: when would you feel comfortable moving everything over to das pecl? [22:11:28] maybe we can move order a larger wiki first, and then do everything else [22:12:10] that could be wen+thur? [22:13:10] that sounds good to me [22:14:57] i'm going to failover the RE mastership on cr2-eqiad -- with graceful-restart routing it should be no impact [22:15:10] !log failing overRE mastership on cr2-eqiad to upgrade re1 [22:15:16] binasher: we might want to do the multiwrite thing though, so we can switch back [22:15:24] Logged the message, Mistress of the network gear. [22:15:30] Aaron|home: i think i'd like to try changing some settings [22:17:13] applyDefaultParams doesn't actually set defaults for everything [22:17:13] !log kaldari synchronized php-1.21wmf3/extensions/CentralNotice 'autoloading ApiCentralNoticeAllocations for CentralNotice' [22:17:13] Logged the message, Master [22:17:13] i think we should set compress_threshold to something [22:17:27] binasher: sure, and can change testwiki to use a multiwrite cache after that [22:17:52] do you just want to multiwrite testwiki to test multiwriting? [22:19:39] yes, I assume we will want to multiwrite to other wikis for a while before switching the read order and then removing the old caches from writes [22:19:43] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::7 [22:20:38] LeslieCarr: is that alert scary? ^^^ [22:20:55] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 112.46 ms [22:21:14] not too scary [22:21:16] looking [22:21:28] i think i've never seen it quite like that before [22:22:25] yeah [22:23:28] New patchset: Dzahn; "protoproxy - use star.wp SSL cert rather than "test-star"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30913 [22:23:28] ipv6? pfft [22:24:35] binasher: so compress_threshold is set to 1500 [22:24:37] who uses that? it's an old protocol, has more than 14 years... [22:24:50] yeah, we're switching to ipv7 [22:26:03] Aaron|home: ah, so it is. hmm, large [22:26:33] kaldari, you're deploying in mobile team's window - please ask next time [22:27:04] MaxSem: sorry about that! [22:27:35] Aaron|home: setting Memcached::OPT_NO_BLOCK sounds desirable [22:27:40] should be finished now, will ping you next time [22:28:11] or whoever I need to ping for that window :) [22:29:15] Aaron|home: setting Memcached::OPT_SERVER_FAILURE_LIMIT to a positive integer might be good too, it defaults to 0 [22:29:50] Memcached::OPT_RETRY_TIMEOUT would have to be set to a positive integer at the same time [22:30:06] kaldari, no problems - we're testing right now so you didn't break anything, but better ask to avoid hurting Wikipedia:) [22:30:22] MaxSem: agreed [22:30:23] LeslieCarr: ipv7.net :) [22:31:56] binasher: so if you set(x) and then get(x) you may not see the changes? [22:32:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:27] New review: Dzahn; "RT-3639" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/30913 [22:34:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30913 [22:34:36] Aaron|home: there's a small chance of that [22:35:04] not that that should be happening in the same request [22:35:22] yeah, well I bet it is in lots of places ;) [22:35:28] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [22:36:23] not really down, but lost network? [22:36:49] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:36:50] actually, not down at all. [22:36:58] yeah, better [22:39:54] binasher: I've forgotten (again). Where's the database stats/profiling information? [22:40:08] Reedy: ishmael.wikimedia.org [22:40:22] thanks [22:41:37] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [22:46:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.633 seconds [22:50:14] New patchset: Aaron Schulz; "Defined memcached multiwrite backend and switched the testwikis to use it." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30918 [22:50:48] binasher: ^ [22:52:29] Aaron|home: newMemcached is the old memcached? heh [22:52:50] well "new" as in "constructor-ish" [22:53:19] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [22:55:03] paravoid: https://gerrit.wikimedia.org/r/#/c/30919 [22:55:41] let me know if that IPv6 detection seems sane [22:59:55] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:03:15] Aaron|home: how do you feel about adding 'persistent' => true to the memcached-pecl array? although i'd be ok testing that later on [23:04:13] Tim wasn't planning on using that, but it should work (I remember ironing bugs testing it locally) [23:04:28] do you remember why? [23:04:32] I suppose we can try it [23:04:55] binasher: he though were still had connection pooling disabled at the time [23:05:37] oh, i think i talked to him about it and was concerned about the total number of connections but it doesn't seem to be a problem based on total number of apache children we currently have [23:05:46] actually I thought that too at the time, though it turns out we were/are still using pooling with the old client [23:05:51] and when it would become a problem, its probably udp time [23:06:09] we aren't currently, wgMemCachedPersistent = false [23:06:22] binasher: did you switch it off? [23:06:29] a while ago [23:06:39] when you were looking at the timeout value, right? [23:06:39] there's something fundamentally broken with it in the old client [23:08:16] !log replaced wikipedia SSL cert with new one by GeoTrust [23:08:31] Logged the message, Master [23:12:34] Aaron|home: i guess there's no point to using persistent connections in the new client right now either [23:13:21] we still have MaxRequestsPerChild 4 set, and apache children live for < 1 minute [23:14:48] yeah, we can look at it later [23:14:56] heads up: we've started receiving a certificate warning on mobile WP [23:14:56] Aaron|home: do you know if TimStarling ever enforced a hard 1.5mil pp-node limit? [23:14:59] mutante, ^^^ [23:15:16] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30918 [23:16:02] the highest pp-node-count for an article is now frwiki: 1235773 Ore [23:16:14] MaxSem: before or after the change? [23:16:19] mutante, right now [23:16:20] and the next 100 highest are all on zhwiki [23:16:29] MaxSem: what does it say [23:16:29] Change merged: CSteipp; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30822 [23:16:34] binasher: yes [23:16:49] wrong host name [23:17:19] MaxSem: wth...did they not include the m. in this one? thanks..checking [23:17:33] mutante, the new cert is just *.wikipedia.org [23:17:40] arrg [23:17:57] it was supposed to replace the older one which had both [23:17:58] so it should be safe to set MaxRequestPerChild back to what it was [23:18:20] MaxSem: are you still in your deploy window? just checking [23:19:08] binasher, yes - for 40 minutes, about to run scap [23:19:08] i'm suprised to see lots of jobqueue deadlocks on JobQueueDB::claim just for wikidatawiki [23:19:32] Yeah.. I noticed t hose earlier [23:19:42] is wikidatawiki one of the only wikis using the new code? [23:19:55] err [23:19:56] When did we merge it.. [23:20:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:26] testwiki, test2wiki, mediawikiwiki and wikidatawiki are all on 1.21wmf3.. [23:21:04] New patchset: Dzahn; "do not use new SSL cert for mobile, does not include the .m... grrr" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30921 [23:21:08] MaxSem: fixing, should be gone soon [23:21:31] thanks [23:21:38] binasher: I think so, yeah [23:22:04] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30921 [23:25:46] getting reports of slow page saves and occassional timeouts on save on en.wiki: Wikipedia:Village_pump_(technical)#Very_slow_page_loading [23:25:51] Example error: [23:25:57] Request: POST http://en.wikipedia.org/w/index.php?title=Talk:Derek_McCulloch&action=submit, from 216.38.130.162 via cp1011.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.132 (10.64.0.132) [23:25:57] Error: ERR_READ_TIMEOUT, errno [No Error] at Tue, 30 Oct 2012 23:22:30 GMT [23:28:16] Aaron|home: job_cmd_token index gap lock is at issue: RECORD LOCKS space id 0 page no 5563409 n bits 144 index `job_cmd_token` of table `wikidatawiki`.`job` trx id 4B9B23CD9 lock_mode X locks gap before rec insert intention waiting [23:28:16] * MaxSem runs scap [23:28:31] kaldari: thanks, going to take alook [23:28:39] insert? [23:28:59] * Aaron|home was looking at the wiki and all of its 9 jobs [23:29:20] nope, it's all on the JobQueueDB::claim update query [23:30:48] where is the "insert intention waiting" coming from? [23:33:21] kaldari: a lot of what's in that village pump report doesn't seem ops/perf related, but like very strange software behavior [23:33:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.966 seconds [23:34:31] binasher: I'll see if there's anything specific to those articles... [23:35:00] "While I'm editing, it suddenly decides to do a save without my telling it to. Or, it will throw me out of the edit box altogether and not save anything." [23:36:28] binasher: yeah, no idea about that one [23:36:45] nothing currently looks amiss on the enwiki or es masters [23:37:17] yeah, the graphs look ok [23:38:32] eh, Warning: include_once(): Failed opening '/home/wikipedia/common/php-1.21wmf3/extensions/TocTree/TocTree.php' for inclusion [23:38:34] during scap [23:38:36] btw, Aaron|home: thank you for this! https://ishmael.wikimedia.org/sample/?hours=260&host=db63 [23:38:38] Reedy, ^^^ [23:38:54] that's based on the time of all queries running against the enwiki master, not actually slow ones [23:38:55] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [23:38:55] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [23:38:55] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:38:55] MaxSem: it's alright, ignore them [23:39:09] Reedy, I've already aborted:P [23:39:21] restarting... [23:40:12] Reedy, and Warning: file_put_contents(/home/wikipedia/common/wmf-config/ExtensionMessages-1.21wmf3.php): failed to open stream: Permission denied in /home/wikipedia/common/php-1.21wmf3/maintenance/mergeMessageFileList.php on line 119 [23:40:17] ? [23:40:33] Can ignore that one aswell [23:40:47] Though, I should ask someone to chown it [23:42:00] Can someone please run: chown mwdeploy:wikidev /home/wikipedia/common/wmf-config/ExtensionMessages-1.21wmf3.php [23:43:41] Reedy: done [23:43:47] thanks [23:43:55] np [23:44:10] !log maxsem Started syncing Wikimedia installation... : [23:44:25] Logged the message, Master [23:50:09] binasher: what do you think is with the deadlocks? [23:51:17] maybe it should select from a slave to get job_id and update on that (PK) [23:51:36] well select the other stuff while at of course to make the object [23:52:18] if the select uses job_random, there shouldn't be to many jobs doing the same row [23:52:41] binasher: It doesn't seem like something silly, like not being in autocommit mode, I keep double checking the code [23:57:28] binasher: looks like it's some templates that are causing the extreme slowness, trying to isolate further.