[00:35:30] New patchset: DamianZaremba; "Since the hostname used in labs is not standard and does not match the one in monitoring snmp traps fail. Fake the hostname using the instancename configured locally and the site the instance is in." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45081 [00:50:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45081 [01:57:04] PROBLEM - MySQL disk space on neon is CRITICAL: DISK CRITICAL - free space: / 354 MB (3% inode=74%): [02:26:15] !log LocalisationUpdate completed (1.21wmf7) at Tue Jan 22 02:26:14 UTC 2013 [02:26:27] Logged the message, Master [02:42:30] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 188 seconds [02:42:57] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 205 seconds [02:48:38] !log LocalisationUpdate completed (1.21wmf8) at Tue Jan 22 02:48:37 UTC 2013 [02:48:49] Logged the message, Master [02:52:51] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:00] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:08] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:08] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:08] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:09] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:18] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:19] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:19] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:19] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:25] are we starting the switch or what? [02:53:27] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:31] * Jasper_Deng sees Apaches running away [02:54:31] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [02:54:39] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [02:54:40] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [02:54:40] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [02:54:40] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [02:54:40] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [02:54:40] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.075 second response time [02:54:40] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.069 second response time [02:54:41] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.066 second response time [02:54:41] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.077 second response time [02:54:42] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.079 second response time [02:54:49] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [02:54:57] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [02:54:58] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [02:54:58] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [02:55:07] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.060 second response time [02:56:45] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.080 second response time [03:00:39] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [03:02:01] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [03:27:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [03:47:10] New patchset: Ryan Lane; "Use paged results in nslcd and set mininum uid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45086 [03:47:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45086 [03:53:57] New patchset: Ryan Lane; "nss_min_uid is only valid in precise+" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45087 [03:54:24] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45087 [04:05:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:00] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 181 seconds [04:10:21] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 197 seconds [04:16:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.807 seconds [04:36:54] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [04:37:31] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [05:07:06] New patchset: Ryan Lane; "Add more explicit scopes for groups and users" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45089 [05:09:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45089 [05:11:23] New patchset: Ryan Lane; "Followup for scoping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45090 [05:11:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45090 [05:11:39] New patchset: Asher; "EQIAD SWITCH: set $::mw_primary = eqiad (sets mobile api backend, redis+mha conf)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45091 [06:17:19] New patchset: Ryan Lane; "Follow up to scoping. Set group map properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45093 [06:19:59] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45093 [06:26:51] RECOVERY - MySQL disk space on neon is OK: DISK OK [06:38:21] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on srv247 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [07:11:33] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:17:09] New patchset: Ryan Lane; "Increase the group and passwd nscd suggested-size" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45094 [07:17:54] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45094 [07:18:44] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection refused [07:39:26] New review: Hashar; "Should probably use ${::instancename} directly instead of (cat /etc/wmflabs-instancename) :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45081 [08:02:27] New patchset: DamianZaremba; "Tidying up exec to use info direct from puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45096 [08:10:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45096 [08:29:47] TimStarling: is there a way to know which is the millionth it.wiki's article? [08:33:16] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 206 seconds [08:33:51] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 220 seconds [08:38:30] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 6 seconds [08:39:07] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:19:11] New patchset: Hashar; "(bug 26784) IRC bots process need nagios monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45097 [09:20:49] New review: Hashar; "I guess that would be good enough. Leslie / Peter mind reviewing this? Maybe the Nagios check shoul..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45097 [09:34:45] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 181 seconds [09:35:31] RECOVERY - Puppet freshness on srv247 is OK: puppet ran at Tue Jan 22 09:35:02 UTC 2013 [09:35:32] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 202 seconds [09:43:27] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:44:12] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:47:23] New patchset: Silke Meyer; "Icons for Wikidata demo instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45098 [09:52:22] Vito: I don't think so [09:53:04] New patchset: Tim Starling; "Remove apaches.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45099 [09:54:21] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45099 [10:01:17] New patchset: Tim Starling; "Fix duplicate definition error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45100 [10:02:10] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45100 [10:09:27] !log on searchidx2: replaced /etc/sudoers with the distro default so that /etc/sudoers.d/* from puppet can take effect [10:09:37] Logged the message, Master [10:14:24] !log jenkins updated all plugins and restarting [10:14:33] Logged the message, Master [10:18:10] !log gallium : restarted puppet [10:18:20] Logged the message, Master [10:30:37] !log also replaced /etc/sudoers on fenari and hume. Hume needs a puppet change which will follow shortly [10:30:47] Logged the message, Master [10:37:54] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:44:54] New patchset: Tim Starling; "Introduce role::applicationserver::maintenance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45106 [10:45:09] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45106 [10:47:38] New patchset: Silke Meyer; "Added variables to configure Wikibase client via labsconsole" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45107 [10:49:47] New review: Silke Meyer; "Split into two separate commits (45098 for logos and 45107 for variables). Abandoning this change here." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44690 [10:50:15] Change abandoned: Silke Meyer; "The content has been split into two separate commits." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44690 [10:51:15] New review: Silke Meyer; "Andrew, you were right. We don't need the variables in mediawiki.pp." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/45107 [10:52:04] hume got a lot of good changes from being put into the right puppet classes [10:53:23] like this: [10:53:25] -;igbinary.compact_strings=On [10:53:25] +igbinary.compact_strings=Off [10:53:37] stuff you forget if you have to manage it separately [10:59:48] New patchset: Hashar; "gerrit: mw/tools IRC notifs to #wikimedia-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45108 [11:01:31] TimStarling: are you on a puppet sprint? :) [11:03:26] you know if you can't find something in your house despite spending half an hour looking for it, that's a sign you need to tidy up [11:03:49] I did that last week end [11:03:57] ended up throwing a ton of useless stuff [11:04:10] well done [11:04:17] no I can buy more :-] [11:04:37] so puppet was looking pretty untidy to my eyes, and it was slowing me down and breaking things [11:04:52] so it was time to do a bit of work on it [11:05:08] we also have to split the manifests in nice module [11:05:27] fenari is broken right now [11:05:30] including all the manifests on each run (via the myriad of require in manifests/site.pp) is not really efficient :-D [11:05:37] I don't understand the applicationserver/mediawiki_new split [11:06:00] I think the idea was to refactor the mediawiki class [11:06:09] apparently to run maintenance scripts you need applicationserver, but it installs apache for no apparent reason, and configures it [11:06:13] and have it applied on eqiad app servers [11:08:44] I think I have it figured out... [11:20:12] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection refused [11:20:23] New patchset: Tim Starling; "Split apache off from applicationserver::packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45113 [11:20:34] paravoid/hashar: either of you want to review that? [11:22:25] TimStarling: looking [11:26:46] i don't get it [11:27:13] it's to fix this: [11:27:15] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Package[libapache2-mod-php5] is already defined in file /var/lib/git/operations/puppet/modules/applicationserver/manifests/packages.pp at line 7; cannot redefine at /var/lib/git/operations/puppet/manifests/misc/noc.pp:8 on node fenari.wikimedia.org [11:27:27] ah the infamous dupe entries [11:27:30] grmblbl [11:27:32] because I just changed fenari to include role::applicationserver::maintenance [11:27:46] which includes applicationserver::packages which conflicts with noc.pp [11:27:50] I guess whoever includes applicationserver::apache_packages [11:27:55] will turns out to have the same issue [11:28:06] since you get the libapache2-mod-php5 defined in two places [11:28:17] (in apache_packages.pp and in packages.pp) [11:28:50] ah, right [11:29:14] one way is to create a dummy class that package{} [11:29:31] then include the dummy class whenever you want it [11:29:46] New patchset: Tim Starling; "Split apache off from applicationserver::packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45113 [11:30:04] New review: Tim Starling; "PS2: actually remove apache from applicationserver::packages" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45113 [11:30:11] better now? [11:30:23] * hashar refreshes [11:30:59] ahh [11:31:15] I had edited it inside my brain but not in the actual file [11:31:47] so [11:31:50] if I am correct [11:31:59] the jobrunner / videoscaler will no more have apache running [11:32:06] since they include applicationserver::packages [11:32:09] yes [11:32:21] there's no ensure=>absent but that is how it will be after reinstall [11:32:40] same goes for the lucene indexer too [11:32:41] but it would have been broken anyway since they didn't include applicationserver::service [11:32:47] yes, I said so in my commit message [11:33:28] ah and tin == deployment [11:33:42] sounds good so [11:34:06] will probably want to update tin / lucene etc to have the apache mod installed though [11:34:09] I guess tin relies on it [11:35:01] what does tin need it for? [11:35:11] New review: Hashar; "Sounds good. PS1 had packages to still include libapache2-mod-php5 which was really confusing." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/45113 [11:35:20] ah maybe it does not need libapache2-mod-php5 [11:35:22] just apache [11:35:44] I think the git-deploy minion fetch from an http:// url served by tin [11:35:50] though probably does not need php5 [11:35:52] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45113 [11:36:11] lucene, I have ZERO idea [11:37:58] my labs instances are not happy hehe [11:37:59] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: File[/usr/local/apache] is already defined in file /etc/puppet/manifests/nfs.pp at line 187; cannot redefine at /etc/puppet/modules/applicationserver/manifests/config/apache.pp:29 on node i-0000031a.pmtpa.wmflabs [11:38:05] most probably unrelated though [11:38:27] lucene doesn't need apache [11:41:47] ok, well I ran puppet on fenari and it doesn't seem horribly broken [11:41:55] :-] [11:42:03] and on app servers? :D [11:42:27] I will try that, and searchidx2 [11:43:58] you know why puppet is slow? [11:45:05] I ran a profile today, it seems to be mostly the manifest parser [11:45:52] we use import, rather than autoload, so we get the whole manifests directory for any server [11:46:36] app servers are fine, no change [11:46:42] searchidx2 is fine [11:47:08] gtg [11:48:31] Tim-away: bye bye :) [11:48:36] New patchset: Hashar; "beta: /usr/local/apache dupe definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45115 [11:48:43] Tim-away: for your backlog, we can get more autoloading by splitting our manifests to submodules :-] [11:48:47] err modules [12:26:02] <^demon> !log restarted ircecho on manganese [12:26:14] Logged the message, Master [14:52:27] New patchset: Mark Bergsma; "Add misc_pmtpa and misc_eqiad hostgroups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45138 [14:54:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45138 [14:57:26] New patchset: Mark Bergsma; "EQIAD SWITCH: Use eqiad bits app servers exclusively." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44252 [14:57:26] New patchset: Mark Bergsma; "EQIAD SWITCH: Use eqiad bits app servers in eqiad (not in pmtpa)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44251 [14:57:26] New patchset: Mark Bergsma; "EQIAD SWITCH: Migrate mobile backend appservers to eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44257 [15:22:50] !log Changed AS14907->AS43821 routing [15:23:00] Logged the message, Master [16:14:58] paravoid: ping [16:15:05] pong [16:15:43] paravoid: howdy... Ryan_Lane said that you are the man that I need to talk to :) [16:15:51] hi :) [16:15:52] what for? [16:15:54] I work on swift [16:16:19] okay [16:16:24] I heard that you guys need global replication, which we don't have right now [16:16:36] New patchset: Mark Bergsma; "EQIAD SWITCHOVER: set pmtpa application servers read-only" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45149 [16:16:49] among other things, yes [16:16:53] hehe [16:17:07] we're in contact with the SwiftStack folks [16:17:14] so, I would be curious to hear a little more [16:17:15] ahh [16:17:16] John has said that it's in the works [16:17:52] New patchset: Mark Bergsma; "EQIAD SWITCHOVER: set eqiad application servers read-write" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45150 [16:17:59] what would you like to hear? [16:18:09] <^demon> mark: I think Asher already has 2 patches in for that. [16:18:23] <^demon> https://gerrit.wikimedia.org/r/#/c/44845/ and https://gerrit.wikimedia.org/r/#/c/44847/ [16:18:41] well, the things inside swiftstack often don't get shared with the larger community it seems :/ [16:18:57] paravoid: I was just curious to hear a little more about your use cases [16:19:53] like to start, what you envision the global setup to look like [16:19:58] how many DCs [16:19:59] well, we currently use swift for media storage -- basically images & videos for all wikimedia projects incl. wikipedia [16:20:10] how much bandwidth will be availabe between the dcs [16:20:24] paravoid: yup [16:20:36] we have two sites in the US, a smaller caching site in EU and we're about to set up a new caching site in the west coast as well [16:20:58] the initial target was replication between the two sites in the US [16:21:17] one active, the other one hot standby (or even active/active) [16:21:21] k [16:21:32] bandwidth between DCs is not an issue [16:21:42] there's latency though [16:21:43] k [16:22:01] note that while waiting for swift to gain geo-replication we've been evaluating Ceph in the other DC [16:22:07] ceph + radosgw [16:22:17] yeah that is what I heard [16:22:36] also note that we've attempted to use container sync to sync containers across [16:22:42] but that was a huge failure [16:22:48] yeah, I wouldn't recommend that [16:22:49] I've filed multiple bugs about that on launchpad [16:23:15] swift couldn't scale for the number of files per container that we wanted, so we've been sharding containers into 256 shards [16:23:20] We should have communicated a bit better in the docs that it was a bit of an experimental feature [16:23:25] that has resulted in us having about 36k containers [16:24:05] container sync already sucks as it is, with 36k containers/200 million files is just impossible for us [16:24:14] I totally understand [16:25:02] note that ceph also has a number of advantages over swift so far [16:25:11] (georeplication is unfortunately not one of them) [16:25:31] paravoid: meaning ceph doesn't have georeplication? [16:25:47] not really, no [16:25:51] k [16:26:04] it does have hierarchical zoning [16:26:07] ok so sticking with replication for a bit [16:26:15] I would like to dig in a bit more [16:26:23] but there's no way of preferring the local DC for reads for example [16:26:26] (and thanks for your time, this is a lot of good feedback) [16:26:27] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection refused [16:26:28] sure [16:26:45] gah, that paged again [16:27:32] So in an ideal world, would you like to have those two DCs act as one physical cluster? [16:31:51] creiht: what do you mean by that? [16:32:12] what's up with the ms-fe.eqiad pages for the last 24 hours [16:32:19] yeah I'm looking into that now [16:32:19] sorry [16:32:33] I'm not sure why it started paging all of a sudden [16:32:34] paravoid: I can wait if you need to work on the pages [16:32:45] it never worked [16:33:21] but, to rephrase, are you looking to have 1 cluster that is spread across several locations, or 2 separate mirrored clusters that can be accessed individually [16:33:31] in an ideal world [16:33:52] but I've been working on fixing it properly; I need to restart pybal on lvs1003 [16:33:58] and let's just say that I'm a bit... reluctant :-) [16:34:33] lvs1006 has the new config and has been restarted and works [16:34:38] not sure if I can just kill pybal on lvs1003 [16:34:45] - 10.64.1.6 - - [22/Jan/2013:16:34:36 +0000] "GET /monitoring/backend HTTP/1.0" 200 247 "-" "Twisted PageGetter" [16:34:48] - 10.64.1.3 - - [22/Jan/2013:16:34:42 +0000] "GET /v1/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/monitoring/pybaltestfile.txt HTTP/1.0" 400 272 "-" "Twisted PageGetter" [16:35:20] the second URL is swift-specific and won't work; the first is what I've provisioned yesterday (with a swift rewrite.py change) as to properly fix it and make it swift/ceph agnostic [16:36:22] creiht: we haven't thought about it much; I can see arguments for both, but what really matters is to have some form of replication between the two sites... [16:36:31] k [16:36:45] my instinct would say two separate clusters as to increase redundancy, ease/stage software upgrades etc. [16:36:52] so having the redundancy is more important [16:36:57] cool [16:37:11] as long as replication is not per container or anything [16:37:15] that's completely nuts :) [16:37:35] paravoid: yeah that was meant for one small use case [16:37:49] and a bit of an experiment [16:38:06] if only we knew that from the get go [16:38:11] yeah sorry :/ [16:38:15] I took over media storage quite late [16:38:32] but before me there was strong hope that container sync would be a workable solution [16:38:37] ahh [16:38:57] and actually putting internal deadlines on that assumption [16:39:02] ouch [16:39:28] so, yeah, you really need to document better that it's not not even close to being production-ready :) [16:39:29] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [16:40:05] paravoid: yeah, I have been thinking about how we can better communicate which features are well tested, and which are a bit more experimental [16:40:21] binasher: btw, the latest page was because nagios was completely broken again and mark just fixed it [16:40:33] creiht: yep, good idea [16:40:46] paravoid: I was working on a different project for about a year, so just getting back to work on swift stuff [16:41:36] unfortunately it seems documentation has taken a bit of a back seat during that time [16:41:46] paravoid: that is an excellent reason to receive a page [16:41:53] that's one of the things that I want to work on [16:41:55] anyways [16:42:03] binasher: hehe [16:42:42] so you mentioned being able to prefer a local DC [16:42:54] how important would that be to you? [16:43:03] hiya [16:49:02] creiht: what do you mean? [16:49:34] it's important to not take a latency hit [16:50:16] yeah, so on any request, you would like to use the nearest location, if possible [16:50:51] New patchset: Pyoungmeister; "allowing jobrunners to be enabled or disabled via role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45156 [16:51:08] would mark or binasher or paravoid take a look at that ^^ [16:51:12] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 247 bytes in 0.054 seconds [16:51:40] binasher: I was also thinking of basing it on $::mw_primary [16:51:45] there [16:51:55] notpeter: is that for spinning up the jobrunners in eqiad today? or just later on [16:51:59] looking [16:52:04] well, today and both [16:52:15] I put in a hack to stop them in eqiad based on host name [16:52:18] but that's a hack [16:52:21] this seems more proper [16:52:28] i wonder if there's a reason not to run jobrunners in both sites [16:52:35] some jobs may process very slowly due to latency [16:52:51] mark: not with the current job runner [16:53:04] sure, could also run from both sites with half the procs [16:53:12] lemme know what you'd like :) [16:53:13] why half? [16:53:14] its too uncoordinated and can stamped databases with too many runners [16:53:18] ah [16:53:20] ok [16:53:36] cool. then I stand by my patchset :) [16:54:00] New patchset: Silke Meyer; "Moves Wikidata items to main namespace, imports main page to different namespace" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45157 [16:54:59] so whom of platform engineering is here? :) [16:55:08] roan [16:55:10] robla just walked in [16:55:17] er, roan's not on platform, is he? [16:55:18] notpeter: that changeset doesn't disable pmtpa jobbers? [16:55:22] chris [16:55:31] roan knows mediawiki, that's all that matters [16:55:33] New review: Faidon; "I'd prefer a run_jobs or something, but I guess enabled is also obvious enough. Looks good otherwise." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45156 [16:55:39] binasher: nein. I'd like to make a sperate patcheset for that [16:56:10] notpeter is right, I'm on features, not platform [16:56:23] RoanKattouw: but you know stuff about things [16:56:24] * ^demon is pinged on platform [16:56:26] <^demon> What's up? [16:56:36] <^demon> mark: Reedy and I both are around. [16:56:44] mediawiki working correctly is a feature as far as I'm concerned today ;-) [16:56:51] hahaha [16:56:53] Of course I was in platform's annual planning meeting last year while I wasn't in the one for features.... but I'm in features, I swear! :D [16:59:29] New patchset: Pyoungmeister; "allowing jobrunners to be enabled or disabled via role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45156 [16:59:31] ok, paravoid, more verbose var name, just for you :) [16:59:44] haha [16:59:50] a nitpick really :) [16:59:57] en, you're right [17:00:03] /en/eh [17:00:12] my var naming is usually underly verbose [17:00:19] <^demon> RoanKattouw: If you ever want to come back to platform, we'd welcome you :) [17:00:20] binasher: i assume the DB's are sufficiently warm? ;-) [17:01:14] mark: there are warmups running now, but yes :) [17:02:51] mark: ready to start with the bits migration? [17:02:58] we are going off of https://wikitech.wikimedia.org/view/Eqiad_Migration_Planning/Steps [17:03:25] yeah i'm currently checking a few bits urls against bits apaches [17:03:32] great [17:04:19] ping me if you need anything [17:05:15] !log staging cname switch for pmtpa/eqiad dbs on sockpuppet [17:05:25] notpeter: ty! [17:05:28] Logged the message, notpeter [17:06:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45156 [17:06:44] getting just 404s for the top bits urls [17:06:47] PROBLEM - Host db1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:57] lemme try tampa [17:07:03] perhaps that's normal ;) [17:07:06] mark: you're querying the apaches? [17:07:09] yes [17:07:19] are you emulating the varnish rewrites? [17:07:28] doh [17:07:29] watch the actual requests hitting pmtpa apaches [17:07:33] no ;) [17:07:35] and try those [17:07:36] thanks [17:07:38] i'll check a couple too [17:07:52] varnishtop -b -i TxURL [17:08:25] yeah looking better [17:08:38] New patchset: Pyoungmeister; "EQIAD SWITCH: switching job runners and tmh to eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45159 [17:09:38] seems ok, but got some slow responses [17:10:31] like > 2s [17:10:39] i hope that's not gonna be a problem [17:10:43] resourceloader can be slow without its intermediary caches being warm [17:10:52] replay the same set of requests? [17:11:01] faster of course [17:11:08] fortunately there aren't many backend requests for bits [17:11:09] we'll see [17:11:13] let's do it [17:11:29] the contents of bits varnish shouldn't all expire at once either [17:12:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44251 [17:12:38] running puppet on arsenic [17:12:56] RECOVERY - Host db1013 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [17:13:33] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [17:13:50] PROBLEM - NTP on db1013 is CRITICAL: NTP CRITICAL: No response from NTP server [17:14:25] !log dns svn repo on sockpuppet has changes staged and checked in, not yet deployed on dobson [17:14:26] seems fine [17:14:35] Logged the message, notpeter [17:15:37] going to run it on the other bits servers [17:15:59] yep, varnishstat on arsenic looks good, and see traffic to the bits apaches [17:16:03] yup [17:17:26] PROBLEM - SSH on db1013 is CRITICAL: Connection refused [17:17:41] hopefully the resourceloader in read only mode issues are really fixed, theres a small trickle of write attempts (RoanKattouw) [17:17:54] Yes, they should be [17:17:59] Are we in r/o mode now? [17:18:02] eqiad is [17:18:04] Or is it just the bits apaches [17:18:06] right [17:18:09] all done ine qiad [17:18:10] eqiad [17:18:12] Yeah there will still be write attempts [17:18:13] sweet [17:18:17] will switch pmtpa bits varnish now [17:18:29] It doesn't bother to check for r/o mode specifically because DB writes may fail for other reasons anyway [17:18:43] mark: can you merge the two mobile changes with it [17:18:48] sure [17:19:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44252 [17:19:17] where are they? you didn't use the eqiad-switchover topic [17:19:24] ah linked in the steps doc [17:19:41] oh sorry about that [17:19:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45091 [17:19:45] one is yours tho :) [17:20:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44257 [17:20:42] merging on sockpuppet [17:21:43] php logs all look good from bits eqiad apaches [17:21:44] merged [17:21:45] running puppet [17:22:03] running puppet on cp1041 [17:23:13] do you want me to do the rest? [17:23:26] paravoid: ?? [17:23:28] no [17:23:43] 1042/1043/1044 [17:23:47] no [17:23:49] k [17:24:05] looks fine on 1041 [17:24:09] yup [17:24:13] ok, doing the rest [17:24:29] RECOVERY - SSH on db1013 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:24:44] mobile has quite some traffic of course [17:24:58] with its hit rate [17:25:26] and while it warms up our caches with slow responses, people think it's their mobile network being slow - brilliant! [17:25:32] hahaha [17:25:34] haha [17:25:40] yep :) [17:25:58] SF folks: if anyone is interested in ridiculously sugary breakfast, come to R31 [17:26:11] don't you love american breakfast [17:26:23] "breakfast" [17:26:25] mark: you mean coffee? [17:26:33] i mean muffins [17:26:44] i need coffee in the morning but can't stand muffins [17:26:48] We have stuff like sticky buns here [17:27:00] RoanKattouw: is there any fresh fruit in there? [17:27:13] Some of the pastries contain fresh fruit [17:27:15] Among .... other things [17:27:21] Like, hidden in a pound of sugar [17:27:25] fucking hell, 600 Mbps of mobile miss traffic already [17:27:36] the next appserver batch should come out of mobile's budget I think ;-) [17:27:42] hah [17:27:56] oh and that's not the half of it [17:28:23] they went and duplicated the entire parsercache on their own in memcached, doubling the usage, without telling us / anyone [17:28:36] so my capacity estimates used for the mc servers, totally wrong [17:28:36] hahaha [17:28:38] anyways [17:28:49] binasher: Mobile did? [17:28:59] now, we are going to wait 5-10 minutes [17:29:10] preilly: yes.. but anyways, later. [17:29:47] ridiculous [17:29:52] consider my waiting... with a sugary american muffin [17:30:20] i'll checkout the pmtpa readonly change in the mean time [17:30:21] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 220 seconds [17:30:22] ready to sync-file [17:30:47] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 235 seconds [17:31:03] Change merged: Mark Bergsma; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44845 [17:31:47] mark: thanks - i'll deploy that one [17:31:59] ok [17:32:23] mark: nice [17:33:25] ~ 800 Mbps of mobile traffic [17:33:35] mark: actually you can if you have it ready, but hold off for a bit more [17:33:50] sure [17:34:36] DBs look bored in ganglia [17:35:35] all but the s3 snapshot db which is lagging [17:35:42] mark: would there be a reason why I can't login to Ganglia? [17:35:56] preilly: assuming you have the password, no [17:36:25] mark: Okay thanks [17:36:44] mark: ok, are you fairly ready for the squid changes? [17:37:04] commits are done, i need to merge, generate config, sync em out [17:37:06] but shouldn't take long [17:37:10] i'll start on that now [17:37:16] if you do the read only step [17:37:26] cool, let me know when its generated but not synced [17:37:35] and then i'll deploy read only in pmtpa [17:37:45] trying to minimize the ro time :) [17:38:14] diffing [17:38:26] PROBLEM - Host db1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:52] generated configs look good to me [17:39:05] i can sync out any time [17:39:45] i'll sync out to cp1001 first [17:39:47] then the rest [17:39:56] readonly going out now [17:40:20] !log asher synchronized wmf-config/db-pmtpa.php 'setting pmtpa to readonly' [17:40:30] Logged the message, Master [17:40:35] mark: ok, go ahead with cp1001 [17:40:36] cp1001 going out now [17:41:45] that's an api backend [17:41:52] it seems happy about its api backend ;) [17:42:08] let's do the rest? [17:42:10] !log db1013 wiped, reimaged, powered off, and un-monitored [17:42:22] Logged the message, Master [17:42:36] mark: go for it [17:42:38] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [17:42:47] i'm going to switch eqiad redis instances to masters [17:42:56] done [17:43:05] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [17:43:18] RoanKattouw: did you RL patch get merged? [17:43:21] *your [17:43:40] and running puppet updates on the pmtpa mc/redis hosts [17:44:03] AaronSchulz: I believe so [17:44:19] * RoanKattouw logs in to check fatal.log [17:44:45] looking at traffic nose dive in pmtpa and spike in eqiad is nice [17:44:53] yes :-) [17:45:06] would be nice if it checked readonly just to avoid log spam, though not a huge deal [17:45:13] only two fatals from eqiad apaches so far, both very typical [17:45:29] ok, i'm going to do the db master switches [17:45:35] ok [17:45:41] exception.log is full of query errors [17:45:47] :D [17:45:51] i'll put the read-write config change on fenari [17:45:57] A minute ago it was mostly block insertions/deletions [17:45:58] ready for sync-file once you're happy [17:46:12] RoanKattouw: look like all attempts to write to r-o dbs though [17:46:14] Via the API, it seems [17:46:19] don't see anything else [17:46:20] binasher: lemme know when dns change [17:46:21] They are, yes [17:46:22] Tue Jan 22 17:46:07 UTC 2013 mw1077 enwiki OAIRepo::logRequest [17:46:26] RoanKattouw: ;) [17:46:37] notpeter: go ahead now [17:46:37] I guess the API isn't very well-equipped for r/o mode [17:46:37] Oh well, too late to fix that now [17:46:51] !log authdns-update [17:47:01] Logged the message, notpeter [17:47:06] started master swaps [17:47:17] Change merged: Mark Bergsma; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44847 [17:47:17] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:26] PROBLEM - Apache HTTP on mw1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:31] uh oh [17:47:35] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:36] all dns servers are happy [17:47:44] PROBLEM - MySQL Slave Delay on db1038 is CRITICAL: CRIT replication delay 183 seconds [17:47:45] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:45] PROBLEM - Apache HTTP on mw1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:45] PROBLEM - Apache HTTP on mw1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:54] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 186 seconds [17:47:54] PROBLEM - MySQL Replication Heartbeat on db1004 is CRITICAL: CRIT replication delay 186 seconds [17:47:54] PROBLEM - MySQL Replication Heartbeat on db1011 is CRITICAL: CRIT replication delay 186 seconds [17:48:02] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:09] which pool are those apaches in [17:48:19] ^ that one [17:48:36] binasher: Application servers eqiad [17:49:05] RECOVERY - Apache HTTP on mw1045 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.026 second response time [17:49:14] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.720 second response time [17:49:23] PROBLEM - MySQL Replication Heartbeat on db1038 is CRITICAL: CRIT replication delay 219 seconds [17:49:24] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.105 second response time [17:49:24] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.097 second response time [17:49:24] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 219 seconds [17:49:24] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.724 second response time [17:49:30] i think it's fine [17:49:50] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:08] PROBLEM - MySQL Slave Delay on db1004 is CRITICAL: CRIT replication delay 232 seconds [17:50:13] RoanKattouw: re: resourceloader cache heat, I imagine if writes are disabled does it choose to serve available cache or generate responses on the fly and be unable to save them? [17:50:26] nvm, caches aren't in the database, and there is still varnish in front. [17:50:35] good [17:50:43] There is some caching in the DB, as well as dependency tracking [17:50:52] right [17:50:53] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: NRPE: Unable to read output [17:51:02] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: NRPE: Unable to read output [17:51:10] but that's relatively cheap to compute as long as the final response will still be cached [17:51:11] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: NRPE: Unable to read output [17:51:13] The worst that can happen is that dependencies (images embedded in CSS) fail to register, breaking cache invalidation down the line as the cache isn't invalidated when those images change [17:51:20] PROBLEM - MySQL Replication Heartbeat on db1009 is CRITICAL: NRPE: Unable to read output [17:51:21] PROBLEM - MySQL Replication Heartbeat on db1006 is CRITICAL: NRPE: Unable to read output [17:51:26] But in practice, this should only happen when new code is deployed [17:51:26] right [17:51:29] PROBLEM - MySQL Replication Heartbeat on es1005 is CRITICAL: NRPE: Unable to read output [17:51:29] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: NRPE: Unable to read output [17:51:29] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.107 second response time [17:51:29] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: NRPE: Unable to read output [17:51:29] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:35] The writes that it's actually trying are probably for message blobs [17:51:45] slaving seems happy :) [17:51:46] yay! [17:51:47] PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: NRPE: Unable to read output [17:51:48] PROBLEM - MySQL Replication Heartbeat on es1009 is CRITICAL: NRPE: Unable to read output [17:51:52] And if it can't cache those it'll just regenerate them for every request [17:51:55] aside from nrpe sucking ass [17:51:56] PROBLEM - MySQL Replication Heartbeat on db1039 is CRITICAL: NRPE: Unable to read output [17:51:56] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: NRPE: Unable to read output [17:51:56] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: NRPE: Unable to read output [17:51:56] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: NRPE: Unable to read output [17:51:56] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: NRPE: Unable to read output [17:51:57] PROBLEM - MySQL Replication Heartbeat on db1026 is CRITICAL: NRPE: Unable to read output [17:52:05] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: NRPE: Unable to read output [17:52:05] PROBLEM - MySQL Replication Heartbeat on es10 is CRITICAL: NRPE: Unable to read output [17:52:05] PROBLEM - MySQL Replication Heartbeat on es1010 is CRITICAL: NRPE: Unable to read output [17:52:05] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: NRPE: Unable to read output [17:52:14] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: NRPE: Unable to read output [17:52:15] PROBLEM - MySQL Replication Heartbeat on db54 is CRITICAL: NRPE: Unable to read output [17:52:15] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: NRPE: Unable to read output [17:52:15] PROBLEM - MySQL Replication Heartbeat on db65 is CRITICAL: NRPE: Unable to read output [17:52:23] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: NRPE: Unable to read output [17:52:23] PROBLEM - MySQL Replication Heartbeat on db67 is CRITICAL: NRPE: Unable to read output [17:52:23] PROBLEM - MySQL Replication Heartbeat on db37 is CRITICAL: NRPE: Unable to read output [17:52:23] PROBLEM - MySQL Replication Heartbeat on db63 is CRITICAL: NRPE: Unable to read output [17:52:23] PROBLEM - MySQL Replication Heartbeat on es1006 is CRITICAL: NRPE: Unable to read output [17:52:24] PROBLEM - MySQL Replication Heartbeat on db31 is CRITICAL: NRPE: Unable to read output [17:52:24] PROBLEM - MySQL Replication Heartbeat on db55 is CRITICAL: NRPE: Unable to read output [17:52:25] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: NRPE: Unable to read output [17:52:32] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: NRPE: Unable to read output [17:52:33] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: NRPE: Unable to read output [17:52:33] PROBLEM - MySQL Replication Heartbeat on es1008 is CRITICAL: NRPE: Unable to read output [17:52:33] PROBLEM - MySQL Replication Heartbeat on db45 is CRITICAL: NRPE: Unable to read output [17:52:33] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: NRPE: Unable to read output [17:52:33] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: NRPE: Unable to read output [17:52:33] PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: NRPE: Unable to read output [17:52:41] PROBLEM - MySQL Replication Heartbeat on db51 is CRITICAL: NRPE: Unable to read output [17:52:42] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: NRPE: Unable to read output [17:52:42] PROBLEM - MySQL Replication Heartbeat on db35 is CRITICAL: NRPE: Unable to read output [17:52:42] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: NRPE: Unable to read output [17:52:42] PROBLEM - MySQL Replication Heartbeat on db1034 is CRITICAL: NRPE: Unable to read output [17:52:42] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: NRPE: Unable to read output [17:52:42] PROBLEM - MySQL Replication Heartbeat on es6 is CRITICAL: NRPE: Unable to read output [17:52:43] PROBLEM - MySQL Replication Heartbeat on db1027 is CRITICAL: NRPE: Unable to read output [17:52:43] PROBLEM - MySQL Replication Heartbeat on es8 is CRITICAL: NRPE: Unable to read output [17:52:50] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: NRPE: Unable to read output [17:52:51] PROBLEM - MySQL Replication Heartbeat on db1049 is CRITICAL: NRPE: Unable to read output [17:52:59] PROBLEM - MySQL Replication Heartbeat on db47 is CRITICAL: NRPE: Unable to read output [17:52:59] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: NRPE: Unable to read output [17:53:00] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: NRPE: Unable to read output [17:53:00] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: NRPE: Unable to read output [17:53:00] PROBLEM - MySQL Replication Heartbeat on es1007 is CRITICAL: NRPE: Unable to read output [17:53:06] notpeter: is the Nagios Remote Plugin Executor just messed up right now? [17:53:08] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.112 second response time [17:53:09] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: NRPE: Unable to read output [17:53:13] Pryes [17:53:17] PROBLEM - MySQL Replication Heartbeat on db52 is CRITICAL: NRPE: Unable to read output [17:53:17] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: NRPE: Unable to read output [17:53:18] preilly: yes [17:53:26] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: NRPE: Unable to read output [17:53:26] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: NRPE: Unable to read output [17:53:26] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: NRPE: Unable to read output [17:53:26] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 64509 bytes in 9.312 seconds [17:53:27] PROBLEM - MySQL Replication Heartbeat on es9 is CRITICAL: NRPE: Unable to read output [17:53:28] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: NRPE: Unable to read output [17:53:32] replication actually loks fine, from checking by hand [17:53:35] PROBLEM - MySQL Replication Heartbeat on db1041 is CRITICAL: NRPE: Unable to read output [17:53:36] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: NRPE: Unable to read output [17:53:36] PROBLEM - MySQL Replication Heartbeat on es5 is CRITICAL: NRPE: Unable to read output [17:53:36] PROBLEM - MySQL Replication Heartbeat on db48 is CRITICAL: NRPE: Unable to read output [17:53:40] "Swap the $shard-primary and $shard-secondary dns records. This is required for heartbeat monitoring, will erroneously show CRIT until then. " [17:53:44] PROBLEM - MySQL Replication Heartbeat on es7 is CRITICAL: NRPE: Unable to read output [17:53:53] PROBLEM - MySQL Replication Heartbeat on db34 is CRITICAL: NRPE: Unable to read output [17:54:02] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: NRPE: Unable to read output [17:54:02] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:11] PROBLEM - MySQL Replication Heartbeat on db39 is CRITICAL: NRPE: Unable to read output [17:54:20] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.100 second response time [17:54:29] PROBLEM - MySQL Replication Heartbeat on db1036 is CRITICAL: NRPE: Unable to read output [17:54:36] mark: do you think that it's dns propogation time? as I did that... [17:54:39] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: NRPE: Unable to read output [17:54:43] !log aaron synchronized php-1.21wmf7/includes/Block.php 'deployed b0d5ff9bfb09f15ccbab6849564511d372ecd8d4' [17:54:47] PROBLEM - MySQL Replication Heartbeat on labsdb1003 is CRITICAL: NRPE: Unable to read output [17:54:54] Logged the message, Master [17:55:09] yeah, what's the ttl? [17:55:13] 5M [17:55:14] 5m [17:55:14] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: NRPE: Unable to read output [17:55:23] PROBLEM - MySQL Replication Heartbeat on db1048 is CRITICAL: NRPE: Unable to read output [17:55:38] shall I try kicking heartbeat on one box? [17:56:27] it seems to have propogated [17:56:44] RECOVERY - MySQL Slave Delay on db1038 is OK: OK replication delay NULL seconds [17:56:45] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:00] what is aaron syncing? [17:57:08] fyi, i've been manually migrating s4.. the eqiad master was lagged once traffic hit it and mha won't do anything in that case [17:57:33] binasher: ok, thanks. i was just going to look at that [17:57:33] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.105 second response time [17:57:38] PROBLEM - Apache HTTP on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:38] Meh [17:58:10] mark: Every 1% of requests that inspect the blocks table, MW purges expired blocks [17:58:14] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [17:58:16] It needs to not try that in r/o mode [17:58:23] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.096 second response time [17:58:27] ok, but can we coordinate all syncs here? [17:58:33] I'm going to switch jobrunning to eqiad [17:58:37] AaronSchulz: I also see backtraces of writes in CentralAuth from within User::loadOptions() [17:58:42] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:46] AaronSchulz: Per mark, please coordinate code syncs here :) [17:58:58] mark: I just told him in person [17:58:58] mark: ok? [17:58:59] RECOVERY - MySQL Slave Delay on db1004 is OK: OK replication delay 0 seconds [17:59:08] notpeter: ok [17:59:14] RoanKattouw: sure, but I don't plan on doing anything else [17:59:17] PROBLEM - Apache HTTP on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:17] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.125 second response time [17:59:20] OK [17:59:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45159 [17:59:47] AaronSchulz: Just saying, today's fatal.log will probably point out a bunch of other r/o issues as well [17:59:48] !log moving jobrunning to eqiad [17:59:50] RoanKattouw: what is it writing [17:59:58] Logged the message, notpeter [18:00:20] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.092 second response time [18:00:31] eqiad apache cpu usage slowly going down [18:00:38] ok, and got the s4 pmtpa master replicating off the eqiad one.. dbs should be done, but a couple more checks before turning off read only [18:00:43] notpeter: are you looking at the heatbeat nagios thing? [18:00:56] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.845 second response time [18:01:02] binasher: no. I can, though. I'm going to try restarting heartbeat on one slave [18:01:23] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:01:27] notpeter: is it just DNS related? [18:01:36] preilly: dns is correct [18:01:37] AaronSchulz: User::loadOptions() indirectly calls CentralAuthHooks::attemptAddUser() which calls User::addToDatabase() which obviously calls Database::insert() [18:01:38] Ryan_Lane: (catching up on backscroll…) As far as I know I haven't merged any of Mike's work yet. What are you seeing? [18:01:39] notpeter: e.g., NRPE: Unable to read output [18:01:56] yeah, seems like a "running the nagios check" problem [18:01:58] RoanKattouw: ugh, that sounds about right [18:02:04] andrewbogott_afk: sockpuppet's puppet working tree was in a a "mikepatch1" branch [18:02:09] ok [18:02:32] Also, lots and lots of API stuff [18:02:42] Although the API paths now seem to be triggering dieReadOnly() [18:02:51] Previously the DB was r/o but $wgReadOnly was false [18:03:03] That caused the API to freak out quite a bit [18:03:03] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.768 second response time [18:03:29] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:47] Fatal error: wikiversions.cdb has no version entry for `DB connection error: No working slave server: Unknown error (10.0.6.46)`. [18:03:50] RoanKattouw: lol [18:04:05] notpeter: I can haz Ganglia login? [18:04:15] AaronSchulz: hahahaaha [18:04:22] roan: /home/w/doc/ [18:04:29] that would be an interesting DB name [18:04:39] RoanKattouw: /home/wikipedia/doc/ganglia.htaccess [18:04:43] paravoid: Oh…. dammit [18:04:45] Thanks man [18:05:06] are we finally switching to reasonable database names derived from domains while moving? [18:05:08] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.314 second response time [18:05:41] paravoid: Holy crap, I must've had a serious case of typing-in-the-wrong-window yesterday. [18:05:48] Danny_B: We're not touching the DBs, no [18:05:55] :-( [18:05:57] oh, awesome, the heartbeat nrpe check is spitting out python stack traces.... [18:06:02] pybal is unhappy about a lot of apaches, not responding within 5s [18:06:44] BTW, Ganglia monitoring on the LVS groups is broken [18:07:06] All three of them broke between 17:50 and 18:00 UTC [18:07:12] Hey Erik [18:07:18] eqiad first, them pmtpa, then esams [18:07:26] strange [18:07:48] I think just masterdom is wrong for nagios [18:07:53] yeah [18:07:55] one ine change [18:08:00] thanks preilly ! [18:08:05] notpeter: np [18:08:33] notpeter: modules/coredb_mysql/files/utils/master_id.py:10:masterdom = '.pmtpa.wmnet' [18:08:42] notpeter: and files/mysql/master_id.py:10:masterdom = '.pmtpa.wmnet' [18:08:53] yeah, i'm going to have change it to a template [18:08:56] so that it's no hardcoded [18:08:58] but whatevs [18:09:39] kinda redundant that it appends -master and also needs a masterdom [18:09:43] TimStarling, Ryan_Lane, thanks for cleaning up my mess; my head is now officially Hung In Shame. [18:09:50] heh [18:10:13] in general, don't do anything nonstandard when merging on sockpuppet ;) [18:10:16] andrewbogott_afk: it happens. we never switch branches on sockpuppet, though [18:10:17] indeed [18:10:18] (And, now I'm going to figure out how to make my non-local/non-labs sessions use a different background color or something) [18:10:24] ah [18:11:09] ok. need to head into the office [18:11:37] binasher: what's the status of DBs? [18:12:08] nice timing.. i think i'm satisfied with their state [18:12:15] ready for read-write? [18:12:43] notpeter: any progress with heartbeat monitoring? [18:12:52] yeha, about to check in a patch [18:13:26] mark: oh, you already merged disabling readonly in eqiad :) [18:13:32] ready for sync-file [18:13:36] go for it [18:13:44] going out now [18:14:18] restarting redis in pmtpa, set to slave from eqiad [18:14:20] New patchset: Pyoungmeister; "setting masterdom by $mw_primary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45164 [18:14:29] preilly: the other file is no longer used and I need to clean up [18:14:31] does that look right? [18:14:34] notpeter: ah, that! thanks [18:14:36] !log root synchronized wmf-config/db-eqiad.php 'eqiad read-write!' [18:14:37] more eyes!!! [18:14:46] looking [18:14:48] Logged the message, Master [18:14:54] flood of exceptions on failed writes has stopped [18:15:19] paravoid: are you okay with https://gerrit.wikimedia.org/r/#/c/45164/1/modules/coredb_mysql/templates/master_id.py.erb [18:15:34] no [18:15:40] I still see them [18:15:42] oops, s4 master was still read-only since i manually migrated that one… writeable now [18:15:42] the definition hasn't changed to reference the template [18:15:47] still references the file [18:15:55] ah, thanks [18:15:55] notpeter: ^^^ [18:15:56] derp [18:16:26] no dberror and exceptions look clean [18:16:31] there we go [18:16:32] Edits flowing in again [18:16:36] where are the maintenance cronjobs going from hume? [18:16:41] or not [18:16:49] notpeter: also can you amend the commit to include, "EQIAD MIGRATION" [18:16:55] well fuck. i was aiming to be done by 10am [18:17:05] but 10:16 isn't bad [18:17:10] not bad at all [18:17:13] binasher: yeah it's a disaster isn't it [18:17:14] :P [18:17:15] 30 mins readonly [18:17:17] ok, now they are gone for real [18:17:20] but pybal isn't happy about many apaches [18:17:22] checking now [18:17:34] Error connecting to 10.64.0.8: Access denied for user 'wikiadmin'@'10.64.16.98' (using password: YES) [18:17:39] it has only 98 pooled now [18:18:01] hm [18:18:01] what is that host [18:18:13] db1004 [18:18:17] no [18:18:20] happier now [18:18:21] mw118 [18:18:26] New patchset: Pyoungmeister; "eqiad migration: setting masterdom by $mw_primary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45164 [18:18:27] why is that using wikiadmin [18:18:29] 1118 and others [18:18:34] db1004 [18:18:43] * preilly is too slow [18:18:59] RoanKattouw: AaronSchulz: when do the www's use wikiadmin? [18:18:59] notpeter: you forgot the .erb [18:19:02] plwiki still read-only? http://pl.wikipedia.org/ [18:19:08] :( [18:19:11] I'm derping hard today [18:19:13] thank you! [18:19:15] np [18:19:26] binasher: I don't think they're supposed to do that at all [18:19:29] binasher: I don't know, it should just be cli scripts [18:19:44] New patchset: Pyoungmeister; "eqiad migration: setting masterdom by $mw_primary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45164 [18:19:47] i didn't grant wikiadmin to the www apache hosts on purpose [18:19:49] only wikiuser [18:19:50] but at leas tI know when to ask for review! [18:19:57] Hmm, looks like maintenance scripts might do that [18:20:03] Is mw1118 a job runner? [18:20:17] no [18:20:20] that's an api, I beleive [18:20:31] paravoid: what do you think of https://gerrit.wikimedia.org/r/#/c/45164/ now? [18:20:55] the job runner hosts should be able to use wikiadmin [18:21:09] ah, I see [18:21:16] go for it and we'll see [18:21:18] mw1118 is an apache.. so.. weird [18:21:25] I thought it used wikiadmin when getDbType() was set to admin [18:21:34] apparently any maintenance script does [18:21:35] I can't find where $wgDBuser is even being set [18:21:45] wikiadmin is in AdminSettings.php [18:21:57] chrismcmahon: no, plwiki is getting edits [18:22:45] i can change the wikiadmin grants to cover all of the apaches if this is urgent, but i'd like to keep it more restricted now [18:22:45] mw1118 is an api apache [18:22:58] binasher: just saw that switch. right now getting no DNS for dewiki here at least http://de.wikimedia.org/ [18:23:07] eek, scratch that [18:23:08] merging my little script change [18:23:14] binasher: I bet it's just for commons [18:23:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45164 [18:23:22] chrismcmahon: there have been no external dns changes [18:23:27] * preilly — API application servers eqiad > mw1118.eqiad.wmnet  [18:23:39] binasher: I made a typo, sorry [18:24:01] we should set system role better on apaches [18:24:02] Oooooh hahaha [18:24:08] to indicate what they are, regular, api, bits, etc [18:24:10] 'wikiuser' is MW's default value for $wgDBuser [18:24:12] That's ... fragilee [18:24:25] mark: I'll do that today [18:24:38] Tim-away would know why the api occasionally uses wikiadmin off the top of his head, i bet [18:24:41] just remove the current "wikimedia-task-appserver" one, that's old format ;) [18:24:55] ok [18:25:22] RECOVERY - MySQL Replication Heartbeat on db1046 is OK: OK replication delay 0 seconds [18:25:27] RoanKattouw: wikiuser or wikiadmin? [18:25:28] there we go [18:25:34] let the recovery flood begin! [18:25:36] notpeter: sweet I'm glad that worked [18:25:37] :) [18:25:39] RECOVERY - MySQL Replication Heartbeat on db63 is OK: OK replication delay 0 seconds [18:25:39] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 0 seconds [18:25:44] notpeter: thanks! [18:25:48] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 0 seconds [18:25:48] thanks for the checking preilly, paravoid :) [18:25:51] hehe [18:25:52] AaronSchulz: wikiuser is the default. But we should set it explicitly somewhere in wmf-config rather than relying on DefaultSettings [18:25:54] check this: http://torrus.wikimedia.org/torrus/Facilities?path=/Power_usage/Total_power_usage/Power_per_site&view=last24h [18:25:57] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [18:26:01] Also, AdminSettings.php has wikiadmin, so all maintenance scripts use wikiadmin [18:26:08] That doesn't explain mw1118 though [18:26:15] mark: oh, cool [18:26:20] mark: haha! [18:26:20] RoanKattouw: oh we mean the software default, haha [18:26:37] Yeah [18:26:39] RoanKattouw: it does, probably UW [18:26:41] Filing a bug for this [18:26:51] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [18:26:52] Hmm haven't checked UW [18:26:59] maintenance should use wikiuser except if the type is set to admin [18:27:02] How would it get wikiadmin, though? Which config variable holds that value? [18:27:06] there is already functionality for this in core [18:27:18] RECOVERY - MySQL Replication Heartbeat on es5 is OK: OK replication delay 0 seconds [18:27:18] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 0 seconds [18:27:24] I like graphs like: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=mw1118.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=API+application+servers+eqiad [18:27:27] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [18:28:03] RECOVERY - MySQL Replication Heartbeat on db37 is OK: OK replication delay 0 seconds [18:28:33] a handfull of app servers have much higher cpu utilization than the rest [18:28:39] RECOVERY - MySQL Replication Heartbeat on es1009 is OK: OK replication delay 0 seconds [18:28:57] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 1 seconds [18:29:20] https://bugzilla.wikimedia.org/show_bug.cgi?id=44251 [18:29:56] binasher: like mw1081 for example [18:30:00] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay -0.000710 seconds [18:30:01] RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay 0 seconds [18:30:24] RECOVERY - MySQL Replication Heartbeat on es1005 is OK: OK replication delay 0 seconds [18:30:37] lots of busy threads: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Application+servers+eqiad&h=mw1081.eqiad.wmnet&v=36&m=ap_busy_workers&jr=&js=&vl=threads&ti=Busy+Threads [18:31:40] i'll restart one of them (1099) to see if it makes a difference [18:31:49] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay 0 seconds [18:31:57] RECOVERY - MySQL Replication Heartbeat on db1049 is OK: OK replication delay 0 seconds [18:32:28] it could be keepalive too [18:32:30] RoanKattouw_away: marktraceur UW on commons right now reporting "Internal error: Server failed to store temporary file" for me. [18:32:34] RECOVERY - MySQL Replication Heartbeat on es9 is OK: OK replication delay 0 seconds [18:32:34] RECOVERY - MySQL Replication Heartbeat on db52 is OK: OK replication delay 0 seconds [18:32:51] mark: yeah: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Application+servers+eqiad&h=mw1108.eqiad.wmnet&v=15&m=ap_keepalive&jr=&js=&vl=proc&ti=Keepalive+%28read%29 [18:33:20] chrismcmahon: that's the wikiadmin problem from above [18:33:21] mark: can you restart mw1102 instead [18:33:36] done [18:33:36] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 198 seconds [18:33:49] preilly: i restarted that one a min ago [18:33:54] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 204 seconds [18:33:58] thanks AaronSchulz [18:34:09] binasher: Okay thanks [18:34:12] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay 0 seconds [18:35:07] New patchset: Pyoungmeister; "adding syste_role to all application server role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45167 [18:35:32] notpeter: actually [18:35:39] put the full role class name as the title [18:35:42] RECOVERY - MySQL Replication Heartbeat on es1007 is OK: OK replication delay 0 seconds [18:35:45] ok, cool [18:35:52] and also remove the old system role explicitly [18:35:52] AaronSchulz: commons uploads require wikiadmin? [18:35:52] or it won't get removed [18:35:52] RECOVERY - MySQL Replication Heartbeat on db1011 is OK: OK replication delay 0 seconds [18:35:55] oh, lame. ok [18:35:55] (this creates a different file) [18:35:56] https://commons.wikimedia.org/wiki/Special:NewFiles is looking ok and new images are coming in [18:36:01] RECOVERY - MySQL Replication Heartbeat on db1006 is OK: OK replication delay 0 seconds [18:36:01] RECOVERY - MySQL Replication Heartbeat on db48 is OK: OK replication delay 0 seconds [18:36:09] ensure => absent on the definition [18:36:12] yep [18:36:14] AaronSchulz: Are you on the UW issue? chrismcmahon, is it happening all the time? [18:36:27] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 0 seconds [18:36:35] marktraceur: only with an experimental checkbox in preferences [18:36:48] images are being successfully uploaded, but don't know if they're using UW [18:36:54] RECOVERY - MySQL Replication Heartbeat on db1005 is OK: OK replication delay 0 seconds [18:37:07] AaronSchulz: what's the check box? chunked transfer? [18:37:12] yeah [18:37:21] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:37:32] srv278 died out of boredom [18:37:34] I guess I won't worry too much, then....that's not something I've dealt with much, so I'm not sure I could help really [18:37:38] it's being dying for months [18:37:41] years [18:37:49] mark: heh [18:37:51] AaronSchulz: why does that use wikiadmin? is it necessary? [18:38:25] no, it's seems like bad configuration for how our maintenance scripts run [18:38:33] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay 0 seconds [18:38:38] they always use an admin account regardless of getDbType() [18:39:00] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [18:39:01] AaronSchulz: fwiw, I get the error and "Chunked..." is not checked in my preferences [18:39:07] mark: the srv278 RT was resolved, I just reopened it [18:39:11] RT #24, hahaha [18:39:24] Wed Aug 11 12:27:56 2010 [18:39:27] tell it to decommission that box [18:39:27] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 0 seconds [18:39:37] we're never gonna fix it anymore [18:39:46] RECOVERY - MySQL Replication Heartbeat on es1010 is OK: OK replication delay 0 seconds [18:39:54] "It still happens, never stopped. Let's fix it or decomission it." is what I commented [18:39:58] before you said that [18:40:03] RECOVERY - MySQL Replication Heartbeat on es10 is OK: OK replication delay 0 seconds [18:40:21] RECOVERY - MySQL Replication Heartbeat on db31 is OK: OK replication delay 0 seconds [18:40:30] RECOVERY - MySQL Replication Heartbeat on db35 is OK: OK replication delay 0 seconds [18:40:30] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [18:41:21] RECOVERY - MySQL Replication Heartbeat on db1039 is OK: OK replication delay 0 seconds [18:41:21] RECOVERY - MySQL Replication Heartbeat on db54 is OK: OK replication delay 0 seconds [18:42:00] RECOVERY - MySQL Replication Heartbeat on db45 is OK: OK replication delay 0 seconds [18:42:18] RECOVERY - MySQL Replication Heartbeat on db47 is OK: OK replication delay 0 seconds [18:42:28] RECOVERY - MySQL Replication Heartbeat on db51 is OK: OK replication delay 0 seconds [18:43:03] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [18:43:11] was mw1081 restarted recently? [18:43:22] RECOVERY - MySQL Replication Heartbeat on db1048 is OK: OK replication delay 0 seconds [18:43:23] preilly: uptime [18:43:37] Reedy: I mean apache restarted [18:43:58] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [18:44:11] Reedy: I already knew it was running for 123 days [18:44:16] RECOVERY - MySQL Replication Heartbeat on es8 is OK: OK replication delay 0 seconds [18:44:33] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay 0 seconds [18:44:42] RECOVERY - MySQL Replication Heartbeat on labsdb1003 is OK: OK replication delay seconds [18:45:09] RECOVERY - MySQL Replication Heartbeat on es1008 is OK: OK replication delay 0 seconds [18:45:23] mw1066 enwikibooks: [7cebacc2] /w/index.php?title=Special:RatingHistory&target=Main_Page Exception from line 365 of /usr/local/apache/common-local/php-1.21wmf7/extensions/ReaderFeedback/specialpages/RatingHistory_body.php: Could not create file directory! [18:45:27] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [18:45:41] AaronSchulz: MaxSem mw1066 enwikibooks: [7cebacc2] /w/index.php?title=Special:RatingHistory&target=Main_Page Exception from line 365 of /usr/local/apache/common-local/php-1.21wmf7/extensions/ReaderFeedback/specialpages/RatingHistory_body.php: Could not create file directory! [18:45:47] Is that a fs->swift migration issue? [18:46:30] RECOVERY - MySQL Replication Heartbeat on es7 is OK: OK replication delay 0 seconds [18:46:30] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay 0 seconds [18:46:30] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [18:46:49] RECOVERY - MySQL Replication Heartbeat on db67 is OK: OK replication delay 0 seconds [18:47:00] looks like RatingHistory->makeSvgGraph('reliability', '/mnt/upload7/wi...') doesn't seem to work [18:47:06] AaronSchulz: ^^ [18:47:12] upload*7* ? [18:47:16] I thought it was upload6 [18:47:17] mark: did you depool mw1082 from pybal? [18:47:25] yes, as it's broken [18:47:37] Also, /mnt/upload{6,7} aren't mounted on the eqiad Apaches it seems [18:47:42] RECOVERY - MySQL Replication Heartbeat on es6 is OK: OK replication delay 0 seconds [18:47:46] mark: ah, what's wrong with? [18:47:51] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 0 seconds [18:48:09] RoanKattouw: seems to use nfs, I don't maintain the ext though, it was supposed to be disabled years ago [18:48:18] RECOVERY - MySQL Replication Heartbeat on db1004 is OK: OK replication delay 0 seconds [18:48:27] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay -0.000820 seconds [18:48:27] RECOVERY - MySQL Replication Heartbeat on db1041 is OK: OK replication delay 0 seconds [18:48:36] RECOVERY - MySQL Replication Heartbeat on db44 is OK: OK replication delay 0 seconds [18:48:40] whoops, make that 1085 [18:48:44] * mark changes the pybal pool [18:48:48] RoanKattouw: and they shouldn't [18:48:52] be mounted [18:48:57] 1085 and 1072 [18:49:12] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay 0 seconds [18:49:12] RECOVERY - MySQL Replication Heartbeat on db1027 is OK: OK replication delay 0 seconds [18:49:15] for example mw1078 doesn't even have /mnt/upload7/wi... [18:49:23] none of them do preilly [18:49:24] no need [18:49:31] apparently there is a need [18:49:31] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 22 seconds [18:49:34] but please fix that instead :) [18:49:40] mark: well the code seems to be trying to do something with it [18:49:40] paravoid: there is a need to disable that extension [18:49:42] we shouldn't mount across DCs [18:49:48] then that needs to be fixed [18:49:48] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 25 seconds [18:50:00] Special:RatingHistory [18:50:08] preilly: That's because thatext was apparently supposed to have been disabled. AaronSchulz said all of this [18:50:10] paravoid: it was obsoleted by aft [18:50:26] RoanKattouw: Okay thanks [18:50:35] okay, so you're disabling that extension? [18:50:42] AaronSchulz: Are you going to disable it? [18:50:42] I'm disabling it [18:50:43] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [18:50:46] After I check where it's running [18:50:48] thanks [18:50:51] RECOVERY - MySQL Replication Heartbeat on db1034 is OK: OK replication delay 0 seconds [18:50:54] I've told features to do this months back now that I think about it [18:51:00] RECOVERY - MySQL Replication Heartbeat on es1006 is OK: OK replication delay 0 seconds [18:51:09] Hmm.... [18:51:14] !log DNS update - deleting wikimania2006/2007 [18:51:15] RoanKattouw: wikinews is the main one and some random places [18:51:16] It is actually running on production wikis [18:51:18] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 0 seconds [18:51:24] RoanKattouw: yes [18:51:25] Logged the message, Master [18:51:36] RECOVERY - MySQL Replication Heartbeat on db1009 is OK: OK replication delay 0 seconds [18:51:42] enwikibooks, enwikinews, huwiki, ruwikinews, strategywiki, trwikinews, plus some test wikis [18:51:44] I think some wikis asked for it to be kept for them [18:51:46] RECOVERY - MySQL Replication Heartbeat on db1026 is OK: OK replication delay 0 seconds [18:51:46] I thought RF was test wikis only [18:51:54] RoanKattouw: it's running on enwikibooks [18:52:03] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [18:52:19] it's the most common cause of exceptions ATM [18:52:39] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay 0 seconds [18:52:40] preilly: I just wrote a complete list of wikis in the channel [18:52:44] Read the backscroll, dude ;) [18:52:51] RoanKattouw: calm down [18:53:15] RECOVERY - MySQL Replication Heartbeat on db39 is OK: OK replication delay 0 seconds [18:53:34] RECOVERY - MySQL Replication Heartbeat on db34 is OK: OK replication delay 0 seconds [18:54:43] New patchset: Catrope; "Disable ReaderFeedback" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45168 [18:54:49] AaronSchulz: --^^ [18:55:14] RoanKattouw, [22:51:44] I think some wikis asked for it to be kept for them [18:55:38] New patchset: Asher; "eqiad migration: fixing dbtree, hardcoding db-eqiad as master and db-pmtpa as secondary, will make more dynamic laster" [operations/software] (master) - https://gerrit.wikimedia.org/r/45169 [18:55:48] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [18:55:53] New patchset: Pyoungmeister; "adding syste_role to all application server role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45167 [18:55:53] MaxSem: I know, but it's totally broken [18:56:38] It should probably be fixed to not use NFS, but I don't know how to. I imagine AaronSchulz does. In the meantime, I think it's better to disable it rather than have it spew exceptions for almost everything it does [18:56:40] RoanKattouw: why not change wmf-config/InitialiseSettings.php? [18:56:50] preilly: I'd have to change them all to false individually [18:56:54] and I'm lazy [18:57:17] Besides, supposedly someone will fix the extension and reenable it then [18:57:20] RoanKattouw: wow okay [18:57:36] Change merged: Asher; [operations/software] (master) - https://gerrit.wikimedia.org/r/45169 [18:58:09] RoanKattouw: can we get https://gerrit.wikimedia.org/r/#/c/45168/1/wmf-config/CommonSettings.php sync'ed [18:58:12] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.085 second response time [18:58:27] notpeter: so what if people put custom stuff in the motd? :) [18:58:28] !log DNS update - adding ca./ja. planet to zirconium for testing new planet [18:58:30] preilly: We can if someone (ping AaronSchulz ) reviews it [18:58:38] Logged the message, Master [18:58:46] AaronSchulz: Please +2 https://gerrit.wikimedia.org/r/#/c/45168/1/wmf-config/CommonSettings.php [18:58:52] i see the existing message is added by the debian package, not puppet [18:59:03] in that case either sed it out or leave it there, don't just remove the motd file altogether :) [18:59:23] I was assuming that if they are using system_role that it would be the only thing in use [18:59:24] but sure [18:59:25] !log restarted pdns on ns1, failed during update [18:59:32] RoanKattouw: looks fine [18:59:35] Logged the message, Master [18:59:42] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45168 [18:59:48] OK, deploying [18:59:50] notpeter: also, the definition title should be the role class name, not the description [18:59:57] ah, ok [19:00:06] and check whitespace [19:00:10] yeah [19:00:40] my editor automatically sets to puppet standard, which is spaces... so I sometimes forget to correct [19:02:57] * Damianz watches the datacenter float across the desert [19:03:17] Damianz: luckily, it won't be 40 years for the datacenter :) [19:03:20] !log catrope synchronized wmf-config/CommonSettings.php 'Disabling ReaderFeedback' [19:03:30] Logged the message, Master [19:03:41] !log catrope synchronized wmf-config/CommonSettings.php 'Disabling ReaderFeedback' [19:03:47] RoanKattouw: odd, adminsettings is only used if $maintenance->getDbType() in doMaintenance, so why are scripts always using it? [19:03:48] Hrmph [19:03:51] Logged the message, Master [19:03:54] Is something wrong with mw1072? [19:03:59] mw1072: Permission denied (publickey,password). [19:04:00] yes [19:04:05] i'll remove it from the node list [19:04:09] Can I comment it out in the .... yes thanks [19:04:22] * RoanKattouw moves to a VisualEditor meeting [19:04:35] New patchset: Jgreen; "fix dhcp mac address for pappas" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45170 [19:04:39] RoanKattouw: just looked at it 5 minutes ago. broken disk it looks [19:04:52] : I/O error, dev sda [19:05:21] RoanKattouw: any idea what is causing, "Internal error in ApiResult::setElement: Attempting to add element imagerepository=shared, existing value is shared" [19:05:41] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45170 [19:06:28] preilly- bug 43849 [19:07:08] so [19:07:12] I guess we're not rolling back eh [19:07:12] New patchset: Pyoungmeister; "adding syste_role to all application server role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45167 [19:07:20] would seem unlikely :) [19:07:27] pgehres: you were talking about your python abilities -- can you check out https://rt.wikimedia.org/Ticket/Display.html?id=4370 ? if you want to come downstairs (or put the kittens on upstairs) i can help you get started [19:09:04] preilly: I've seen that before but I don't remember what causes it [19:09:32] New patchset: Asher; "dbtree lives in operations/software" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45171 [19:09:36] RoanKattouw: I found the fix https://gerrit.wikimedia.org/r/#/c/40562/1 [19:10:45] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45171 [19:15:20] binasher: do you have any issue with restarting apache on mw1103? [19:16:59] lol Exception from line 206 of /usr/local/apache/common-local/php-1.21wmf7/languages/Language.php: Invalid language code "{{{1}}}" [19:17:59] I'd agree [19:18:28] haha [19:18:50] thus were spoken the words of celebration: [19:18:54] I guess we're not rolling back eh [19:19:41] hahaha [19:19:41] things mostly looking good? :) [19:19:44] is someone on mw1081 right now? [19:19:44] you tell us :) [19:20:10] Eloquence: completed at 10:18 [19:20:22] Eloquence: things are looking pretty good…. mark, binasher, notpeter paravoid did a wonderful job [19:20:36] erm, I don't think I deserve any credit [19:20:51] credit by association :) [19:20:52] one rarely used extension still wants nfs and was disabled, chunked uploading is being worked on [19:20:53] "paravoid did nothing" [19:20:53] ;) [19:20:58] well, there's also all the prep work.... [19:21:05] at least i think its being worked on [19:21:40] :-) [19:21:40] prep work = doing most of the migration over the weekend when nobody was around to notice ;) [19:21:47] hahaha [19:21:54] UW had/has(?) an issue, but it's solvable I think [19:22:14] if all that's broken is uploadwizard, I think we're good :) it breaks if you look at it funny. [19:22:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45167 [19:22:17] Yes, UW and maintenance scripts are using wikiadmin rather than wikiuser to access the DB apparently [19:22:25] And the eqiad DBs didn't like that [19:22:57] \o/ here's to our hosting future with reduced risk of imminent disaster [19:23:38] well. [19:23:41] provided we keep tampa in shape. [19:23:48] which is gonna be pretty tough [19:24:39] we're getting rather low on capacity on some clusters in tampa already, e.g. squid/varnish [19:24:47] paravoid: when might you get the chance to poke at lvs and the ceph rgws? [19:24:48] and I'm wondering if we should invest on new hardware in tampa or somewhere else instead [19:25:10] mark, I just approved a $260K hardware order for tampa last week :p [19:25:25] AaronSchulz: poke at what? [19:25:58] can someone take down the technical maintenance banner [19:26:02] mark upgrade critical hardware at tampa to buy time to build a new DC? [19:26:05] paravoid: the url for the gateway works 1/2 of the time (2/4 nodes pooled, one that actually works) [19:26:39] it is true that test2wiki is now at eqiad, yes? [19:26:42] Eloquence: new apaches yes [19:26:51] preilly found a fix for the API error lying around in Gerrit, but it has a rebase conflict. Its author is anomie so I've asked him to rebase [19:26:53] was that order of new apaches just for tampa? [19:27:05] as far as I'm aware [19:27:11] the same should be replicated in eqiad.. apache capacity is tight [19:27:20] AaronSchulz: when was the last time you checked? [19:27:25] those new app servers are much nicer that what's in eqiad [19:27:36] paravoid: last week, was it fixed lately? [19:27:38] yes [19:27:39] but tampa is gonna need quite a bit more than that if it needs to keep up ;) [19:27:44] binasher, 60 servers for Tampa, according to CT [19:27:50] app server capacity is seriously tight in both dc's [19:28:56] RoanKattouw- Talking about me like I'm not even here? ;) [19:29:13] anyway [19:29:17] let's discuss that in february :) [19:29:19] Can we just decommission tampa and use SF instead [19:29:29] not our current SF location [19:29:30] anomie: haha sorry [19:31:55] Eloquence: Just order a truck too :D [19:32:39] RoanKattouw- There you go, rebased [19:33:21] Thanks man [19:35:36] LeslieCarr: I was not touting my python skills, just saying that I much prefer it over PHP :-p [19:35:55] sadly though, I am booked today writing JS [19:35:56] yeah, we'll need to shut down labs for the extent of the truck move ;) [19:36:09] would be nice to get eqiad up before that [19:36:12] RoanKattouw: ahhhh, it's PrivateSettings [19:36:21] if ( php_sapi_name() == 'cli' ) { [19:36:39] also... [19:36:46] no wonder I couldn't find where it was being set, I was looking in a bunch of git repos [19:36:48] latency is gonna be quite a bit worse between eqiad-SF [19:36:51] ok [19:37:02] some things which are okish now, will be unbearable from there ;) [19:37:07] !log upgrading labsconsole to 1.21wmf8 [19:37:18] Logged the message, Master [19:37:18] mark: how much worse? [19:37:19] pgehres: free tomorrow? ;) [19:37:34] 10ms? [19:37:35] Ryan_Lane: why don't you tell me [19:37:37] LeslieCarr: we shall see. this whole job thing gets in the way sometimes [19:37:37] ping eqiad [19:37:42] 60ms or so? [19:37:46] hehe [19:37:54] so 30 worse? [19:37:59] 40ms worse [19:38:07] it's only 20 now? [19:38:11] I really could just ping [19:38:11] :D [19:38:12] 26 [19:38:15] last time I checked it was 26 [19:38:16] yes you slacker [19:38:18] :D [19:39:38] so? [19:39:43] New review: Andrew Bogott; "lgtm" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45098 [19:39:44] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45098 [19:40:35] it's funny, yesterday's esams outage probably had more impact than the switchover today :-) [19:41:00] Ryan_Lane: i see 81ms from eqiad to the office [19:41:07] wow [19:41:07] may not be the most optimal path [19:41:15] just saying ;) [19:41:19] I keep forgetting how large the US are [19:41:40] 81ms is much more that what I'd guess [19:42:49] New review: Andrew Bogott; "Much better, thanks for rewriting!" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45107 [19:42:50] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45107 [19:45:35] preilly: Thanks for finding that fix, I merged it [19:45:54] mark: Are you OK with me deploying a fix for that API error preilly found, and AaronSchulz deploying a fix for the wikiadmin behavior? [19:46:12] yes [19:46:19] cool [19:46:34] OK, I'll do the API fix now [19:51:57] New review: Andrew Bogott; "One question, one minor request" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/45157 [19:52:50] New patchset: Dzahn; "add ja./ca. planet Apache configs for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45174 [19:55:09] New review: Dzahn; "it's just on zirconium, not cluster apache conf" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45174 [19:55:10] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45174 [19:55:18] RoanKattouw: the change will break asher's rate limiting mysql trick for the job queue though (unless runJobs is changed to use wikiadmin) [19:55:29] hmm [19:55:37] Better clear it with binasher first then [19:55:49] (e.g. connection limit) [19:56:02] ...who I don't see around here [19:56:11] where are the maintenance cronjobs going to from hume? [19:57:40] New patchset: Jgreen; "attempt to configure pappas to pxeboot off of boron" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45175 [19:58:41] Oh I think he's having lunch [19:59:00] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45175 [20:03:01] Meh, cherry-picking the API fix but it's conflicting again [20:04:39] RoanKattouw- Probably just spaces. Want me to prepare the patch against wmf7? [20:04:59] I got it [20:05:05] nifty [20:05:08] It really was just spaces [20:05:11] But I had to check [20:05:20] And RELEASE-NOTES was being annoying [20:05:33] RELEASE-NOTES is *always* annoying. [20:06:54] from #wikimedia-tech just now: [20:06:55] (01:04:13 PM) yehar: I edited an equation in an article and now it says: "Failed to parse (Missing texvc executable; please see math/README to configure.)" [20:06:55] (01:04:20 PM) yehar: http://en.wikipedia.org/w/index.php?title=Equivalent_rectangular_bandwidth [20:07:03] Ah, whoops [20:07:10] Is texvc not installed on the eqiad apaches? [20:07:20] AaronSchulz: ---^^ [20:07:25] thanks RoanKattouw [20:08:03] texvc does not appear to be puppetized [20:08:16] I think it's supposed to be in the wikimedia-task-appserver package? [20:09:10] hmm.. don't see it in dpkg -L ... [20:09:15] RoanKattouw: there is the apache script, but there is no bastion one [20:09:16] out for lunch though ..bbiab [20:09:33] meh, I'll just recompile [20:09:55] Thanks man [20:10:08] it used to compile on every scap [20:10:46] Maybe the scap rewrite took that out? [20:10:57] no it was taken out a long time ago [20:11:15] I have the API fixes cherry-picked and ready, just waiting on Jenkins to confirm that the tests aren't broken [20:11:21] anyway, git deploy will eventually deal with this sanity (neither manual for new branches nor all the time for no reason ideally) [20:12:21] !log Recompiled textvc on all hosts [20:12:27] RoanKattouw: http://en.wikipedia.org/wiki/Equivalent_rectangular_bandwidth looks fine after purge [20:12:32] Logged the message, Master [20:12:34] AaronSchulz: so how is it deployed now? [20:12:49] it's on http://wikitech.wikimedia.org/view/Hetdeploy [20:12:58] a line ddsh cmd ;) [20:13:16] a scap-recompile [20:13:18] ok [20:13:39] good [20:13:41] the 1 line should be a script on the deploy host just like scap [20:14:02] though like I said git deploy will redo this stuff anyway [20:14:08] you can make that happen! :) [20:15:16] WTF... of course Jenkins breaks on me just as hashar leaves town [20:15:27] RoanKattouw: I am there still [20:15:32] RoanKattouw: what is happening ? :-D [20:15:48] wmf7/wmf8 commits are failing Jenkins for apparently no reason [20:15:49] See https://gerrit.wikimedia.org/r/#/c/45177/ [20:16:08] eah [20:16:08] Wait, only the wmf7 one failed [20:16:09] yaeya [20:16:12] Its sibling https://gerrit.wikimedia.org/r/#/c/45178/ succeeded [20:16:13] let s revert my changes [20:16:29] the jenkins job share the workspace between branches [20:16:34] so if a job build master then wmf [20:16:40] it ends up having to fetch ALL submodules [20:16:43] which lead to a failure [20:16:54] if the job next to it runs on wmf, it just refresh the modules [20:17:20] Oooh [20:17:22] Right [20:17:22] anyway change 45177 had a  lint failure apparently [20:17:28] So it fails the first time and succeeds the second time [20:17:29] https://integration.mediawiki.org/ci/job/mediawiki-core-lint/4832/console [20:17:32] holy hell [20:17:32] It claims to [20:17:40] yeah that failed [20:17:41] That's not a lint failure [20:17:47] That's an internal failure [20:18:33] sorry about that Roan :( [20:18:43] No worries [20:18:51] mark: Is it accurate to say that our data center migration is "over"? [20:18:58] yes [20:19:01] I ask because that was just tweeted and I was like "hah? That's quick" [20:19:04] Alright [20:19:12] That's a pleasant surprise then :) [20:19:18] well, the app servers part of that ;) [20:19:30] Yeah, I know, Swift grumble grubmle [20:19:36] and some misc bits here and there [20:19:41] but all in all, we're almost there now [20:20:18] hashar: How do I trigger another run, "recheck", right? [20:20:36] Oh nm it's picking it up [20:20:55] RoanKattouw: add a cover message in gerrit with a CR+2 vote [20:22:50] RoanKattouw: failing for whatever reason :( [20:23:17] Yeah it's failing again [20:23:23] I'm just gonna bypass Jenkins for this commit [20:23:31] RoanKattouw: yeah [20:23:39] RoanKattouw: i know there are some issues with the wmf branches [20:23:56] OK now deploying the API fix [20:23:56] I assume that this is an issue, albeit a minor one, considering it's not public-facing: https://noc.wikimedia.org/cgi-bin/report.py [20:24:38] RoanKattouw: I guess you can most often safely ignore the tests on wmf branches :-D [20:25:05] phuzion: heh, whoops [20:25:15] RoanKattouw: I assume that's not intentional ;) [20:26:41] !log catrope synchronized php-1.21wmf7/includes/api/ApiQueryImageInfo.php 'Fix API error in imageinfo' [20:26:52] Logged the message, Master [20:26:54] !log catrope synchronized php-1.21wmf8/includes/api/ApiQueryImageInfo.php 'Fix API error in imageinfo' [20:27:05] Logged the message, Master [20:32:10] ok [20:32:18] if all is well, I'll go offline now [20:32:30] call/text if not ;) [20:34:00] mark: I think that everyone is afk now :) [20:39:32] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [20:59:16] AaronSchulz: hey [20:59:26] AaronSchulz: so, everything okay with ms-fe.eqiad? [20:59:38] I haven't checked again yet [21:00:04] okay [21:00:18] we're thinking of possibly enabling mw writes to ceph this week [21:00:29] but only if potential failures are going to be non-fatal [21:00:47] so, maybe originals + async journal replaying? [21:00:49] what do you think? [21:01:49] that's what I was thinking, doing it async [21:02:09] paravoid: is ceph supposed to have all the originals? I get 404s for stuff that is in swift on testwiki [21:02:31] I've stopped the replication process a few days ago [21:02:38] when I started on thumbs [21:02:46] is this an old file? [21:02:50] could you give me the filename? [21:03:10] lvs setup seems ok now [21:03:17] http://test.wikipedia.org/wiki/File:Testfile_1357988954.jpg [21:04:39] 12 Jan [21:04:44] so yeah, could be that [21:04:54] same with http://test.wikipedia.org/wiki/File:Wikipedia-logo-test_%28proposed%29.png from 2006 [21:07:05] oh testwiki? [21:07:13] I wonder if I synced those containers [21:07:14] paravoid: a file listing of all the public zone is empty [21:07:19] * paravoid checks [21:08:37] yeah, didn't do those [21:08:53] what *was* done? [21:09:02] ^wikipedia-commons-local-(public|deleted)\.[0-9a-f]{2}$ [21:09:02] ^wikipedia-[a-z][a-z]-local-(public|deleted)$ [21:09:02] ^wikipedia-[a-z][a-z]-local-(public|deleted)\.[0-9a-f]{2}$' [21:09:02] ^wikipedia-[a-z]{3}-local-(public|deleted)$ [21:09:02] ^wikipedia-(be-x-old|wg-en|zh-classical|zh-classical|zh-min-nan|zh-min-nan|zh-yue|zh-yue-local)-local-(public|deleted)$ [21:09:05] ^(wikibooks|wikimedia|wikinews|wikiquote|wikisource|wikiversity|wikivoyage|wiktionary)-.*-local-(public|deleted)(\.[0-9a-f]{2})?$ [21:09:08] ^global-.* [21:09:10] .*-timeline-* [21:09:17] * Damianz finds paravoid's medication [21:09:30] the last one is wrong [21:09:46] I wonder if my notes are wrong or if it was done wrong :) [21:10:07] global/timeline are not done anyway [21:10:27] they suffer from the small file size issue that affects thumbs too [21:18:18] New patchset: DamianZaremba; "BZ 26784 - Adding process monitoring for ircecho" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45097 [21:21:39] paravoid: is there an ms-be1008? [21:22:02] it died when chris fitted the H710 on it [21:22:05] doesn't power up anymore [21:22:44] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 203 seconds [21:23:21] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 215 seconds [21:24:04] cmjohnson1: btw, do you want me to file a ticket about that or are you handling it already? [21:24:12] paravoid: so i called dell today ...and after I was on phone w/them i noticed it was on [21:24:20] New review: DamianZaremba; "That process check is local, need to do it via nrpe." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45097 [21:24:22] haha [21:24:28] I didn't do anything [21:24:30] but there is still a problem...cos i cycled it and it has not come back up and it's been 2 hours [21:24:33] I mean, I couldn't [21:24:43] couldn't login to mgmt at all [21:24:46] and i created a ticekt already [21:24:54] neither could i until today [21:25:02] most likely a board change [21:25:11] nod [21:25:12] thanks :) [21:26:21] New patchset: DamianZaremba; "BZ 26784 - Adding process monitoring for ircecho" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45097 [21:28:01] paravoid: testing around it seems to work fine [21:28:06] cool [21:28:08] New review: DamianZaremba; "Sorry hashy" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/45097 [21:28:35] getDirectoryList() for empty dirs returns array( false ), but that may be a MW bug [21:28:55] paravoid: and the leveldb data is on regular hdds? [21:29:38] seems reasonably quick in any case [21:29:42] yes [21:30:00] haven't done anything about leveldb, so I guess so :) [21:30:04] note that it's also replicating right now [21:30:15] from 1001-1004 to 1007, 1009-1012 [21:30:20] multiple MB/s [21:30:38] so it should less fast that normally [21:34:51] New patchset: DamianZaremba; "BZ 26784 - Adding process monitoring for ircecho" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45097 [21:38:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45097 [21:42:44] <^demon> Ryan_Lane: hi. https://gerrit.wikimedia.org/r/#/c/44436/ plz. ty. [21:43:22] warn folks of a restart ;) [21:43:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44436 [21:44:04] force running puppet [21:45:01] <^demon> !log gerrit restarting. don't panic. [21:45:10] bleh. upgrading labsconsole is harder now that we have the submodles in the branch :D [21:45:14] Logged the message, Master [21:45:28] I'm really kidding, it's nice, but it's a change [21:45:40] <^demon> which branch? [21:45:45] wmf branches [21:46:04] <^demon> we've had those. labs is late to the party [21:46:23] this is the first upgrade I've done since then [21:46:37] <^demon> :) [21:52:35] New review: Silke Meyer; "Concerning your second comment - it is related to the rest: I had to add delete "git clone Wikibase"..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/45157 [21:53:04] 23:30 < paravoid> so it should less fast that normally [21:53:06] wtf [21:53:11] what was I thinking? [21:53:30] *I* can't even parse that [21:54:06] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 1 seconds [21:54:53] New patchset: Ori.livneh; "Change UDP logging prefix for EventLogging to "EventLogging"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45243 [21:55:15] !log msw-a7-eqiad power was accidently removed while switch was being moved [21:55:26] Logged the message, Master [21:55:26] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [21:57:06] paravoid: :D [21:57:23] urgh, bloodsugar trying to kill me [21:58:46] * Damianz gives RobH a kit-kat [21:59:08] opposite [21:59:14] gimme insulin ;] [21:59:25] that's just no fun [22:01:01] New review: Ori.livneh; "-2ing to block deployment until eqiad deployment freeze is over." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/45243 [22:05:25] New patchset: Dzahn; "turn ja./ca. planets into real virtul hosts with doc roots" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45244 [22:05:29] PROBLEM - MySQL disk space on neon is CRITICAL: DISK CRITICAL - free space: / 353 MB (3% inode=74%): [22:08:40] paravoid: for clarity, when was the ceph copy script first started? [22:10:07] ganglia says 12/21 [22:16:39] hello [22:16:46] hi Tim [22:16:58] TimStarling: you'll be happy to know you won't be greeted with smoke and fumes [22:17:08] (yet) [22:17:09] hahaha [22:17:13] excellent [22:17:30] yes, that'll start when the SF people go home [22:17:57] At least peak traffic should be us time... [22:18:45] so I'll just read the IRC logs then? [22:19:12] paravoid: so there is nothing blocking ceph atm? [22:19:23] I'm afraid there is [22:19:34] controllers? [22:19:36] we're waiting for the h310s to be fully migrated to h710s [22:19:46] that needs a few days it seems [22:19:57] it's been running for ~24h now [22:20:13] then after that happens we need to see how well thumbs replication will work [22:20:18] hopefully a lot better [22:20:21] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45244 [22:20:33] we have the varnish rewrite.py bits in gerrit, these needs a bit polishing and deployment [22:20:46] the VCL replacing rewrite.py that is [22:20:55] paravoid: what kind of replication? [22:21:07] we still completely lack puppetization & monitoring which sucks [22:21:07] the thumbs one? [22:21:11] swiftrepl [22:21:22] ah, that script [22:21:26] with the H310 it was copying with 5-6MB/s [22:21:31] disks were at 100% [22:21:51] ssd journal is (probably) going to be a huge help with small writes [22:21:54] but not with the h310 [22:22:02] https://bugzilla.wikimedia.org/show_bug.cgi?id=44258 "wikibugs missing from #mediawiki" [22:22:10] we still have more than 20T to go, so 5-6MB/s isn't really going to work :-) [22:22:32] btw, cleanupUploadStash has been running since Friday :-) [22:22:36] still cleaning up commons [22:22:46] the rest are done [22:22:57] is puppet preventing cron overruns? [22:23:01] nope [22:23:17] but I've been running it in a screen [22:23:21] well hopefully it will be faster after the initial run [22:23:32] yes [22:23:39] we're already down 2TB I think [22:23:56] and it was 5? or something [22:24:34] so it probably needs a week more or less [22:25:28] New patchset: Lcarr; "modifying apache containing classes to ensure default site is absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45251 [22:25:31] anyone have a minute to check out ^^ ? [22:25:43] yeah it was 5 [22:26:59] https://blog.wikimedia.org/2013/01/19/wikimedia-sites-move-to-primary-data-center-in-ashburn-virginia/#comments [22:27:02] so I think the next step after the h310 would be to start writing originals to ceph on a more permanent basis [22:27:09] i.e. journal/your script [22:27:13] I love how pressing "page 2" gives no results [22:27:18] ahh, typical wordpress [22:27:46] paravoid: I'd want to run copyFileBackend a few times first (with --missingonly) [22:27:53] what's that? [22:27:54] * AaronSchulz doesn't trust stuff :) [22:28:14] Deploying a JS fix now [22:28:19] paravoid: copies containers from one backend to another [22:28:41] containers? [22:28:44] oh, you really should [22:28:46] * AaronSchulz would also run setZoneAccess after that [22:28:55] and then the sync script [22:29:00] wait, you mean the container themselves or their contents? [22:29:13] the contents [22:29:17] ah [22:29:22] !log catrope synchronized php-1.21wmf7/resources/jquery/jquery.client.js '8a06466e1697f58717e92532437be642a796d5a1' [22:29:26] LeslieCarr: 22:25:36 err: Could not parse for environment production: Syntax error at ':'; expected '}' at /var/lib/jenkins/jobs/operations-puppet-validate/workspace/manifests/nagios.pp:323 [22:29:33] Logged the message, Master [22:29:35] !log catrope synchronized php-1.21wmf8/resources/jquery/jquery.client.js '8a06466e1697f58717e92532437be642a796d5a1' [22:29:43] note that swiftrepl doesn't account for deletes [22:29:46] Logged the message, Master [22:29:47] we really need the journal for this [22:29:55] The Journal. [22:30:21] http://online.wsj.com/home-page [22:30:39] * AaronSchulz ducks [22:31:59] pay wall [22:35:20] http://www.thejournal.org/ [22:36:26] /var/lib/puppet on neon is 5.8GB [22:36:30] New patchset: Lcarr; "modifying apache containing classes to ensure default site is absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45251 [22:36:32] hahaha [22:36:35] that seems to be the main reason it's almost out of disk space [22:36:38] PROBLEM - ircecho_service_running on neon is CRITICAL: NRPE: Command check_ircecho not defined [22:36:44] specifically /var/lib/puppet/clientbucket [22:36:45] filebucket [22:36:47] yeah [22:36:48] kill that [22:37:20] just rm -rf /var/lib/puppet/clientbucket/* ? [22:37:25] yeah [22:37:42] that's archived versions of previous configuration files [22:37:47] icinga's I presume [22:37:52] PROBLEM - ircecho_service_running on spence is CRITICAL: NRPE: Command check_ircecho not defined [22:38:14] yep, seems so [22:38:40] !log on neon: cleaned up /var/lib/puppet/clientbucket/* since it was out of disk space [22:38:45] thanks [22:38:50] Logged the message, Master [22:39:01] New patchset: Lcarr; "modifying apache containing classes to ensure default site is absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45251 [22:39:29] RECOVERY - MySQL disk space on neon is OK: DISK OK [22:41:54] New patchset: Lcarr; "modifying apache containing classes to ensure default site is absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45251 [22:43:22] why do some of the eqiad apaches have 12 CPUs and some 24? [22:43:37] TimStarling, HT [22:45:14] isn't that disabled by default now? [22:46:00] in theory;) [22:47:32] New patchset: Dzahn; "modify index.html template to link to prod sites instead of labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45256 [22:48:22] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45256 [22:49:21] yes, it does seem to be hyperthreading [22:49:39] the test for that has changed since I last had to care about it [22:50:29] New patchset: Silke Meyer; "Moves Wikidata items to main namespace, imports main page to different namespace" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45157 [22:52:02] New patchset: Lcarr; "modifying apache containing classes to ensure default site is absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45251 [22:52:23] is it disabled by defualt though ? [22:52:40] i guess obviously not ;) [22:53:20] i dont think Dell even delivers all boxes with the same BIOS settings :p [22:53:25] https://gerrit.wikimedia.org/r/45251 mutante ? [22:53:47] TBH that's probably what happened - we just stuck with whatever dell had delivered [22:55:13] there's a kernel boot parameter "noht" which disables hyperthreading, it's not specified [22:55:23] ^demon: TimStarling: Download distributor seems broken [22:55:26] https://www.mediawiki.org/wiki/Special:ExtensionDistributor/Vector [22:55:28] https://www.mediawiki.org/wiki/Special:ExtensionDistributor/Gadgets [22:55:29] empty list [22:55:44] (not even master) [22:58:46] New patchset: Silke Meyer; "Moves Wikidata items to main namespace, imports main page to different namespace" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45157 [22:59:08] <^demon> Krinkle: Looking. [22:59:51] <^demon> Hmm, wfm locally. Wonder if it's a proxy issue. [22:59:53] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45157 [23:01:48] broke for me locally [23:02:10] we're using a CONNECT method proxy? [23:02:14] <^demon> Yep. [23:03:00] what URL works for you? [23:03:02] !log labsconsole upgraded to 1.21wmf8 [23:03:13] Logged the message, Master [23:03:39] hello tim [23:03:44] are you fixing all the problems we left behind? [23:04:01] ^demon hasn't left yet [23:04:03] !log echo enabled on labsconsole [23:04:09] :o [23:04:14] Logged the message, Master [23:04:19] very courteous of him ;) [23:04:24] !log Alex Monk's echo changes to OpenStackManager deployed on labsconsole [23:04:36] Logged the message, Master [23:04:53] ^demon: I tried https://api.github.com/repos/wikimedia/mediawiki-extensions-Vector/master and I got a 404 [23:06:14] <^demon> string(81) "https://api.github.com/repos/wikimedia/mediawiki-extensions-Vector/tarball/master" [23:06:15] <^demon> string(86) "https://nodeload.github.com/wikimedia/mediawiki-extensions-Vector/legacy.tar.gz/master" [23:06:18] <^demon> Both wfm locally. [23:06:25] <^demon> (I'm not using a proxy) [23:06:35] shall I fix one issue then? [23:07:13] ah right, I missed the /tarball/ [23:07:34] New patchset: Lcarr; "modifying apache containing classes to ensure default site is absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45251 [23:08:00] New patchset: Mark Bergsma; "Add eqiad internal range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45262 [23:08:25] <^demon> That's what I thought needed doing ^ [23:08:46] curl: (56) Received HTTP code 403 from proxy after CONNECT [23:08:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45262 [23:08:53] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [23:09:38] try now :) [23:10:22] <^demon> WFM on Ext:APC [23:10:32] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [23:10:37] did someone reboot srv278 ? [23:10:40] like on purpose ? [23:10:46] please kill that box [23:10:47] hard [23:10:51] Ahm, it's srv278 [23:10:59] ok, seems like it rebooted on its own ... [23:11:01] <^demon> Krinkle: Seems to be working now. Some extensions may *look* broken for a bit until the cache entries expire. [23:11:05] from my quick check at syslog [23:11:07] yes [23:11:10] srv278 does that [23:11:10] It has been broken for as long as I've worked here [23:11:12] for years already [23:11:17] haha [23:11:24] goddamn [23:11:26] ticket #24 [23:11:28] yes [23:11:42] Which really means the issue predates RT [23:11:54] yes [23:12:06] ^demon: thx [23:12:09] the first 50 tickets or so I flushed straight from memory on the first day [23:12:21] Cause I remember you guys filed ~100 tickets during the first weekend in DC [23:12:28] damn :) [23:12:38] anyone got a few minutes and want to check out https://gerrit.wikimedia.org/r/#/c/45251/5 ? [23:12:52] i mean https://gerrit.wikimedia.org/r/#/c/45251/6 [23:13:02] i'm out again ;) [23:13:05] hehe [23:13:05] bye [23:13:07] LeslieCarr: I did look at it, it looked complicated [23:13:13] hehe [23:13:19] kaldari: I deployed Echo to labsconsole, and OpenStackManager also has some Echo support [23:13:23] kaldari: thanks to krenair [23:13:30] have you seen all the different ways we create apache servers? ;) [23:13:46] also, yes, we need to use the new module everywhere - however that's a larger set of patchsets (not just 1!) [23:14:27] Ryan_Lane: cool, hope we don't break anything :) [23:15:11] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [23:15:23] We tested this first (labsception :D). It shouldn't break anything. [23:15:59] heh [23:16:01] indeed [23:16:59] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.070 second response time [23:17:06] LeslieCarr: the idea is to ensure the default site is absent everywhere? [23:17:17] PROBLEM - Puppet freshness on db66 is CRITICAL: Puppet has not run in the last 10 hours [23:17:20] yep [23:17:30] what about gerrit.pp? [23:17:44] there was an ensure=>absent that you took out [23:17:58] so gerrit uses webserver => apache [23:18:03] i mean webserver:apache [23:18:07] which has the ensure => absent in that [23:18:12] don't want to define it twice [23:18:15] right [23:18:54] <^demon> Shorter gerrit.pp makes me happy. [23:19:14] PROBLEM - Puppet freshness on db60 is CRITICAL: Puppet has not run in the last 10 hours [23:19:21] oh in that case, i better put a bunch of comments in to fix that ;) [23:19:30] robla: so one thing still left is to fix the use of AdminSettings in PrivateSettings [23:19:41] <^demon> We still use AdminSettings in production? [23:19:45] <^demon> I thought I killed that ages ago. [23:20:17] this would make cli scripts not use the admin db user all the time (which is not usable by apaches, at least in eqiad) [23:20:37] there really is no reason to run as wikiadmin all the time anyway [23:20:57] How's pmtpa -> equiad going, btw? Going to go to press soon. [23:21:00] RoanKattouw: heh, I think asher disappeared :) [23:21:07] More stuff to do tomorrow, day after I assume? [23:21:19] maybe I should just hack runJobs to use wikiadmin and the go through with it [23:22:24] Jarry1250: for all press statements the pr folks are the people to talk to [23:22:29] LeslieCarr: looks like you did duplicate the default site in openstack.pp [23:22:33] oh [23:22:36] thanks for the fyi [23:22:38] Why is "(bug 44141)" not being linked in gerrit? [23:22:40] https://gerrit.wikimedia.org/r/#/c/44825/ [23:22:44] it includes webserver::apache2 which has the default site [23:22:59] LeslieCarr: Well we're only talking the Signpost here [23:23:15] did our comment parsers not survive the migration (if gerrit was migrated today as well) [23:23:17] "press" is, I suppose, an exagerration. [23:23:28] *exaggeration [23:23:35] <^demon> Krinkle: We haven't upgraded yet. [23:23:38] <^demon> Current regex is "\\b([bB][uU][gG]\\:\\s+#?)(\\d+)\\b" [23:23:45] we have cancelled the other maintenance windows this week and expect to be running smooth (afaik) [23:23:48] <^demon> Only difference was me adding \\: the other day. [23:23:55] of course, as with any day, unexpected things happen ;) [23:24:02] PROBLEM - NTP on srv278 is CRITICAL: NTP CRITICAL: Offset unknown [23:24:03] ^demon: ? Why? that would break everything [23:24:03] Oh, great to hear. [23:24:10] who puts colons in between? I've never seent aht [23:24:19] Krinkle: gerrit was already at eqiad [23:24:20] New review: Tim Starling; "Fine except for the duplicate default site in openstack.pp" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45251 [23:24:27] <^demon> Krinkle: To support linking in the new Bug: 1234 format you can use in footers. [23:24:32] ^demon is ahead of the curve [23:24:50] I did have one question: Under the new scheme, will any traffic ordinarily be routed through pmtpa? [23:24:50] what "new format"? And why would it have to break existing commit messages with "bug 123" inline [23:24:51] New patchset: Lcarr; "modifying apache containing classes to ensure default site is absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45251 [23:24:59] Bug: 123 implies it is on a stand alone line [23:25:04] in sentence flow it looks off [23:25:08] <^demon> Yep, and that's how we should be doing it from now on. [23:25:11] <^demon> Lemme find the example. [23:25:17] I don't need an example [23:25:23] I understand completely [23:25:29] but it should've been added as a new thing, not replace [23:25:44] <^demon> https://gerrit.wikimedia.org/r/#/c/44434/ [23:25:50] It is perfectly valid to just "mention" a bug in a sentence [23:25:56] <^demon> Yeah, I should've made the :? as optional. [23:25:59] <^demon> That was a whoops. [23:26:02] k [23:26:08] PROBLEM - Puppet freshness on db55 is CRITICAL: Puppet has not run in the last 10 hours [23:27:14] New patchset: Demon; "The colon in bug references should be optional" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45264 [23:27:18] <^demon> Ryan_Lane: ^ [23:28:02] RECOVERY - NTP on srv278 is OK: NTP OK: Offset -0.002769708633 secs [23:28:04] <^demon> Krinkle: This is why I want it the new way though :) https://gerrit.wikimedia.org/r/#/q/bug:43523,n,z [23:28:31] <^demon> I also added matching for RT. [23:28:34] ^demon: hm.. interesting. no "message:" prefix [23:28:40] ^demon: Everytime you restart gerrit kittens get very worried. [23:28:48] it is special cased for one-liner key/value paris? [23:28:52] I guess nobody has looked at fatal.log recently, it shows tmh100* being broken [23:29:08] [22-Jan-2013 23:23:17] Fatal error: wikiversions.cdb has no version entry for `DB connection error: Access denied for user 'wikiadmin'@'10.64.16.146' (using password: YES) (10.0.6.43)`. [23:29:10] <^demon> Krinkle: It has to be in the footer. Gerrit calls them "tracking ids" internally. [23:29:18] funny kind of version [23:30:11] huh, I though that was ephemeral when I checked it [23:30:29] * AaronSchulz remembers joking about that to roan [23:30:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45264 [23:30:42] ^demon: ^^ [23:30:59] force running puppet [23:31:00] <^demon> thanks. [23:31:14] <^demon> Damianz: People should panic less :) [23:31:14] PROBLEM - Puppet freshness on db65 is CRITICAL: Puppet has not run in the last 10 hours [23:31:14] PROBLEM - Puppet freshness on db57 is CRITICAL: Puppet has not run in the last 10 hours [23:32:52] ^demon: Topic for #wikimedia-panic: Panic room | Abandon all hope, ye who join here [23:33:37] I guess it comes from nextJobDB.php [23:34:00] yep [23:36:06] <^demon> Krinkle: Bug linking fixed. Thanks for noticing. [23:36:11] yw [23:36:28] TimStarling: would it work as wikiuser? [23:37:19] the grant is broken [23:37:22] best to just fix it [23:37:25] | 10.64.0.0/255.255.252.0 | wikiadmin | [23:37:36] mysql doesn't do CIDR last time I checked [23:38:37] Sorry, just to check, under the new scheme, will any traffic ordinarily be routed through pmtpa? [23:41:03] Jarry1250: I believe a small amount of traffic will still go to pmtpa, because some upload-related services are still there [23:41:09] I guess that https://bugzilla.wikimedia.org/show_bug.cgi?id=44259 is INVALID as test.wikipedia.org will be dead now (no NFS anymore)? Somebody please correct me if I'm wrong [23:41:37] RoanKattouw: Okay, thanks [23:41:48] <^demon> andre__: I think testwiki should be made into a normal cluster wiki like test2wiki. [23:41:56] <^demon> I don't see any reason to totally delete it. [23:42:26] Jarry1250: Mind you I'm not 100% sure that's accurate [23:42:45] ^demon, oh sure, wasn't after that, more like explaining properly in that bug report the reasons :) [23:42:58] Actually, this looks pretty conclusive: http://ganglia.wikimedia.org/latest/?c=Swift%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [23:43:06] Several hundreds of megabits/s going to Swift pmtpa [23:43:31] RoanKattouw: scalars are still there [23:43:55] <^demon> andre__: I don't know when it'll get unlocked. It won't be enabled again as-is. Probably needs apache config changes first. [23:43:58] they will migrate later [23:44:02] New patchset: MaxSem; "Postgres module for OSM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36155 [23:44:06] I see scalers in eqiad too, or are those not being used? [23:44:08] ^demon, okay, thanks! [23:44:09] <^demon> But as far as keeping test.wp.o, I think it's nice having 2 testwikis. [23:44:16] well, they are there, not used though [23:44:29] (Incidentally, why is ganglia password protected these days?) [23:44:37] Jarry1250, XSS [23:44:42] Jarry1250: Because we found XSS vulnerabilities in Ganglia's web interface [23:44:58] | GRANT ALL PRIVILEGES ON `%a%`.* TO 'wikiadmin'@'10.0.%' | [23:45:01] Oh, makes sense. [23:45:03] * RoanKattouw makes his disappointed why-do-authors-of-system-level-tools-always-screw-up-web-security face [23:45:07] * Jamesofur notes that if we're going to cut back to only 1 test wiki again we should probably have that 1 be test.wp.o rather then test2 ;)  [23:45:15] right, so wikiadmin gets access to any database with "a" in its name [23:45:23] hahahaha [23:46:02] TimStarling: that's nearly all of our DBs :) [23:46:06] <^demon> Like enwiki...oh wait [23:46:14] should have been 'w' ;) [23:46:42] New patchset: Ryan Lane; "Combine sysadmin and netadmin into projectadmin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45267 [23:47:07] AaronSchulz: Will the imagescaler requests be switched over at some point? [23:47:12] I assume that's the plan. [23:47:28] [15:43] AaronSchulz they will migrate later [23:47:43] it is later :P [23:47:45] some varnish vcl stuff needed first [23:49:35] TimStarling: what should nextJobDB do with errors? Call wfLogDBError() and return an empty string (same as "no dbs with jobs")? [23:50:11] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 186 seconds [23:50:43] Exception.php should be modified to send its errors to STDERR [23:50:51] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 199 seconds [23:50:52] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 198 seconds [23:50:59] stderr isn't captured by backticks [23:51:19] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 204 seconds [23:51:44] actually it seems to be trying to do that already [23:51:45] Mmm, guillom says the WMF has "885 servers" -- does that include just pmtpa and eqiad, or places like Amsterdam too, do you reckon? [23:51:48] not sure why it wouldn't work [23:54:40] !log wikiadmin permissions for eqiad were broken, attempting to use CIDR notation, fixing them on all masters [23:54:53] Logged the message, Master [23:58:36] New patchset: MaxSem; "Postgres module for OSM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36155 [23:58:51] !log on db1006: fixing permissions for root@fenari [23:59:02] Logged the message, Master