[00:35:30] New patchset: DamianZaremba; "Since the hostname used in labs is not standard and does not match the one in monitoring snmp traps fail. Fake the hostname using the instancename configured locally and the site the instance is in." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45081 [00:50:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45081 [01:57:04] PROBLEM - MySQL disk space on neon is CRITICAL: DISK CRITICAL - free space: / 354 MB (3% inode=74%): [02:26:15] !log LocalisationUpdate completed (1.21wmf7) at Tue Jan 22 02:26:14 UTC 2013 [02:26:27] Logged the message, Master [02:42:30] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 188 seconds [02:42:57] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 205 seconds [02:48:38] !log LocalisationUpdate completed (1.21wmf8) at Tue Jan 22 02:48:37 UTC 2013 [02:48:49] Logged the message, Master [02:52:51] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:00] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:01] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:08] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:08] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:08] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:09] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:18] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:19] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:19] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:19] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:25] are we starting the switch or what? [02:53:27] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:31] * Jasper_Deng sees Apaches running away [02:54:31] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [02:54:39] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [02:54:40] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [02:54:40] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [02:54:40] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [02:54:40] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [02:54:40] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.075 second response time [02:54:40] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.069 second response time [02:54:41] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.066 second response time [02:54:41] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.077 second response time [02:54:42] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.079 second response time [02:54:49] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [02:54:57] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [02:54:58] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [02:54:58] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [02:55:07] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.060 second response time [02:56:45] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.080 second response time [03:00:39] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [03:02:01] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [03:27:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [03:47:10] New patchset: Ryan Lane; "Use paged results in nslcd and set mininum uid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45086 [03:47:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45086 [03:53:57] New patchset: Ryan Lane; "nss_min_uid is only valid in precise+" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45087 [03:54:24] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45087 [04:05:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:00] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 181 seconds [04:10:21] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 197 seconds [04:16:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.807 seconds [04:36:54] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [04:37:31] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [05:07:06] New patchset: Ryan Lane; "Add more explicit scopes for groups and users" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45089 [05:09:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45089 [05:11:23] New patchset: Ryan Lane; "Followup for scoping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45090 [05:11:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45090 [05:11:39] New patchset: Asher; "EQIAD SWITCH: set $::mw_primary = eqiad (sets mobile api backend, redis+mha conf)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45091 [06:17:19] New patchset: Ryan Lane; "Follow up to scoping. Set group map properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45093 [06:19:59] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45093 [06:26:51] RECOVERY - MySQL disk space on neon is OK: DISK OK [06:38:21] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on srv247 is CRITICAL: Puppet has not run in the last 10 hours [07:11:32] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [07:11:33] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:17:09] New patchset: Ryan Lane; "Increase the group and passwd nscd suggested-size" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45094 [07:17:54] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45094 [07:18:44] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection refused [07:39:26] New review: Hashar; "Should probably use ${::instancename} directly instead of (cat /etc/wmflabs-instancename) :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45081 [08:02:27] New patchset: DamianZaremba; "Tidying up exec to use info direct from puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45096 [08:10:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45096 [08:29:47] TimStarling: is there a way to know which is the millionth it.wiki's article? [08:33:16] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 206 seconds [08:33:51] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 220 seconds [08:38:30] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 6 seconds [08:39:07] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:19:11] New patchset: Hashar; "(bug 26784) IRC bots process need nagios monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45097 [09:20:49] New review: Hashar; "I guess that would be good enough. Leslie / Peter mind reviewing this? Maybe the Nagios check shoul..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45097 [09:34:45] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 181 seconds [09:35:31] RECOVERY - Puppet freshness on srv247 is OK: puppet ran at Tue Jan 22 09:35:02 UTC 2013 [09:35:32] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 202 seconds [09:43:27] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:44:12] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:47:23] New patchset: Silke Meyer; "Icons for Wikidata demo instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45098 [09:52:22] Vito: I don't think so [09:53:04] New patchset: Tim Starling; "Remove apaches.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45099 [09:54:21] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45099 [10:01:17] New patchset: Tim Starling; "Fix duplicate definition error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45100 [10:02:10] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45100 [10:09:27] !log on searchidx2: replaced /etc/sudoers with the distro default so that /etc/sudoers.d/* from puppet can take effect [10:09:37] Logged the message, Master [10:14:24] !log jenkins updated all plugins and restarting [10:14:33] Logged the message, Master [10:18:10] !log gallium : restarted puppet [10:18:20] Logged the message, Master [10:30:37] !log also replaced /etc/sudoers on fenari and hume. Hume needs a puppet change which will follow shortly [10:30:47] Logged the message, Master [10:37:54] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:44:54] New patchset: Tim Starling; "Introduce role::applicationserver::maintenance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45106 [10:45:09] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45106 [10:47:38] New patchset: Silke Meyer; "Added variables to configure Wikibase client via labsconsole" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45107 [10:49:47] New review: Silke Meyer; "Split into two separate commits (45098 for logos and 45107 for variables). Abandoning this change here." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44690 [10:50:15] Change abandoned: Silke Meyer; "The content has been split into two separate commits." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44690 [10:51:15] New review: Silke Meyer; "Andrew, you were right. We don't need the variables in mediawiki.pp." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/45107 [10:52:04] hume got a lot of good changes from being put into the right puppet classes [10:53:23] like this: [10:53:25] -;igbinary.compact_strings=On [10:53:25] +igbinary.compact_strings=Off [10:53:37] stuff you forget if you have to manage it separately [10:59:48] New patchset: Hashar; "gerrit: mw/tools IRC notifs to #wikimedia-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45108 [11:01:31] TimStarling: are you on a puppet sprint? :) [11:03:26] you know if you can't find something in your house despite spending half an hour looking for it, that's a sign you need to tidy up [11:03:49] I did that last week end [11:03:57] ended up throwing a ton of useless stuff [11:04:10] well done [11:04:17] no I can buy more :-] [11:04:37] so puppet was looking pretty untidy to my eyes, and it was slowing me down and breaking things [11:04:52] so it was time to do a bit of work on it [11:05:08] we also have to split the manifests in nice module [11:05:27] fenari is broken right now [11:05:30] including all the manifests on each run (via the myriad of require in manifests/site.pp) is not really efficient :-D [11:05:37] I don't understand the applicationserver/mediawiki_new split [11:06:00] I think the idea was to refactor the mediawiki class [11:06:09] apparently to run maintenance scripts you need applicationserver, but it installs apache for no apparent reason, and configures it [11:06:13] and have it applied on eqiad app servers [11:08:44] I think I have it figured out... [11:20:12] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection refused [11:20:23] New patchset: Tim Starling; "Split apache off from applicationserver::packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45113 [11:20:34] paravoid/hashar: either of you want to review that? [11:22:25] TimStarling: looking [11:26:46] i don't get it [11:27:13] it's to fix this: [11:27:15] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Package[libapache2-mod-php5] is already defined in file /var/lib/git/operations/puppet/modules/applicationserver/manifests/packages.pp at line 7; cannot redefine at /var/lib/git/operations/puppet/manifests/misc/noc.pp:8 on node fenari.wikimedia.org [11:27:27] ah the infamous dupe entries [11:27:30] grmblbl [11:27:32] because I just changed fenari to include role::applicationserver::maintenance [11:27:46] which includes applicationserver::packages which conflicts with noc.pp [11:27:50] I guess whoever includes applicationserver::apache_packages [11:27:55] will turns out to have the same issue [11:28:06] since you get the libapache2-mod-php5 defined in two places [11:28:17] (in apache_packages.pp and in packages.pp) [11:28:50] ah, right [11:29:14] one way is to create a dummy class that package{} [11:29:31] then include the dummy class whenever you want it [11:29:46] New patchset: Tim Starling; "Split apache off from applicationserver::packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45113 [11:30:04] New review: Tim Starling; "PS2: actually remove apache from applicationserver::packages" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45113 [11:30:11] better now? [11:30:23] * hashar refreshes [11:30:59] ahh [11:31:15] I had edited it inside my brain but not in the actual file [11:31:47] so [11:31:50] if I am correct [11:31:59] the jobrunner / videoscaler will no more have apache running [11:32:06] since they include applicationserver::packages [11:32:09] yes [11:32:21] there's no ensure=>absent but that is how it will be after reinstall [11:32:40] same goes for the lucene indexer too [11:32:41] but it would have been broken anyway since they didn't include applicationserver::service [11:32:47] yes, I said so in my commit message [11:33:28] ah and tin == deployment [11:33:42] sounds good so [11:34:06] will probably want to update tin / lucene etc to have the apache mod installed though [11:34:09] I guess tin relies on it [11:35:01] what does tin need it for? [11:35:11] New review: Hashar; "Sounds good. PS1 had packages to still include libapache2-mod-php5 which was really confusing." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/45113 [11:35:20] ah maybe it does not need libapache2-mod-php5 [11:35:22] just apache [11:35:44] I think the git-deploy minion fetch from an http:// url served by tin [11:35:50] though probably does not need php5 [11:35:52] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45113 [11:36:11] lucene, I have ZERO idea [11:37:58] my labs instances are not happy hehe [11:37:59] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: File[/usr/local/apache] is already defined in file /etc/puppet/manifests/nfs.pp at line 187; cannot redefine at /etc/puppet/modules/applicationserver/manifests/config/apache.pp:29 on node i-0000031a.pmtpa.wmflabs [11:38:05] most probably unrelated though [11:38:27] lucene doesn't need apache [11:41:47] ok, well I ran puppet on fenari and it doesn't seem horribly broken [11:41:55] :-] [11:42:03] and on app servers? :D [11:42:27] I will try that, and searchidx2 [11:43:58] you know why puppet is slow? [11:45:05] I ran a profile today, it seems to be mostly the manifest parser [11:45:52] we use import, rather than autoload, so we get the whole manifests directory for any server [11:46:36] app servers are fine, no change [11:46:42] searchidx2 is fine [11:47:08] gtg [11:48:31] Tim-away: bye bye :) [11:48:36] New patchset: Hashar; "beta: /usr/local/apache dupe definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45115 [11:48:43] Tim-away: for your backlog, we can get more autoloading by splitting our manifests to submodules :-] [11:48:47] err modules [12:26:02] <^demon> !log restarted ircecho on manganese [12:26:14] Logged the message, Master [14:52:27] New patchset: Mark Bergsma; "Add misc_pmtpa and misc_eqiad hostgroups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45138 [14:54:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45138 [14:57:26] New patchset: Mark Bergsma; "EQIAD SWITCH: Use eqiad bits app servers exclusively." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44252 [14:57:26] New patchset: Mark Bergsma; "EQIAD SWITCH: Use eqiad bits app servers in eqiad (not in pmtpa)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44251 [14:57:26] New patchset: Mark Bergsma; "EQIAD SWITCH: Migrate mobile backend appservers to eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44257 [15:22:50] !log Changed AS14907->AS43821 routing [15:23:00] Logged the message, Master [16:14:58] paravoid: ping [16:15:05] pong [16:15:43] paravoid: howdy... Ryan_Lane said that you are the man that I need to talk to :) [16:15:51] hi :) [16:15:52] what for? [16:15:54] I work on swift [16:16:19] okay [16:16:24] I heard that you guys need global replication, which we don't have right now [16:16:36] New patchset: Mark Bergsma; "EQIAD SWITCHOVER: set pmtpa application servers read-only" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45149 [16:16:49] among other things, yes [16:16:53] hehe [16:17:07] we're in contact with the SwiftStack folks [16:17:14] so, I would be curious to hear a little more [16:17:15] ahh [16:17:16] John has said that it's in the works [16:17:52] New patchset: Mark Bergsma; "EQIAD SWITCHOVER: set eqiad application servers read-write" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45150 [16:17:59] what would you like to hear? [16:18:09] <^demon> mark: I think Asher already has 2 patches in for that. [16:18:23] <^demon> https://gerrit.wikimedia.org/r/#/c/44845/ and https://gerrit.wikimedia.org/r/#/c/44847/ [16:18:41] well, the things inside swiftstack often don't get shared with the larger community it seems :/ [16:18:57] paravoid: I was just curious to hear a little more about your use cases [16:19:53] like to start, what you envision the global setup to look like [16:19:58] how many DCs [16:19:59] well, we currently use swift for media storage -- basically images & videos for all wikimedia projects incl. wikipedia [16:20:10] how much bandwidth will be availabe between the dcs [16:20:24] paravoid: yup [16:20:36] we have two sites in the US, a smaller caching site in EU and we're about to set up a new caching site in the west coast as well [16:20:58] the initial target was replication between the two sites in the US [16:21:17] one active, the other one hot standby (or even active/active) [16:21:21] k [16:21:32] bandwidth between DCs is not an issue [16:21:42] there's latency though [16:21:43] k [16:22:01] note that while waiting for swift to gain geo-replication we've been evaluating Ceph in the other DC [16:22:07] ceph + radosgw [16:22:17] yeah that is what I heard [16:22:36] also note that we've attempted to use container sync to sync containers across [16:22:42] but that was a huge failure [16:22:48] yeah, I wouldn't recommend that [16:22:49] I've filed multiple bugs about that on launchpad [16:23:15] swift couldn't scale for the number of files per container that we wanted, so we've been sharding containers into 256 shards [16:23:20] We should have communicated a bit better in the docs that it was a bit of an experimental feature [16:23:25] that has resulted in us having about 36k containers [16:24:05] container sync already sucks as it is, with 36k containers/200 million files is just impossible for us [16:24:14] I totally understand [16:25:02] note that ceph also has a number of advantages over swift so far [16:25:11] (georeplication is unfortunately not one of them) [16:25:31] paravoid: meaning ceph doesn't have georeplication? [16:25:47] not really, no [16:25:51] k [16:26:04] it does have hierarchical zoning [16:26:07] ok so sticking with replication for a bit [16:26:15] I would like to dig in a bit more [16:26:23] but there's no way of preferring the local DC for reads for example [16:26:26] (and thanks for your time, this is a lot of good feedback) [16:26:27] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection refused [16:26:28] sure [16:26:45] gah, that paged again [16:27:32] So in an ideal world, would you like to have those two DCs act as one physical cluster? [16:31:51] creiht: what do you mean by that? [16:32:12] what's up with the ms-fe.eqiad pages for the last 24 hours [16:32:19] yeah I'm looking into that now [16:32:19] sorry [16:32:33] I'm not sure why it started paging all of a sudden [16:32:34] paravoid: I can wait if you need to work on the pages [16:32:45] it never worked [16:33:21] but, to rephrase, are you looking to have 1 cluster that is spread across several locations, or 2 separate mirrored clusters that can be accessed individually [16:33:31] in an ideal world [16:33:52] but I've been working on fixing it properly; I need to restart pybal on lvs1003 [16:33:58] and let's just say that I'm a bit... reluctant :-) [16:34:33] lvs1006 has the new config and has been restarted and works [16:34:38] not sure if I can just kill pybal on lvs1003 [16:34:45] - 10.64.1.6 - - [22/Jan/2013:16:34:36 +0000] "GET /monitoring/backend HTTP/1.0" 200 247 "-" "Twisted PageGetter" [16:34:48] - 10.64.1.3 - - [22/Jan/2013:16:34:42 +0000] "GET /v1/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/monitoring/pybaltestfile.txt HTTP/1.0" 400 272 "-" "Twisted PageGetter" [16:35:20] the second URL is swift-specific and won't work; the first is what I've provisioned yesterday (with a swift rewrite.py change) as to properly fix it and make it swift/ceph agnostic [16:36:22] creiht: we haven't thought about it much; I can see arguments for both, but what really matters is to have some form of replication between the two sites... [16:36:31] k [16:36:45] my instinct would say two separate clusters as to increase redundancy, ease/stage software upgrades etc. [16:36:52] so having the redundancy is more important [16:36:57] cool [16:37:11] as long as replication is not per container or anything [16:37:15] that's completely nuts :) [16:37:35] paravoid: yeah that was meant for one small use case [16:37:49] and a bit of an experiment [16:38:06] if only we knew that from the get go [16:38:11] yeah sorry :/ [16:38:15] I took over media storage quite late [16:38:32] but before me there was strong hope that container sync would be a workable solution [16:38:37] ahh [16:38:57] and actually putting internal deadlines on that assumption [16:39:02] ouch [16:39:28] so, yeah, you really need to document better that it's not not even close to being production-ready :) [16:39:29] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [16:40:05] paravoid: yeah, I have been thinking about how we can better communicate which features are well tested, and which are a bit more experimental [16:40:21] binasher: btw, the latest page was because nagios was completely broken again and mark just fixed it [16:40:33] creiht: yep, good idea [16:40:46] paravoid: I was working on a different project for about a year, so just getting back to work on swift stuff [16:41:36] unfortunately it seems documentation has taken a bit of a back seat during that time [16:41:46] paravoid: that is an excellent reason to receive a page [16:41:53] that's one of the things that I want to work on [16:41:55] anyways [16:42:03] binasher: hehe [16:42:42] so you mentioned being able to prefer a local DC [16:42:54] how important would that be to you? [16:43:03] hiya [16:49:02] creiht: what do you mean? [16:49:34] it's important to not take a latency hit [16:50:16] yeah, so on any request, you would like to use the nearest location, if possible [16:50:51] New patchset: Pyoungmeister; "allowing jobrunners to be enabled or disabled via role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45156 [16:51:08] would mark or binasher or paravoid take a look at that ^^ [16:51:12] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 247 bytes in 0.054 seconds [16:51:40] binasher: I was also thinking of basing it on $::mw_primary [16:51:45] there [16:51:55] notpeter: is that for spinning up the jobrunners in eqiad today? or just later on [16:51:59] looking [16:52:04] well, today and both [16:52:15] I put in a hack to stop them in eqiad based on host name [16:52:18] but that's a hack [16:52:21] this seems more proper [16:52:28] i wonder if there's a reason not to run jobrunners in both sites [16:52:35] some jobs may process very slowly due to latency [16:52:51] mark: not with the current job runner [16:53:04] sure, could also run from both sites with half the procs [16:53:12] lemme know what you'd like :) [16:53:13] why half? [16:53:14] its too uncoordinated and can stamped databases with too many runners [16:53:18] ah [16:53:20] ok [16:53:36] cool. then I stand by my patchset :) [16:54:00] New patchset: Silke Meyer; "Moves Wikidata items to main namespace, imports main page to different namespace" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45157 [16:54:59] so whom of platform engineering is here? :) [16:55:08] roan [16:55:10] robla just walked in [16:55:17] er, roan's not on platform, is he? [16:55:18] notpeter: that changeset doesn't disable pmtpa jobbers? [16:55:22] chris [16:55:31] roan knows mediawiki, that's all that matters [16:55:33] New review: Faidon; "I'd prefer a run_jobs or something, but I guess enabled is also obvious enough. Looks good otherwise." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45156 [16:55:39] binasher: nein. I'd like to make a sperate patcheset for that [16:56:10] notpeter is right, I'm on features, not platform [16:56:23] RoanKattouw: but you know stuff about things [16:56:24] * ^demon is pinged on platform [16:56:26] <^demon> What's up? [16:56:36] <^demon> mark: Reedy and I both are around. [16:56:44] mediawiki working correctly is a feature as far as I'm concerned today ;-) [16:56:51] hahaha [16:56:53] Of course I was in platform's annual planning meeting last year while I wasn't in the one for features.... but I'm in features, I swear! :D [16:59:29] New patchset: Pyoungmeister; "allowing jobrunners to be enabled or disabled via role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45156 [16:59:31] ok, paravoid, more verbose var name, just for you :) [16:59:44] haha [16:59:50] a nitpick really :) [16:59:57] en, you're right [17:00:03] /en/eh [17:00:12] my var naming is usually underly verbose [17:00:19] <^demon> RoanKattouw: If you ever want to come back to platform, we'd welcome you :) [17:00:20] binasher: i assume the DB's are sufficiently warm? ;-) [17:01:14] mark: there are warmups running now, but yes :) [17:02:51] mark: ready to start with the bits migration? [17:02:58] we are going off of https://wikitech.wikimedia.org/view/Eqiad_Migration_Planning/Steps [17:03:25] yeah i'm currently checking a few bits urls against bits apaches [17:03:32] great [17:04:19] ping me if you need anything [17:05:15] !log staging cname switch for pmtpa/eqiad dbs on sockpuppet [17:05:25] notpeter: ty! [17:05:28] Logged the message, notpeter [17:06:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45156 [17:06:44] getting just 404s for the top bits urls [17:06:47] PROBLEM - Host db1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:57] lemme try tampa [17:07:03] perhaps that's normal ;) [17:07:06] mark: you're querying the apaches? [17:07:09] yes [17:07:19] are you emulating the varnish rewrites? [17:07:28] doh [17:07:29] watch the actual requests hitting pmtpa apaches [17:07:33] no ;) [17:07:35] and try those [17:07:36] thanks [17:07:38] i'll check a couple too [17:07:52] varnishtop -b -i TxURL [17:08:25] yeah looking better [17:08:38] New patchset: Pyoungmeister; "EQIAD SWITCH: switching job runners and tmh to eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45159 [17:09:38] seems ok, but got some slow responses [17:10:31] like > 2s [17:10:39] i hope that's not gonna be a problem [17:10:43] resourceloader can be slow without its intermediary caches being warm [17:10:52] replay the same set of requests? [17:11:01] faster of course [17:11:08] fortunately there aren't many backend requests for bits [17:11:09] we'll see [17:11:13] let's do it [17:11:29] the contents of bits varnish shouldn't all expire at once either [17:12:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44251 [17:12:38] running puppet on arsenic [17:12:56] RECOVERY - Host db1013 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [17:13:33] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:13:33]