[00:17:03] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [00:34:12] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:58:31] any ops around? [01:58:47] please kill 28157 on labsdb1004.eqiad.wmnet using select pg_cancel_backend(28157); [02:02:08] yurik: what happened? [02:02:26] godog, not much, just my own create index that i cannot cancel [02:02:41] and i cannot cancel because i am not a root on the box [02:02:59] and postgress requires root unless you cancel from within the same session it seems [02:03:59] ah curious, yeah I was reading pg_cancel_backend docs and it is supposed to work from the same user [02:04:04] well the same role [02:05:17] yurik: anyways, done [02:05:48] godog, thx. If you want, google: ERROR: must be superuser to signal other server processes [02:06:30] godog, could you send me the result of select * from pg_stat_activity; [02:06:53] not sure what's going on with the server [02:08:32] is something up you think? [02:15:29] godog, max ran some index creation a while ago, because it turned out that that server was not set up correctly. I don't know the status, and ganglia is not giving full picture [02:15:30] http://ganglia.wikimedia.org/latest/?r=hour&cs=6%2F7%2F2015+19%3A35&ce=6%2F7%2F2015+19%3A37&m=cpu_report&c=Miscellaneous+eqiad&h=labsdb1004.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS [02:21:14] !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 07m 07s) [02:21:20] Logged the message, Master [02:23:24] yurik: https://phabricator.wikimedia.org/P739 [02:23:50] yurik: btw that's what I meant by 'what happened' earlier, if you thought sth was not working as it should [02:25:38] godog, thanks! i think postgres is importing gis update... i wonder if it will barf if it tries to create an index and insert new data at the same time [02:26:15] !log LocalisationUpdate completed (1.26wmf8) at 2015-06-08 02:25:12+00:00 [02:26:20] Logged the message, Master [02:29:33] godog, i think its even worse than i thought ... not for sure though - some of the update queries were started on may 27 - two weeks ago [02:29:46] akosiaris, ^ [02:40:07] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1344638 (10Yurik) Update: Thanks to @fgiunchedi, I got the result of `select * from pg_stat_activity;` --... [02:40:31] yurik: yup the create index is probably waiting on that COPY from the 27th [02:41:10] godog, any clue of what it might be doing? akosiaris said everything is ready for our use [02:47:28] yurik: no idea tbh, not a postgres expert [03:58:51] PROBLEM - puppet last run on cp3018 is CRITICAL puppet fail [04:16:02] RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:25:12] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jun 8 05:24:08 UTC 2015 (duration 24m 7s) [05:25:15] Logged the message, Master [05:58:55] Nemo_bis: sorry, I still owe you an update for that Lime Survey patch. I got side-tracked by , which is an amazingly depressing read. [05:59:42] ori: so depressing that I refrained from reading more than 5 or so messages; is there something relevant to us in there? [06:00:23] (If there is, sorry for not reading all if myself.) [06:02:01] I'm not sure how to handle packaging. The standard requirement for deploying third-party software is to use Debian packages, and if there are no Debian packages, to make them ourselves. Ops are typically only OK with deviating from that when the application has some properties that make it impossible or impractical to package for Debian. [06:02:58] Lime Survey is a really strange case: there appear to be packages for it in Debian's package staging environment that meet all the stringent requirements of the Debian project and which have been extensively reviewed and refined [06:04:06] but for some bizarre reason they are stuck in Debian purgatory [06:04:56] so I'm not sure how we're supposed to handle that. Ask one of our opsen who is also a Debian project member to help the packages land, or just import the packages to apt.wikimedia.org, or just not use the packages at all. [06:05:54] when _joe_ arrives maybe he can advise us [06:08:32] PROBLEM - puppet last run on mw2108 is CRITICAL Puppet has 1 failures [06:15:47] I also thought about trying to get the package unstuck, but it's been so long since the last comments there [06:25:22] RECOVERY - puppet last run on mw2108 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:13] PROBLEM - puppet last run on db1023 is CRITICAL Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 2 failures [06:33:42] PROBLEM - puppet last run on mw1235 is CRITICAL Puppet has 1 failures [06:33:42] PROBLEM - puppet last run on analytics1010 is CRITICAL Puppet has 1 failures [06:33:43] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:34:22] PROBLEM - puppet last run on mw1025 is CRITICAL Puppet has 2 failures [06:34:23] PROBLEM - puppet last run on mw1172 is CRITICAL Puppet has 1 failures [06:34:32] PROBLEM - puppet last run on labcontrol2001 is CRITICAL Puppet has 1 failures [06:34:32] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:34:32] PROBLEM - puppet last run on mw1213 is CRITICAL Puppet has 1 failures [06:34:42] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:35:22] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:35:22] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:35:31] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:45:45] (03CR) 10Alexandros Kosiaris: [C: 031] ssh: Disable agent forwarding for production [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [06:45:52] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on db1023 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:12] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:47:12] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:12] RECOVERY - puppet last run on analytics1010 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:12] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:21] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:21] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:47:22] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:01] RECOVERY - puppet last run on mw1172 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:48:02] RECOVERY - puppet last run on mw1213 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:48:02] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:38] (03PS2) 10Ori.livneh: [WIP] Stub LimeSurvey configuration [puppet] - 10https://gerrit.wikimedia.org/r/213579 (https://phabricator.wikimedia.org/T94807) (owner: 10Nemo bis) [07:11:18] (03PS5) 10Giuseppe Lavagetto: etcd: setup servers/ganglia stuff [puppet] - 10https://gerrit.wikimedia.org/r/216099 [07:13:40] (03PS6) 10Giuseppe Lavagetto: etcd: setup servers/ganglia stuff [puppet] - 10https://gerrit.wikimedia.org/r/216099 [07:14:31] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: setup servers/ganglia stuff [puppet] - 10https://gerrit.wikimedia.org/r/216099 (owner: 10Giuseppe Lavagetto) [07:19:40] (03CR) 10Muehlenhoff: [C: 031] ssh: Disable agent forwarding for production [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [07:20:32] (03PS1) 10ArielGlenn: xml dumps: remove logs batchsize from config, obsolete [puppet] - 10https://gerrit.wikimedia.org/r/216617 [07:21:32] (03CR) 10ArielGlenn: [C: 032] xml dumps: remove logs batchsize from config, obsolete [puppet] - 10https://gerrit.wikimedia.org/r/216617 (owner: 10ArielGlenn) [07:24:46] (03CR) 10Joal: [C: 031] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/216328 (owner: 10Ottomata) [07:38:43] (03PS1) 10Giuseppe Lavagetto: etcd: fix mac address, add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/216618 [07:40:01] (03PS2) 10Giuseppe Lavagetto: etcd: fix mac address, add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/216618 [07:50:14] (03PS3) 10Giuseppe Lavagetto: etcd: fix mac address, add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/216618 [07:50:44] good morning [07:50:55] morning hashar [07:51:42] morning [07:52:10] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: fix mac address, add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/216618 (owner: 10Giuseppe Lavagetto) [08:00:38] (03PS1) 10Giuseppe Lavagetto: partman: adding virtual.cfg to the etcd VMs [puppet] - 10https://gerrit.wikimedia.org/r/216619 [08:00:48] <_joe_> akosiaris: ^^ correct? [08:01:40] (03CR) 10Alexandros Kosiaris: [C: 031] partman: adding virtual.cfg to the etcd VMs [puppet] - 10https://gerrit.wikimedia.org/r/216619 (owner: 10Giuseppe Lavagetto) [08:01:50] (03CR) 10Giuseppe Lavagetto: [C: 032] partman: adding virtual.cfg to the etcd VMs [puppet] - 10https://gerrit.wikimedia.org/r/216619 (owner: 10Giuseppe Lavagetto) [08:06:38] (03CR) 10Giuseppe Lavagetto: [C: 031] "Given the upgrade plan. this is correct." [puppet] - 10https://gerrit.wikimedia.org/r/216429 (owner: 10Jcrespo) [08:07:09] thank you, _joe_! [08:09:24] (03CR) 10Alexandros Kosiaris: [C: 031] Upgrade es1007 from MariaDB 5.5 to 10 [puppet] - 10https://gerrit.wikimedia.org/r/216429 (owner: 10Jcrespo) [08:10:50] (03CR) 10Ori.livneh: [C: 031] Upgrade es1007 from MariaDB 5.5 to 10 [puppet] - 10https://gerrit.wikimedia.org/r/216429 (owner: 10Jcrespo) [08:14:25] (03PS2) 10Jcrespo: Upgrade es1007 from MariaDB 5.5 to 10 [puppet] - 10https://gerrit.wikimedia.org/r/216429 [08:14:59] (03CR) 10Jcrespo: [C: 032] Upgrade es1007 from MariaDB 5.5 to 10 [puppet] - 10https://gerrit.wikimedia.org/r/216429 (owner: 10Jcrespo) [08:23:42] 6operations: The debian installer for jessie is broken - https://phabricator.wikimedia.org/T101684#1344917 (10Joe) 3NEW [08:24:32] PROBLEM - mysqld processes on es1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [08:31:11] RECOVERY - mysqld processes on es1007 is OK: PROCS OK: 1 process with command name mysqld [08:31:36] (03PS1) 10Ori.livneh: Add fluorine's IPv6 address to dataset1001's 'hosts allow' [puppet] - 10https://gerrit.wikimedia.org/r/216620 [08:32:36] (03CR) 10Ori.livneh: [C: 032] "Going with this for now, but continuing to look for a better solution." [puppet] - 10https://gerrit.wikimedia.org/r/216620 (owner: 10Ori.livneh) [08:34:02] PROBLEM - MySQL Slave Running on es1007 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: Unable to load replication GTID slave state from mysql.gtid_slave_pos [08:34:58] (03PS1) 10KartikMistry: CX: Add wikis for deployment on 20150609 [puppet] - 10https://gerrit.wikimedia.org/r/216621 (https://phabricator.wikimedia.org/T100764) [08:35:57] (03PS1) 10KartikMistry: CX: Add wikis for deployment on 20150609 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216622 (https://phabricator.wikimedia.org/T100764) [08:47:45] ori: ayill awake ? [08:47:49] *stil [08:48:00] <_joe_> matanya: don't enable him :P [08:48:07] :D [08:48:30] I just need an update for https://phabricator.wikimedia.org/T97251#1336024 after some of my changes were merged [08:48:40] a new paste, i guess [08:51:15] 6operations, 5Patch-For-Review: Fix all .erb variable warnings - https://phabricator.wikimedia.org/T97251#1345054 (10ori) Update: {P741} [08:52:05] <_joe_> I could've done it :) [08:52:33] I'm really off [08:52:49] thanks! good night [08:53:19] <_joe_> good night! [08:53:52] ah, good progress :) neat, on the right track [08:55:08] 134 --> 23 [08:55:40] <_joe_> matanya: \o/ [08:56:00] :) [09:01:59] (03PS1) 10Matanya: ganglia: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/216624 [09:04:12] 6operations, 5Patch-For-Review: Fix all .erb variable warnings - https://phabricator.wikimedia.org/T97251#1345111 (10Matanya) I think i submitted patches for all of those which are not analytics-related. @ottomata can you please fix your, or catch me on IRC to work it out together? [09:08:07] 7Puppet, 10Beta-Cluster: Puppet failures on deployment-mx: can't find puppet://private/dkim/wikimedia.org-wiki-mail.key - https://phabricator.wikimedia.org/T87848#1345113 (10hashar) [09:13:21] 7Puppet, 10Beta-Cluster: Puppet failures on deployment-mx: can't find puppet://private/dkim/wikimedia.org-wiki-mail.key - https://phabricator.wikimedia.org/T87848#1345116 (10hashar) Puppet can't find puppet://private/dkim/wikimedia.org-wiki-mail.key . The instance has the puppet class `role::mail::mx` and I g... [09:27:41] 6operations, 6Labs: Expand list of people who can create new Labs project - https://phabricator.wikimedia.org/T101688#1345141 (10yuvipanda) 3NEW [09:28:41] _joe_: was there a change to the job queue may 27th that doubled the job queue ? [09:28:58] <_joe_> matanya: not that I know of [09:29:07] <_joe_> but probably, yes [09:29:32] myabe i am not understanding the graph correctly [09:30:22] _joe_: https://graphite.wikimedia.org/dashboard/#temporary-2 [09:31:24] <_joe_> matanya: uhm pretty strange, I'll take a look later [09:31:31] thanks [09:43:31] 7Blocked-on-Operations, 6Labs, 10Maps, 6Scrum-of-Scrums: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1345174 (10yuvipanda) This also requires labsdb1005 to be upgraded since that's the postgres slave for this instance, but it's also the master f... [09:44:11] 6operations: The debian installer for jessie is broken - https://phabricator.wikimedia.org/T101684#1345176 (10MoritzMuehlenhoff) The Release file on the host is truncated for some reason: /target/var/lib/apt/lists/debootstrap.invalid_dists_jessie_Release has 126196 bytes, which the regular jessie Release file ha... [09:50:11] 6operations: The debian installer for jessie is broken - https://phabricator.wikimedia.org/T101684#1345177 (10MoritzMuehlenhoff) The 126196 version is the Release file of the initial 8.0 jessie release [09:54:32] _joe_: i collected some weird things in that dashboard. enjoy looking, bye [09:55:07] 7Blocked-on-Operations, 6Labs, 10Maps, 6Scrum-of-Scrums: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1345183 (10akosiaris) Trusty, which is the easy upgrade path, has postgis 2.1.2 and postgres 9.3. @maxsem, @yurik, are those sufficient ? Otherw... [10:00:54] (03CR) 10Paladox: "@Dzahn or @Brian Wolff how can the link work if it is for example like this Bug:T5432 Please see https://git.wikimedia.org/commit/phabrica" [puppet] - 10https://gerrit.wikimedia.org/r/215247 (owner: 10Paladox) [10:04:59] !log ganglia gmetad problems [10:05:04] Logged the message, Master [10:06:45] PROBLEM - HTTP on uranium is CRITICAL: Connection refused [10:12:30] (03CR) 10Manybubbles: elasticsearch: allow control of dynamic scripting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/216325 (owner: 10BryanDavis) [10:27:06] RECOVERY - HTTP on uranium is OK: HTTP OK: HTTP/1.1 302 Found - 426 bytes in 0.007 second response time [10:27:28] !log disabled puppet on uranium, investigating ganglia problems [10:27:33] Logged the message, Master [10:35:47] (03Abandoned) 10KartikMistry: Fix tabs and spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215572 (owner: 10KartikMistry) [10:36:19] (03Abandoned) 10KartikMistry: WIP: Do not use registry and fallback to config.default.js [puppet] - 10https://gerrit.wikimedia.org/r/191263 (https://phabricator.wikimedia.org/T89803) (owner: 10KartikMistry) [10:36:42] (03Abandoned) 10KartikMistry: config: Remove wgContentTranslationTranslateInTarget for existing wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197854 (owner: 10KartikMistry) [10:45:59] (03CR) 10Jcrespo: "I'm ok with it, but Springle should check it too as it directly affects our production masters." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/195779 (https://phabricator.wikimedia.org/T91545) (owner: 10Thcipriani) [10:49:01] (03PS1) 10Giuseppe Lavagetto: ganglia: allow individual clusters to be migrated to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/216629 [10:49:07] <_joe_> akosiaris: ^^ [10:49:15] <_joe_> akosiaris: substantially, the ganglia [10:50:02] <_joe_> err ganglia_aggregator_config function could not distinguish clusters based on their own config, only on the site where they were in. [10:50:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [10:50:45] (03Abandoned) 10Hashar: Disable Extension:Oversight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169612 (https://bugzilla.wikimedia.org/60373) (owner: 10Reedy) [10:50:52] (03CR) 10Alexandros Kosiaris: [C: 031] ganglia: allow individual clusters to be migrated to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/216629 (owner: 10Giuseppe Lavagetto) [10:51:12] <_joe_> so now my idea is we use the "new" config whenever the aggregators are not defined [10:51:17] <_joe_> basically [10:51:40] <_joe_> let's see if this works [10:52:15] (03PS2) 10Jcrespo: Repool es1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216430 [10:52:22] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia: allow individual clusters to be migrated to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/216629 (owner: 10Giuseppe Lavagetto) [10:52:38] (03CR) 10Jcrespo: [C: 032] Repool es1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216430 (owner: 10Jcrespo) [10:55:40] !log jynus Synchronized wmf-config/db-eqiad.php: repool es1007 (duration: 01m 08s) [10:55:44] Logged the message, Master [10:57:45] (03CR) 10Giuseppe Lavagetto: "On second thoughts, the way you can scrutinize puppet changes now is so horrible that it's worth having the config not pretty-formatted" [puppet] - 10https://gerrit.wikimedia.org/r/200603 (owner: 10Giuseppe Lavagetto) [11:01:31] (03PS3) 10Giuseppe Lavagetto: ganglia::web::view: order resources in template [puppet] - 10https://gerrit.wikimedia.org/r/200603 [11:02:15] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:07:05] 6operations, 6Labs, 7network: permit syslog from labs to lithium - https://phabricator.wikimedia.org/T90695#1345250 (10akosiaris) 5Open>3Resolved a:3akosiaris Rules have been updated on cr{1,2} and now the packets flow through. Resolving [11:07:29] (03CR) 10Yuvipanda: [C: 031] "So who wants to push this through?" [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [11:08:13] 6operations, 6Labs, 7network: permit syslog from labs subnet to lithium - https://phabricator.wikimedia.org/T90695#1345253 (10yuvipanda) [11:09:50] 6operations, 6Labs, 7network: permit syslog from labs hosts subnets to lithium - https://phabricator.wikimedia.org/T90695#1345256 (10akosiaris) [11:09:57] this is more accurate ^ [11:10:49] (03PS2) 10Hashar: Add another source to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216316 (https://phabricator.wikimedia.org/T101513) (owner: 10Odder) [11:12:42] (03CR) 10Hashar: [C: 031] Add another source to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216316 (https://phabricator.wikimedia.org/T101513) (owner: 10Odder) [11:16:31] (03CR) 10Dan-nl: "thanks hashar, who can +2 it? is it possible to have people with that permission added to these types of patches automatically?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216316 (https://phabricator.wikimedia.org/T101513) (owner: 10Odder) [11:19:54] (03PS1) 10Faidon Liambotis: exim: untangle exim4.conf between roles & simplify [puppet] - 10https://gerrit.wikimedia.org/r/216635 [11:19:56] (03PS1) 10Faidon Liambotis: exim: split phab_relay into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216636 [11:19:58] (03PS1) 10Faidon Liambotis: exim: split rt_relay into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216637 [11:20:00] (03PS1) 10Faidon Liambotis: exim: split OTRS config into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216638 [11:20:02] (03PS1) 10Faidon Liambotis: exim: split mailman config into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216639 [11:20:04] (03PS1) 10Faidon Liambotis: exim: split main MX config into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216640 [11:20:06] (03PS1) 10Faidon Liambotis: exim: kill one-size-fits-all SMTP_IMAP_MM template [puppet] - 10https://gerrit.wikimedia.org/r/216641 [11:20:08] (03PS1) 10Faidon Liambotis: mail: remove secondary MX role from sodium (2nd take) [puppet] - 10https://gerrit.wikimedia.org/r/216642 [11:20:10] (03PS1) 10Faidon Liambotis: exim: remove $smart_route_list [puppet] - 10https://gerrit.wikimedia.org/r/216643 [11:20:12] (03PS1) 10Faidon Liambotis: exim: inline @local_domains [puppet] - 10https://gerrit.wikimedia.org/r/216644 [11:20:14] (03PS1) 10Faidon Liambotis: exim: kill unused exim::roled parameters [puppet] - 10https://gerrit.wikimedia.org/r/216645 [11:20:16] (03PS1) 10Faidon Liambotis: exim: kill all exim::* classes except for ::roled [puppet] - 10https://gerrit.wikimedia.org/r/216646 [11:20:18] (03PS1) 10Faidon Liambotis: exim: remove defer_domains for single-domain MXes [puppet] - 10https://gerrit.wikimedia.org/r/216647 [11:20:20] (03PS1) 10Faidon Liambotis: exim: use exim4 directly from Phab/RT [puppet] - 10https://gerrit.wikimedia.org/r/216648 [11:20:22] (03PS1) 10Faidon Liambotis: exim: use exim4 directly from role::mail::lists [puppet] - 10https://gerrit.wikimedia.org/r/216649 [11:20:24] (03PS1) 10Faidon Liambotis: exim: use exim4 directly from role::otrs [puppet] - 10https://gerrit.wikimedia.org/r/216650 [11:20:26] (03PS1) 10Faidon Liambotis: exim: fold exim::roled into role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/216651 [11:20:28] (03PS1) 10Faidon Liambotis: mail: rename role::mail::lists to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/216652 [11:20:31] 6operations: The debian installer for jessie is broken - https://phabricator.wikimedia.org/T101684#1345279 (10MoritzMuehlenhoff) I did some digging, but there's no obvious bug in the d-i logic to download the Release file. This might simply be a case of Squid serving an outdated cached copy of the Release file. [11:22:28] (03CR) 10jenkins-bot: [V: 04-1] mail: rename role::mail::lists to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/216652 (owner: 10Faidon Liambotis) [11:22:33] oh shit :) [11:24:24] (03PS15) 10Yuvipanda: For cert names, use the fqdn instead of the ec2id if use_dnsmasq is lowered. [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [11:24:29] (03PS2) 10Faidon Liambotis: mail: rename role::mail::lists to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/216652 [11:24:31] (03PS2) 10Faidon Liambotis: exim: fold exim::roled into role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/216651 [11:25:28] (03PS1) 10Alexandros Kosiaris: Avoid caching autoinstall files [puppet] - 10https://gerrit.wikimedia.org/r/216654 (https://phabricator.wikimedia.org/T101419) [11:25:56] chasemp: https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+topic:kill-exim-roled,n,z request for review :) [11:26:34] !log updated mc2* to 2:2.8.17-1+deb8u1 [11:26:39] Logged the message, Master [11:31:41] 6operations, 10vm-requests, 7discovery-system: eqiad: 3 VM request for ETCD - https://phabricator.wikimedia.org/T101506#1345300 (10Joe) @faidon ok I didn't got that bit (upgrading to jessie). Creating a blocking task is perfectly fine by me. I'll do it. [11:39:10] (03CR) 10Alexandros Kosiaris: [C: 031] "I am not overly fond of collectors, they can introduce very nasty dependencies, but this LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/215352 (owner: 10Faidon Liambotis) [11:42:15] <_joe_> !log purging squid cache on carbon [11:42:19] Logged the message, Master [11:45:23] (03PS1) 10Alexandros Kosiaris: Remove redundant logrotate installation [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/216656 [11:46:22] (03CR) 10Giuseppe Lavagetto: [C: 031] Avoid caching autoinstall files [puppet] - 10https://gerrit.wikimedia.org/r/216654 (https://phabricator.wikimedia.org/T101419) (owner: 10Alexandros Kosiaris) [11:57:15] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove redundant logrotate installation [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/216656 (owner: 10Alexandros Kosiaris) [12:01:44] 6operations: The debian installer for jessie is broken - https://phabricator.wikimedia.org/T101684#1345326 (10Joe) Purging the squid cache solved the problem, thanks! [12:01:51] 6operations: The debian installer for jessie is broken - https://phabricator.wikimedia.org/T101684#1345327 (10Joe) 5Open>3Resolved [12:12:26] (03PS1) 10Giuseppe Lavagetto: ocg: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/216658 [12:13:23] (03CR) 10Giuseppe Lavagetto: [C: 032] ocg: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/216658 (owner: 10Giuseppe Lavagetto) [12:18:46] <_joe_> \o/ [12:18:55] <_joe_> ocg migrated to ganglia_new [12:21:54] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /srv/dib/tmp/image.fsqDjv7T/mnt/tmp/ccache is not accessible: Permission denied [12:23:35] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [12:26:25] PROBLEM - puppet last run on labnodepool1001 is CRITICAL Puppet has 1 failures [12:29:02] (03PS2) 10Alexandros Kosiaris: Avoid caching autoinstall files [puppet] - 10https://gerrit.wikimedia.org/r/216654 (https://phabricator.wikimedia.org/T101419) [12:31:16] (03CR) 10Alexandros Kosiaris: [C: 031] "OK, lets merge. This will be helpful during the ruby 1.9+ puppet migration anyways" [puppet] - 10https://gerrit.wikimedia.org/r/200603 (owner: 10Giuseppe Lavagetto) [12:31:56] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /srv/dib/tmp/image.6Ty04wte/mnt is not accessible: Permission denied [12:32:38] (03PS1) 10Hashar: Depends on parted [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/216660 [12:32:47] (03CR) 10Alexandros Kosiaris: "it probably is worth sending an email to engineering about this before pushing it. Just as a heads up so that people don't waste that much" [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [12:33:35] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [12:41:39] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Rsync live labstore filesystem to local eqiad copy - https://phabricator.wikimedia.org/T101011#1345407 (10coren) [12:41:42] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Migrate Labs NFS storage from RAID6 to RAID10 - https://phabricator.wikimedia.org/T96063#1345408 (10coren) [12:41:45] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Make a block-level copy of the codfw mirror of labstore1001 to eqiad - https://phabricator.wikimedia.org/T101010#1345405 (10coren) 5Open>3Resolved The copy is complete, and is mounted at the destination. A caveat worth nothing: since the sou... [12:43:24] RECOVERY - puppet last run on labnodepool1001 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:44:32] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100, 3Labs-Sprint-101: Rsync live labstore filesystem to local eqiad copy - https://phabricator.wikimedia.org/T101011#1345425 (10coren) [12:44:54] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100, 3Labs-Sprint-101: Migrate Labs NFS storage from RAID6 to RAID10 - https://phabricator.wikimedia.org/T96063#1345427 (10coren) [12:52:04] (03PS1) 10Yuvipanda: openstack: Add a check to see if Tool Labs instances are spread enough [puppet] - 10https://gerrit.wikimedia.org/r/216661 (https://phabricator.wikimedia.org/T101635) [12:52:15] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.CLmyjc5s/mnt is not accessible: Permission denied [12:52:43] (03CR) 10jenkins-bot: [V: 04-1] openstack: Add a check to see if Tool Labs instances are spread enough [puppet] - 10https://gerrit.wikimedia.org/r/216661 (https://phabricator.wikimedia.org/T101635) (owner: 10Yuvipanda) [12:53:49] (03PS2) 10Yuvipanda: openstack: Add a check to see if Tool Labs instances are spread enough [puppet] - 10https://gerrit.wikimedia.org/r/216661 (https://phabricator.wikimedia.org/T101635) [12:53:55] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [13:03:06] !log added strongswan_5.3.0-1+wmf2 to jessie-wikimedia on carbon [13:03:10] Logged the message, Master [13:09:41] 6operations: terbium - dpkg reports broken packages - https://phabricator.wikimedia.org/T101583#1345484 (10chasemp) p:5Triage>3Normal [13:10:23] (03PS1) 10Alexandros Kosiaris: Remove graphite hosts from backups [puppet] - 10https://gerrit.wikimedia.org/r/216665 [13:10:28] 6operations, 6Labs: Make Labs NFS alerts paging - https://phabricator.wikimedia.org/T101650#1345486 (10yuvipanda) p:5Triage>3High [13:10:41] 6operations, 6Labs, 3Labs-Sprint-101: Make Labs NFS alerts paging - https://phabricator.wikimedia.org/T101650#1344262 (10yuvipanda) [13:10:57] 6operations: Change main branch of puppet repository to be 'master' instead of production - https://phabricator.wikimedia.org/T101632#1345489 (10chasemp) p:5Triage>3Normal I am not sure if there are any other reasons this switch would be difficult now. May be a good thing to ask in Ops meeting. [13:11:46] 6operations: Change main branch of puppet repository to be 'master' instead of production - https://phabricator.wikimedia.org/T101632#1345492 (10yuvipanda) +1, I'll bring it up. [13:12:18] 6operations, 10ops-codfw: ms-be2008.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T101665#1345493 (10chasemp) [13:12:20] 6operations: Change main branch of puppet repository to be 'master' instead of production - https://phabricator.wikimedia.org/T101632#1345494 (10yuvipanda) Probably all the people who are still 'pushing to production' and their local scripts. [13:23:34] PROBLEM - DPKG on zirconium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:24:25] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.soYRuEtB/mnt/tmp/ccache is not accessible: Permission denied [13:24:55] PROBLEM - puppet last run on zirconium is CRITICAL Puppet has 1 failures [13:26:05] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [13:32:02] 6operations, 6Services: Define and then implement a way for a future service owner to provide the info required to have a new service brought into production - https://phabricator.wikimedia.org/T97031#1345510 (10akosiaris) After contemplating quite a bit on this, I think the best way to get this done is via a... [13:32:53] (03CR) 10Alexandros Kosiaris: "@filippo, does this make sense now that we got graphite2001 as well?" [puppet] - 10https://gerrit.wikimedia.org/r/216665 (owner: 10Alexandros Kosiaris) [13:36:54] RECOVERY - DPKG on zirconium is OK: All packages OK [13:37:01] (03CR) 10BBlack: [C: 04-1] certs: replace require by collector ordering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215352 (owner: 10Faidon Liambotis) [13:39:04] 6operations, 6Services: Define and then implement a way for a future service owner to provide the info required to have a new service brought into production - https://phabricator.wikimedia.org/T97031#1345526 (10mobrovac) I'm all for it. The only inconvenient thing I see with this approach is that the Phabrica... [13:39:06] (03PS1) 10Alexandros Kosiaris: etherpad: ensure present [puppet] - 10https://gerrit.wikimedia.org/r/216672 [13:40:06] RECOVERY - puppet last run on zirconium is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:40:32] (03PS1) 10Andrew Bogott: Remove use_dnsmasq logic. [puppet] - 10https://gerrit.wikimedia.org/r/216673 [13:48:43] (03CR) 10Yuvipanda: [C: 04-1] "Should also grep for the dnsmasq IP and remove it from places?" [puppet] - 10https://gerrit.wikimedia.org/r/216673 (owner: 10Andrew Bogott) [13:48:43] (03PS1) 10Giuseppe Lavagetto: role::etcd: fix ferm, add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/216675 [13:48:43] andrewbogott: ^ [13:48:46] andrewbogott: I think that's just setting the nameserver variable in realm.pp [13:48:46] (03CR) 10Giuseppe Lavagetto: [C: 032] role::etcd: fix ferm, add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/216675 (owner: 10Giuseppe Lavagetto) [13:49:44] PROBLEM - High load average on labstore1001 is CRITICAL 88.89% of data above the critical threshold [24.0] [14:00:16] (03PS1) 10Giuseppe Lavagetto: ectd::monitoring: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/216678 [14:00:16] (03CR) 10Giuseppe Lavagetto: [C: 032] ectd::monitoring: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/216678 (owner: 10Giuseppe Lavagetto) [14:00:17] (03PS2) 10Andrew Bogott: Remove use_dnsmasq logic. [puppet] - 10https://gerrit.wikimedia.org/r/216673 [14:00:17] YuviPanda: ^ [14:00:18] (03CR) 10Yuvipanda: [C: 04-1] Remove use_dnsmasq logic. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/216673 (owner: 10Andrew Bogott) [14:00:18] andrewbogott: nit [14:00:18] trailing space :) [14:00:18] ok, looking [14:00:21] ori: thanks for the reminder :) https://phabricator.wikimedia.org/T98563#1345570 [14:00:21] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 80.00% of data above the critical threshold [35.0] [14:00:21] (03PS3) 10Andrew Bogott: Remove use_dnsmasq logic. [puppet] - 10https://gerrit.wikimedia.org/r/216673 [14:00:22] YuviPanda: ^ [14:00:22] (03CR) 10Yuvipanda: [C: 031] Remove use_dnsmasq logic. [puppet] - 10https://gerrit.wikimedia.org/r/216673 (owner: 10Andrew Bogott) [14:00:31] (03PS9) 10Faidon Liambotis: certs: replace require by collector ordering [puppet] - 10https://gerrit.wikimedia.org/r/215352 [14:00:59] YuviPanda: I’m going to run a test as soon as nfs is back… [14:01:03] Maybe will have breakfast in the meantime [14:01:10] oh, it’s back! [14:06:08] PROBLEM - etcd service on etcd1001 is CRITICAL: Timeout while attempting connection [14:06:09] PROBLEM - puppet last run on etcd1001 is CRITICAL: Timeout while attempting connection [14:06:09] (03PS7) 10Paladox: Add json and less highlight support to gitblit and gerrit [puppet] - 10https://gerrit.wikimedia.org/r/216421 [14:06:09] PROBLEM - salt-minion processes on etcd1001 is CRITICAL: Timeout while attempting connection [14:06:10] PROBLEM - DPKG on etcd1001 is CRITICAL: Timeout while attempting connection [14:06:10] PROBLEM - Disk space on etcd1001 is CRITICAL: Timeout while attempting connection [14:06:10] PROBLEM - Etcd cluster health on etcd1001 is CRITICAL: Timeout while attempting connection [14:06:10] PROBLEM - RAID on etcd1001 is CRITICAL: Timeout while attempting connection [14:06:20] PROBLEM - configured eth on etcd1001 is CRITICAL: Timeout while attempting connection [14:06:28] (03CR) 10BBlack: [C: 031] certs: replace require by collector ordering [puppet] - 10https://gerrit.wikimedia.org/r/215352 (owner: 10Faidon Liambotis) [14:06:39] PROBLEM - dhclient process on etcd1001 is CRITICAL: Timeout while attempting connection [14:11:09] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [14:11:28] <_joe_> s orry for the alarms [14:11:54] (03CR) 10Giuseppe Lavagetto: [C: 032] role::etcd: define etcd::host [puppet] - 10https://gerrit.wikimedia.org/r/216681 (owner: 10Giuseppe Lavagetto)