[00:04:39] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [00:04:59] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [00:06:59] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 00:06:52 UTC 2014 [00:07:19] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 00:07:16 UTC 2014 [00:07:40] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 12:06:52 AM UTC [00:07:59] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 12:07:16 AM UTC [00:28:39] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:29:29] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.149 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [01:04:56] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:04:57] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:06:46] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 01:06:40 UTC 2014 [01:06:56] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:06:40 AM UTC [01:07:06] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 01:07:00 UTC 2014 [01:07:56] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:07:00 AM UTC [02:03:47] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:03:56] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:06:56] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 02:06:53 UTC 2014 [02:07:17] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 02:07:09 UTC 2014 [02:07:47] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:07:09 AM UTC [02:07:56] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:06:53 AM UTC [02:26:42] !log LocalisationUpdate completed (1.23wmf8) at Thu Jan 2 02:26:41 UTC 2014 [02:29:36] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:30:26] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.149 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:33:36] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:34:34] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.149 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:37:34] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:38:24] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.149 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:38:41] !log LocalisationUpdate completed (1.23wmf7) at Thu Jan 2 02:38:41 UTC 2014 [03:02:15] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jan 2 03:02:15 UTC 2014 [03:21:34] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:23:24] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.149 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [03:46:34] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:47:24] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.152 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [03:48:48] (03PS1) 10Rschen7754: add templateeditor right for testwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104912 [03:50:34] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:51:24] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.148 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [03:51:56] (03PS2) 10Rschen7754: add templateeditor right for testwiki: bug: 59084 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104912 [04:03:21] (03PS3) 10Rschen7754: add templateeditor right for testwiki: bug: 59084 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104912 [04:04:00] (03PS4) 10Legoktm: Add templateeditor right, group, and restriction [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104912 (owner: 10Rschen7754) [04:04:47] (03PS5) 10John F. Lewis: Add templateeditor right, group, and restriction [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104912 (owner: 10Rschen7754) [04:05:12] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:05:12] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:05:16] (03CR) 10John F. Lewis: [C: 031] "Configuration seems fine. Commit message is fine too now." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104912 (owner: 10Rschen7754) [04:06:42] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 04:06:36 UTC 2014 [04:07:02] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 04:06:56 UTC 2014 [04:07:12] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 04:06:56 AM UTC [04:07:12] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 04:06:36 AM UTC [04:53:32] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:55:22] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.150 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [05:04:36] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:04:46] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:07:07] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 05:07:01 UTC 2014 [05:07:16] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 05:07:11 UTC 2014 [05:07:36] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 05:07:01 AM UTC [05:07:46] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 05:07:11 AM UTC [05:21:33] (03CR) 10MZMcBride: "I don't think there's any need for this. The bug is still awaiting a rationale. The commit message attempts to provide one, but it's prett" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104912 (owner: 10Rschen7754) [05:48:29] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:49:19] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.149 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [06:05:38] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [06:05:48] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [06:07:28] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 06:07:23 UTC 2014 [06:07:28] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 06:07:23 UTC 2014 [06:07:38] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 06:07:23 AM UTC [06:07:48] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 06:07:23 AM UTC [07:04:35] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:04:45] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:06:55] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 07:06:45 UTC 2014 [07:06:56] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 07:06:45 UTC 2014 [07:07:35] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 07:06:45 AM UTC [07:07:45] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 07:06:45 AM UTC [08:04:52] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:04:52] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:06:52] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 08:06:44 UTC 2014 [08:07:03] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 08:06:59 UTC 2014 [08:07:52] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 08:06:44 AM UTC [08:07:52] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 08:06:59 AM UTC [08:36:24] !log gallium / jenkins upgrading Jenkins from 1.509.4 to 1.532.1 [08:36:55] ... [08:46:45] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [08:47:35] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 5.435 seconds response time. nagiostest.beta.wmflabs.org returns [09:05:04] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [09:05:14] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [09:07:24] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 09:07:16 UTC 2014 [09:07:34] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 09:07:31 UTC 2014 [09:07:57] !log jenkins restarted [09:08:04] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 09:07:31 AM UTC [09:08:14] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 09:07:16 AM UTC [09:08:33] (03PS1) 10Hashar: stages.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104919 [09:16:44] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [09:17:34] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.149 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [09:22:06] (03PS1) 10Hashar: applicationserver: pass puppetlint / retab [operations/puppet] - 10https://gerrit.wikimedia.org/r/104920 [09:31:36] (03PS1) 10Hashar: ganglia_new: fix puppet-lint issues [operations/puppet] - 10https://gerrit.wikimedia.org/r/104921 [09:53:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Redirect kr.wikimedia [operations/apache-config] - 10https://gerrit.wikimedia.org/r/101220 (owner: 10John F. Lewis) [10:03:08] (03CR) 10Alexandros Kosiaris: [C: 032] 2 new education redirects [operations/apache-config] - 10https://gerrit.wikimedia.org/r/102753 (owner: 10Jeremyb) [10:04:42] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [10:04:42] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [10:07:02] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 10:06:54 UTC 2014 [10:07:14] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 10:07:09 UTC 2014 [10:07:42] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:06:54 AM UTC [10:07:42] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:07:09 AM UTC [10:08:22] (03PS1) 10Hashar: beta: allow ssh from gallium on parsoid instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/104924 [10:08:46] akosiaris: good morning :) whenever you are done with apache config could you merge in https://gerrit.wikimedia.org/r/104924 please ? :-D that is for beta / ci [10:15:10] (03CR) 10Alexandros Kosiaris: [C: 032] beta: allow ssh from gallium on parsoid instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/104924 (owner: 10Hashar) [10:16:06] hashar: Good morning to you too. And happy new year!! :) [10:16:54] akosiaris1: thanks :-D and yeah happy 0x7DE [10:18:37] :-) [10:19:39] akosiaris1: do you have any clue how I could get a debian package uploaded on debian repo ? [10:19:59] I have updated the python-statsd package by bumping the changelog entry but no clue whom to ask :D [10:21:14] the official one ? you are supposed to wait for an uploader.... and I 've been advised not to push things on that front. [10:21:45] aka "patience is a virtue" [10:23:29] akosiaris1: well I haven't even uploaded the file :D [10:23:47] maybe I should poke the debian python folks [10:26:14] your best best I 'd say [10:46:53] (03PS1) 10Alexandros Kosiaris: Install python{,3}-dev packages on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/104929 [10:47:57] (03CR) 10jenkins-bot: [V: 04-1] Install python{,3}-dev packages on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/104929 (owner: 10Alexandros Kosiaris) [10:51:52] (03PS2) 10Alexandros Kosiaris: Install python{,3}-dev packages on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/104929 [10:54:42] (03CR) 10Alexandros Kosiaris: [C: 032] Install python{,3}-dev packages on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/104929 (owner: 10Alexandros Kosiaris) [11:04:32] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:04:32] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:07:12] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 11:07:08 UTC 2014 [11:07:32] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 11:07:08 AM UTC [11:07:32] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 11:07:29 UTC 2014 [11:08:32] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 11:07:29 AM UTC [11:08:45] (03PS1) 10Hashar: beta: parsoid switch to jenkins deployed parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/104932 [11:09:25] akosiaris1: and another lame beta/parsoid change https://gerrit.wikimedia.org/r/#/c/104932/ update the parsoid upstart configuration to have it load the NPM module from the proper place [11:09:34] akosiaris1: no impact on production since prod does not use upstart yet :-] [11:09:50] (03PS2) 10Hashar: beta: parsoid switch to jenkins deployed parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/104932 [11:10:00] amended summary to reflect it has no impact on prod [11:10:49] don't forget we hope to rip off as much of that for pro as possible though ;-) [11:10:53] *prod [11:11:52] yeah got to update that later on with gwicke [11:11:59] they want to use upstart as well [11:12:37] that will be very nice indeed [11:14:37] mind merging that meanwhile ? :-] [11:14:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "There is a typo. And a nitpick. Otherwise LGTM" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104932 (owner: 10Hashar) [11:14:43] ah [11:15:06] upstart is nice ? [11:16:19] upstart is ok ;) [11:16:26] (03CR) 10Hashar: beta: parsoid switch to jenkins deployed parsoid (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104932 (owner: 10Hashar) [11:16:32] I like upstart [11:16:41] (03PS3) 10Hashar: beta: parsoid switch to jenkins deployed parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/104932 [11:16:56] Ubuntu upstart cookbook is worth a read http://upstart.ubuntu.com/cookbook/ [11:17:00] I dislike upstart :P. even systemd seems better [11:17:06] no systemd is not better [11:17:24] oh please, lets not bring the Debian "systemd vs upstart" drama here :-D [11:17:32] why not? [11:17:33] done brought it [11:17:39] that is not a debian drama [11:18:11] I have never used systemd myself nor do I have any clue what are the difference between the two [11:18:39] well systemd is too intrusive for an init system [11:18:50] it wants DBUS for example ... [11:19:10] it changes early boot logging as well [11:20:26] addressed issue on https://gerrit.wikimedia.org/r/104932 meanwhile [11:20:58] (03CR) 10Alexandros Kosiaris: [C: 032] beta: parsoid switch to jenkins deployed parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/104932 (owner: 10Hashar) [11:21:32] Ubuntu is probably going to stick with upstart anyway so we are stuck with it as well [11:21:43] though if we switch to Debian ... [11:21:51] ahahaha [11:23:54] hashar: merged [11:27:18] thanks! [11:29:42] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:30:33] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.578 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [11:35:11] (03PS1) 10Mark Bergsma: Update A/AAAA records of SLD project domain records [operations/dns] - 10https://gerrit.wikimedia.org/r/104937 [11:35:12] (03PS1) 10Hashar: beta: parsoid localsettings.js [operations/puppet] - 10https://gerrit.wikimedia.org/r/104938 [11:37:58] (03CR) 10Mark Bergsma: [C: 032] Update A/AAAA records of SLD project domain records [operations/dns] - 10https://gerrit.wikimedia.org/r/104937 (owner: 10Mark Bergsma) [11:38:10] akosiaris1: and a last one to publish the parsoid configuration on beta https://gerrit.wikimedia.org/r/#/c/104938/ :d [11:40:34] I am out to lunch be back in a few [11:40:35] :-) [11:46:02] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [11:55:13] (03PS1) 10Mark Bergsma: Swap new and old bits-lb LVS service IPs for esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/104940 [11:56:50] (03CR) 10Mark Bergsma: [C: 032] Swap new and old bits-lb LVS service IPs for esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/104940 (owner: 10Mark Bergsma) [12:05:08] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [12:05:08] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [12:06:47] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 12:06:43 UTC 2014 [12:07:07] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 12:07:01 UTC 2014 [12:07:08] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 12:06:43 PM UTC [12:07:21] (03PS1) 10Mark Bergsma: Update bits-lb.esams IP addresses to the new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/104941 [12:08:07] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 12:07:01 PM UTC [12:13:31] (03CR) 10Mark Bergsma: [C: 032] Update bits-lb.esams IP addresses to the new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/104941 (owner: 10Mark Bergsma) [12:25:46] (03PS1) 10Mark Bergsma: Swap old and new bits-lb.eqiad IPv6 LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/104945 [12:27:10] (03CR) 10Mark Bergsma: [C: 032] Swap old and new bits-lb.eqiad IPv6 LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/104945 (owner: 10Mark Bergsma) [12:52:41] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [12:52:41] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [12:55:29] (03PS1) 10Mark Bergsma: Update AAAA record of bits-lb.eqiad to the new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/104947 [12:58:13] (03CR) 10Mark Bergsma: [C: 032] Update AAAA record of bits-lb.eqiad to the new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/104947 (owner: 10Mark Bergsma) [13:07:14] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 13:06:36 UTC 2014 [13:07:22] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 13:07:12 UTC 2014 [13:07:41] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:06:36 PM UTC [13:07:41] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:07:12 PM UTC [13:11:18] (03PS1) 10Mark Bergsma: Add new mobile-lb LVS service IPs (Zero scheme) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104948 [13:12:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [13:12:50] (03CR) 10Mark Bergsma: [C: 032] Add new mobile-lb LVS service IPs (Zero scheme) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104948 (owner: 10Mark Bergsma) [13:20:57] (03PS1) 10Mark Bergsma: Remove site pmtpa from the protoproxy configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/104949 [13:23:06] (03CR) 10Mark Bergsma: [C: 032] Remove site pmtpa from the protoproxy configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/104949 (owner: 10Mark Bergsma) [13:24:18] akosiaris1: i have some more redirects if you're still in the merging mood [13:27:02] akosiaris1: also, did you deploy what you already merged? [13:30:49] (03PS1) 10Mark Bergsma: Add new mobile LVS service IPs to protoproxies (Zero scheme) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104950 [13:31:02] huh, 2 issues: 1 the redirect is broken and 2) it's cached in varnish as the older state [13:31:14] (krwikimedia) [13:32:22] (03CR) 10Mark Bergsma: [C: 032] Add new mobile LVS service IPs to protoproxies (Zero scheme) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104950 (owner: 10Mark Bergsma) [13:32:50] jeremyb: sure and yes [13:48:33] (03PS15) 10Hashar: Enable Wikidata build on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [13:48:44] RECOVERY - Varnish HTTP text-backend on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [13:48:44] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:19:11 PM UTC [13:48:44] RECOVERY - SSH on cp1065 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:48:44] RECOVERY - puppet disabled on cp1065 is OK: OK [13:48:44] RECOVERY - Disk space on cp1065 is OK: DISK OK [13:48:45] RECOVERY - Varnish HTCP daemon on cp1065 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [13:48:45] RECOVERY - DPKG on cp1065 is OK: All packages OK [13:48:46] RECOVERY - RAID on cp1065 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:48:58] icinga-wm: no flooding, kthx! [13:49:24] RECOVERY - Puppet freshness on cp1065 is OK: puppet ran at Thu Jan 2 13:49:15 UTC 2014 [13:49:44] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:49:15 PM UTC [13:49:51] akosiaris1: so, we have some (probably widespread) redirect problems with the new system uncovered by the latest kr.wikimedia.org change. https://bugzilla.wikimedia.org/54883#c5 (but definitely effecting more than just that new one) [13:50:01] (03CR) 10Hashar: [C: 032] "with aude" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [13:50:47] speaking which, Tim-away are you here? i guess nick indicates no [13:50:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:49:15 PM UTC [13:50:54] (03Merged) 10jenkins-bot: Enable Wikidata build on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [13:51:12] hashar: re jenkins jobs, i should be trusted, no? :) [13:51:49] jeremyb: probably, add your email in integration/zuul-config.git , there is a few examples in the history :) [13:51:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:49:15 PM UTC [13:51:57] jeremyb: yes [13:52:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:49:15 PM UTC [13:53:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:49:15 PM UTC [13:54:01] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [13:54:40] PROBLEM - LDAPS on virt1000 is CRITICAL: Connection refused [13:54:50] RECOVERY - Puppet freshness on cp1065 is OK: puppet ran at Thu Jan 2 13:54:42 UTC 2014 [13:54:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:49:15 PM UTC [13:55:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [13:56:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [13:57:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [13:58:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [13:59:30] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:59:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:00:20] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.149 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [14:00:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:01:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:02:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:03:50] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:04:35] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:04:46] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:04:46] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:04:58] 1970? [14:05:00] (03PS1) 10Hashar: missing extension-list-labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104953 [14:05:45] (03CR) 10Hashar: "extension-list-lab got removed by that change which caused messages to no more be updating :( Fixed by https://gerrit.wikimedia.org/r/#/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104741 (owner: 10Dan-nl) [14:05:45] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:05:52] (03CR) 10Hashar: [C: 032] missing extension-list-labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104953 (owner: 10Hashar) [14:06:00] (03Merged) 10jenkins-bot: missing extension-list-labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104953 (owner: 10Hashar) [14:06:45] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:07:15] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 14:07:14 UTC 2014 [14:07:25] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 14:07:19 UTC 2014 [14:07:35] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:07:14 PM UTC [14:07:39] aude: that's the beginning of time [14:07:45] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:07:19 PM UTC [14:07:45] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:07:48] yep [14:08:26] bahh [14:09:06] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:09:32] * jeremyb fights with labs [14:09:35] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:09:56] i almost feel like it would be worth it to take an apache out of rotation and test there [14:10:42] RECOVERY - Puppet freshness on cp1065 is OK: puppet ran at Thu Jan 2 14:10:30 UTC 2014 [14:10:42] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 01:54:42 PM UTC [14:10:47] (03PS1) 10Hashar: Wikibase: fix extension-list paths [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104954 [14:11:19] (03PS2) 10Hashar: Wikibase: fix extension-list paths [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104954 [14:11:29] (03CR) 10Hashar: [C: 032] Wikibase: fix extension-list paths [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104954 (owner: 10Hashar) [14:11:35] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:10:30 PM UTC [14:12:35] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:10:30 PM UTC [14:12:44] (03Merged) 10jenkins-bot: Wikibase: fix extension-list paths [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104954 (owner: 10Hashar) [14:13:35] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:10:30 PM UTC [14:13:58] 3 minutes ago and it still is critical ??? [14:13:59] grrrrr [14:14:37] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:10:30 PM UTC [14:15:02] (03PS1) 10Mark Bergsma: Swap old and new mobile-lb.eqiad LVS service IPs (IPv6) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104955 [14:15:35] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:10:30 PM UTC [14:15:45] RECOVERY - Puppet freshness on cp1065 is OK: OK [14:16:07] akosiaris1: that's a box that had chronic problems a week or two ago. idk what happened since then [14:16:35] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:15:35 PM UTC [14:16:56] jeremyb: yeah I am aware. But this is not a problem of the box... crappy icinga has problems now [14:17:16] (03CR) 10Mark Bergsma: [C: 032] Swap old and new mobile-lb.eqiad LVS service IPs (IPv6) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104955 (owner: 10Mark Bergsma) [14:17:45] akosiaris1: or snmpt (sp?) [14:17:53] or the thing it calls [14:17:55] or something [14:20:05] (03PS1) 10Hashar: beta: make mw-update-l10n verbose [operations/puppet] - 10https://gerrit.wikimedia.org/r/104956 [14:20:40] (03PS1) 10Faidon Liambotis: icinga: capitalize Faidon's name & remove Asher [operations/puppet] - 10https://gerrit.wikimedia.org/r/104957 [14:21:00] (03CR) 10Faidon Liambotis: [C: 032 V: 032] icinga: capitalize Faidon's name & remove Asher [operations/puppet] - 10https://gerrit.wikimedia.org/r/104957 (owner: 10Faidon Liambotis) [14:25:33] what's up with virt100* and icinga? [14:26:35] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [14:27:15] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.33 ms [14:28:34] mark: virt1000 had its opendj crash due to "too many open files" [14:28:42] puppet restarted it [14:29:36] that one also affected DNS for labs... the crappy freshness checks ? I really don't know [14:29:50] no i mean, the virt* boxes seem to be breaking the icinga config dependencies [14:29:58] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [14:30:36] akosiaris1: when was the opendj problem? [14:30:49] sudo's been very slow since at least last night [14:30:49] some 39 minutes ago [14:30:56] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.426 second response time [14:31:00] still having problems now [14:31:35] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:19:14 PM UTC [14:32:35] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:19:14 PM UTC [14:33:37] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:19:14 PM UTC [14:34:01] mark: seems to be ok now... did you run puppet manually ? [14:34:19] yes [14:34:26] it was missing the virt1002 host definition... [14:34:26] seems like subsequent puppet runs fix it [14:34:30] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:19:14 PM UTC [14:34:33] race condition ? [14:35:51] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:19:14 PM UTC [14:35:51] RECOVERY - Puppet freshness on cp1065 is OK: GRRRRRR [14:36:28] akosiaris1: hey, where do we update puppet volatile now? palladium, or all workers? [14:36:30] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:36:40] andre__, *: see last 2 commits on https://git.wikimedia.org/commit/operations%2Fapache-config.git/d768fd64ea6594b01ddc6eb78a91507b846e1b22 ; tested and working in labs. now someone has to modify the php to generate the right conf. i'll be back in ~30 mins (anyone feel free to work on that php fix...) [14:37:30] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:38:30] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:39:30] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:40:30] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:40:53] !log swift: setting weight of ms-be5 sde1 to 0, pending RT 6555 [14:41:09] paravoid: all workers [14:41:18] RECOVERY - Disk space on ms-be5 is OK: DISK OK [14:41:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:42:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:42:43] (03PS1) 10Mark Bergsma: Update mobile-lb.eqiad AAAA record [operations/dns] - 10https://gerrit.wikimedia.org/r/104959 [14:43:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:43:34] (03CR) 10Mark Bergsma: [C: 032] Update mobile-lb.eqiad AAAA record [operations/dns] - 10https://gerrit.wikimedia.org/r/104959 (owner: 10Mark Bergsma) [14:44:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:45:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:46:40] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:47:34] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:47:57] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:47:57] manybubbles: i'm working a bit today, should I merge that priority CirrusSearch jobs change? [14:48:08] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [14:48:22] ottomata: I'm happy with it. It can wait until you're working for reals though [14:48:26] when are you back? [14:48:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:49:01] monday 100% [14:49:18] but i'm kinda working a half day today, so I'm happy to do it now [14:49:24] !log hashar synchronized wmf-config 'Wikibase tweak for beta 976f2e9..7f80acb' [14:49:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:49:41] ottomata: cool. may as well then [14:49:54] no time like the present [14:50:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:51:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:52:28] (03PS2) 10Ottomata: Prioritize priority CirrusSearch jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/104763 (owner: 10Chad) [14:52:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:35:34 PM UTC [14:52:34] (03CR) 10Ottomata: [C: 032 V: 032] Prioritize priority CirrusSearch jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/104763 (owner: 10Chad) [14:54:25] ottomata: hello! [14:54:28] hiya! [14:54:29] happy new year :) [14:54:33] icinga is full of kafka errors [14:54:33] backaattcha! [14:54:43] ahahaha [14:54:46] yes yes, going to look at that in juuust a minute, i sent an email about that [14:54:51] happy new year :-) [14:54:54] they are not real [14:55:01] they are caused by ganglios problems [14:55:01] both varnishkafka & brokers [14:55:04] at least, they were on monday [14:56:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [14:57:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [14:58:18] thanks ottomata. got time to talk about elasticsearch plugins? [14:58:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [14:58:41] manybubbles: sorta, gonna try to fix this kafka icinga thing [14:58:48] we need to talk about that as a larger thing, right? [14:58:53] deploying jvm stuff? [14:59:06] been meaning to send an email to restart that discussion for about a week now [14:59:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [14:59:45] ottomata: at this point I just want my plugins. Larger thing or not. whatever it takes [15:00:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [15:00:35] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [15:00:35] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [15:01:04] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [15:01:05] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate is 2751.14198192 [15:01:14] RECOVERY - Kafka Broker Messages In on analytics1022 is OK: OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate is 2752.83059598 [15:01:14] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [15:01:26] ottomata: kafka still broken :-D [15:01:34] or no more hmm [15:01:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [15:01:51] naw, its ganglios that is broken [15:02:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [15:03:34] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [15:04:40] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [15:05:10] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [15:05:10] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [15:05:29] !log starting rolling upgrade of Elasticsearch servers. Going from 0.90.7 to 0.90.9. [15:05:39] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 02:53:10 PM UTC [15:07:09] could use two merges for beta, one to make mw-update-l10n verbose : https://gerrit.wikimedia.org/r/#/c/104956/ [15:07:10] next being to properly track beta parsoid config https://gerrit.wikimedia.org/r/#/c/104938/ [15:07:39] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: STALE [15:07:39] PROBLEM - Varnishkafka Delivery Errors on cp4012 is CRITICAL: STALE [15:07:59] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: STALE [15:08:09] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: STALE [15:08:10] PROBLEM - Kafka Broker Messages In on analytics1022 is CRITICAL: STALE [15:08:10] PROBLEM - Varnishkafka Delivery Errors on cp4011 is CRITICAL: STALE [15:14:39] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [15:14:39] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [15:14:59] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [15:15:09] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate is 2709.66566769 [15:15:10] RECOVERY - Kafka Broker Messages In on analytics1022 is OK: OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate is 2708.55707436 [15:15:10] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [15:15:24] (03CR) 10Faidon Liambotis: "What's the deal we have with wikimedia.li & wikimedia.pl? We own the domains but use some commercial nameservers? What's legal's take on t" [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [15:18:14] (03CR) 10Alexandros Kosiaris: [C: 032] beta: make mw-update-l10n verbose [operations/puppet] - 10https://gerrit.wikimedia.org/r/104956 (owner: 10Hashar) [15:18:24] somehow I think ganglios is going to break again in a short while... [15:18:24] hmm [15:18:37] (03PS1) 10Hashar: beta: finish parsoid switching to jenkins job [operations/puppet] - 10https://gerrit.wikimedia.org/r/104961 [15:18:48] akosiaris: while you are it I got https://gerrit.wikimedia.org/r/104961 [15:19:06] akosiaris: which switch the beta parsoid to a new repository. I did a similar change earlier this morning [15:19:43] (03CR) 10Alexandros Kosiaris: [C: 032] beta: parsoid localsettings.js [operations/puppet] - 10https://gerrit.wikimedia.org/r/104938 (owner: 10Hashar) [15:21:12] Syslogs::Readable[messages]/File[/var/log/messages]/mode: mode changed '0640' to '0644' [15:21:12] \O/ [15:23:19] (03PS2) 10Alexandros Kosiaris: beta: finish parsoid switching to jenkins job [operations/puppet] - 10https://gerrit.wikimedia.org/r/104961 (owner: 10Hashar) [15:23:31] akosiaris: you will be praised :-] [15:24:39] hashar: I want a bard's song :-) [15:24:46] (03CR) 10Alexandros Kosiaris: [C: 032] beta: finish parsoid switching to jenkins job [operations/puppet] - 10https://gerrit.wikimedia.org/r/104961 (owner: 10Hashar) [15:34:51] (03PS1) 10Hashar: beta: parsoid: typo in file definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/104965 [15:35:06] akosiaris1: and a typo : file:// should be file:/// https://gerrit.wikimedia.org/r/104965 :-] [15:35:38] damn... I should have noticed that [15:35:49] well [15:35:59] I knew it it was too easy to be true [15:36:02] our puppet tests should have prevented that in the first place [15:36:06] (03CR) 10Alexandros Kosiaris: [C: 032] beta: parsoid: typo in file definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/104965 (owner: 10Hashar) [15:36:26] completed parsing of enwiki:0.8680641636207775 in 2701 ms [15:36:29] works like a charm [15:48:43] 'git' 'clone' '-q' 'ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/Echo.git' 'Echo' [15:48:43] Permission denied (publickey). [15:48:43] fatal: The remote end hung up unexpectedly [15:48:43] git exit with status 128 [15:48:44] DAMN IT [15:50:37] hashar: are you back from holidays now? [15:50:48] chrismcmahon: yup [15:53:24] andre__: akosiaris1: i have a fix for redirects issue I was talking about before. [15:53:37] * jeremyb prepares commit [15:56:41] * andre__ wonders why he gets pinged here. [15:57:19] (03PS2) 10Andrew Bogott: Recomission virt1001, 1002, 1003. [operations/puppet] - 10https://gerrit.wikimedia.org/r/104676 [15:58:39] (03PS3) 10Tim Landscheidt: Recomission virt1001, 1002, 1003. [operations/puppet] - 10https://gerrit.wikimedia.org/r/104676 (owner: 10Andrew Bogott) [15:59:03] andre__: bug 54883 [15:59:28] ah. [15:59:40] (03CR) 10Andrew Bogott: [C: 032] Recommission virt1001, 1002, 1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/104676 (owner: 10Andrew Bogott) [15:59:48] thanks :) [16:04:05] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [16:04:24] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [16:07:45] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 16:07:39 UTC 2014 [16:07:45] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 16:07:39 UTC 2014 [16:08:05] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 04:07:53 PM UTC [16:08:06] ^ me tinkering [16:08:15] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 04:07:53 PM UTC [16:08:28] wha? [16:09:08] happy new year andrewbogott :-D [16:09:15] same to you! [16:09:18] noticed some slowness when doing sudo , might be ldap borked [16:09:35] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 16:09:30 UTC 2014 [16:09:36] on the beta cluster? [16:09:40] yeah [16:09:59] 'some slowness' = seconds or minutes? [16:10:01] no more the case apparently, there is a slight delay but not as awful as earlier today [16:10:06] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 04:09:30 PM UTC [16:10:08] a few seconds, maybe 5 - 10 [16:10:19] just FYI, in case it happens again =D [16:10:25] RECOVERY - Puppet freshness on virt1003 is OK: puppet ran at Thu Jan 2 16:10:15 UTC 2014 [16:10:32] working fine right now [16:10:49] Hm, ok :( [16:12:14] good morning, comrades [16:12:20] 'morning! [16:12:49] jgage, you're working at the SF office, at least now and then, right? [16:14:10] yep, one of the few ;) [16:15:13] jgage: OK, I'm visiting the office next Friday, maybe we'll cross paths :) [16:16:04] andrewbogott cool, i'll keep an eye out for you. i sit next to leslie. [16:16:27] andrewbogott: woot :) [16:18:27] (03CR) 10BryanDavis: [C: 031] Revert "Configure Varnish not to cache scholarship app reqs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103768 (owner: 10BryanDavis) [16:18:45] (03PS1) 10Mark Bergsma: Swap old and new LVS service IPs for ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/104972 [16:19:43] hopefully by next week ulsfo's cross connects will be ready.... [16:20:18] LeslieCarr: Haven't you been saying that for the last 11 weeks? [16:20:24] hehehe [16:20:25] (03CR) 10Mark Bergsma: [C: 032] Swap old and new LVS service IPs for ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/104972 (owner: 10Mark Bergsma) [16:20:40] well only for the past 2 weeks -- before that i was hoping the links wouldbe ready [16:21:04] Ah, maybe I don't know the difference between a link and a cross connect :( [16:23:24] (03PS1) 10Mark Bergsma: Update ulsfo service IP addresses according to new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/104976 [16:34:48] (03PS1) 10Hashar: beta: remove old parsoid updater [operations/puppet] - 10https://gerrit.wikimedia.org/r/104978 [16:35:15] off [16:37:08] RECOVERY - Puppet freshness on virt1001 is OK: puppet ran at Thu Jan 2 16:36:59 UTC 2014 [16:39:25] (03CR) 10Mark Bergsma: [C: 032] Update ulsfo service IP addresses according to new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/104976 (owner: 10Mark Bergsma) [16:42:48] (03PS1) 10Mark Bergsma: Remove old ulsfo LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/104979 [16:44:03] (03CR) 10Mark Bergsma: [C: 032] Remove old ulsfo LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/104979 (owner: 10Mark Bergsma) [16:54:46] (03PS3) 10BryanDavis: [WIP] Kibana puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/104172 [16:55:03] (03CR) 10BryanDavis: [WIP] Kibana puppet class (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104172 (owner: 10BryanDavis) [17:08:28] (03PS1) 10Jeremyb: fix redirects for percent-encoded destinations [operations/apache-config] - 10https://gerrit.wikimedia.org/r/104984 [17:08:29] (03PS1) 10Jeremyb: write a \n at EOF after generating rest of conf [operations/apache-config] - 10https://gerrit.wikimedia.org/r/104985 [17:11:30] !log reedy synchronized php-1.23wmf9 'staging' [17:21:55] !log reedy updated /a/common to {{Gerrit|I49a405d8a}}: Wikibase: fix extension-list paths [17:22:00] (03PS1) 10Reedy: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104988 [17:22:52] (03CR) 10jenkins-bot: [V: 04-1] Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104988 (owner: 10Reedy) [17:26:19] ^d: Can you rm -rf /a/common/docroot/bits/WikipediaMobileFirefoxOS.bak2 please from tin? [17:27:38] <^d> done [17:28:02] (03PS2) 10Reedy: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104988 [17:28:03] thanks [17:28:33] (03PS3) 10Reedy: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104988 [17:28:41] (03CR) 10Reedy: [C: 032] Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104988 (owner: 10Reedy) [17:30:47] (03Merged) 10jenkins-bot: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104988 (owner: 10Reedy) [17:31:09] (03PS1) 10Reedy: Wrap inclusion of wmfConfigDir/extension-list-labs in file_exists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104989 [17:32:01] (03CR) 10Reedy: "Git doesn't like empty files" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104953 (owner: 10Hashar) [17:32:23] (03PS2) 10Reedy: Wrap inclusion of wmfConfigDir/extension-list-labs in file_exists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104989 [17:32:30] (03CR) 10Reedy: [C: 032] Wrap inclusion of wmfConfigDir/extension-list-labs in file_exists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104989 (owner: 10Reedy) [17:32:43] (03CR) 10jenkins-bot: [V: 04-1] Wrap inclusion of wmfConfigDir/extension-list-labs in file_exists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104989 (owner: 10Reedy) [17:33:56] (03CR) 10jenkins-bot: [V: 04-1] Wrap inclusion of wmfConfigDir/extension-list-labs in file_exists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104989 (owner: 10Reedy) [17:34:02] (03PS3) 10Reedy: Wrap inclusion of wmfConfigDir/extension-list-labs in file_exists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104989 [17:34:40] (03CR) 10Reedy: [C: 032] Wrap inclusion of wmfConfigDir/extension-list-labs in file_exists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104989 (owner: 10Reedy) [17:37:44] (03Merged) 10jenkins-bot: Wrap inclusion of wmfConfigDir/extension-list-labs in file_exists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104989 (owner: 10Reedy) [17:39:12] !log reedy synchronized docroot and w [17:42:44] reedy@tin:/a/common$ scap 1.23wmf9 testwiki to 1.23wmf9 and build l10n cache [17:42:44] Syncing all versions. [17:42:44] Checking syntax of wmf-config and multiversion... done [17:42:44] Copying to tin from tin.eqiad.wmnet...ok [17:42:45] PHP Warning: opendir(/usr/local/apache/common-local/php-1.23wmf9/cache/l10n/upstream): failed to open dir: No such file or directory in /usr/local/bin/mergeCdbFileUpdates on line 76 [17:42:45] Could not open directory '/usr/local/apache/common-local/php-1.23wmf9/cache/l10n/upstream'. [17:42:46] /usr/local/bin/scap-2: line 48: die: command not found [17:43:32] AaronSchulz: ori ^ [17:43:44] Trying to build and deploy l10n cache for a new mw version [17:46:13] I notice it doesn't like 1.23wmf9 and syncs all anyway? [17:46:58] !log reedy started scap: 1.23wmf9 testwiki to 1.23wmf9 and build l10n cache [17:47:20] Updating LocalisationCache for 1.23wmf9... Updated 366 JSON file(s) in '/a/common/php-1.23wmf9/cache/l10n'. [17:47:22] looks sane at least [17:49:45] getting lots of errors [17:49:51] l10n related [17:50:13] (03CR) 10saper: [C: 031] "I hope that ns*.wikimedia.org ns servers are not used as resolvers anywhere (only as auth zone servers) so those changes don't really chan" [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [17:50:29] manybubbles: Where? [17:50:36] Reedy: the logs [17:50:48] I tail -F fatal.log exception.log habitually now [17:51:08] when it moves so fast it is distracting then something is wrong [17:53:02] oh it is very bad now [17:53:44] That almost looks like scap is pushing out crap localisation files [17:54:15] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [17:54:32] It looks like they are all noexternallanglinks [17:54:37] do those myeah [17:54:39] yeah [17:54:51] those don't stop page rendering, do they? [17:54:54] hmmmm [17:55:13] magic words are bad [17:55:14] We show crappy exceptions [17:55:24] But we haven't had a flood fo complaints either... [17:55:50] the entrypoint file thing for merge messages works fine, doesn't it? [17:56:36] Presumably [17:56:42] I didn't think there was any need for me to test it [17:56:49] well, you know [17:57:06] files in /a/common/php-1.23wmf8/cache/l10n/ look to be of sane size etc [17:57:10] seems beta is fine [17:57:20] Wait for scap to finish and see how things lay [17:57:31] wikivoyage works fine, purging etc [17:57:43] there's been some change in the way/order localisation files are pushed out [17:57:52] oh [17:57:53] But I have honestly no idea [17:58:02] Almost like they're all being wiped [17:58:07] then repulled/generated/whatever [17:58:28] * Reedy the still lack of complaints [17:58:31] magic words might no tlike it [17:58:31] *notes the [17:58:37] * aude nods [18:00:50] Reedy: I don't get that when I run sync-common...I do get some rename() errors though [18:01:01] don't get what? [18:01:53] Nemo_bis: I like blue [18:02:08] :D [18:02:18] the opendir() error [18:02:31] actually those errors are from rsync itself [18:02:37] it's nicely filled of blue [18:05:06] scap needs a progress bar [18:05:13] (i think ori might've already said that) [18:05:25] so, about treating all exceptions as important ;) [18:05:40] meh [18:05:47] Users still haven't apparently noticed [18:06:17] well, or-i of 2 months ago would be filling bugs right now :) [18:06:31] Reedy: I just got a report [18:06:36] of? [18:06:41] and one on #wikimedia-commons of a lot of errors [18:07:00] co localisation reportedly regressed to earlier state (missing) [18:07:07] Eww [18:07:08] https://commons.wikimedia.org/wiki/Main_Page [18:07:09] is broken [18:07:14] (corsican, an Italian dialect) [18:07:21] :( [18:07:25] if you hit the "right" server [18:07:34] AaronSchulz: Are all the localisation cache files being nuked first? [18:07:51] 1 in 5 or 6 seem to die on commons [18:10:40] Reedy: they are excluded from rsync and rebuilt afterwards, the existing ones aren't nuke though (they get renamed over with new ones) [18:11:19] Reedy: is scap still running? [18:11:23] yup [18:11:53] Killing it would've probably made things even worse [18:16:06] just curious if it was running [18:17:46] It'd be nice if we could do progress based on server count finished vs total servers [18:19:56] I'm getting a Mediawiki internal error on commons [18:20:05] "Exception caught inside exception handler." [18:23:04] Yup [18:23:17] localisation seems not right for wikidata :( [18:23:36] 40 minutes and counting [18:23:55] Reedy, the whole site is down, and I just used it [18:24:02] No it's not [18:24:29] It depends if you're "lucky" [18:24:43] Commons is working [18:24:49] For me [18:24:56] I see it break 1 in 5 or 6 [18:25:09] aude: As usual we'll have to wait and then assess the damage [18:25:15] It's working for me now too [18:25:22] And it's back down [18:25:35] Yup [18:25:41] Beause all servers aren't in the same state [18:25:53] :( [18:26:06] i think it's not finding wikibase localisation [18:26:52] reedy@fluorine:/a/mw-log$ du --si exception.log [18:26:53] 269M exception.log [18:26:53] reedy@fluorine:/a/mw-log$ du --si exception.log [18:26:53] 1.1G exception.log [18:26:57] It seems to be getting worse [18:27:14] Oh my [18:27:35] * AaronSchulz is trying to find where that is even defined [18:27:41] Yeah, I can't access it at all now. [18:27:51] I guess everyone who accesses the site creates a exception log [18:27:53] gg MediaWiki [18:28:09] I blame Jamesofur [18:28:11] j/k [18:28:11] gzip will be able to compress it down nicely at least [18:28:12] :P [18:28:19] ah, WikibaseClient.i18n.magic.php [18:28:21] Hey James! [18:28:27] it's supposted be included with $wgExtensionEntryPointListFiles [18:28:37] maybe that doesn't work or we did something wrong [18:28:50] NOT IT! [18:28:51] https://git.wikimedia.org/commitdiff/operations%2Fmediawiki-config.git/e6fc8d9db38947776523bfdcc330bdecbaadd034 [18:28:55] i wonder if you broke it aude... [18:29:00] * aude wonders [18:29:02] :p [18:29:19] Is it only commons that seems to be broken? [18:29:33] We could disable wikibase till it's fixed [18:29:33] wikivoyage also [18:29:36] an dwikidata [18:29:39] Reedy, I can still get to commons files on other wikis [18:29:55] Yeah, you're not using localisation stuff from commons [18:30:17] And scap is still going [18:30:26] so...in eval.php, <> works fine with /home on tin but not in /usr [18:30:27] (03PS1) 10Aude: Revert "Enable Wikidata build on beta labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105001 [18:30:32] with the later I get the exception [18:30:43] i would merge that and then we can try to figure otu what/how to do [18:30:51] (03CR) 10jenkins-bot: [V: 04-1] Revert "Enable Wikidata build on beta labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105001 (owner: 10Aude) [18:30:56] grrrr [18:30:58] rage [18:31:13] sync tin is presumably synced with itself, I guess it makes sense for it to get worse as scap runs [18:31:25] reedy@fluorine:/a/mw-log$ du --si exception.log [18:31:25] 2.2G exception.log [18:31:37] 5 minutes and it doubled in side [18:31:39] *size [18:32:06] using text() works in both /home and /usr [18:32:15] so I guess anything that invokes parse in /usr is broken [18:32:20] Does scap have --quick yet? [18:32:21] (with that same exception) [18:32:50] Reedy: since newlines were added the .json files since last scap, it will definitely be slow now [18:33:02] :| [18:33:26] * AaronSchulz expects the usually 43min [18:33:32] *usual [18:33:40] next run will be more interesting [18:33:55] ugh, can't type today [18:34:29] pages' [18:34:43] (03PS2) 10Aude: Revert "Enable Wikidata build on beta labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105001 [18:34:47] for commons [18:34:56] Watchmouse? [18:35:12] AaronSchulz: it's done [18:35:17] yep [18:35:19] just doing sync-wikiverisons now [18:35:20] !log reedy finished scap: 1.23wmf9 testwiki to 1.23wmf9 and build l10n cache [18:35:26] scap completed in 51m 16s. [18:35:36] 51 minutes? [18:35:44] Reedy: well that part is like 3 sec ;) [18:36:04] (03CR) 10Reedy: [C: 032] Revert "Enable Wikidata build on beta labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105001 (owner: 10Aude) [18:36:06] Step 1 [18:36:49] (03Merged) 10jenkins-bot: Revert "Enable Wikidata build on beta labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105001 (owner: 10Aude) [18:36:50] so, l10n_cache-en.cdb is the same on mw1017 and /usr and /home on tin [18:37:04] and probably the same everywhere else then [18:37:56] I'll run scap again then with 105001 merged [18:38:41] recovery page just came in [18:38:46] I'm probably late to the party but I'm getting a ton of MWExceptions in tha last few minutes accessing commons [18:38:53] Yes. [18:38:59] party's almost over [18:39:26] Reedy: if the json transporting was broken then I'd expect those md5sum values not to match /a/common [18:41:13] like this: [86f366d4] 2014-01-02 18:37:42: Fatal exception of type MWException [18:42:15] dschwen, yep [18:42:36] yep, late to the party? [18:42:40] yup [18:42:42] by about 50 minutes [18:42:45] :) [18:42:51] not for me [18:42:57] Updating LocalisationCache for 1.23wmf8... Updated 366 JSON file(s) in '/a/common/php-1.23wmf8/cache/l10n'. [18:43:00] still getting the raw exception messages [18:43:02] Hm [18:43:04] !log reedy started scap: active rebuild localisation cache with updated wikidata config [18:43:04] Actually [18:43:08] I'm gonna cheat here [18:43:12] dschwen: like everyone else in the world [18:43:16] [2ff0f682] 2014-01-02 18:43:04: Fatal exception of type MWException [18:43:24] Lets do it quicker [18:43:43] sync-dir > scap [18:43:52] Reedy: cheat? [18:44:04] I'm not letting scap run through [18:44:12] I dont want to wait another 50 minutes [18:44:19] sync-dir php-1.23wmf8/cache/l10n/ Sync [18:44:30] can sync wmf9 after as that doesn't matter [18:44:56] Are we still populating tampa? [18:44:58] * Reedy grumbles [18:45:27] 4.3G exception.log [18:45:45] those Ctrl+R monkeys, eh. [18:46:23] https://ganglia.wikimedia.org/latest/graph.php?c=Analytics%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1388688358&g=network_report&z=medium [18:46:25] page: broken again [18:46:34] Reedy: scap shouldn't be as slow next run [18:46:37] Analytics have had a good amount traffic [18:47:06] so eval.php works with parsing again on tin [18:47:27] apergos: We just need to wait for the localisation cache to be sync'd... [18:47:42] aude: you say all Wikivoyage projects and Wikidata too? [18:48:01] and commons [18:48:02] wonder why it thought things were fixed [18:48:13] It might have got lucky and hit a fixed server [18:48:22] meh [18:48:23] load average: 35.80, 23.58, 10.78 [18:48:33] * apergos grits teeth [18:48:36] twkozlowski: yes [18:49:17] thanks Reedy, aude -- preparing a mention for Tech News [18:49:31] I see it started around 17:42 UTC, so it's been just about an hour now [18:49:35] * aude hides [18:49:53] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous+eqiad&h=tin.eqiad.wmnet&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [18:49:55] wheee [18:49:55] Reedy: so you are doing sync-dir now? [18:49:57] http://status.wikimedia.org/8777/308948/https-services---commons thinks everything is still rather well, 90 % uptime last hour :D [18:50:22] AaronSchulz: Yup. Has been for about 7 minutes [18:50:33] aspergos, looks like there is still partying going on, and pretty hard [18:50:48] sorry, that was a typo, not an asperger joke :-( [18:50:55] We're having fuuuun [18:50:59] tab completion ftw [18:52:08] says 60 % uptime btw [18:52:17] * Nemo_bis wonders why not just shut it down [18:52:48] happy 2014; we've decided to shut down commons, it's too much of a bother :-P [18:53:08] I meant status.wikimedia.org :P [18:53:15] * apergos hopes we aren't really gong to wait another 48 minutes [18:53:37] it shouldn't be [18:54:39] if the cdb files changed, then it will be a while [18:55:03] it's a good job that tin is a reasonable spec server [18:55:08] oh looky, an en wikivoyage page now [18:55:15] that worked? [18:55:27] so "$wmfConfigDir/extension-list-wikidata"; resolves to somethign php-1.23wmf8 specific [18:55:32] that says it's broken [18:55:53] that's probably the problem [18:56:19] isn't it something stupid like php-1.23wmf8/../wmf-config ? [18:57:15] yes [18:57:16] /a/common/php-1.23wmf8/../wmf-config [18:57:19] Yeah [18:57:19] > echo $wmfConfigDir; [18:57:19] /a/common/php-1.23wmf7/../wmf-config [18:57:23] and i don't see the new extension list [18:57:29] maybe as it was reverted, though [18:57:39] idk if it was there [18:57:48] we can bring it locally onto tin when things are fixed to test it [18:58:01] seemed to be what was needed for beta to work [18:58:20] Reedy: https://meta.wikimedia.org/wiki/Wikimedia_Forum#Wikivoyage_is_down.3F - Assume you know something :p [18:58:37] We do [18:58:51] should be fixed soon as localisation stuff finishes [18:58:53] PATIENCE YOUNG PADAWAN [18:58:53] Reedy: Good. Just thought I'll like you to an onwiki report :) [18:59:21] Patience? Ha :p [18:59:24] Typical they made the edit on a wiki with no issues ;) [18:59:43] :D [19:00:15] recovery page for en wikivoyage... [19:00:40] JohnLewis: I took the liberty to post a short answer [19:00:47] (on Meta) [19:01:02] and for commons [19:01:15] twkozlowski: Good :p [19:09:52] 8.6G exception.log [19:11:13] another commons whine just came in [19:11:24] Reedy: is sync-dir done? [19:11:27] No [19:11:31] * aude super impatient [19:11:44] still 10s of rsync processes running [19:13:09] AaronSchulz: It'll !log when it's done ;) [19:13:11] aude: not just you [19:13:40] * aude can't imagine how the new config didn't work [19:14:12] no problems with beta [19:14:31] dsh -cM -g mediawiki-installation -o -oSetupTimeout=30 -F30 -- ???sudo -u mwdeploy rsync -a --delete-delay --delay-updates --compress --delete --exclude=**/.svn/lock --exclude=**/.git/objects --exclude=**/.git/**/objects --exclude=**/cache/l10n/*.cdb --no-perms --exclude=cache/l10n tin.eqiad.wmnet::common/php-1.23wmf8/cache/l10n// /usr/local/apache/common-local/php-1.23wmf8/cache/l10n/?? [19:14:37] Is that actually going to do anything? [19:15:19] Reedy: so, sync-common-file uses MW_RSYNC_ARGS for the directory case, which will include << --exclude=**/cache/l10n/*.cdb >> afaik [19:15:19] excluding and targetting in the same command? [19:16:31] yes, right, sycn-common-file already excludes that anyway, heh [19:17:07] -rw-r--r-- 1 mwdeploy mwdeploy 2734554 Jan 2 18:38 l10n_cache-ab.cdb [19:17:12] yeah it won't exclude anything then since it's relative [19:17:31] I thought you were doing cache/ for second there [19:18:08] which explains why it is taking forever as you'd expect copying all the stuff [19:18:19] morebots is dead btw [19:19:16] again? i'll restart it [19:20:16] load average: 30.74, 30.81, 28.34 [19:20:31] netsplit about 24h ago [19:23:01] AaronSchulz: hey, heads up, others will be deploying today, when this is over (ie: things no longer broken), can you send out that email to engineering@? Thanks :) [19:23:24] (03PS1) 10Reedy: Sync EQIAD before PMTPA [operations/puppet] - 10https://gerrit.wikimedia.org/r/105006 [19:23:28] test.wikidata looks better [19:23:30] greg-g: It's probably not AaronSchulzs fault [19:23:46] ^^ Why are we syncing PMTPA at all currently? (I know I've asked that before) [19:23:50] no, but this is just in reference to the change in the command args [19:24:02] Aha [19:24:29] ie: benny and any other random deployer who doesn't read all of -operations scrollback doesn't have a clue that anything has changed. [19:24:30] !log restarted morebots [19:25:03] Waiting for tampa first just adds more frustration in cases like this [19:25:16] yeah, that's lame [19:25:19] Logged the message, Master [19:25:30] !log stuff has been broken [19:25:39] :) [19:25:43] * Reedy kicks morebots [19:25:51] greg-g: the old style works fine now [19:26:01] Logged the message, Master [19:26:07] no one has to use it differently now [19:26:11] cool [19:26:30] nothing worth bragging about then? ;) [19:26:31] 31 seconds for morebots to reply? [19:26:44] the extension branch think is probably worth an email though [19:26:50] when it first starts up, yes, it has some initialization stuff to work through [19:27:11] Reedy: does that apply for wmf9? [19:27:18] *thing [19:27:26] apergos: it's reticulating splines [19:27:32] What thing? [19:27:35] wmf9 needs doing next [19:27:44] I just did wmf8 to be "quicker" [19:28:18] still getting sporadic exceptions, is there still an ongoing sync? [19:28:42] Yup [19:29:10] 40 minutes apparently [19:29:13] so it's mostly up? [19:29:19] Probably [19:29:31] yeah [19:29:43] looks to be in the high mw1[01]\d{2} range [19:29:49] [11:27] AaronSchulz the extension branch thing [19:29:49] Reedy: I saw that was merged [19:29:58] Yup [19:30:04] almost 2 hours on, now :/ [19:30:09] I made sure I did it before deploying [19:30:14] Then had to kick git a few times [19:30:26] around 1h45min [19:30:56] twkozlowski: right, hence the "almost" [19:31:07] AaronSchulz: https://gerrit.wikimedia.org/r/#/c/104970/ git/gerrit sucks [19:31:21] is this one of those cases where deploy performance, including LU updates, is our main bottleneck to quick recovery? [19:31:54] :) [19:31:59] Yeah [19:32:29] Eloquence: yes! [19:32:41] just because of a missing magic word [19:32:42] Syncing hundreds of MB to hundreds of servers is never going to be quick in one way or another [19:32:43] mainly [19:32:57] Reedy, why would we have to sync hundreds of MB to hundreds of servers if only a few things have changed? [19:33:06] that ^ [19:33:08] for wikidata, it has no magic word so localized messages are just not shown [19:33:09] Because there's changes to all localisation files [19:33:15] Every language has localisations [19:33:16] but one can still edit wikidata [19:33:28] Reedy: you sentence is best completed "in the way we do things now." [19:33:31] your* [19:33:35] no, I understand that it's due to the localisation cache updates. but that seems to indicate that an incremental update mechanism for messages is badly called for [19:33:36] We support 369 languages [19:33:38] in theory scap probably would be faster than sync-dir on l10n...at least in theory [19:33:47] Incremental updates to binary files... :D [19:33:49] Eloquence: an incremental update mechanism has been in place for a few days [19:34:01] AaronSchulz: I was thinking that due to the seeding and mutliple sources [19:34:14] COMPUTERS I HATE YOU [19:34:15] ori: honest question: do you have performance numbers pre/post? [19:35:02] greg-g: there's the SAL, but the numbers are useless because of another unfortunate property of scap, which is that it is only as fast as the slowest target host [19:35:12] AaronSchulz: I guess sync-dir should probably get scap style deployment optimisations ;) [19:35:27] if a target host is saturated, it'll hang forever on that one host [19:35:33] Reedy: scap would exclude l10n/*.cdb and only sync the /upstream directory (MD5/json files) and then it would rebuild the cdbs per host from the json [19:35:36] * aude wish if the magic word was not used on a page, then it's missing ness shouldn't affect editing or viewing [19:35:49] magic words are brittle [19:35:50] ... [19:35:55] ori: I guess I'm just wondering how close you are to writing up the "We made scap awesomer, about this much" email [19:36:08] When can we start expensing holidays? [19:36:25] greg-g: i'm not going to write it [19:36:33] it's your sprint and aaron's code [19:36:46] ori: I like you niceness hack on search idx2001 ;) [19:36:47] s/you/aaron/ then [19:36:51] er, 1001 [19:36:51] aude: yeah, that is odd, just doing wfmessage( 'editing' )->parse() blew up in eval [19:37:04] ori: "my sprint" officially ended on the 13th :/ [19:37:16] Reedy, just request a personal day off if you're working on a holiday :) [19:37:22] ori: I don't think pointing fingers is the right thing here [19:37:57] so how about we do some test runs with small i18n changes and time it before announcing anything? [19:38:04] AaronSchulz: sounds good [19:38:29] I guess this could have been a test just now, but we played it "safe" with sync-dir [19:38:29] i don't thing nagging about e-mails is, either. if you want to help communicate this, you can look at the git commits messages, put them on an etherpad, start working them into clear prose, and optionally combine it with scap timing data from the SAL [19:38:43] so it will have to be next set of changes [19:39:43] ori: I was asking how close you were to it is all. I asked AaronSchulz about the "please send email about updated scap" and he said that old commands work now, so I bowed down on that (Benny needs to deploy a fix today, which is why I asked) [19:40:50] and i'm just pointing out how you could help that along [19:41:48] fair, I just thought the co-authors would be best to explain what is going on/changed :/ [19:41:59] AaronSchulz, is https://gerrit.wikimedia.org/r/#/c/103080/ the changeset that implements incremental updates? [19:42:04] (co-authors only because you were reviewing merges) [19:44:26] well that a random follow-ups [19:44:32] *and [19:44:38] * greg-g nods [19:44:46] *nod* that's pretty cool. look forward to seeing the effect. [19:44:48] Commons being down has been reported, correct? [19:45:00] StevenW: yep [19:45:01] * AaronSchulz is skipping letters like a bad CD drive today [19:45:11] maybe I should have went on vacation [19:45:16] StevenW: Almost feels like you're trolling ;) [19:45:21] AaronSchulz: :P [19:45:32] What, me? I would never. [19:45:34] Reedy, out of curiosity, were you aware of this work at all? [19:45:36] wikivoyage is back! [19:45:48] <\pi{r^2}> StevenW: https://meta.wikimedia.org/w/index.php?title=Tech/News/2014/02&curid=3243228&diff=6925038&oldid=6924574 [19:45:57] <\pi{r^2}> also see the SAL, etc. [19:46:20] wikidata is still missing localisation [19:47:19] \pi{r^2}: Now, this is only a draft, don't quote me on that yet! :-) [19:49:34] Yay [19:49:41] It's nearly stated syncing to the last server [19:49:53] 1217/1220 [19:50:05] (03CR) 10Chad: "I'd say just remove it entirely. Since Mark shut down the LVS & other services we basically can't fail over to it anymore even if we wante" [operations/puppet] - 10https://gerrit.wikimedia.org/r/105006 (owner: 10Reedy) [19:50:05] [fcd7c33c] 2014-01-02 19:49:48: Fatal exception of type MWException [19:50:23] * Reedy pets Bsadowski1 [19:51:03] wikidata is good now [19:51:27] yay! [19:51:30] Now I'm getting spammed by 1000s of lines of errors Aa [19:51:34] DMAN IT [19:51:42] He left as I was trying to tab complete [19:51:48] mw1201: rsync: rename failed for "/usr/local/apache/common-local/php-1.23wmf8/cache/l10n/upstream/l10n_cache-gsw.cdb.MD5" (from upstream/.~tmp~/l10n_cache-gsw.cdb.MD5): No such file or directory (2) [19:51:49] mw1201: rsync: rename failed for "/usr/local/apache/common-local/php-1.23wmf8/cache/l10n/upstream/l10n_cache-gsw.cdb.json" (from upstream/.~tmp~/l10n_cache-gsw.cdb.json): No such file or directory (2) [19:51:49] mw1201: rsync: rename failed for "/usr/local/apache/common-local/php-1.23wmf8/cache/l10n/upstream/l10n_cache-gu.cdb.MD5" (from upstream/.~tmp~/l10n_cache-gu.cdb.MD5): No such file or directory (2) [19:52:37] !log aaron synchronized php-1.23wmf8/cache/l10n/upstream [19:52:53] !log reedy synchronized php-1.23wmf8/cache/l10n/ 'Sync' [19:52:53] "fixed" [19:53:04] \o/ [19:53:20] * aude sighs [19:53:27] exception.log is also quiet again (to confirm it's fixed) [19:54:12] (03CR) 10Reedy: "I guess in that case we should probably remove the tampa servers from any of the mw related dsh lists too..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/105006 (owner: 10Reedy) [19:54:27] AaronSchulz: I got tonnes of spam at the end [19:54:28] mw1201: rsync: rename failed for "/usr/local/apache/common-local/php-1.23wmf8/cache/l10n/upstream/l10n_cache-gu.cdb.MD5" (from upstream/.~tmp~/l10n_cache-gu.cdb.MD5): No such file or directory (2) [19:54:28] etc [19:56:27] !log reedy updated /a/common to {{Gerrit|Ic5919deee}}: Revert "Enable Wikidata build on beta labs" [19:56:34] (03PS1) 10Reedy: Wikipedias to 1.23wmf8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105014 [19:57:09] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.23wmf8 [19:58:26] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [19:58:27] yay [19:58:35] (03CR) 10Reedy: [C: 032] Wikipedias to 1.23wmf8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105014 (owner: 10Reedy) [19:59:02] (03Merged) 10jenkins-bot: Wikipedias to 1.23wmf8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105014 (owner: 10Reedy) [19:59:30] (03PS1) 10Reedy: testwiki to 1.23wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105015 [19:59:31] (03PS1) 10Reedy: Rest of phase1 to 1.23wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105016 [19:59:54] (03CR) 10Reedy: [C: 032] testwiki to 1.23wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105015 (owner: 10Reedy) [20:00:37] Reedy: probably the sync-dirs running at the same time (which doesn't really matter) [20:00:47] reedy@tin:/a/common$ scap 1.23wmf9 Re-build 1.23wmf9 localisation cache [20:00:47] Syncing all versions. [20:00:49] * AaronSchulz was getting impatient [20:01:46] (03Merged) 10jenkins-bot: testwiki to 1.23wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105015 (owner: 10Reedy) [20:01:56] !log reedy started scap: 1.23wmf9 Re-build 1.23wmf9 localisation cache [20:01:57] if that kept taking forever I was going to run mergeCdbFileUpdates everywhere, but it finished soon after [20:03:09] Updated 366 CDB file(s) in '/usr/local/apache/common-local/php-1.23wmf9/cache/l10n'. [20:16:57] Reedy: so the script doesn't check out the branches automatically? [20:18:27] It does [20:18:43] But when you get a pub key error from gerrit when checking out, the whole script stops [20:19:21] But as we need them all branching... And it's very unlikely to be a big error (if it is, I'd know gerrit was down) [20:19:39] ah [20:21:01] so waiting a few seconds and trying again results in it working next time around [20:21:31] * AaronSchulz curious since an extension was on (no branch) [20:21:34] Reedy: heads up, bsitu has some fixes for Flow that he'd like in wmf9, when should he do that? [20:21:56] Scap is well on it's way with the 1.23wmf9 localisation cache [20:22:05] When that's done, wikis to swap, then I'm done... [20:22:10] * greg-g nods [20:22:11] Not debugging audes change tonight ;) [20:22:14] :) [20:22:16] though I suppose one could keep doing it that way and check out the HEAD hash of the branch as it changes [20:22:31] I was thinking about that too [20:22:44] as it rm -rf's the build dir every time and starts again [20:23:33] greg-g thanks, we'll prepare the version bump for Flow on 1.23wmf9 and wait for your OK. [20:24:41] * greg-g nods [20:25:13] spagewmf: I'm going to go get some food right now, I probably won't be back until after Reedy's done, just fyi. [20:25:22] greg-g: regarding https://wikitech.wikimedia.org/wiki/Deployments#Week_of_January_6 [20:25:40] StevenW: yessir? [20:25:50] will all merged patches to BetaFeatures go out during that update? [20:26:17] or just the search additions for those listed languages? [20:26:24] pretty sure BetaFeatures is an autobranched one... [20:26:30] Yeah from master [20:26:47] thanks to you both [20:26:48] https://www.mediawiki.org/wiki/MediaWiki_1.23/wmf9/Changelog#BetaFeatures [20:26:48] Reedy: is this the first message build for wmf9? [20:26:53] AaronSchulz: Nope [20:27:06] It was done first time around when wmf8 broke [20:27:29] !log reedy finished scap: 1.23wmf9 Re-build 1.23wmf9 localisation cache [20:27:34] yay [20:27:38] scap completed in 26m 53s. [20:30:00] * greg-g foods [20:31:26] Reedy: greg-g fine with looking at the change another time [20:31:40] (03CR) 10Reedy: [C: 032] Rest of phase1 to 1.23wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105016 (owner: 10Reedy) [20:31:58] aude: Probably wants some doing it manually on testwiki/mw1017 [20:32:00] (03Merged) 10jenkins-bot: Rest of phase1 to 1.23wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105016 (owner: 10Reedy) [20:32:04] still sucks that it's so complicated to change what extensions are used on beta vs production [20:32:19] ok [20:34:12] (03PS1) 10Reedy: Make logmsgbot report scap length to irc channel, but not log it [operations/puppet] - 10https://gerrit.wikimedia.org/r/105021 [20:35:11] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: rest of phase1 wikis to 1.23wmf9 [20:36:10] ori: did you get a chance to peek at https://gerrit.wikimedia.org/r/#/c/103619/ ? [20:36:46] * AaronSchulz will kill that 'refs/changes/66/22466/14' => 'core' line afterwards [20:36:46] AaronSchulz: i'm in the middle of that right now [20:36:47] nice [20:38:07] aude: Don't disagree with you. Just something like that (quite a big change) should've probably been deployed in a more supervised fashion to the cluster ;) [20:38:22] * aude nods [20:39:51] AaronSchulz: I had to rebase it again today! [20:40:01] i should've removed a couple of the other hacks then [20:40:05] The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. [20:40:51] spagewmf: [20:40:52] [02-Jan-2014 20:39:44] Fatal error: Class 'Container' not found at /usr/local/apache/common-local/php-1.23wmf9/extensions/Flow/FlowActions.php on line 101 [20:41:00] I'm guessing that's one of the things you're gonna fix? [20:41:49] spagewmf: Should be good to go... [20:41:52] manybubbles: where? :/ [20:42:21] [20:42:01] what happens to db1007, db1028 [20:42:36] Reedy: yes [20:42:36] Reedy: exception.log I believe wikidatawiki [20:42:45] s7 snapshot is 1007 [20:42:51] PROBLEM - MySQL Slave Running on db68 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 39425319 for key PRIMARY on query. Defaul [20:43:00] db1028 is extension1 master [20:43:01] PROBLEM - MySQL Slave Running on db1041 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 39425319 for key PRIMARY on query. Defaul [20:43:01] PROBLEM - MySQL Slave Running on db1024 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 39425319 for key PRIMARY on query. Defaul [20:43:11] PROBLEM - MySQL Slave Running on db1028 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 39425319 for key PRIMARY on query. Defaul [20:43:18] springle-away, ^^^ [20:43:41] PROBLEM - MySQL Slave Running on db1007 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 39425319 for key PRIMARY on query. Defaul [20:45:10] MaxSem: I think paging/ringing him might be better than irc ping ;) [20:45:11] Exception Caught: The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. The administrator who locked it offered this explanation: The database has been automatically locked while the slave database servers catch up to the master [20:45:21] on wikidata [20:45:42] Reedy, I assume icinga is doing this right now;) [20:45:49] * aude rage [20:46:01] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: CRIT replication delay 320 seconds [20:46:11] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 327 seconds [20:46:21] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 333 seconds [20:46:21] PROBLEM - MySQL Replication Heartbeat on db1041 is CRITICAL: CRIT replication delay 333 seconds [20:46:47] [21:38:07] aude: Don't disagree with you. Just something like that (quite a big change) should've probably been deployed in a more supervised fashion to the cluster ;) [20:46:56] Reedy: we were trying to get your review for a long time :( [20:47:00] I know [20:47:14] But just merging it and not testing it then causing other things to break [20:47:22] we'll have to schedule a time for it again [20:47:54] we didn't get feedback even on when it'd be reviewed. at some point we have to move forward - especially since this is blocking other things [20:48:00] i agree this could have been done better but... [20:48:05] we tried [20:48:12] i'm not blaming you [20:48:21] it was merged untested and left for me to fall over [20:48:28] fair enough :( [20:48:31] PROBLEM - Varnishkafka log producer on cp1047 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishkafka [20:48:32] sorry about that [20:49:11] (03PS1) 10Andrew Bogott: Add a global that turns off the ssh banner for a bastion. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105069 [20:51:53] How much longer on the DB lock on enwp? [20:51:59] (03PS2) 10Andrew Bogott: Add a global that turns off the ssh banner for a bastion. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105069 [20:52:37] Till someone finds what's wrong and fixes it [20:52:51] why am i seeing things go by in recent changes (with current timestamps) if the database is locked? [20:53:57] Just to annoy you [20:55:48] <\pi{r^2}> Technical_13: impatient? [20:55:57] <\pi{r^2}> ;) [20:56:25] A little, it's past my nap time and I have two posts for talk pages waiting to go... [21:00:53] * andre__ updates the bug report in https://bugzilla.wikimedia.org/show_bug.cgi?id=59221 and comments on some Village Pumps ("Commons is down!") [21:02:14] now it's just read only :P [21:02:26] (03CR) 10Andrew Bogott: [C: 032] Add a global that turns off the ssh banner for a bastion. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105069 (owner: 10Andrew Bogott) [21:02:51] Who's running a schema change? [21:03:05] (03PS2) 10Ryan Lane: Add redis config for keystone in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/104322 [21:04:04] Erm why is stuff in lockdown? [21:04:19] Just to annoy you [21:04:29] Qcoder00: there is a replication issue ongoing; staff are on it already. [21:04:30] Reedy: OK we have the bump to extensions/Flow for 1.23wmf9 ready, is it OK for bsitu to sync-dir it? [21:04:39] No [21:04:49] Reedy: how many presses of your arrow key did that take? ;-D [21:05:18] lol twkozlowski it's a macro to save time... [21:05:31] RECOVERY - Varnishkafka log producer on cp1047 is OK: PROCS OK: 1 process with command name varnishkafka [21:07:21] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay -0 seconds [21:07:41] RECOVERY - MySQL Slave Running on db1007 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [21:08:22] !log All? #Wikimedia wikis read only since about 20.40 UTC, s7 database replication halted [21:08:33] not that Twitter posting works, but who knows [21:09:01] RECOVERY - MySQL Slave Running on db1041 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [21:09:01] RECOVERY - MySQL Slave Running on db1024 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [21:09:01] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay -1 seconds [21:09:11] RECOVERY - MySQL Slave Running on db1028 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [21:09:11] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 2 seconds [21:09:21] RECOVERY - MySQL Replication Heartbeat on db1041 is OK: OK replication delay -1 seconds [21:09:44] grr -!- morebots [local-more@208.80.153.164] has quit [Ping timeout: 240 seconds] [21:09:45] !log someone's schema change hit http://bugs.mysql.com/bug.php?id=61548 and broke replication on s7, I dropped triggers and renamed target table into… well, you will see. [21:09:50] ah [21:10:08] DOMAS SAVE[D] US [21:10:13] <\pi{r^2}> Yay! [21:10:17] can't log [21:10:22] and wikitech needs some magic tokens to log in [21:10:23] :( [21:10:56] ori: morebots please [21:11:16] or Ryan_Lane [21:11:35] domas: do you have an account on wikitech? [21:11:35] people add crap features, morebots then doesn't work [21:11:42] Ryan_Lane: dunno, probably [21:11:42] and you don't need a token [21:12:03] unless you enable two factor authentication [21:12:31] waiting for wikitech.wikimedia.org [21:12:39] some day things will work [21:12:53] waiting for? [21:13:01] it loads fine for me [21:13:04] * andrewbogott will fix morebots [21:13:12] one day you'll stop being a douchbag too domas [21:13:14] good for you [21:13:21] Ryan_Lane: it doesn't work [21:13:32] I'll keep waiting for that day, and you can keep waiting for things to work [21:13:46] Ryan_Lane: what did I do? [21:14:31] domas: I saved your !log on wiki [21:14:37] merci [21:15:00] domas: what error do you receive if any? [21:15:20] thanks of course for fixing things, but you could be a little less annoying about things and if you have issues report them nicely [21:15:25] wikitech doesn't seem to accept some unicode characters in passwords [21:15:47] Nemo_bis: yeah, that seemed to occur when opendj was upgraded [21:15:58] Ryan_Lane: what the fuck is wrong with you [21:16:03] I've been meaning to take a look at it. likely due to some default password policy [21:16:04] I have 40s page loads [21:16:16] well Ryan_Lane it's understandable to be a bit frustrated/anxious when you're fixing all projects being read only :) [21:16:17] sure I can point out that 40s page execution is on the slow side [21:16:26] Ryan_Lane: so fuck you and your attitudes [21:16:28] 40s? [21:16:32] yes, 40s [21:16:42] * andrewbogott too :( [21:16:42] where? wikitech? [21:17:02] https://www.dropbox.com/s/adqxgll5hmscsii/Screenshot%202014-01-02%2013.15.01.png [21:17:14] Ryan_Lane: learn to read [21:17:20] [13:12:30] waiting for wikitech.wikimedia.org [21:17:41] I guess consultants and not FTEs can be assholes [21:18:03] well, if you'd report your problem nicely rather than just being a negative person about everything, maybe you'd get a nicer response [21:18:14] my contract is short term and it's not for this [21:18:19] maybe I was looking at tons of different things [21:18:25] and yes, I reported that it doesn't load fast enough [21:18:33] what do you want me to do [21:18:41] go file bugzillla incident with "please fix it" ? [21:18:51] oops, now it's fast for me again. [21:18:54] maybe say in which way it's loading slow? login loading slow is a different problem [21:19:08] the wiki is fast for me, over a mifi [21:19:19] (03CR) 10Mark Bergsma: [C: 031] "Actually I didn't remove LVS at pmtpa, that's still fully functional. But Squid and SSL have been removed, which doesn't prevent the eqiad" [operations/puppet] - 10https://gerrit.wikimedia.org/r/105006 (owner: 10Reedy) [21:20:37] so, what happened with the db lock? [21:20:40] Ryan_Lane: *shrug*, all I knew few minutes ago was "omg site readonly, replication broken", I had to make mirriads of assumptions and make it work, by shooting someone's work in the head [21:20:57] Ryan_Lane: I don't know why you expect me to be extremely eloquent about whatever other problem I hit on the way [21:21:08] spagewmf: Should be good to go now [21:21:10] domas: heya, thanks much, can you say what happened? more than the !log, or is that about it? [21:21:13] Ryan_Lane: calling someone a "douchebag" right there [21:21:15] was not welcome [21:21:22] it's not a matter of being eloquent. it's a matter of not being a dick about things like morebots [21:21:35] it wasn't on the channel. it happens. it's an irc bot that writes to a wiki [21:21:43] a dick? I wrote that bot, and it was always on the channel [21:21:58] yes yes, your code is perfect. [21:22:02] but now apparently it is not on the channel and fails constantly since all the fancy twitter code was added [21:22:05] https://meta.wikimedia.org/wiki/User_talk:Midom#Attitudes_barnstar [21:22:23] Many pages are taking me ~10s to load right now. I'm not attentive enough to know if that's exceptional or not. [21:22:36] domas: I'm not sure it's related to the twitter posting [21:22:42] domas: As far as I know morebots fails mostly due to netsplit, and always has? [21:22:43] andrewbogott: on wikitech? that would be exceptional [21:22:45] maybe to something else [21:22:48] whatever [21:22:49] I guess I haven't looked at the source lately. [21:22:50] yes. netsplit [21:22:55] it's usually just labs problems or maybe in this case the wiki being unresponsive [21:22:58] and the library that domas chose to use it at fault [21:23:02] *is [21:23:08] domas: It's a legit problem, but not really self-inflicted… needs a total rewrite :( [21:23:22] though in this case, if wikitech is taking too long to respond it could also be an issue [21:23:27] greg-g: so, what happened - schema change that adds auto-incs has to be in special order [21:23:34] Well, we just refactored it to use a different irc lib, and it has the same failure case. [21:23:42] So domas may be blameless at this point. [21:23:53] greg-g: otherwise there's a nasty behavior that corrupts data and what not [21:24:40] greg-g: and breaks replication [21:25:08] domas: gotcha, could you see who started the schema change? [21:25:17] greg-g: I did not see it running anywhere [21:25:22] it could be leftover triggers [21:25:30] but they were everywhere [21:25:32] which is odd [21:25:44] interesting [21:25:49] folks who reported slow loading pages (domas, andrewbogott), which sites were those? [21:26:01] domas: what would you tell springle-away if he was here? :) [21:26:02] greg-g: fairly sure it was springle-away for https://bugzilla.wikimedia.org/show_bug.cgi?id=49189 [21:26:03] greg-g, springle has been working on the externallinks schema changes according to SAL [21:26:13] thanks both :) [21:26:18] apergos: wikitech, nothing to do with problems from earlier. [21:26:26] he should be waking up soon :) [21:26:36] yep [21:27:08] greg-g: http://bugs.mysql.com/bug.php?id=61548 :) [21:27:12] greg-g: he worked at mysql support [21:27:12] ok, I'm going to consider that not an emergency then and be metaphorically afk [21:27:14] he should get it [21:27:15] :) [21:27:15] thanks [21:27:23] andrewbogott: check for ldap issues with logstat.py [21:27:26] Ryan_Lane: My first instinct is to restart ldap -- did you just now do that? [21:27:30] I did not [21:27:38] I always make sure it's responding poorly first [21:27:38] * andrewbogott looks before leaping [21:27:39] I wonder what the fix is [21:27:39] (03PS1) 10Edenhill: Add support for %{Varnish:xid}x (X-Varnish: ..) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/105093 [21:27:54] apergos: definitely not an emergency [21:28:22] labs eswiki replication doesn't catch up, btw [21:28:51] Guess you'll need to ask Coren|Travel nicely [21:29:14] Reedy: I? [21:29:21] domas: thanks [21:30:27] for a labsdb replication issue springle is probably also the right person to investigate, will prod him when he's around. [21:30:29] someone who cares needs to ask? ;) [21:30:57] !log bsitu synchronized php-1.23wmf9/extensions/Flow 'Update Flow to master' [21:31:19] Logged the message, Master [21:32:29] Ryan_Lane: logstat.py -something? [21:32:30] Reedy: heya, can you write up a quick post-mortem for this mornings first outage? [21:32:47] andrewbogott: python logstat.py [21:32:59] Eloquence: Thanks :) [21:33:06] oh! That makes more sense :) [21:36:11] Ryan_Lane: Search: 150928 Avg: 1232.2 ms Max: 8365 ms >100ms: 37763 (25%) >1000ms: 37467 (24%) [21:36:16] Expen$ive! [21:36:30] yep. that's a problem [21:36:31] that's on virt0? [21:36:34] yeah [21:36:56] restart opendj. we'll need to track down what queries need to be optimized [21:37:30] Let's see… I can't remember if dns needs restarting after opendj? Or only the other way 'round [21:37:30] it's also possible that there's still just a slow memory leak [21:37:35] it usually does [21:37:41] and it's random in how that breaks [21:37:48] it could break on either virt0 or virt1000 [21:37:53] 'k [21:39:03] domas, how is wikitech treating you now? [21:39:08] Faster, if not fast enough? [21:39:10] I can't log in [21:39:23] I'm not sure which password I should use [21:40:00] When you say 'not sure which...' [21:40:09] do you have a labs password? [21:40:14] tried it too [21:40:15] Or, did you have? [21:40:31] 2fa maybe? [21:40:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:40:43] well, see, exactly, I don't know [21:40:44] Sorry if you've been through all this already, I'm behind on my backscroll homework [21:40:53] I never had any 2FA set up [21:41:15] oh wait, I logged in [21:41:21] woo! [21:41:25] when it takes a minute for log in to work [21:41:28] I may not notice immediately [21:41:34] reasonable [21:41:40] oh snap, I should avoid stating the fact, cause then I will be called a douchebag [21:42:18] no. you should stop blaming other people's code for your own codes failures [21:42:30] simmer down, kids [21:42:41] lol [21:43:11] which of my code failed to log me in quick? [21:43:20] oh wait, opendj is leaking memory, let's restart it! [21:43:21] will fix it [21:43:23] I'm referencing the morebots comment you made [21:43:23] cron a restart [21:43:49] that's why I called you a douchebag. not because of wikitech [21:44:00] well, it still did not work [21:44:01] * andrewbogott admits that 'cron a restart' is seeming like a pretty good idea :/ [21:44:03] your wikitech issue report was just not helpful [21:44:03] because of some features added [21:44:23] :) [21:44:34] hah [21:44:36] andrewbogott: nah. the memory grows because some queries are using a large amount of memory and a large number of them are occuring [21:44:46] usually it's unindexed queries [21:45:04] I track down the queries, add indexes and the problem goes away [21:45:17] restarting the daemon is just a way to get it back into working condition. [21:45:18] Ryan_Lane: well, I have to admit, it took me multiple steps like "open another tab with console enabled, figure out which request is slow, get timing" [21:45:22] Ryan_Lane: why would a restart help then? Do you mean that all those queries are all /pending/? [21:45:25] the memory leak is slow [21:45:28] Ryan_Lane: about a problem that I did not really care much about at the time [21:46:03] and yes, I love vague reports, people can join in with "me toos", others can look at "is everything all right" [21:46:36] once you mentioned login was the issue, the issue was obvious ;) [21:47:06] well, unfortunately I don't always have your knowledge that "stuff leaks, just need to restart it once in a while" kind of obvious stuff [21:48:07] I haven't had to restart opendj in ages due to a leak [21:48:07] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 9.002 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [21:48:07] I saw similar issues recently, can see a pattern, things leaking memory around, and not much care generally [21:48:08] there must be new queries occurring that are causing the issue. [21:48:14] when I see that, I restart the daemon, track down the queries and fix them [21:48:14] greg-g, Reedy: we updated Flow in 1.23wmf9, seems OK thanks! [21:48:26] leaving opendj stable for another 6 or so months till people start doing new odd queries [21:49:01] I don't see how the situation is any different from databases. bad queries cause problems. [21:49:22] track them down, fix them, maybe restart the server if it's in a bad state [21:50:08] spagewmf: whew, one thing didn't break! [21:50:42] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:50:56] andrewbogott: pdns on virt1000? [21:51:12] andrewbogott: after restarting opendj it's necessary to check pdns on both [21:51:24] I did... [21:51:28] I can't wait till we can switch away from pdns backed ldap [21:51:32] err ldap backed pdns [21:51:53] * Damianz notes domas is even more obnoxious than he is and raises his aspirations [21:52:36] Damianz: yes, sure, I call people douchebags on a first chance. For example I have no idea who you are, but I'm sure you're a douchebag. Great to meet you, sir [21:52:46] Damianz: hah. not a good role model [21:52:55] !blame [21:53:46] :( [21:53:52] wrong channel or something [21:54:03] wm-bot: slacker [21:54:23] it's learning from the bad example of morebots the AWOL repeated offender [22:03:31] !log aaron started scap: active Timing test [22:03:31] PROBLEM - Varnishkafka log producer on cp1047 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishkafka [22:03:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 8.754 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [22:03:53] Logged the message, Master [22:05:04] * Eloquence pets morebots [22:05:10] <^d> ori: You around? [22:05:46] ^d: yep [22:05:54] <^d> Think we could get https://gerrit.wikimedia.org/r/#/c/103768/ in today? [22:06:42] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:07:56] the heck? [22:08:09] paravoid: around? [22:08:31] RECOVERY - Varnishkafka log producer on cp1047 is OK: PROCS OK: 1 process with command name varnishkafka [22:09:24] (03PS3) 10Ori.livneh: Revert "Configure Varnish not to cache scholarship app reqs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103768 (owner: 10BryanDavis) [22:10:26] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Configure Varnish not to cache scholarship app reqs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103768 (owner: 10BryanDavis) [22:11:31] PROBLEM - Varnishkafka log producer on cp1047 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishkafka [22:12:27] (03PS1) 10Aaron Schulz: Removed called to undefined "die" method in scap-2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105101 [22:12:36] ori: ^ [22:12:37] AaronSchulz: i'm still on that change, btw [22:12:39] heh [22:12:41] (03CR) 10Ottomata: [C: 032 V: 032] Add support for %{Varnish:xid}x (X-Varnish: ..) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/105093 (owner: 10Edenhill) [22:13:57] !log loaded VCL change from Id54312fd5 (scholarship app) on cp1043 & cp1044 (misc-eqiad) [22:14:05] ^d: ^ [22:14:14] <^d> :) [22:14:20] Logged the message, Master [22:14:40] (03CR) 10Ori.livneh: [C: 032 V: 032] Removed called to undefined "die" method in scap-2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105101 (owner: 10Aaron Schulz) [22:18:31] <^d> ori: Getting cache hits, pages are varying on Cookie as they should :) [22:18:37] ^d, ori: Scholarships looks like it's behaving [22:18:44] Jinx [22:18:46] <^d> Yep, looks all good to me. [22:18:47] <^d> :) [22:18:57] sweet! [22:23:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 9.291 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [22:25:39] !log aaron finished scap: active Timing test [22:25:57] Logged the message, Master [22:26:22] !log reedy updated /a/common to {{Gerrit|I2c2836d40}}: Rest of phase1 to 1.23wmf9 [22:26:26] (03PS1) 10Reedy: Remove loginwiki from echowiki dblist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105102 [22:26:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:26:52] Logged the message, Master [22:27:01] (03CR) 10Reedy: [C: 032] Remove loginwiki from echowiki dblist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105102 (owner: 10Reedy) [22:27:30] hm, 24.5min [22:27:43] (03Merged) 10jenkins-bot: Remove loginwiki from echowiki dblist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105102 (owner: 10Reedy) [22:28:33] !log reedy synchronized echowikis.dblist 'Remove loginwiki' [22:28:56] Logged the message, Master [22:35:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:31:45 PM UTC [22:35:28] RECOVERY - Varnishkafka log producer on cp1047 is OK: PROCS OK: 1 process with command name varnishkafka [22:37:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:31:45 PM UTC [22:37:52] (03PS4) 10BryanDavis: [WIP] Kibana puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/104172 [22:39:16] !log aaron started scap: active Timing test [22:39:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:31:45 PM UTC [22:39:53] Logged the message, Master [22:40:28] RECOVERY - Puppet freshness on wtp1003 is OK: puppet ran at Thu Jan 2 22:40:19 UTC 2014 [22:41:43] is ULSFO serving traffic right now? I remember emails saying it was turned on for people in oceania. (this is for the signpost's tech report) [22:42:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [22:42:18] RECOVERY - SSH on virt1007 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:43:00] !log aaron finished scap: active Timing test [22:43:03] oop, those extra varnishkafka and/or varnishncsa alerts on cp1047 are me, Snaps and I are testing somethin [22:43:08] 4min [22:43:24] Logged the message, Master [22:43:28] PROBLEM - Varnishkafka log producer on cp1047 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishkafka [22:43:33] * AaronSchulz goes to figure out how much time was rsync and how much cdb rebuilding... [22:44:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [22:46:19] (03PS1) 10BryanDavis: Add kibana.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/105105 [22:46:19] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [22:48:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [22:50:02] AaronSchulz: build CDB somewhere, torrent it out! [22:50:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [22:50:36] AaronSchulz: HHVM is interesting in the way that one can ship only bytecode changes [22:50:42] actually, I think I'll just do another dsh phase for that with full fanout since the work isn't shared at all [22:51:01] should go from 4min to 20sec or so [22:51:17] you could do parallel build! [22:51:40] fan out different parts of the tree to different hosts, then merge them into single DB built, then distribute the file [22:51:44] or multiple .db subfiles [22:51:49] there're so many ways! [22:51:49] ;-) [22:52:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [22:52:38] RECOVERY - Puppet freshness on virt1007 is OK: puppet ran at Thu Jan 2 22:52:33 UTC 2014 [22:54:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [22:54:38] RECOVERY - Host virt1007 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:56:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [22:58:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [23:00:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [23:02:18] PROBLEM - Puppet freshness on wtp1003 is CRITICAL: Last successful Puppet run was Thu 02 Jan 2014 10:40:19 PM UTC [23:02:38] RECOVERY - Puppet freshness on wtp1003 is OK: puppet ran at Thu Jan 2 23:02:36 UTC 2014 [23:03:42] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.118 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [23:05:45] (03CR) 10BryanDavis: [WIP] Kibana puppet class (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104172 (owner: 10BryanDavis) [23:06:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:11:46] (03CR) 10BryanDavis: [WIP] Kibana puppet class (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104172 (owner: 10BryanDavis) [23:12:14] greg-g: I've never seen https://www.mediawiki.org/wiki/MediaWiki_1.XX/wmfNN/Changelog before, I added links to Reedy's novellas to https://www.mediawiki.org/wiki/MediaWiki_1.23 [23:13:01] RECOVERY - NTP on virt1007 is OK: NTP OK: Offset -0.04505157471 secs [23:13:55] spagewmf: cool, I sometimes try to pull out the important changes and put them at the top. [23:15:29] what's the script that generates the Changelog wikitext ? [23:15:45] it's in make-release [23:15:48] I believe [23:15:50] * greg-g looks [23:17:08] greg-g: make-deploy-notes/make-deploy-notes I think [23:17:14] there [23:17:19] sorry, was multitasking with my bank [23:18:01] (03PS1) 10Aaron Schulz: Added a separate scap-rebuild-cdbs phase to scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/105110 [23:18:36] (03PS2) 10Aaron Schulz: Added a separate scap-rebuild-cdbs phase to scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/105110 [23:22:40] paravoid: hi, happy new year :) [23:26:14] average: i don't think he's about today [23:26:20] (03PS3) 10Aaron Schulz: Added a separate scap-rebuild-cdbs phase to scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/105110 [23:26:55] Reedy: ah ok [23:37:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 9.478 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [23:40:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:48:41] PROBLEM - DPKG on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:51] PROBLEM - swift-container-server on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:51] PROBLEM - puppet disabled on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:51] PROBLEM - swift-object-updater on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:01] PROBLEM - swift-object-auditor on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:01] PROBLEM - swift-container-replicator on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:01] PROBLEM - RAID on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:01] PROBLEM - Disk space on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:01] PROBLEM - swift-object-replicator on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:11] PROBLEM - swift-account-auditor on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:11] PROBLEM - swift-container-auditor on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:11] PROBLEM - swift-object-server on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:21] PROBLEM - swift-account-reaper on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:21] PROBLEM - swift-account-replicator on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:31] PROBLEM - swift-container-updater on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:31] PROBLEM - swift-account-server on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:44] so, no one added the ipb_parent_block_id index [23:49:56] no wonder that DELETE query shows up high in ishmael [23:50:01] must be a table scan each query [23:50:23] :( [23:50:31] omg nested deletes [23:50:31] PROBLEM - Swift HTTP backend on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:50:45] nested blocks that is [23:51:21] RECOVERY - Swift HTTP backend on ms-fe1 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.062 second response time [23:54:41] RECOVERY - DPKG on ms-be7 is OK: All packages OK [23:54:41] RECOVERY - puppet disabled on ms-be7 is OK: OK [23:54:41] RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [23:54:42] RECOVERY - swift-object-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [23:54:51] RECOVERY - swift-container-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [23:54:51] RECOVERY - swift-object-auditor on ms-be7 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [23:54:51] RECOVERY - RAID on ms-be7 is OK: OK: no disks configured for RAID [23:54:51] RECOVERY - swift-object-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [23:55:01] RECOVERY - swift-account-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [23:55:01] RECOVERY - swift-container-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:55:01] RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [23:55:11] RECOVERY - swift-account-reaper on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [23:55:11] RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [23:55:22] RECOVERY - swift-container-updater on ms-be7 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [23:55:22] RECOVERY - swift-account-server on ms-be7 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [23:59:05] (03PS1) 10Ori.livneh: Set $wgULSFontRepositoryBasePath to protocol-relative URL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105115 [23:59:19] okay, lightning deploy time! [23:59:27] * MaxSem will be first