[00:02:51] (03CR) 10Dzahn: noc.pp - various lint fixes (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139462 (owner: 10Dzahn) [00:41:53] (03PS1) 10Gage: filter changes to support messages from Hadoop [operations/puppet] - 10https://gerrit.wikimedia.org/r/140623 [00:42:49] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 21:42:32 UTC [01:13:49] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 22:13:08 UTC [01:23:26] (03PS2) 10Dzahn: rancid.pp - lint and tidy, quoting, arrows, retab [operations/puppet] - 10https://gerrit.wikimedia.org/r/139464 [01:23:44] (03CR) 10Dzahn: rancid.pp - lint and tidy, quoting, arrows, retab (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139464 (owner: 10Dzahn) [01:27:59] PROBLEM - Disk space on dataset1001 is CRITICAL: DISK CRITICAL - free space: /data 1521057 MB (3% inode=99%): [01:31:49] PROBLEM - Puppet freshness on virt1008 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 22:31:32 UTC [01:37:09] RECOVERY - Puppet freshness on virt1008 is OK: puppet ran at Thu Jun 19 01:37:04 UTC 2014 [01:53:35] MaxSem: Do you know where the server that powers releases.wikimedia.org is? [02:01:17] (03PS1) 10Catrope: Add myself to releasers-mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/140634 [02:13:09] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Thu Jun 19 02:13:04 UTC 2014 [02:24:45] !log LocalisationUpdate completed (1.24wmf8) at 2014-06-19 02:23:42+00:00 [02:24:52] Logged the message, Master [02:28:30] hey rob do we have a ftp server [02:35:37] RoanKattouw, James_F lol: http://git.wikimedia.org/blob/operations%2Fpuppet.git/1b2bf694e60a05dc219ba499a0e150b2e191b642/manifests%2Fgerrit.pp#L328 [02:46:55] !log LocalisationUpdate completed (1.24wmf9) at 2014-06-19 02:45:51+00:00 [02:46:59] Logged the message, Master [03:02:57] RoanKattouw: I had the same q a while back, documented it on https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org [03:03:07] update if no longer acccurate :) [03:34:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 19 03:33:36 UTC 2014 (duration 33m 35s) [03:34:48] Logged the message, Master [03:43:49] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 21:42:32 UTC [05:14:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53269 bytes in 0.205 second response time [05:17:27] (03PS1) 10Springle: Additional labsdb federated tables for commonswiki_f_p, each already accessible via direct view on slice s4 commonswiki_p. [operations/software] - 10https://gerrit.wikimedia.org/r/140644 [05:22:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53290 bytes in 0.408 second response time [05:31:12] (03PS1) 10Yuvipanda: Grant bearND ability to upload mobile releases [operations/puppet] - 10https://gerrit.wikimedia.org/r/140646 [05:31:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:32:13] (03CR) 10BearND: [C: 031] Grant bearND ability to upload mobile releases [operations/puppet] - 10https://gerrit.wikimedia.org/r/140646 (owner: 10Yuvipanda) [05:41:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53269 bytes in 0.122 second response time [05:45:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:50:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53269 bytes in 0.318 second response time [05:54:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:55:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 4.582 second response time [06:04:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:09:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 4.634 second response time [06:11:25] (03PS1) 10Matanya: swift: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140654 [06:14:02] (03CR) 10Matanya: [C: 031] apt/pin.pp - retab and mini quoting fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/139458 (owner: 10Dzahn) [06:14:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:15:25] (03CR) 10Matanya: rancid.pp - lint and tidy, quoting, arrows, retab (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139464 (owner: 10Dzahn) [06:21:05] (03PS1) 10Matanya: kafka: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140655 [06:21:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 0.283 second response time [06:25:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:30:34] (03PS1) 10Matanya: kafkatee: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140656 [06:32:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 0.467 second response time [06:36:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:20] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 2.015 second response time [06:44:49] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 21:42:32 UTC [06:47:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:54:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 0.365 second response time [06:58:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 0.744 second response time [07:08:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:09:29] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 6.150 second response time [07:12:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:14:19] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 0.332 second response time [07:15:24] _joe_: can you please ^ ? [07:18:29] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:25:29] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 7.825 second response time [07:50:29] (03PS1) 10Matanya: lucene: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140665 [07:51:16] (03PS1) 10PleaseStand: Remove use of deprecated wfGetIP() [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140666 [07:56:57] (03PS1) 10Matanya: redisdb: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140667 [08:00:39] (03PS2) 10PleaseStand: Remove use of deprecated wfGetIP() [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140666 [08:08:28] * YuviPanda waves at hashar [08:09:21] YuviPanda: :-] [08:31:23] (03PS8) 10Hashar: beta: bring in mediawiki/skins.git [operations/puppet] - 10https://gerrit.wikimedia.org/r/136325 (https://bugzilla.wikimedia.org/65868) [08:32:12] (03CR) 10Hashar: [C: 031] "_joe_ this can be merged anytime. It does not impact production and is already deployed on the beta cluster puppetmaster. Thanks for the " [operations/puppet] - 10https://gerrit.wikimedia.org/r/136325 (https://bugzilla.wikimedia.org/65868) (owner: 10Hashar) [08:32:28] (03PS3) 10Hashar: beta: Add mediawiki/core/vendor to beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/137463 (owner: 10BryanDavis) [08:33:10] (03CR) 10Hashar: "_joe_ same there, can be merged since it does not impact prod and it is already on the beta cluster :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137463 (owner: 10BryanDavis) [08:40:44] (03CR) 10Giuseppe Lavagetto: [C: 032] beta: bring in mediawiki/skins.git [operations/puppet] - 10https://gerrit.wikimedia.org/r/136325 (https://bugzilla.wikimedia.org/65868) (owner: 10Hashar) [08:43:07] _joe_: Bryan Davis has a few more patches for beta which I already reviewed [08:43:11] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+owner:%22BryanDavis+%253Cbdavis%2540wikimedia.org%253E%22,n,z [08:43:18] topic messages are prefixed with 'beta: ' [08:45:29] <_joe_> hashar: I'll take a look later, I'm brewing the puppet 3 patches for the varnishes [08:46:02] \O/ [08:46:12] _joe_: we have varnishes box for bits/mobile/text/upload on beta cluster [08:46:18] they might already be running puppet 3 [08:46:22] I am not sure [08:46:31] <_joe_> hashar: they most surely are [08:46:33] maybe the local puppetmaster got migrated already. I can't remember [08:46:59] <_joe_> it should [08:47:23] so if you want to give a try to your patch, you can cherry pick them on deployment-salt.eqiad.wmflabs ( /var/lib/git/operations/puppet ) and run them on the varnish instances [08:47:33] they are prefixed deployment-cache-XX [08:47:48] (03PS14) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [08:48:57] (03CR) 10jenkins-bot: [V: 04-1] cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [08:57:03] (03PS15) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [09:03:30] (03PS1) 10Odder: Raise account creation limit for Telugu Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140669 (https://bugzilla.wikimedia.org/66822) [09:10:53] (03PS1) 10Giuseppe Lavagetto: puppet3: caches 1 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140671 [09:10:55] (03PS1) 10Giuseppe Lavagetto: puppet3: caches 2 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140672 [09:10:57] (03PS1) 10Giuseppe Lavagetto: puppet3: caches 3 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140673 [09:10:59] (03PS1) 10Giuseppe Lavagetto: puppet3: caches 4 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140674 [09:12:46] _joe_: you're being too conservative for my taste :P [09:16:44] <_joe_> I know [09:17:10] <_joe_> I went to brew a coffee because I expected you to comment :) [09:17:25] :P [09:17:55] <_joe_> seriously, at least for varnish, lvs, dbs and dns I'd prefer to play it safe [09:18:07] <_joe_> no reason to screw things up when we can avoid it [09:18:31] <_joe_> I plan to do everything today and tomorrow [09:24:06] (03PS16) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [09:25:09] (03PS2) 10Filippo Giunchedi: report swift global statistics to statsd [operations/puppet] - 10https://gerrit.wikimedia.org/r/139394 [09:27:11] (03PS2) 10Giuseppe Lavagetto: puppet3: caches 1 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140671 [09:30:39] (03PS1) 10Gage: Support logging via GELF, for sending to Logstash [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/140676 [09:30:53] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: caches 1 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140671 (owner: 10Giuseppe Lavagetto) [09:32:45] (03CR) 10Nikerabbit: [C: 04-1] cxserver configuration for beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [09:37:22] (03PS2) 10Giuseppe Lavagetto: puppet3: caches 2 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140672 [09:39:06] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: caches 2 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140672 (owner: 10Giuseppe Lavagetto) [09:39:15] (03CR) 10Giuseppe Lavagetto: [V: 032] puppet3: caches 2 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140672 (owner: 10Giuseppe Lavagetto) [09:44:14] (03PS2) 10Giuseppe Lavagetto: puppet3: caches 3 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140673 [09:45:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet3: caches 3 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140673 (owner: 10Giuseppe Lavagetto) [09:45:49] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 21:42:32 UTC [09:47:32] <_joe_> mmmh does this have to do with our outage yesterday? [09:51:40] (03PS1) 10Gage: Hadoop: supply JARs for GELF output, pass parameter [operations/puppet] - 10https://gerrit.wikimedia.org/r/140677 [09:52:58] (03PS1) 10Matanya: cache: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140678 [09:58:09] (03CR) 10QChris: "While it totally would make sense, I am not too sure about" [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [10:00:29] (03PS14) 10Reedy: Gather all soft-disabled uploads wikis in one config item [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [10:00:49] (03PS17) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [10:01:12] (03CR) 10Reedy: [C: 032] Gather all soft-disabled uploads wikis in one config item [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [10:01:23] (03Merged) 10jenkins-bot: Gather all soft-disabled uploads wikis in one config item [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [10:04:22] !log reedy Synchronized wmf-config/: I248fa7b98a8a0eea943c6643d1bf9c2ed36296b8 (duration: 00m 15s) [10:04:27] Logged the message, Master [10:06:06] (03PS1) 10Reedy: Add commonsuploads.dblist to noc [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140681 [10:06:47] (03CR) 10Reedy: [C: 032] Add commonsuploads.dblist to noc [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140681 (owner: 10Reedy) [10:06:53] (03Merged) 10jenkins-bot: Add commonsuploads.dblist to noc [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140681 (owner: 10Reedy) [10:07:12] !log reedy Synchronized docroot and w: (no message) (duration: 00m 14s) [10:07:16] Logged the message, Master [10:09:03] !log reedy Synchronized docroot/noc: (no message) (duration: 00m 15s) [10:09:08] Logged the message, Master [10:09:49] (03PS2) 10Giuseppe Lavagetto: puppet3: caches 4 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140674 [10:10:57] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet3: caches 4 of 4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140674 (owner: 10Giuseppe Lavagetto) [10:11:01] ok, wth won't that work [10:12:45] (03PS1) 10Nemo bis: Disable local uploads where unused, per local consensus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140684 (https://bugzilla.wikimedia.org/65389) [10:13:08] I see correct navigation url on it.wiktionary.org [10:14:43] I was meaning getting it to display in noc [10:15:23] Ah ok. I was worrying that you said the config didn't work. :) [10:15:36] * that you meant [10:16:00] Bleugh, sync-docroot doesn't sync dblists it seems [10:20:19] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [10:34:25] (03PS1) 10Filippo Giunchedi: add swift eqiad-prod cluster dashboard [operations/puppet] - 10https://gerrit.wikimedia.org/r/140685 [10:39:52] <_joe_> \o/ [10:40:10] \o/² [10:40:46] \\o o// [10:42:15] FWIW I've tried the latest upstream version of gdash as a test, it is kinda nicer [10:42:17] <_joe_> apt-get install cowsay [10:42:58] I'll have an accessible url in labs shortly [10:45:43] (03PS8) 10Nuria: [WIP] Add backup role and scripts to wikimetrics [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [10:46:32] (03PS3) 10Nuria: Enable the new backup role if set [operations/puppet] - 10https://gerrit.wikimedia.org/r/139558 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [10:52:02] (03CR) 10Reedy: [C: 032] Disable local uploads where unused, per local consensus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140684 (https://bugzilla.wikimedia.org/65389) (owner: 10Nemo bis) [10:52:08] (03Merged) 10jenkins-bot: Disable local uploads where unused, per local consensus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140684 (https://bugzilla.wikimedia.org/65389) (owner: 10Nemo bis) [10:52:55] !log reedy Synchronized docroot and w: (no message) (duration: 00m 15s) [10:52:55] akosiaris: cxserver config puppet patch is ready for review (we believe :)) https://gerrit.wikimedia.org/r/#/c/139095/ (CC Nikerabbit) [10:53:00] Logged the message, Master [10:53:41] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 15s) [10:53:46] Logged the message, Master [10:53:58] (03PS4) 10Nuria: Enable the new backup role in wikimetrics if set [operations/puppet] - 10https://gerrit.wikimedia.org/r/139558 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [10:58:56] Reedy: not sync'ed yet on the wikis afaics [10:59:11] It is [10:59:27] I'm guessing it didn't touch InitialiseSettings like it was supposed to [11:00:01] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 14s) [11:00:06] Logged the message, Master [11:03:10] kart_: yes I 've seen it. I will be reviewing it today [11:15:28] Reedy: config is correctly reflected on cs.wikiversity (first batch) but not on wuu.wiki (second), both have no entry in groupOverrides [11:18:51] What am I doing wrong :/ [11:21:52] e.g. http://gdash-latest.wmflabs.org/dashboards/swift.eqiad-prod/ [11:25:39] (03CR) 10Filippo Giunchedi: "sample running (possibly temporarily) at http://gdash-latest.wmflabs.org/dashboards/swift.eqiad-prod/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140685 (owner: 10Filippo Giunchedi) [11:25:43] !log reedy Synchronized commonsuploads.dblist: (no message) (duration: 00m 15s) [11:25:48] Logged the message, Master [11:26:18] !log reedy Synchronized wmf-config/: touch (duration: 00m 15s) [11:26:23] Logged the message, Master [11:29:56] now it worked! [11:29:57] thanks [11:30:05] .... [11:30:17] all I did was touch the lot and resync [11:43:14] akosiaris: thanks! [12:46:49] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 21:42:32 UTC [12:51:41] (03CR) 10Jgreen: [C: 031] exim: sign with DKIM on the mail routers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140584 (owner: 10Faidon Liambotis) [12:52:02] thanks :) [13:00:47] (03PS1) 10Matanya: statistics: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140702 [13:00:54] one more to go [13:01:05] <_joe_> !log re-enable puppet on lvs1003 [13:01:10] Logged the message, Master [13:05:39] (03PS1) 10Matanya: solr: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140703 [13:07:05] i'm done, now needs ops to review, and then we can move puppet to start voting! (I hope) [13:07:15] *jenkins voting [13:08:27] matanya: wow! [13:08:30] matanya: for puppet lint? [13:08:32] nicee [13:08:43] for tabs at first stage [13:08:51] and later on more lint checks [13:09:26] (03PS2) 10Faidon Liambotis: exim: sign with DKIM on the mail routers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140584 [13:09:28] (03PS2) 10Faidon Liambotis: mail: move wiki-mail-eqiad IP stanzas to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/140585 [13:09:30] (03PS2) 10Faidon Liambotis: exim: add all of our domains to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140586 [13:09:32] (03PS2) 10Faidon Liambotis: exim: get rid of the implicit secondary MX feature [operations/puppet] - 10https://gerrit.wikimedia.org/r/140587 [13:09:34] (03PS2) 10Faidon Liambotis: mail: add a root system alias to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/140588 [13:09:43] need to verify submodules too [13:09:51] (03CR) 10Faidon Liambotis: [C: 032] exim: sign with DKIM on the mail routers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140584 (owner: 10Faidon Liambotis) [13:10:04] (03CR) 10Faidon Liambotis: [C: 032] mail: move wiki-mail-eqiad IP stanzas to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/140585 (owner: 10Faidon Liambotis) [13:10:30] (03CR) 10Faidon Liambotis: [C: 032] exim: add all of our domains to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140586 (owner: 10Faidon Liambotis) [13:10:36] (03CR) 10Faidon Liambotis: [C: 032] exim: get rid of the implicit secondary MX feature [operations/puppet] - 10https://gerrit.wikimedia.org/r/140587 (owner: 10Faidon Liambotis) [13:12:11] <_joe_> Am I the only one hating auto linting programs like puppet lint? [13:12:26] <_joe_> I think they're an advice, not an enforcement [13:12:36] (03CR) 10Faidon Liambotis: [C: 032] mail: add a root system alias to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/140588 (owner: 10Faidon Liambotis) [13:12:59] <_joe_> if you enforce puppet-lint, you will find people doing crazy things and spending time just to overcome minor linting problems [13:13:19] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [13:13:25] I think if that's the case, the linter or its configuration needs tweaking [13:13:25] <_joe_> I prefer auto-discipline to tech enforcement [13:13:41] if the linting rules are sane, people will converge rapidly on writing in a style that works with it [13:13:46] <_joe_> bblack: yes a very liberal linter could help [13:14:04] _joe_: puppet-lint is configurable [13:14:08] <_joe_> bblack: I hate the stupid 'line with more than X chars' rules [13:14:09] and jenkins too [13:14:17] me too, we can ignore it [13:14:18] <_joe_> matanya: I kinda know that ;) [13:14:40] <_joe_> matanya: mine was a cultural comment (about preferring auto-discipline) [13:14:47] but tabs are annoying [13:14:56] _joe_: I do too, my normal terminal setup is 212 characters wide, and I like to use that real-estate [13:15:04] <_joe_> I do run puppet-link usually on my code while I write it [13:15:07] <_joe_> 212? [13:15:09] and should get -1 same as trailing white spaces [13:15:09] <_joe_> wow [13:15:27] but others might put 12 terminal panes on their screen and hate people that have lines > 66 characters or whatever [13:15:39] it's something you have to decide on as a sane policy for a shared codebase [13:15:49] RECOVERY - Puppet freshness on lvs1003 is OK: puppet ran at Thu Jun 19 13:15:48 UTC 2014 [13:15:49] <_joe_> bblack: my point is, in general I keep lines short [13:16:09] <_joe_> but you can't conceive an hard rule for that without causing pain [13:16:18] while, we all agree on some [13:16:23] <_joe_> (think of a long exec command in puppet) [13:16:28] my point is, I don't. I think my code is clearer when I go ahead and make a line 140 characters long if it needs to be, instead of breaking it into some cascading/wrapping structure just to obey the 80-rule [13:16:59] i wasn't referring to those kind of rules [13:17:03] but that's a minority viewpoint, and most coding standards disagree for supposedly-good reasons [13:17:07] <_joe_> bblack: I found myself doing that to pass pep8 when that exception was not added, and it's not funny [13:17:32] <_joe_> bblack: I'm almost always part of the minority that gets things right I guess :) [13:18:17] but hey, you took away the discussion to things i wasn't referring to [13:18:34] i was talking about tabs [13:18:39] only tabs [13:18:53] yeah I donno, it's an area ripe for research I'm sure. But I think, if you start with the given that people have wider terminals these days (which is a big if, apparently!), you're better off preserving vertical space to make a function visually shorter when reading it. [13:19:26] <_joe_> matanya: don't get me wrong, the linting job you're doing is great. [13:19:28] not that I'd advocate crunching multiple statements into one line pointlessly; I just don't advocate breaking up what is a naturally-lengthy line into an artificial sequence of several lines [13:19:41] <_joe_> bblack: +1 [13:20:07] <_joe_> maybe I still have fortran PTSD after all these years [13:20:21] <_joe_> wow it's like 7 years I don't have to write fortran [13:20:23] I have a great idea, maybe you two review my patches ? [13:20:26] <_joe_> I should celebrate [13:20:36] lol [13:20:48] "-1 Your puppet-lint config has lines >80 chars long" [13:21:05] (03PS1) 10Giuseppe Lavagetto: puppet3: lvs 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140704 [13:21:06] (03PS1) 10Giuseppe Lavagetto: puppet3: lvs 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140705 [13:21:22] bblack: meh, death to that rule [13:21:53] (03PS1) 10Faidon Liambotis: mail: brown paper bag fix, fix spurious include [operations/puppet] - 10https://gerrit.wikimedia.org/r/140706 [13:22:11] (03PS2) 10Giuseppe Lavagetto: puppet3: lvs 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140704 [13:22:20] (03PS2) 10Faidon Liambotis: mail: brown paper bag fix, fix spurious include [operations/puppet] - 10https://gerrit.wikimedia.org/r/140706 [13:22:34] (03CR) 10Faidon Liambotis: [C: 032] mail: brown paper bag fix, fix spurious include [operations/puppet] - 10https://gerrit.wikimedia.org/r/140706 (owner: 10Faidon Liambotis) [13:22:44] I love how we wait minutes for jenkins [13:22:50] <_joe_> yes [13:22:51] and then it doesn't even catch the most obvious of errors [13:22:54] _joe_: just reminding you not to bring puppet3 to trusty [13:22:59] (03CR) 10Faidon Liambotis: [V: 032] mail: brown paper bag fix, fix spurious include [operations/puppet] - 10https://gerrit.wikimedia.org/r/140706 (owner: 10Faidon Liambotis) [13:23:10] related tidbit since we're talking about commits that refactor whitespace to conform to a linting standard [13:23:12] <_joe_> matanya: yes that's work for today, actually [13:23:26] <_joe_> (making things work in trusty) [13:23:39] ah, nice [13:23:40] <_joe_> matanya: wasn't just the puppetmaster having issues on trusty? [13:23:50] <_joe_> clients should be fine AFAIR [13:23:55] should [13:23:56] a lot of people get annoyed at "fix up the formatting" commits because they screw up history. e.g. you run "git blame", and find a commit that just formatted whitespace, then you have to look at that commit's history to find the real commit, etc [13:24:05] there's actually a fix for that in git blame: https://coderwall.com/p/x8xbnq [13:24:13] but should always explode in the face [13:24:32] bblack: just use -w [13:24:47] * YuviPanda agrees with line char limits being un productive [13:24:49] even worse is onevar in JS, which forces you to declare all your variables at the top of each function [13:24:51] (03CR) 10Mark Bergsma: "People might be using these hostnames, so I think this warrants an announcement." [operations/dns] - 10https://gerrit.wikimedia.org/r/140136 (owner: 10Faidon Liambotis) [13:24:51] like it's C89 [13:24:52] yeah that's what the link shows: -w and -M (to ignore moving lines around too) [13:24:54] * YuviPanda rages a little as wsell [13:25:15] * matanya replied before clicking :/ [13:26:57] (03CR) 10BBlack: "3.13.0-30 does fix XPS distribution for bnx2 (and other drivers that use kernel-default queue selection). Waiting for all lvs100x reboote" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140376 (owner: 10BBlack) [13:29:42] _joe_: hit higher-numbered LVS's first (e.g. 3004, 4004, 1006?) [13:29:57] <_joe_> bblack: ok [13:29:59] as a general rule they tend not to be the active ones with traffic [13:30:08] <_joe_> ok, changing that [13:31:09] (03CR) 10Anomie: Raise account creation limit for Telugu Wikipedia (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140669 (https://bugzilla.wikimedia.org/66822) (owner: 10Odder) [13:31:40] (03PS1) 10Faidon Liambotis: check_smtp: /really/ send the FQDN on HELO [operations/puppet] - 10https://gerrit.wikimedia.org/r/140707 [13:31:59] (03CR) 10Faidon Liambotis: [C: 032 V: 032] check_smtp: /really/ send the FQDN on HELO [operations/puppet] - 10https://gerrit.wikimedia.org/r/140707 (owner: 10Faidon Liambotis) [13:35:11] (03PS1) 10Jgreen: wikimedia.community got lost in the shuffle [operations/puppet] - 10https://gerrit.wikimedia.org/r/140708 [13:37:19] (03CR) 10Odder: Raise account creation limit for Telugu Wikipedia (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140669 (https://bugzilla.wikimedia.org/66822) (owner: 10Odder) [13:37:33] (03PS3) 10Giuseppe Lavagetto: puppet3: lvs 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140704 [13:37:51] (03CR) 10Anomie: [C: 04-1] "I'd really want to see consensus for this that isn't just a handful of people on Meta. Let's not give people the opportunity to bring out " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139819 (https://bugzilla.wikimedia.org/66066) (owner: 10Odder) [13:39:36] (03PS4) 10Giuseppe Lavagetto: puppet3: lvs 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140704 [13:39:49] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: lvs 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140704 (owner: 10Giuseppe Lavagetto) [13:39:59] (03CR) 10Giuseppe Lavagetto: [V: 032] puppet3: lvs 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140704 (owner: 10Giuseppe Lavagetto) [13:41:28] (03CR) 10Nemo bis: "Anomie, the bug is currently marked "shell", hence your -1 appears to be out of process." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139819 (https://bugzilla.wikimedia.org/66066) (owner: 10Odder) [13:42:44] (03PS2) 10Faidon Liambotis: exim: add wikimedia.community to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140708 (owner: 10Jgreen) [13:42:52] (03PS3) 10Faidon Liambotis: exim: add wikimedia.community to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140708 (owner: 10Jgreen) [13:43:02] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: add wikimedia.community to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140708 (owner: 10Jgreen) [13:50:50] (03PS4) 10Chad: Tools: Install cmake [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb) [13:51:04] (03CR) 10Chad: "Someone mind looking at this? Should be pretty trivial since it's rebased cleanly." [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb) [13:52:55] (03CR) 10Yuvipanda: [C: 031] Tools: Install cmake [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb) [13:57:51] (03CR) 10Alexandros Kosiaris: [C: 032] Move mirror maker argument checking into start func [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140209 (owner: 10Ottomata) [14:02:54] (03PS2) 10Giuseppe Lavagetto: puppet3: lvs 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140705 [14:05:27] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: lvs 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140705 (owner: 10Giuseppe Lavagetto) [14:08:52] (03CR) 10Dzahn: [C: 032] Tools: Install cmake [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb) [14:12:01] PROBLEM - Host elastic1017 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:03] (03CR) 10Dzahn: [C: 032] apt/pin.pp - retab and mini quoting fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/139458 (owner: 10Dzahn) [14:12:38] (03CR) 10Anomie: "Just because a handful of people mark a bug as "shell" doesn't mean that there can't be objections." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139819 (https://bugzilla.wikimedia.org/66066) (owner: 10Odder) [14:12:46] <_joe_> matanya: I think you did identify all the trusty-incompatible facts, right? [14:13:14] (03PS9) 10Nuria: [WIP] Add backup role and scripts to wikimetrics [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [14:15:14] mutante: ty [14:15:44] mutante: https://gerrit.wikimedia.org/r/#/c/140646/ would also be nice :) [14:16:38] (03PS1) 10Faidon Liambotis: Add rt.wikimedia.org to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140713 [14:16:56] (03CR) 10jenkins-bot: [V: 04-1] Add rt.wikimedia.org to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140713 (owner: 10Faidon Liambotis) [14:17:00] (03CR) 10Dzahn: "what does "marked shell" mean nowadays? isn't that from SVN times before Gerrit when every change had to be deployed manually and ops and " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139819 (https://bugzilla.wikimedia.org/66066) (owner: 10Odder) [14:17:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Please, no binaries in the puppet tree. A .deb package is way better for this" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140677 (owner: 10Gage) [14:17:47] (03PS2) 10Faidon Liambotis: Add rt.wikimedia.org to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140713 [14:19:07] (03CR) 10Faidon Liambotis: [C: 032] Add rt.wikimedia.org to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140713 (owner: 10Faidon Liambotis) [14:19:31] (03CR) 10Nemo bis: "It means that the reporter wasn't told to get wider consensus or whatever. Anomie has bugzilla access so he surely can update the bug as h" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139819 (https://bugzilla.wikimedia.org/66066) (owner: 10Odder) [14:23:01] (03CR) 10Dzahn: "could you please link to a ticket because this is an access request" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140646 (owner: 10Yuvipanda) [14:23:22] (03PS3) 10Filippo Giunchedi: report swift global statistics to statsd [operations/puppet] - 10https://gerrit.wikimedia.org/r/139394 [14:23:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] report swift global statistics to statsd [operations/puppet] - 10https://gerrit.wikimedia.org/r/139394 (owner: 10Filippo Giunchedi) [14:23:45] \o/ [14:24:15] mutante: sure, let me do that once bernd wakes up [14:24:19] now on to testing it :)) [14:27:21] (03CR) 10Manybubbles: [C: 04-1] "I'm going to block this for a technical reason - it'll tons more load on Cirrus and cause weird suggestions. I'm also unsure what it'll d" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139819 (https://bugzilla.wikimedia.org/66066) (owner: 10Odder) [14:29:08] (03CR) 10QChris: [C: 04-1] [WIP] Add backup role and scripts to wikimetrics (032 comments) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [14:29:42] (03PS2) 10Yuvipanda: Grant bearND ability to upload mobile releases [operations/puppet] - 10https://gerrit.wikimedia.org/r/140646 [14:29:46] (03PS2) 10Dzahn: misc/management.pp - retab and lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/139460 [14:30:21] !log replacing failed disk slot3 es1006 [14:30:26] Logged the message, Master [14:32:54] (03PS1) 10Filippo Giunchedi: do not reassing password variable :( [operations/puppet] - 10https://gerrit.wikimedia.org/r/140716 [14:33:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] do not reassing password variable :( [operations/puppet] - 10https://gerrit.wikimedia.org/r/140716 (owner: 10Filippo Giunchedi) [14:33:33] (03CR) 10Dzahn: [C: 032] misc/management.pp - retab and lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/139460 (owner: 10Dzahn) [14:34:03] <_joe_> matanya: I can't find to_a anywhere in our custom facts [14:34:42] mutante: added RT, I'll wait for tfinc to respond, I guess [14:35:41] YuviPanda: cool,thanks [14:35:46] mutante: yw [14:36:01] <_joe_> so, trying to convert to puppet 3 rcs*, which are trustys [14:36:01] <_joe_> I only find that in the rsync module [14:37:46] (03CR) 10Dzahn: [C: 031] solr: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140703 (owner: 10Matanya) [14:40:31] (03PS1) 10Filippo Giunchedi: do not wrap cron lines [operations/puppet] - 10https://gerrit.wikimedia.org/r/140718 [14:40:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] do not wrap cron lines [operations/puppet] - 10https://gerrit.wikimedia.org/r/140718 (owner: 10Filippo Giunchedi) [14:42:28] (03PS3) 10Dzahn: rancid.pp - lint and tidy, quoting, arrows, retab [operations/puppet] - 10https://gerrit.wikimedia.org/r/139464 [14:43:23] ottomata: do you have everything backed up on analytics1021? [14:46:59] (03CR) 10Andrew Bogott: "Looks good, just a couple of additional lint bits" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139782 (owner: 10Matanya) [14:50:24] (03PS2) 10Dzahn: noc.pp - various lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/139462 [14:50:54] manybubbles: So which of us would like to SWAT today? [14:50:59] (03Abandoned) 10Dzahn: noc.pp - various lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/139462 (owner: 10Dzahn) [14:51:25] anomie: best if you do it - I'm going to have to go jump on a train at 11:45 and might want to get there early so it doesn't leave me. [14:51:32] ok [14:51:35] sorry - it feels like you do it 3/4 of the time [14:51:48] no problem. I haven't really kept track, actually. [14:52:09] twkozlowski: Ping for SWAT in about 8 minutes [14:52:56] damn submodules .. warning: Failed to merge submodule modules/kafka (commits don't follow merge-base) [14:56:33] (03PS1) 10Giuseppe Lavagetto: puppet3: move rcs, to test trusty clients [operations/puppet] - 10https://gerrit.wikimedia.org/r/140720 [14:56:49] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet3: move rcs, to test trusty clients [operations/puppet] - 10https://gerrit.wikimedia.org/r/140720 (owner: 10Giuseppe Lavagetto) [14:57:11] (03PS2) 10Dzahn: labsdebrepo: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/139782 (owner: 10Matanya) [14:57:38] (03PS3) 10Dzahn: labsdebrepo: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/139782 (owner: 10Matanya) [14:58:14] anomie: I'm here. [14:59:23] (03PS4) 10Dzahn: labsdebrepo: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/139782 (owner: 10Matanya) [14:59:50] (03CR) 10Andrew Bogott: [C: 031] labsdebrepo: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/139782 (owner: 10Matanya) [15:01:06] (03CR) 10Dzahn: [C: 032] labsdebrepo: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/139782 (owner: 10Matanya) [15:01:17] hmm, no jouncebot today? [15:01:20] * anomie starts SWAT [15:01:40] (03CR) 10Anomie: [C: 032] Raise account creation limit for Telugu Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140669 (https://bugzilla.wikimedia.org/66822) (owner: 10Odder) [15:02:05] (03Merged) 10jenkins-bot: Raise account creation limit for Telugu Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140669 (https://bugzilla.wikimedia.org/66822) (owner: 10Odder) [15:02:40] !log anomie Synchronized wmf-config/throttle.php: SWAT: Raise account creation limit for Telugu Wikipedia workshop on June 23 [[gerrit:140669]] (duration: 00m 15s) [15:02:42] twkozlowski: Not that it's testable, but ^ [15:02:44] Logged the message, Master [15:03:05] (03PS2) 10Anomie: Put testwiki namespaces in the right place [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140261 (owner: 10TTO) [15:03:11] (03PS1) 10Filippo Giunchedi: brown paper bag fix: send all the swift metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/140721 [15:03:14] (03CR) 10Anomie: [C: 032] "SWAT deploy" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140261 (owner: 10TTO) [15:03:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] brown paper bag fix: send all the swift metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/140721 (owner: 10Filippo Giunchedi) [15:04:03] (03Merged) 10jenkins-bot: Put testwiki namespaces in the right place [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140261 (owner: 10TTO) [15:04:14] (03CR) 10Dzahn: [C: 031] statistics: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140702 (owner: 10Matanya) [15:04:26] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Put testwiki namespaces in the right place [[gerrit:140261]] (duration: 00m 15s) [15:04:31] Logged the message, Master [15:04:41] oops [15:04:52] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Put testwiki namespaces in the right place [[gerrit:140261]] (duration: 00m 14s) [15:05:02] twkozlowski: ^ Test please [15:05:16] I confirm this works now :-) [15:05:35] twkozlowski: Not going to do the Help namespace patch, too many objections. [15:05:43] (03CR) 10Dzahn: [C: 031] redisdb: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140667 (owner: 10Matanya) [15:05:44] * anomie is done with SWAT [15:06:02] anomie: Yes, sure. Another time maybe. [15:06:12] anomie: Thanks for the help, really appreciated :-) [15:06:16] twkozlowski: no problem [15:08:18] (03CR) 10Dzahn: kafkatee: lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140656 (owner: 10Matanya) [15:08:59] (03CR) 10Dzahn: [C: 031] kafka: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140655 (owner: 10Matanya) [15:11:15] (03CR) 10Dzahn: [C: 031] swift: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140654 (owner: 10Matanya) [15:13:50] !log removed old pmtpa swift stats from graphite [15:13:55] Logged the message, Master [15:19:55] (03CR) 10Dzahn: [C: 032] add wsa (wikistats admin) basic shell script [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140601 (owner: 10Dzahn) [15:20:01] (03CR) 10BryanDavis: "It would be nice to see what the gelf records emitted by hadoop look like to understand what mappings this will end up creating in logstas" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140623 (owner: 10Gage) [15:20:24] (03CR) 10Dzahn: "just a labs thing, i should likely not even have the bot output it in -operations" [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140601 (owner: 10Dzahn) [15:20:59] (03Abandoned) 10Odder: Add Help namespace to default search on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139819 (https://bugzilla.wikimedia.org/66066) (owner: 10Odder) [15:22:09] (03CR) 10Dzahn: [C: 032] add 'add' feature to wikistats admin script [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140604 (owner: 10Dzahn) [15:26:03] (03PS4) 10Hashar: contint: reduce duplication with mediawiki::packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 [15:26:24] (03CR) 10jenkins-bot: [V: 04-1] contint: reduce duplication with mediawiki::packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 (owner: 10Hashar) [15:26:38] (03CR) 10Hashar: "Amended commit message with:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 (owner: 10Hashar) [15:27:24] (03PS1) 10Nikerabbit: Enable ContentTranslation extension on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 [15:27:33] (03CR) 10jenkins-bot: [V: 04-1] Enable ContentTranslation extension on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [15:29:32] <_joe_> hashar: it's time to upgrade jenkins tests on puppet to puppet 3 [15:29:53] <_joe_> we have most of our mission-critical functions on puppet 3 [15:30:08] <_joe_> and tomorrow it's going to be 100% of the mission critical things [15:30:34] woot :) [15:30:52] does that include our Romulan Cloaking Device too? :) [15:32:37] _joe_: iirc puppet parser validate is run only run on gallium [15:33:01] (03PS2) 10Nikerabbit: Enable ContentTranslation extension on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 [15:33:37] (03CR) 10Dzahn: [C: 032] add maintenance functions for wikistats admins [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140605 (owner: 10Dzahn) [15:34:14] _joe_: yup confirmed. So if you upgrade puppet on gallium , the jenkins job will be upgraded as a result :] [15:34:40] (03CR) 10Nikerabbit: Enable ContentTranslation extension on beta labs (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [15:37:11] RECOVERY - Host elastic1017 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:37:23] (03Abandoned) 10Dzahn: retab update.php and sync live hack with repo [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140609 (owner: 10Dzahn) [15:39:11] PROBLEM - puppet disabled on elastic1017 is CRITICAL: Connection refused by host [15:39:52] PROBLEM - check if dhclient is running on elastic1017 is CRITICAL: Connection refused by host [15:39:52] PROBLEM - SSH on elastic1017 is CRITICAL: Connection refused [15:39:52] PROBLEM - Disk space on elastic1017 is CRITICAL: Connection refused by host [15:39:52] PROBLEM - check configured eth on elastic1017 is CRITICAL: Connection refused by host [15:40:01] PROBLEM - RAID on elastic1017 is CRITICAL: Connection refused by host [15:40:01] PROBLEM - DPKG on elastic1017 is CRITICAL: Connection refused by host [15:45:48] (03CR) 10Alexandros Kosiaris: [C: 032] Minor lint base::monitoring::host [operations/puppet] - 10https://gerrit.wikimedia.org/r/139332 (owner: 10Alexandros Kosiaris) [15:46:37] what's up with elastic1017? anyone working on it? [15:46:51] RECOVERY - SSH on elastic1017 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [15:47:13] I believe cmjohnson1 was replacing a disk? [15:48:01] i think that was analytics1021 [15:48:04] cmjohnson1: [15:48:33] <_joe_> hashar: great! will do ASAP [15:48:35] ah yeah you are right mutante, nevermind [15:48:38] oh, "allocate old solr boxen as analytics "... [15:48:42] must be that [15:49:15] no, neither.. ehm [15:49:15] _joe_: hopefully nothing else will break. [15:49:36] _joe_: i am heading out to catch my daughter. Will be back later tonight though [15:49:53] <_joe_> hashar: we can do that tomorrow [15:50:17] sure thing [15:50:31] * hashar wave [15:50:33] z [15:51:41] !log powercycling elastic1017 (went down and no console output) [15:51:47] Logged the message, Master [15:52:01] PROBLEM - NTP on elastic1017 is CRITICAL: NTP CRITICAL: No response from NTP server [15:52:33] (03PS2) 10Filippo Giunchedi: add swift eqiad-prod cluster dashboard [operations/puppet] - 10https://gerrit.wikimedia.org/r/140685 [15:52:51] PROBLEM - Host elastic1017 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:39] ^ had to reset the iDRAC itself as well, since then console output would do "something" again [15:55:54] before "connect com2" / "console com2" just didnt do anything [15:57:41] * Starting load fallback graphics devices [fail] [15:57:44] ? [15:58:01] RECOVERY - Host elastic1017 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [16:11:17] (03CR) 10Matanya: kafkatee: lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140656 (owner: 10Matanya) [16:14:03] (03CR) 10Dzahn: kafkatee: lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140656 (owner: 10Matanya) [16:24:55] (03CR) 10Matanya: kafkatee: lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140656 (owner: 10Matanya) [16:39:49] greg-g: Around? [16:39:52] yep [16:40:03] You can guess what I need, I guess :D [16:40:26] Can slip in maybe half an hour before the MediaWiki train deploy? [16:40:31] * I sneak [16:41:12] what's up? [16:41:32] (03CR) 10BBlack: [C: 031] "I agree, but otherwise +1" [operations/dns] - 10https://gerrit.wikimedia.org/r/140136 (owner: 10Faidon Liambotis) [16:42:42] greg-g: We're not going to deploy a new branch with wmf10 today (delayed that until wmf11), but we would like https://gerrit.wikimedia.org/r/140344 to be deployed [16:42:46] it's a small JS only fix [16:43:30] is that all you guys ever have? small js fixes :) [16:44:19] greg-g: nice :) [16:44:19] hoo: but ok :) [16:45:10] greg-g: We had a lot of JS troubles in the past week. Thing I've deployed 3 or 4 js fixes in the past 7 days [16:45:26] :) [16:45:45] that's one of hte reasons we delayed the next deploy: To let stuff settle a bit [16:46:01] * greg-g nods [16:46:02] good call [16:47:20] mutante: elasti1017 was a reinstall...swapped the disk back to ssds [16:53:42] hoo: you wanna do it 30 minutes before the train window? [16:53:54] yeah [16:54:14] but can also be done during the window if that's more comfortable for you [16:54:31] I'd like to separate, honestly [16:54:32] cmjohnson1: ok,the DRAC reset was needed anyways it seems [16:54:44] i was on it [16:55:07] apergos: do you recall what the delay is before we're supposed to get an icinga warning about puppet freshness? [16:56:32] cmjohnson1: it wasn't doing the "currently in use" thing or showing blank screen, it was doing litreally nothing when sending the command [16:57:26] andrewbogott: 36000s [16:57:28] weird. no big deal. [16:58:07] apergos: ten hours? [16:58:16] um… mutante: ? [16:58:25] I thought it was closer to one [16:59:27] andrewbogott: i just see this [16:59:29] nagios.pp: $freshness = 36000, [17:00:03] so 1 hour [17:00:34] 60 * 60 = 3600 = one hour? [17:00:46] * andrewbogott is maybe not so good with multiplying... [17:02:05] andrewbogott: right, one more 0, 10 hours [17:02:17] i expected it to be quicker than that too [17:02:42] I think it should be 1 hour. But maybe there's a good reason [17:02:47] vaguely recalls apergos lowering it [17:03:00] or suggesting to lower it [17:03:02] right, I thought from 2 to 1 or similar [17:04:09] 10 hours sound exactly like the person who merged is asleep [17:04:13] so, yea [17:04:28] mutante: need anything else tabs wise ? [17:04:43] andrewbogott: you saw the ticket i made? [17:05:10] on virt1008 the check worked just fine [17:05:48] mutante: oh, ok. ticket #? [17:06:03] #7716 , you are added as requestor [17:06:17] but i just said.. it's still weird [17:06:24] because it cant be a global issue. it works on other boxes [17:06:34] greg-g, Reedy - the media viewer release to all wikis is happening today as part of the normal MW train, it looks like - if you'd like us to do it in a separate window, I can [17:06:37] or it was temp [17:07:01] RECOVERY - RAID on es1006 is OK: OK: optimal, 1 logical, 2 physical [17:07:04] marktraceur: it's fairly low risk today, right? given it's already on en and de? [17:07:15] Yeah, the big ones were commons and enwiki [17:07:15] So [17:07:19] We should be OK [17:07:24] * marktraceur knocked on wood [17:07:25] that was my reasoning for not separating, but, as I said before, separation is good (don't quote me out of context there) [17:09:30] matanya: i don't think so? the script was for manifest in $(find $basedir -name *.pp); do ... [17:09:42] so everything *.pp should cover it [17:09:44] (03CR) 10Alexandros Kosiaris: "The general concepts are OK. Some minor issues here and there and the big one is that it needs modularization and roles." (0312 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [17:10:13] So only submoudles still need attention [17:11:07] (03CR) 10Dzahn: [C: 032] kafkatee: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140656 (owner: 10Matanya) [17:12:33] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [17:16:48] cmjohnson1, hey [17:17:07] hi [17:17:09] RobH, hey i send another configuration please see picture [17:17:54] i think rob is running an errand atm [17:18:02] cmjohnson1, you ok with the last config [17:18:13] papaul: does it work better for you? [17:18:25] Yes much better [17:18:36] the big thing is not have the cables intertwined as they leave the cable managers [17:18:47] then go for that configuration [17:19:07] I like that the best personally [17:19:15] we didn't even think about doing that last week [17:20:46] cmjohnson1, i know was thinking about it last night and said i will try it once on site today [17:21:38] good idea. it looks good and is functional. let's move forward with it [17:21:52] did you see the RT ticket for you? new shipment arrived yesterday [17:21:55] papaul ^ [17:22:17] cmjohnson1, yes i took the ticket [17:23:03] greg-g: I'm read for wmf9... I guess the newly made wmf10 will automatically pick up the last commit [17:23:09] (wmf10 is not there yet) [17:23:15] @James_F friendly reminder about deploying visual editor to wikimania2014 wiki [17:23:26] err: /Stage[main]/Apache::Mod::Perl/Apache::Mod_conf[perl2]/Exec[ensure_present_mod_perl2]/returns: change from notrun to 0 failed: /usr/sbin/a2enmod perl2 returned 1 instead of one of [0] at /etc/puppet/modules/apache/manifests/mod_conf.pp:27 [17:23:29] cmjohnson1, i send you a pic the green cable and blue are separate check email and let me know [17:23:30] ori: akosiaris ^^ [17:23:42] edsaperia: Did that not get done? [17:23:43] ori: also, the breakage from yesterday is still unfixed [17:23:44] * James_F sighs. [17:23:56] edsaperia: I'll go grumble at people, thanks for the flag. [17:23:58] paravoid: machine ? [17:24:04] akosiaris: that was from iodine [17:24:09] Unless VE looks a lot like wikitext, I don't think so [17:24:10] i saw it that is good..biggest concern is coming out of the cable managers since some go up and some go down. as long as they're seperated [17:24:17] hoo: as long as you beat Reedy in making the branch :) [17:24:27] hoo: go ahead and get started [17:24:31] also papaul. use the 5' cables for both servers up and servers below...then the 7' [17:24:54] cmjohnson1, yes i am doing that [17:24:57] On my way :) [17:25:01] ok..cool [17:26:02] Reedy: just a heads up, hoo is doing yet another "small js fix" right now, he's merging to master and backporting to wmf9. The heads up part is: we should make sure it made the cut to wmf10 [17:26:04] (03PS1) 10Ori.livneh: apache::mod: fix perl module config name [operations/puppet] - 10https://gerrit.wikimedia.org/r/140744 [17:26:06] akosiaris: ^ [17:26:09] cmjohnson1, thanks [17:26:23] paravoid: what is still broken? [17:26:27] yw [17:26:44] ori: https://gerrit.wikimedia.org/r/#/c/140218/2 [17:26:53] ori: alex's comment specifically [17:27:27] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] apache::mod: fix perl module config name [operations/puppet] - 10https://gerrit.wikimedia.org/r/140744 (owner: 10Ori.livneh) [17:27:28] (03CR) 10Gage: "Have been discussing this over email with MWalker; here's an example event from Hadoop without any mutate or prune:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140623 (owner: 10Gage) [17:27:29] > I am wondering however why remove sites-available/sites-enabled from the equation? What was the driving factor ? [17:27:41] it was missed by reviewers that this is what apache::vhost was *always* doing [17:28:30] it's not just that [17:28:39] ori: This did not have the intended result. The ensure => present lines have kept the files intact, as in they are still symlinks to sites-available. [17:28:53] akosiaris: example host? [17:28:56] file { 'foo': ensure => present, content => 'bar' }, when "foo" is a symlink to "bar", does nothing [17:29:07] it won't delete the symlink and create a new file with content bar [17:29:18] zirconium for one [17:29:26] well all of them [17:29:28] (03CR) 10Gage: "Summary of this patch:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140623 (owner: 10Gage) [17:29:31] !log hoo Synchronized php-1.24wmf9/extensions/Wikidata/: Update Wikidata to fix the entity selector (duration: 00m 09s) [17:29:34] Logged the message, Master [17:29:42] ok [17:29:46] i get it now [17:29:48] i'll fix it [17:29:58] but this is unfair: [17:30:00] 00:23 < mutante> "The approach of the new apache module is to provision files [17:30:00] 00:23 < mutante> directly in sites-enabled rather than symlinks to files in sites-available." [17:30:00] 00:23 < paravoid> ewwww [17:30:01] sorry, I thought it was clear yesterday [17:30:02] 00:23 < mutante> is that right? [17:30:04] 00:23 < paravoid> bad ori [17:30:08] it's not the new apache module [17:30:18] it's the apache module i opposed introducing in the first place [17:30:47] this behavior was not someething i introduced [17:30:52] that is a comment to changeset that changes a bunch of sites-available/sites-enabled sites to sites-enabled [17:30:58] (03PS1) 10Andrew Bogott: Intentionally break puppet compile for virt1008 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140745 [17:31:10] mutante, ^ would replicate the breakage that we saw before, right? [17:31:19] Maybe I'm thinking about this all wrong [17:31:23] there's a reason we haven't migrated to the apache module, and that's because we knew it's crap [17:31:34] to the pre-refactor apache module, I mean [17:31:46] greg-g: Thanks again :) Verified the fix, everything looks fine. [17:31:48] and that was one of the reasons, fwiw [17:31:51] hoo: thank you [17:32:01] Reedy: everything's all clear :) [17:33:48] paravoid: that's not persuasive, because if everything was using that module it'd be trivial to update it to manage both symlink and concrete file [17:33:58] which i'm still suggesting we do [17:34:00] andrewbogott: right, i would think so, yes [17:34:06] hm... [17:34:16] ok, I will not merge that now, though, since I plan to not be working in 10 hours :) [17:34:17] it's always darkest before the dawn? :) [17:34:24] (03CR) 10Dzahn: [C: 031] "this should break it, +1 :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140745 (owner: 10Andrew Bogott) [17:34:38] andrewbogott: had to, it's so rare that breakage is intentional [17:34:43] :) [17:34:44] ori: I think we had this discussion yesterday; this isn't what I was complaining about now [17:34:56] ori: I'm saying that this patch doesn't do what is intending to do [17:35:06] yeah, sadly that piece i just now groked [17:35:10] and didn't quite process before [17:35:16] it's okay [17:35:17] but i'll start working on a fix now [17:35:27] omfg [17:35:48] ? [17:36:03] still, i expected the death of generic-definitions.pp & co to be met with, i dunno, less lamentation and gnashing of teeth and public ewwwing :P [17:36:07] Reedy: scary [17:37:03] ori: I'm appreciative of the general effort, I don't like that particular change so much; your explanation makes it better, but it wasn't documented on the commit message, so how was I supposed to know that your intention was for it to be temporary? [17:37:26] ori: I did not publicly ewww. In fact I did not even privalety ewww. I got scared TBH, but well, it needed to be done :-) [17:37:39] (I did :) [17:37:45] i'm just being a drama queen [17:37:54] i vented at mutante too yesterday [17:38:01] i should stop sulking and fix the bug [17:38:03] bbiab [17:38:40] " ori: I did not publicly ewww. In fact I did not even privalety ewww." is surely quip-worthy btw [17:39:13] ori's self awareness is good. [17:39:27] if not a little self-deprecating ;) [17:39:33] ori: i just ewwed because of the labs users having their configs deleted [17:39:51] and you found a good solution, getting them out of the file bucket [17:40:17] i still haven't emailed the list about it [17:40:21] argh, too many things [17:40:21] oh ouch [17:40:24] I didn't see that coming [17:40:40] that's a bit nasty [17:40:46] why where labs users config files deleted ? [17:40:46] yes, several users had files in sites-enabled that were not in puppet [17:40:48] mutante: if i call a var in files and the var is calling another var in a .pp file everything will break ? [17:40:52] aaaah [17:40:53] ouch [17:41:08] akosiaris: because they didnt puppetize.. and yea. there are quite a few apache setups [17:41:25] it was a significant historic occasion, namely the first time puppet's filebucket functionality was useful for anyone [17:41:34] haha [17:41:42] hehe [17:41:46] matanya: a) vars are not calling vars b) pastebin ? [17:41:58] and we can answer [17:42:00] ori: lol [17:42:00] hm, but still, maybe the module needs to allow for the usecase of manual config [17:42:10] well I used it a couple of times before too [17:42:11] we don't do that in prod, but we do allow it in labs [17:42:21] and it's probably better for those users to use the apache module too [17:42:40] and yea, i would prefer to not put files directly in sites-enabled but that's less important [17:43:28] we also had this unfortunate combination.. puppet breakage on the master and no monitoring alarm (maybe it was less than 10 hours) [17:43:40] see andrewbogott's comments above as well about lowering that (again?) [17:44:04] so then when the master was fixed we got those other issues [17:44:15] that had not been applied before [17:44:20] imo it shouldn't be lowered; we should instead use puppet reporters to emit alert on any failed run [17:44:30] maybe warning for the first failure, crit for second [17:44:41] that's what I emailed about [17:44:42] to accommodate really ephemeral failures [17:44:48] i think 10 hours means that the person who changed something is almost guaranteed to be off when it triggers [17:45:15] mutante: heh, i didn't think about that but 10 hours does seem almost fiendish in that way [17:46:11] akosiaris: https://tools.wmflabs.org/paste/view/c130d789 [17:46:41] paravoid: oh, last thing [17:46:44] paravoid: speaking of ewwwing [17:46:56] https://github.com/facebook/hhvm/pull/2988 [17:47:00] is it bad that i'm proud of that? :P [17:47:39] greg-g: Mind if I do a quick sync to fix a botched SWAT from yesterday? [17:47:51] (forgot to run git submodule update for the submodule inside of VE) [17:48:08] ori, can you tell me more about 'instead use puppet reporters…'? [17:48:23] That's how I handle the puppet status field on wikitech; but i haven't thought much about how to integrate that with icinga [17:48:30] (03Abandoned) 10Chad: Remaining wikis other than enwiki and commonswiki to Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 (owner: 10Chad) [17:48:46] matanya: templates. Turn it into a template and have all ${variables} to <%= @variables %> style [17:48:46] (03Restored) 10Chad: Remaining wikis other than enwiki and commonswiki to Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 (owner: 10Chad) [17:49:16] I really don't know why it isn't already a template, it should be. Your variables are already in the class so you should be ok [17:49:41] akosiaris: I thought about that at first, but will i need to use lookup ? [17:50:02] nope [17:50:16] webrequest_log_directory is defined a couple of lines above [17:50:31] right, but it points to another var [17:50:42] that is not an issue? [17:50:52] andrewbogott: i think the answer to how long it takes might be that we have it defined in multiple places / remnants [17:50:56] $freshnessinterval = $interval * 60 * 6 [17:51:01] matanya: nope [17:51:08] the latter is in modules/base/manifests/init.pp [17:51:12] the former in misc/nagios.pp [17:51:39] thanks, will do. any preferred location in the template dir ? [17:52:11] templates/kafkatee ? [17:52:19] modules/kafka/templates/ ? [17:52:33] is it a separate thing from the kafka module? [17:52:34] mutante: it is the role class that has the resource [17:52:39] not the module class [17:52:43] oh, ok [17:52:52] gtg, c ya [17:53:08] thanks akosiaris bye [17:54:08] mutante: Ah, so probably apergos meant to change it to an hour [17:55:25] andrewbogott: oh wow.. a third place? [17:55:26] files/icinga/icinga.cfg:host_freshness_check_interval=60 [17:56:12] nevermind, that is unrelated "freshness" [17:57:23] (03PS3) 10Chad: Move remaining pool 4 lsearchd wikis (except commons) to Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 [17:58:14] andrewbogott: so, it is 20 minutes [17:58:23] 72 ## This is in mins. Do not set this to 0 or > 60 73 $interval = 20 [17:58:33] 76 $freshnessinterval = $interval * 60 * 6 [18:00:05] Reedy, greg-g: The time is nigh to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140619T1800) [18:01:15] (03PS1) 10Matanya: kafkatee: convert logrotate script into a template [operations/puppet] - 10https://gerrit.wikimedia.org/r/140749 [18:01:26] (03PS2) 10MarkTraceur: Remove completed surveys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138634 [18:02:00] Reedy: Hey sorry I merged something into wmf/1.24wmf10 while you were checking it out on tin, so could you git pull after you're done? [18:02:21] Reedy: Also, wmf9's extensions/VisualEditor needs a git submodule update (was forgotten during yesterday's SWAT) [18:02:46] (03PS2) 10Matanya: kafkatee: convert logrotate script into a template [operations/puppet] - 10https://gerrit.wikimedia.org/r/140749 [18:02:57] (03PS1) 10Chad: Move commons over to Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140752 [18:02:59] (03PS1) 10Chad: Move remaining pool 3 wikis to Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140753 [18:03:01] (03PS1) 10Chad: Pool 2 wikis (dewiki, frwiki, jawiki) get Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140754 [18:03:03] (03PS1) 10Chad: enwiki gets Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140755 [18:03:23] No what's interesting, ^d! [18:03:26] (03CR) 10jenkins-bot: [V: 04-1] Remove completed surveys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138634 (owner: 10MarkTraceur) [18:03:26] that's* [18:03:57] <^d> I was just getting eager and prepping a ton of changes. Only one I'll get soon is the first one. [18:03:58] <^d> :) [18:04:15] ^d: When are you planning to merge this? [18:04:29] (03PS3) 10MarkTraceur: Remove completed surveys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138634 [18:04:38] Well, all of these changes; all are newsworthy for me. [18:05:12] papaul: i like it! [18:05:17] <^d> https://gerrit.wikimedia.org/r/#/c/136338/ will likely go sometime next week. [18:05:25] sorry, was getting my replacement moto x for the one i broke [18:05:30] <^d> Rest is dunno. [18:06:06] RobH: LOL i read it: was getting my replacement motd for the one i broke [18:06:09] * Nemo_bis happy [18:06:12] (03PS1) 10Chad: Remove "Cirrus as alternative" switches [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140756 [18:06:14] (03PS1) 10Faidon Liambotis: wikimedia_domains: add more domains, per OTRS db [operations/puppet] - 10https://gerrit.wikimedia.org/r/140757 [18:06:26] (03PS1) 10Reedy: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140758 [18:06:28] was thinking, how can he replace a motd ... ? [18:06:28] (03PS1) 10Reedy: testwiki to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140759 [18:06:48] motd breaking would be less heartbreak [18:06:54] this will be the first phone i have EVER broken [18:07:01] (03CR) 10Reedy: [C: 032] Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140758 (owner: 10Reedy) [18:07:04] and i never put my electronics in any kind of case. [18:07:07] (03Merged) 10jenkins-bot: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140758 (owner: 10Reedy) [18:07:18] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140759 (owner: 10Reedy) [18:07:25] (03Merged) 10jenkins-bot: testwiki to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140759 (owner: 10Reedy) [18:07:35] but, it turns out spending 25 USD extra on a custom wooden backing on the phone makes it 'customized' and thus covered under an additional one year warranty for 'breaking yer shit' [18:07:43] ^d: so what does https://gerrit.wikimedia.org/r/#/c/136338/3/wmf-config/InitialiseSettings.php do because it's not immediately clear to me? [18:07:47] or else i would be replacing the screen on my own. [18:07:50] matanya: #2895 #6857 :) [18:07:57] nice one RobH [18:08:00] !log reedy Started scap: testwiki to 1.24wmf10 and build l10n cache [18:08:00] looking mutante [18:08:04] Logged the message, Master [18:08:06] (03CR) 10Jgreen: [C: 031] wikimedia_domains: add more domains, per OTRS db [operations/puppet] - 10https://gerrit.wikimedia.org/r/140757 (owner: 10Faidon Liambotis) [18:08:51] oh, you can break motd :D [18:08:52] <^d> twkozlowski: It turns it on for default everywhere, except the wikis set to false. [18:09:00] matanya: yes:) [18:09:11] ^d: as primary? [18:09:15] <^d> Yep. [18:09:15] net spilt :/ [18:09:50] (03PS7) 10Withoutaname: Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) [18:09:57] <^d> twkozlowski: Note that I'm deleting cirrus.dblist, which already are all set to primary. [18:10:26] ^d: But Cirrus was already primary on a lot of wikis, so that's a small change, I think? [18:10:40] whoops, comment conflict. [18:10:46] <^d> Yep. [18:10:59] <^d> So the change is (all.dblist - deleted cirrus.dblist - wikis set to false) [18:11:03] mutante: https://gerrit.wikimedia.org/r/#/c/140749/ [18:11:13] Well! \o/ anyway :-) [18:11:32] (03PS2) 10Faidon Liambotis: wikimedia_domains: add more domains, per OTRS db [operations/puppet] - 10https://gerrit.wikimedia.org/r/140757 [18:11:34] (03PS1) 10Faidon Liambotis: exim: only route to valid OTRS addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/140761 [18:12:21] (03CR) 10Faidon Liambotis: [C: 032 V: 032] wikimedia_domains: add more domains, per OTRS db [operations/puppet] - 10https://gerrit.wikimedia.org/r/140757 (owner: 10Faidon Liambotis) [18:13:16] (03CR) 10Jgreen: [C: 031] exim: only route to valid OTRS addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/140761 (owner: 10Faidon Liambotis) [18:13:23] (03CR) 10Faidon Liambotis: [C: 032] exim: only route to valid OTRS addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/140761 (owner: 10Faidon Liambotis) [18:13:54] _joe_: do you have time to untangle a puppet3-related apt problem? dynamicproxy-gateway.eqiad.wmflabs isn't coping well [18:14:05] Reedy: This is your friendly reminder that media viewer enablement on all wikis is part of your charge this week. :) [18:16:37] (03CR) 10Nemo bis: "Sent a notice: http://lists.wikimedia.org/pipermail/wikilovesmonuments/2014-June/007258.html" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140757 (owner: 10Faidon Liambotis) [18:16:53] Nemo_bis: you rock [18:17:13] this is just awesome, thanks [18:17:27] (03CR) 10BryanDavis: "If you and Matt are happy with the resulting log events, I don't see any reason to block these tweaks." [operations/puppet] - 10https://gerrit.wikimedia.org/r/140623 (owner: 10Gage) [18:18:10] :) [18:19:57] ^d: Yay! comm says 70 new wikis [18:22:27] <^d> :D [18:23:46] ffs [18:25:59] if Reedy uses "ffs" or "omfg" that should be an icinga alarm [18:26:44] mutante: Also /no( no)+/ [18:26:54] Internet dropped out [18:27:00] Loads of processes still syncing files [18:27:19] Honestly, Reedy saying much of anything in this channel during a deploy is a bad sign [18:29:52] I'm the only one logged into bast1001? o_0 [18:37:20] !log reedy Started scap: scap 1.24wmf10 take 2... [18:37:25] Logged the message, Master [18:37:38] (03PS1) 10Nuria: Reports in prod should be stored on redis 30 days [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/140764 (https://bugzilla.wikimedia.org/63664) [18:37:45] !log neon, logstash100x, zirconium, stat1001, netmon1001: replaced sites-enabled symlinks with their targets and forced puppet-run to clean up after Iddc778a28 [18:37:50] Logged the message, Master [18:37:55] ^ akosiaris, mutante, paravoid [18:47:00] Reedy: ok, well, I'm going to go to the closest place that sells SSDs of any quality, which is a 25 minute drive away, so, be back in a little over an hour :/ [18:47:15] maybe longer given lunch time traffic :/ [18:47:28] marktraceur: ^^ fyi to you too [18:47:34] greg-g: Central Computers [18:47:37] on Howard [18:47:39] mutante: I'm in petaluma [18:47:44] ah, i see [18:48:04] central computers would be a 2 hour round trip :) [18:48:05] and i'd say probably Intel SSDs .. [18:48:21] gotcha, yea [18:48:38] you would think S.F. had more than that one real computer store [18:48:50] was surprised by that [18:48:58] K [18:49:08] (03PS2) 10Nuria: Reports in prod should be stored on redis 30 days [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/140764 (https://bugzilla.wikimedia.org/63664) [18:49:21] mutante: or a place to buy/browse computer books, for that matter [18:51:37] ori: maybe people just all have pirated O'Reilly PDFs anymore :p [18:52:51] ori: how is the public library ? IT section? [18:53:00] greg-g: may i request you to support my access request ? [18:53:14] matanya: sure, leave a message, be back in a little over an hour [18:53:17] mutante: you know, i'm embarrassed to say i haven't tried [18:53:20] relevant RT or what not [18:53:27] Hey folks. I'm getting some bad ping to bast1001.wikimedia.org (around 300 ms). Anyone else having issues? [18:53:31] ok, thanks greg-g [18:54:22] ori: i know they let you take ebooks from the library, but you can only "take" it if the current copy has been returned .. it's so funny :) [18:54:27] ori: i have about 1GB of IT books purchased interested ? [18:54:50] "returning" a PDF so somebody else can then have it :p [18:54:57] matanya: #wikimedia-warez [18:55:08] ori: purchased [18:55:10] !xdcc send matanya-bot #4 [18:55:24] heh [18:55:36] matanya: thanks (really) but i like browsing physical books sometimes [18:55:43] the barnes & noble at union sq in nyc was great for that [18:56:15] funny my friend who just moved out there has the same complaint [18:56:25] (from NY) [18:56:30] I'm like - people read books?! [18:56:37] sure [18:56:43] milimetric: *browse* [18:56:48] browse and make elaborate plans to read [18:56:57] but not _actually_ read, obviously [18:57:23] just spend some quality time imagining what it'd be like, to, say, know C# like the back of your hand [18:59:43] andrewbogott: i haven't replied to your question from earlier because it turns out it's a bit more complicated than i imagined [19:00:19] !log reedy Finished scap: scap 1.24wmf10 take 2... (duration: 22m 59s) [19:00:24] Logged the message, Master [19:00:32] ori: OK. I have in mind to look into it at some point as well… writing a reporter is easy, I'm just not sure what the next step is. [19:00:34] andrewbogott: i'm looking for a straightforward way to push an unscheduled alert to icinga [19:00:36] http://sfpl.org/index.php?pg=0000000301 [19:00:48] eLibrary [19:00:56] ori: that's a passive check, isn't it? [19:01:00] i'm not sure [19:01:20] ori: you can send a custom notification via the web ui [19:01:22] are passive checks only for submitting positive results? [19:01:25] (03PS1) 10Reedy: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140768 [19:01:29] like "i'm still ok"-type results? [19:01:45] oh, I see. I'm not sure either [19:01:50] if so, then it wouldn't work. but if a passive check can actively (ahem) report a failure then it could work [19:02:05] i'm seeing some evidence of people using some "ncsa" plugin [19:02:15] ncsa is not an acronym i've seen often in the past, say, 20 years [19:02:19] ori: yes, use nsca for that, we have it in use [19:02:26] puppetized some of that [19:02:32] ah, cool [19:02:36] so you have 2 kinds of passive checks [19:02:37] yeah, so it does look like ncsa can do that [19:02:39] snmp and nsca based [19:02:40] ori: I have a hard copy of the "Essential System Administration" 1st edition from 1991 [19:02:44] passive checks can submit any result :) [19:02:59] (03Abandoned) 10Reedy: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140768 (owner: 10Reedy) [19:03:00] they just have the case of staleness i.e. implied bad state after a certain amount of time [19:03:05] (03PS1) 10Reedy: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140769 [19:03:11] Reedy: Double-check that you're aware of the media viewer thing, I don't think I got an ACK from you [19:03:21] ori: it's a choice of snmptrap or nsca [19:03:25] so that could even be a post-run command [19:03:27] for puppet [19:03:30] mutante: why do we use both? [19:03:30] but besides that, what chase said [19:03:36] that receives the exit status and submits a check [19:03:42] mutante: 'custom notification via the web ui' meaning a rest api? [19:03:47] as opposed to only submitting an ok status on successful runs [19:03:47] ori: that's what the puppet freshness check is [19:03:51] (03CR) 10Reedy: [C: 032] Wikipedias to 1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140769 (owner: 10Reedy) [19:03:53] because that's what the labs reporter does now, could be trivial to reuse in prod [19:03:55] mutante: ACK [19:04:17] You mean marktraceur I hope [19:04:20] ori: and now i see also HTML 4.01 transition guide, dated 1999 want one of those ? :D [19:04:28] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140769 (owner: 10Reedy) [19:04:31] * marktraceur assumes, stops worrying [19:04:32] mutante: but it doesn't execute on failed runs, right? [19:04:37] andrewbogott: no, just by clicking i meant, but that's just for sending a custom text [19:04:42] mutante: it signals failure via the absence of signal [19:04:45] marktraceur: aye [19:04:52] ori: that is correct [19:05:00] dang [19:05:04] say you have cron, and the result is submitted as a passive [19:05:09] ori: that's where the staleness comes into play, it becomes CRIT if for a certain time it does not get ACKs anymore [19:05:18] yea [19:05:21] so why don't we make a simple change to it [19:05:52] at my last job we had a generic cron wrapper that all puppet crons used [19:05:54] chasemp: nsca has encryption [19:06:00] so that it still behaves the way it does currently (submits an all-clear on successful runs, the absence of which for any length of time is a signal of failure) [19:06:09] chasemp: and we started using it for fundraising [19:06:13] but that also just immediately reports failed runs [19:06:20] Reedy: I'll be at lunch soon, tgr is your man if you need it [19:06:27] if the run just failed you don't need to wait for time to elapse to know that puppet is failing [19:06:27] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf9 [19:06:29] that seems silly [19:06:33] Logged the message, Master [19:06:44] mutante: assuming that was about why both. I really meant why use snmptt at all? :) [19:07:29] marktraceur: when is soon? I was just thinking of hitting the curry track [19:07:43] I'm leaving in about 5 minutes [19:07:52] ori, mutante sorry to get in on the convo last minute, but yes this is the exact use case for passive checks :) [19:07:56] But probably Reedy won't get to the deploy for a bit? [19:08:00] chasemp: yes it was, and ..not sure.. historic reasons (tm) [19:08:25] ok, I'll stick around then [19:08:38] mutante: got it thanks. classic tm. [19:08:47] chasemp: makes sense [19:09:15] manifests/misc/icinga.pp: file { '/etc/icinga/nsca_frack.cfg': [19:09:15] (03PS1) 10Reedy: group0 to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140770 [19:09:51] PROBLEM - check configured eth on pc1002 is CRITICAL: Timeout while attempting connection [19:10:01] PROBLEM - Disk space on pc1002 is CRITICAL: Timeout while attempting connection [19:10:01] PROBLEM - check if dhclient is running on pc1002 is CRITICAL: Timeout while attempting connection [19:10:11] PROBLEM - MySQL disk space on pc1002 is CRITICAL: Timeout while attempting connection [19:10:11] PROBLEM - mysqld processes on pc1002 is CRITICAL: Timeout while attempting connection [19:10:11] PROBLEM - DPKG on pc1002 is CRITICAL: Timeout while attempting connection [19:10:11] PROBLEM - RAID on pc1002 is CRITICAL: Timeout while attempting connection [19:10:22] ori: chasemp source => 'puppet:///private/icinga/nsca.cfg', [19:10:26] Fyi: Just received several 503's on Enwiki. [19:10:36] same on Commons, peeps are reporting [19:11:11] root@palladium:~/private/files/icinga# grep crypt nsca.cfg [19:11:13] all wikis. [19:11:31] Sweet [19:11:41] PROBLEM - Host pc1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:11:48] And that'll be why [19:11:51] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:51] PROBLEM - Apache HTTP on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:51] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:51] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:51] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:52] PROBLEM - Apache HTTP on mw1113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:52] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:53] PROBLEM - Apache HTTP on mw1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:53] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:54] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:54] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:58] * Vito confirms the issue from Italy too [19:12:02] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:02] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:02] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:02] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:02] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:02] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:02] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:03] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:03] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:04] PROBLEM - Apache HTTP on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:04] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:05] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:15] Oh, i suppose that is related as well. [19:12:20] oh well, I guess operations knows. [19:12:21] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:21] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:21] PROBLEM - Apache HTTP on mw1212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:21] PROBLEM - Apache HTTP on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:21] PROBLEM - Apache HTTP on mw1214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:27] checking pc1002 [19:12:37] Great time for greg-g to leave [19:12:51] I don't like that one server brings the whole site down. (Just sayin'.) [19:12:56] seems its borked completely and unresponsive to ssh [19:12:59] checking mgmt [19:13:17] Not sure greg could help much [19:13:18] wat [19:13:28] oom death [19:13:32] powercycle [19:13:35] im rebooting it [19:13:36] Reedy: i'd revert [19:13:38] before it boots up [19:13:41] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.541 second response time [19:13:41] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.079 second response time [19:13:41] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.055 second response time [19:13:41] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [19:13:47] losing at one host takes us down? [19:13:53] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:53] s/at/that/ [19:14:01] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.165 second response time [19:14:01] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:02] yes my question after this as well [19:14:04] what is pc1002 [19:14:07] parser cache [19:14:08] parser cache [19:14:11] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.526 second response time [19:14:11] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.691 second response time [19:14:11] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.770 second response time [19:14:20] Reedy: it's going to boot with a cold, empty memcached [19:14:21] PROBLEM - Apache HTTP on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:21] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:26] if it OOMd there's a good chance it's related to the deploy [19:14:31] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.630 second response time [19:14:31] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:31] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.686 second response time [19:14:37] isn't it mysql? [19:14:41] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:49] Kk [19:14:51] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.045 second response time [19:14:52] better start populating the cache using known-good mw code [19:15:01] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.676 second response time [19:15:08] Who blew it up? ori or Reedy? [19:15:11] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [19:15:11] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [19:15:11] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [19:15:11] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.075 second response time [19:15:11] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.067 second response time [19:15:11] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.088 second response time [19:15:11] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.774 second response time [19:15:12] RECOVERY - Apache HTTP on mw1215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.931 second response time [19:15:12] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.068 second response time [19:15:13] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.074 second response time [19:15:13] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.015 second response time [19:15:16] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias back to 1.24wmf8 [19:15:20] T13|sleeps: neither, it looks like. incredible, i know. [19:15:21] Logged the message, Master [19:15:21] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.047 second response time [19:15:21] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:21] PROBLEM - Apache HTTP on mw1164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:21] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:21] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [19:15:21] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.079 second response time [19:15:22] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [19:15:31] PROBLEM - Apache HTTP on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:31] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [19:15:31] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [19:15:48] !loh powercycled pc1002 [19:15:51] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.940 second response time [19:15:51] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.051 second response time [19:15:51] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [19:16:00] coming back up [19:16:09] RobH: Ciscos are run, more than 1 user can be on console :) [19:16:11] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.995 second response time [19:16:11] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.526 second response time [19:16:16] also did the same just now [19:16:21] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.480 second response time [19:16:21] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:21] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:21] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:21] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [19:16:21] like in that second [19:16:30] its one of the cisco servers, that cannot skip memory check [19:16:31] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:35] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.059 second response time [19:16:35] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.187 second response time [19:16:36] it will be a few minutes [19:16:41] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.257 second response time [19:16:41] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:41] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [19:16:51] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.814 second response time [19:16:51] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.027 second response time [19:16:51] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [19:16:51] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.067 second response time [19:17:01] !log powercycled pc1002 [19:17:05] Logged the message, Master [19:17:11] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.064 second response time [19:17:11] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.075 second response time [19:17:11] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.059 second response time [19:17:11] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.067 second response time [19:17:11] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.063 second response time [19:17:11] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [19:17:11] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.063 second response time [19:17:12] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.570 second response time [19:17:13] has anyone ever seen one of those bios memory checks fail? never in 20+ years have i. [19:17:20] heh [19:17:21] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.092 second response time [19:17:24] ori: Yeah, it is mysql [19:17:29] hah, typo, thanks MatmaRex [19:17:31] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.361 second response time [19:17:31] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [19:17:59] Close Network Connection to Exit [19:18:03] Reedy: what do you mean? [19:18:06] that's so convenient :p duh [19:18:14] $pcTemplate = array( 'type' => 'mysql', [19:18:14] 'dbname' => 'parsercache', [19:18:15] oh, the pc [19:18:16] yeah [19:18:19] jgage: once. DDR EEC memory about 10 years back gave up on me. [19:18:21] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:21] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [19:18:21] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [19:18:34] gj icinga-wm [19:18:41] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:43] notaspy, neat. yay for ecc. which i bet our servers don't use. [19:18:49] Reedy: should we comment it out? [19:18:54] “Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes.” ?? [19:18:59] AaronSchulz: does it use a hash ring? [19:19:01] PROBLEM - Apache HTTP on mw1197 is CRITICAL: Connection timed out [19:19:03] Rastus_Vernon: known [19:19:06] Rastus_Vernon: thanks [19:19:10] ori: what? [19:19:16] i guess we need more PC boxes? [19:19:18] mysql-multiwrite [19:19:21] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:21] SPOF [19:19:21] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 68826 bytes in 0.459 second response time [19:19:22] no [19:19:30] just an array and the consistentHash sort method [19:19:41] I guess it's a sort of hash ring, not HashRing though [19:20:01] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:03] AaronSchulz: would commenting it out improve things or hurt more? [19:20:08] back here [19:20:15] lots of stuff will remap though, only 3 boxen [19:20:21] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:21] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:21] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:21] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:21] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:21] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:21] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:22] it = pc1002's entry [19:20:28] so it depends how how the memcached hit rate is [19:20:31] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:31] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:41] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:46] the memory check is done [19:20:46] i think it's worth a shot [19:20:47] matanya: pc1002 is still down [19:20:51] it's coming back now [19:20:51] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:54] ori: why? [19:20:55] oh, okay [19:21:01] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.148 second response time [19:21:01] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:21:05] <_joe_> hey [19:21:07] AaronSchulz: disregard, if it's coming back up it's not worth it [19:21:10] <_joe_> what happened? [19:21:11] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [19:21:11] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.486 second response time [19:21:11] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.608 second response time [19:21:11] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.749 second response time [19:21:15] <_joe_> just got paged [19:21:18] _joe_: shit broke ;) [19:21:21] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [19:21:21] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:21:23] you will get more misses without it for the long tail, and the new stuff still will be warm in memcached in front of it [19:21:31] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [19:21:34] _joe_: one of the parser cache DBs became unresponsive [19:21:48] 503 at https://en.wikipedia.org/wiki/Special:Search?search=Module%3AHtmlBuilder&sourceid=Mozilla-search [19:21:49] <_joe_> Reedy: yeah I got that :P [19:21:50] had to be rebooted [19:21:51] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.723 second response time [19:21:52] actually it would be good for us to have a general response strategy for that before hand come to think of it [19:22:06] <_joe_> jackmcbarn: we know, working on it [19:22:10] Good to see https://wikitech.wikimedia.org/wiki/Parser_cache is up to date [19:22:11] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.064 second response time [19:22:11] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.056 second response time [19:22:17] Reedy: :) [19:22:20] <_joe_> :the" db? [19:22:21] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [19:22:28] <_joe_> ori: "the" db? [19:22:28] Reedy: Obsolete:Parser cache expansion [19:22:32] i like the namespace [19:22:32] 146 Fatal error: Call to a member function real_escape_string() on a non-object in /usr/local/apache/common-local/php-1.24wmf9/includes/db/DatabaseMysqli.php on line 289 [19:22:35] Reedy: It's a stub, don't be so hard on it [19:22:37] :d [19:22:41] RECOVERY - Host pc1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:22:41] pc1002 login: [19:22:43] re: ECC, I think it's pretty automatic these days on linux if the hardware supports. you can check an individual machine with e.g. "lsmod|grep edac" to see if the modules are loaded [19:22:54] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.691 second response time [19:22:54] RECOVERY - check configured eth on pc1002 is OK: NRPE: Unable to read output [19:22:54] RECOVERY - Disk space on pc1002 is OK: DISK OK [19:22:54] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.699 second response time [19:22:54] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.307 second response time [19:22:54] RECOVERY - check if dhclient is running on pc1002 is OK: PROCS OK: 0 processes with command name dhclient [19:22:54] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.103 second response time [19:22:54] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.991 second response time [19:22:54] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.585 second response time [19:22:54] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.641 second response time [19:22:54] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.060 second response time [19:22:54] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.068 second response time [19:22:54] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.118 second response time [19:22:55] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.423 second response time [19:23:01] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.913 second response time [19:23:01] RECOVERY - MySQL disk space on pc1002 is OK: DISK OK [19:23:01] RECOVERY - DPKG on pc1002 is OK: All packages OK [19:23:01] RECOVERY - RAID on pc1002 is OK: OK: no RAID installed [19:23:06] happy parser caching again [19:23:07] _joe_: there are three. the reason the failure of a single node was so devestating is exactly what we're perplexed about [19:23:11] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.051 second response time [19:23:11] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.053 second response time [19:23:11] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.049 second response time [19:23:11] RECOVERY - Apache HTTP on mw1165 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.049 second response time [19:23:11] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.055 second response time [19:23:11] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.052 second response time [19:23:11] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.057 second response time [19:23:36] Boom! [19:23:39] icinga-wm floods are the best [19:23:40] <_joe_> ori: can you point me to the code using it? [19:23:47] it will come back by itself [19:24:06] _joe_: MediaWiki core [19:24:14] what was the nature of the outage? A bunch of exceptions? Threads stuck all waiting? [19:24:25] * revent hands out cookies. [19:24:39] I am wondering why pc1002 failed a couple minutes after all wikipedia got switched to 1.24wmf9 [19:24:45] AaronSchulz: caches showing 503s [19:24:50] hashar: yes, there was probably a stampede of some sort [19:24:52] mutante said it was OOMd [19:24:54] hashar: out of memory [19:25:06] peeps are saying site's back [19:25:06] yea, that's what i saw on console before powercycle [19:25:15] mutante: yeah pom explains the kill [19:25:19] twkozlowski: How dare they [19:25:20] so the mediamwiki push invalidated too much cache? [19:25:27] i think we should just call those "Dr. OOM"s [19:25:33] Victor von Oom. [19:25:39] * twkozlowski sets emergency Twitter mode to "off" [19:25:40] If Wikipedia got switched to 1.24wmf9, does that explain why other Wikimedia projects are having the problem as well? [19:25:41] :> [19:25:41] I am wondering what caused the oom [19:25:54] Rastus_Vernon: It's all the same server pool [19:25:58] A lack of availible memory? :P [19:26:09] Rastus_Vernon: all wikis regardless of their versions use the same parser caches (pc1001 pc1002 and pc1003) [19:26:10] Come on guys, it's donation time [19:26:17] lol [19:26:28] We obviously need more servers [19:26:31] @twkozlowski @JRTomlinAuthor No need to apologize! Glad to hear it's being fixed; hope it's not making y'all coocoo bananas. [19:26:33] Reedy: that's why we should have fr-banners on the varnish error page :p [19:26:44] Ah, I see… so a parser cache DB stopped working because of a MediaWiki update to Wikipedia that used too much memory and now all the sites are down. Is that it? [19:26:44] coocoo bananas :-) [19:26:56] Rastus_Vernon: Well, they're not down [19:27:03] Rastus_Vernon: They are back. [19:27:07] _joe_: operations/mediawiki-config, wmf-config/db-eqiad.php defines $wmgParserCacheDBs [19:27:07] Oh, right. [19:27:07] Rastus_Vernon: they are back now, but yea [19:27:16] And as it stands, we're not exactly sure what happened [19:27:27] <_joe_> ori: ok so, just REBOOTING solved the issue? [19:27:29] _joe_: next, in the same repo, wmf-config/CommonSettings.php L351-363 [19:27:33] _joe_: yep [19:27:35] <_joe_> I may have an explanation [19:27:40] _joe_: well, power cycling [19:27:43] computers suck [19:27:45] And WMF people laugh at the mortals running ancient code... [19:27:48] <_joe_> but just restarting the db or turning it off? [19:28:00] mutante: Putting a notice on error pages saying “Our servers are run by your donations. Donate to help make Wikimedia sites more reliable!” might work… [19:28:03] _joe_: specifies that they use MultiWriteBagOStuff [19:28:04] machine was network unresponsive [19:28:27] <_joe_> Reedy: define network unresponsive [19:28:36] Rastus_Vernon: yep:) i think i tmight [19:28:38] did not respond to pings. [19:28:41] <_joe_> tcp connections to port 3306 resulted in? [19:28:51] <_joe_> some opsen please :) [19:28:55] <_joe_> fill in the blanks [19:28:58] pin critical [19:28:58] [20:09:51] PROBLEM - check configured eth on pc1002 is CRITICAL: Timeout while attempting connection [19:28:59] _joe_: the implementation of that is in mediawiki, in includes/objectcache/MultiWriteBagOStuff.php [19:29:00] etc [19:29:03] _joe_: Robh rebooted via console [19:29:12] my understanding is it was unreachable by other means [19:29:19] _joe_: you couldn't ssh to it anymore, so we used mgmt to powercycle [19:29:25] is it back up? [19:29:29] and the rest all came back by itself [19:29:37] <_joe_> chasemp: yes, someone tried to do a tcp connection to 3306, and if so, was it hang up or rejected or what [19:29:41] i'm seeing response from pages, yes. [19:29:46] its back up [19:29:48] _joe_: ah yes, understood that I don't know :) [19:29:51] well, its responsive to ssh now [19:29:54] (im on it) [19:30:02] @twkozlowski That's the making of a bad day. No worries - just not used to seeing you guys go down. Ever. Nice work! [19:30:07] <_joe_> RobH: is mysql up? [19:30:10] RobH: ^^ [19:31:04] nope, wont fire [19:31:12] _joe_: if you know about the server, hop on ;] [19:31:48] <_joe_> I don't even know wich one it is [19:31:52] pc1002 [19:31:56] <_joe_> I can try to troubleshoot mysql not starting [19:31:58] !log started mysql on pc1002 [19:31:59] wtf [19:32:01] RECOVERY - mysqld processes on pc1002 is OK: PROCS OK: 1 process with command name mysqld [19:32:03] what happened [19:32:04] Logged the message, Master [19:32:09] mutante: what did ya run? [19:32:27] <_joe_> RobH: service mysql start? [19:32:28] springle: wmf10 to all wikis -> pcs1002 unresponsive -> apache busy workers -> 503s [19:32:29] RobH: /etc/init.d/mysql start [19:32:31] cuz normal upstart didnt do it [19:32:39] huh, wtf [19:32:43] springle: pcs1002 hard reboot, mysql on pcs1002 did not come up [19:32:48] it first said something about the pid file existing [19:32:49] springle: that's where we're at [19:32:50] but it's not set [19:33:15] ori springle then mutante did something not mysql is up there :) [19:33:19] springle: [19:33:21] now I mean [19:33:21] root@pc1002:~# /etc/init.d/mysql status * MySQL is not running, but PID file exists [19:33:22] <_joe_> ori: the real point we're at is - one cahce db died for whatever reason and we;ve gone down in flames [19:33:27] root@pc1002:~# /etc/init.d/mysql status * MySQL is not running [19:33:31] <_joe_> THAT is the problem [19:33:33] ori, springle: looks like it may have started with many "DBError: DB connection error: Too many connections (10.64.16.157)" events [19:33:35] mysql is recovering [19:33:35] root@pc1002:~# /etc/init.d/mysql status * MySQL running (6031) [19:33:59] <_joe_> bd808: that is a consequence [19:34:02] _joe_: yes, these systems were not designed by complete idiots you know [19:34:13] <_joe_> ori: exactly :) [19:34:15] but failover/redundancy is not as expected obviously [19:34:32] <_joe_> ori: no blame given [19:34:36] springle: I told you we needed to replace it with MongoDB [19:34:41] <_joe_> this happens all the time man :) [19:34:52] Log says 20 minutes downtime, more or less? [19:34:53] <_joe_> that failover systems do not work as expected [19:35:20] greg-g isn't allowed to buy ssd's anymore [19:35:23] Reedy: :) [19:35:24] irc probably isn't the best place to do the post-analysis of the architectural problems or whatever [19:35:24] the internet can't take it [19:35:33] just fix things and then we can go to the lists for that :) [19:35:34] twkozlowski: 14 based on icinga [19:35:44] so, DB connection error: Too many connections [19:36:07] Thanks Reedy [19:36:18] on rollout of new release [19:36:19] is consistent with a run on the parser cache DBs due to massive cache invalidation [19:36:31] First time we've seen this... [19:36:35] Or at least, in a long time anyway [19:36:37] <_joe_> looking at the logs... mysql server became unresponsive/slow, connection piled up, this blocked workers in apache [19:36:56] <_joe_> probably this has happened before any failover mechanism kicked in [19:37:06] Reedy: yeah....we've had at least one parser cache outage that almost went unnoticed, so this is a weird new development [19:37:08] sql-bagostuff events in logstash -- https://logstash.wikimedia.org/#dashboard/temp/S9cjGaUCRriwaPpk9xNMCw [19:38:22] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [19:38:52] https://www.mediawiki.org/wiki/MediaWiki_1.24/wmf10/Changelog [19:40:11] Guess it'd be worth looking at pc load from around when it was deployed to non Wikipedias on tuesday [19:40:28] for the postmortem [19:40:32] Thu Jun 19 19:07:32 UTC 2014 mw1074 enwiki Error connecting to 10.64.16.157: :real_connect(): (08004/1040): Too many connections [19:40:33] that's the first [19:40:35] Remove $wgDBClusterTimeout [19:40:38] 19:07:32 [19:40:53] is that suspicious? [19:41:05] it's in the change log and removes a db cluster timeout? [19:41:18] <_joe_> mutante: /win 20 [19:41:21] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [19:41:22] <_joe_> oh sorry [19:41:28] https://gerrit.wikimedia.org/r/#/c/139164/1/includes/DefaultSettings.php [19:41:51] mutante:That sounds.. [19:42:24] mutante: But based on the diff, it's unused (in core at least) [19:42:56] (03PS1) 10Reedy: Revert "Wikipedias to 1.24wmf9" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140804 [19:43:07] (03CR) 10Reedy: [C: 032] Revert "Wikipedias to 1.24wmf9" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140804 (owner: 10Reedy) [19:43:10] * Reedy cleans up for the moment [19:43:13] (03Merged) 10jenkins-bot: Revert "Wikipedias to 1.24wmf9" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140804 (owner: 10Reedy) [19:43:35] noting that's already technically deployed [19:43:36] brb [19:44:03] the commit linked in that one's commit msg is interesting too: https://gerrit.wikimedia.org/r/#/c/139066/ [19:44:48] ah, db/LoadBalancer.php, indeed [19:44:53] parser cache doesn't even go through load balancer anyway [19:45:13] someone is reporting lua errors in #wikimedia [19:45:14] well and the change appears to be a true no-op, but sometimes appearances can be deceiving [19:45:21] * bd808 adds type:dberror to https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [19:45:22] err, #mediawiki [19:45:32] Lua error: Cannot create process: proc_open(/dev/null): failed to open stream: Operation not permitted [19:46:00] jorm: they also said 'mywiki', so it's possible they're just setting up their own thing [19:46:12] <_joe_> bblack: the change should have enforced a 10 seconds timeout [19:46:23] <_joe_> to any connection [19:46:32] i misread that as "now that the wikis are back, this is happening..." [19:46:37] <_joe_> which in my understanding is the same thing that happened before [19:47:52] * AaronSchulz wishes we didn't use mysql for parser cache [19:49:04] #wikipedia #downA database query error has occurred. This may indicate a bug in the software. Function: LinkCache::addLinkObj Error: 0 [19:49:09] _joe_: I think with those patches most-recently linked, there was no functional change. it was just code cleanup. the LoadBalancer.php stuff in 139066 shows it was documented to use on paramter name (which the callers used), but actually checked another, so it always used 10 [19:49:20] someone posted that onto Twitter during the downtime ^^ [19:50:09] * AaronSchulz sighs at http://mysqlserverteam.com/server-side-select-statement-timeouts/ [19:50:25] pc1002 was compiling its puppet catalog when cpu wio shot up [19:50:32] could be a coincidence [19:50:51] <_joe_> ori: I'm not [19:50:53] <_joe_> sure [19:51:24] more data for postmortem: [19:51:24] ori: there was a run of puppet 20 minutes before and no changes merged in in between in operations/puppet [19:51:31] so I am assuming it is a coincidence [19:51:48] per http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=MySQL+eqiad&h=pc1002.eqiad.wmnet&jr=&js=&v=1.0&m=cpu_wio&vl=%25&ti=CPU+wio , cpu wio shot up [19:51:54] not even a very strange one, given how often puppet runs [19:51:54] 2014-06-19T19:07:30+00:00,1.9 [19:51:55] 2014-06-19T19:07:45+00:00,29.6 [19:52:31] somewhere in that 15-second interval [19:52:35] <_joe_> ori: exactly when puppet started to exec things on that server, it probably started computing file content hashes at that point [19:52:46] cmjohnson1, hey did we think about how the switches power cables routes [19:53:23] <_joe_> but the wio spike started some minutes earlier [19:53:33] they will have to go through the holes on the back side of the cable managers..i can't think of another way ..robh any suggestions [19:53:35] <_joe_> puppet simply added some load to the lot [19:53:57] the wio fits nicely with the cache invalidation theory [19:54:18] cmjohnson1: yep, the cable manager backside, they usually have a pass through [19:55:12] 19:06 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf9 [19:55:32] rc1002 died less than a minute after that [19:55:57] so something between 1.24wmf9 and 8 [19:56:20] ori: lame pdf of pc1002 ganglia view http://noc.wikimedia.org/~hashar/pc1002_2014-06-19_1900.pdf might help for the postmortem [19:56:28] https://www.mediawiki.org/wiki/MediaWiki_1.24/wmf9/Changelog [19:56:33] cmjohnson1, there are no pass through on any of the cable manager [19:56:36] and the network Bytes/sec go down that much when one box dies ? http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=MySQL+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [19:56:52] cmjohnson1, sending pic [19:57:07] <_joe_> mutante: yes but look at the spike just before [19:57:37] Hey Reedy: Hope you have a better connection now :) Do you have an ETA for when Media Viewer will be enabled for all wikis? Keegan and I are standing by … [19:58:09] <_joe_> btw, pc1001 had the same pattern [19:58:13] <_joe_> it just did not die [19:58:15] <_joe_> http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=bytes_in&mreg[]=%5Ebytes_in%24&hreg[]=%5Epc1001&aggregate=1&hl=pc1001.eqiad.wmnet|MySQL%20eqiad [19:58:19] mutante: yup because all MediaWiki boxes were dead/busy and no more sending any queries to the mysql servers [19:58:27] fabriceflorin: We're still in the aftermath of the outage.. Does it depend on any specific MW version? [19:58:30] <_joe_> we may want to look at its slow query log [19:58:52] bd808: does scap not log to a file by default? [19:58:53] it could be the case that the common pattern on pc100[12] is just nearly-fatal, and the puppet load pushed it over the brink. [19:58:55] I wonder what the server room looks like. [19:59:06] Donald_ET3: Ours? [19:59:12] Yea. :D [19:59:16] <_joe_> so the problem was with both databases [19:59:24] <_joe_> bblack: that's my hypothesis as well [19:59:28] Donald_ET3: https://commons.wikimedia.org/wiki/Category:Wikimedia_servers [19:59:34] <_joe_> bblack: we need conditiona puppet runs [19:59:35] ori: I sends udp2log messages to fluorine, but not to a local file, no. [19:59:43] bblack: so we need to let more free memory to let puppet run? :( [19:59:43] springle: should pc1001/1002 be added to "dbtree" (to see slow queries?) [19:59:46] https://meta.wikimedia.org/wiki/Wikimedia_photo_desk#WMF_Servers Donald_ET3 [19:59:55] <_joe_> mutante: they should, probably [19:59:56] also, noticed in hashar's pdf that we had a sliver of Swap showing up on the memory graph before oom. I wonder if the wio was from actual mysql i/o traffic, or from wio on disk-swap due to impending oom? [19:59:59] hashar: yep, makes sense [20:00:14] <_joe_> I was about to ask if we had the slow query log somewhere [20:00:33] <_joe_> hashar: not memory in general in this case we may have had a herd effect [20:00:35] _joe_: i tried here http://noc.wikimedia.org/dbtree/ [20:00:37] robh: there isn't a hole in the back of the cable mgr for papaul [20:00:51] <_joe_> connections piling up means more memory gets eaten up [20:00:54] do we need to have an open 1u? [20:00:57] <_joe_> for each connection [20:01:27] <_joe_> so, ori, after all the problem was *not* in the load balancer code :) [20:01:28] Reedy: Thanks for the prompt response. I believe the Media Viewer release on all wikis is just a config change, which marktraceur was working on. So it is not technically dependent on a new MW version. [20:01:44] <_joe_> but in something that nearly killed the pc100* cluster [20:01:55] I have not made the config patch, someone else might have though [20:02:04] bd808: it really should [20:02:12] mutante: If it's 'default' => true it's easy [20:02:19] Reedy: Robla just explained to me that things are delayed a bit due to the recent outage. But let us know when you think it will go live, so we can plan ahead. [20:02:27] cmjohnson1: .... yuck [20:02:33] well, arent the cable mangers plastic? [20:02:36] we could make a hole [20:02:46] or leave 1u, thats annoying [20:02:46] bblack: ori: another version with large graphs and only one columns. Makes it easier to compare http://noc.wikimedia.org/~hashar/pc1002_2014-06-19_1900-large.pdf [20:02:55] No it is not plastic [20:02:56] marktraceur: Do you think we have time to go grab lunch, so we can eat at our desk? [20:02:56] i don't like that either [20:03:06] fabriceflorin: Yeah, don't worry about it, I'll be around [20:03:16] OK, will be right back. [20:03:16] I suspect we won't go for a little bit [20:03:18] ori: It would be easy enough to add. Would /var/log/scap/scap.log be a reasonable location? [20:03:22] marktraceur: It's enabled everywhere? [20:03:29] Just wmgMediaViewerLoggedIn that's fale [20:03:30] *false [20:03:37] /var/log/scap.log imo [20:03:43] Reedy: It's not yet, the config change should be to default the beta config variable to false [20:04:07] <_joe_> everyone: any idea about what may have caused that db spike? changed queries that ran with full table scans or with filesorts? [20:04:10] ori: Works for me. I'll write a bug for it. [20:04:29] pc1003 also shows that tiny bit of Swap activity in the mem graph: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=pc1003.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=mem_report&c=MySQL+eqiad [20:04:34] so they were all running low [20:04:57] ori: we don't set MYSQLI_OPT_CONNECT_TIMEOUT do we? [20:05:09] <_joe_> bblack: my hipothesis is that for some reason one query slowed down -> max connections -> swap -> death spiral [20:05:10] papaul: there should be room on top of the access switch and below the mgmt switch to route the power cable [20:05:38] <_joe_> or, simply, for some reasons we funneled 10x the queries to the db [20:05:40] also notable, in pc1003's live dmesg log: [20:05:41] never mind...looking at picture [20:05:41] [Mon Oct 7 15:44:22 2013] Killed process 8036 (mysqld) total-vm:214352716kB, anon-rss:193870716kB, file-rss:1140kB [20:05:44] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=bytes_in&mreg[]=^bytes_in%24&hreg[]=^pc10&aggregate=1&hl=pc1002.eqiad.wmnet|MySQL+eqiad%2Cpc1001.eqiad.wmnet|MySQL+eqiad%2Cpc1003.eqiad.wmnet|MySQL+eqiad [20:05:52] _joe_: ^ [20:05:58] we had an oom-kill of mysqld back in October, and the related pidlist in the log shows puppet running at the same time then as well :) [20:06:06] that spike in pc1003 when pupet ran [20:06:22] before we used the mysqli client I guess it didn't matter due to mysql.connect_timeout [20:06:26] * AaronSchulz wonders what the default is [20:06:26] would puppet account for bytes_in like that? [20:06:39] Chris that will work only if we keep yesterday configuration and not todays config [20:06:42] papaul: can you drill a hole in the back of the cable manager [20:06:43] <_joe_> springle: no [20:07:11] <_joe_> springle: our hipothesys is that the dbs got hammered, pc1002 died because the coincidence of a puppet run [20:07:20] that would be the best method...put a hole on the end under the display [20:07:26] <_joe_> but the outage was caused by the 3 server getting hogged down at the same time [20:08:36] <_joe_> springle: do we have slow queries logs for those servers? [20:08:40] MediaWiki.stats.pcache_hit.count plummeted: [20:08:41] http://graphite.wikimedia.org/render/?width=969&height=550&_salt=1403208491.613&target=MediaWiki.stats.pcache_hit.count&from=00%3A00_20140618&until=23%3A59_20140619 [20:08:58] _joe_: i agree with that. however something made pc1002 spike network before the others [20:09:11] Chris i can only if i have the right tools because thr cable managers are not plastic [20:09:20] <_joe_> springle: not puppet I'd say [20:09:30] no, not puppet [20:09:37] papaul: bblack seems like the sorta guy that'll have power tools ;D [20:09:37] mass cache invalidation [20:09:41] cache hit rate went down [20:09:58] <_joe_> ori: and that is calculated how? [20:10:04] <_joe_> the cache hit rate [20:10:10] <_joe_> or more properly, when? [20:10:17] <_joe_> once the request has been completed [20:10:29] <_joe_> or one signal is emitted as soon as the cache is not hit? [20:10:53] once req completed [20:11:07] <_joe_> ok so that may be a byproduct [20:11:09] _joe_: slow queries aren't relevant afaik. parsercache is simple lookups [20:11:18] <_joe_> or not, but we don't know [20:11:50] <_joe_> ok so, either we changed some code in the cache lookup , that changed queries, or ori is correct [20:11:59] ori: so how could that hit pc1002, in the form of a bytes_in spike, sooner than the others? [20:12:24] Aren't the queries just a select * from pc_foo where id = bar type of thing? [20:12:28] springle: just an array and the consistentHash sort method [20:12:42] <_joe_> a mass cache invalidation -> flood of queries -> servers hogged -> pc1002 dies because puppet runs --> the other two servers are unrsponsive because of the horde effect [20:12:42] maybe some key pattern that hashed to pc1002 [20:12:51] Reedy: yes, queries are just simple [20:12:53] I think it hit all 3, it's just pc1002 happened to die from it and pc100[13] lived through it (and perhaps puppet running at that moment was a contribution to the distinction) [20:12:56] Chris the plastic sheet that we have in between each cabinet i think we can put another small hole on the back and in the front [20:13:06] <_joe_> I agree with bblack [20:13:28] <_joe_> bblack: though, it could've been pc1002 dying and the other two succumbing to horde effects [20:13:40] <_joe_> so, we probably need more of those servers anyway [20:13:45] there's not much obvious lag in the effects on the graphs, though [20:13:57] <_joe_> bblack: look at the one springle pasted [20:14:01] <_joe_> there is a 1-minute lag [20:14:09] <_joe_> http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=bytes_in&mreg[]=^bytes_in%24&hreg[]=^pc10&aggregate=1&hl=pc1002.eqiad.wmnet|MySQL+eqiad%2Cpc1001.eqiad.wmnet|MySQL+eqiad%2Cpc1003.eqiad.wmnet|MySQL+eqiad [20:14:36] <_joe_> that may or may not put us in either of the two scenarios [20:16:03] Reedy/papaul: yeah but I'm not there right now :) [20:16:27] heh [20:16:35] cmjohnson1, better just another hole at the back since there is already one in the front [20:16:48] surely dc staff has a drill? [20:17:17] papaul: we have to get you the right bit [20:18:06] i think adding a hole to the back is the best option. you can either go to home depot and buy one and expense or robh can order you one [20:18:12] <_joe_> bblack: the lag is actually 2 minutes [20:18:20] I don't think there is one that is large enough onsite [20:18:45] how big of a whole are we talking in what material? [20:18:51] s/whole/hole/ :) [20:19:51] lighter? [20:20:25] bblack one large enough to fit the end of a power cable through [20:20:44] probably an inch in diameter at most [20:20:53] cmjohnson1, it has to be two holes [20:21:09] why? [20:21:25] Two swithces [20:21:36] can you make one large enough to run all 3 cables through? [20:21:44] 2 power cables for asw...1 for msw [20:22:01] you might want to pick up a small hole-saw bit then instead of a regular drill bit. it will cut it much cleaner and more-reliably. [20:22:19] cmjohnson1, no the cable managers are 1 u [20:22:37] fwiw, there was a change to https://en.wikipedia.org/wiki/Template:Navbox at 19:02 UTC. possible cause of cache invalidations? [20:22:57] robla: srsly? :| [20:22:59] maybe look for rubber grommet to match the hole size in the electrical section as well, so the edge doesn't wear into the cable over time [20:23:06] cmjohnson1, you can not put a big hole for two holes [20:23:08] what the [20:23:27] You'd presume editing Navbox isn't going to help matters [20:23:49] wb greg-g [20:23:50] yeah..so if you cut a hole out of just one you should be able to route all the power cables through it..especially if you do it the way bblack is suggesting. [20:23:54] papaul ^ [20:24:03] greg-g: Your wood-knocking was insufficient [20:24:08] so, cluster out? [20:24:11] How much blame are we placing on MediaWiki? Can we try deploying again? [20:24:18] can we take the cabling conversation somewhere else then? [20:24:20] (by hole-saw I mean like this: http://www.lowes.com/pd_348134-28303-1772779_4294607734__?productId=3361282&Ns=p_product_qty_sales_dollar|1&pl=1¤tURL=%3FNs%3Dp_product_qty_sales_dollar%7C1&facetInfo= ) [20:25:12] http://graphite.wikimedia.org/render/?width=828.59375&height=464.84375&_salt=1403209454.593&logBase=10&from=19%3A00_20140619&target=removeBelowValue(servers.pc1002.network.eth0.rx_bit.value%2C0.00001)&target=removeBelowValue(servers.pc1002.cpu.total.iowait.value%2C0.00001)&until=19%3A10_20140619 [20:25:12] Reedy: I think, given we have a plausible non wmf10 explanation for a major load event, yeah, seems worth trying. greg-g, what do you think? [20:25:26] sorry, just got back, think about what? [20:25:26] rx_bit goes up a little bit before iowait [20:25:33] ah, dpeloying again [20:25:41] if that's the working theory (non-wmf10) yeah [20:25:49] i wouldn't [20:25:51] imo [20:25:51] would a navbox change make it possible to hit pc1002 specifically, before the others? [20:25:57] we still don't have the full picture [20:26:06] so, what does pc stand for? [20:26:08] fair enough [20:26:09] and it'll contaminate logs with more data [20:26:10] parser cache [20:26:19] ahhhh [20:26:22] well crap [20:26:44] Thanks bblack [20:26:47] bblack I like that..i was even thinking a dremel [20:26:53] greg-g: do you know of any particular time sensitivity/general antsiness to get wmf10 out? [20:26:56] springle: navbox? [20:27:03] If it was contributed to by the Template:Navbox edit, lame -- https://en.wikipedia.org/w/index.php?title=Template:Navbox&diff=613595756&oldid=579335568 [20:27:05] ori: enwp template, really highly used [20:27:06] ori: yes [20:27:11] heh [20:27:23] AaronSchulz would know better than me [20:27:44] seems like a good guess tho [20:27:53] robla: no [20:27:55] robla: greg-g I think there's only really manybubbles wanting wmf10 particularly for anything [20:28:17] we don't have to shelve wmf10 plans for good, just for a little while longer while there's uncertainty about the causes of the outage that just ended [20:28:22] I can wait [20:28:23] cmjohnson1: I think they make that style of hole saw down to ~3/4" size on the small side. Usually drilling into thin material with giant regular drill bits is touchy and error-prone and dangerous. drill jumps around and stuff [20:28:23] I'm all patience [20:28:26] Well, it's not wmf10 at fault [20:28:27] made of the stuff [20:28:27] It's wmf9 ;) [20:28:44] Reedy: yes, but we don't want to add more plot twists to the story atm [20:28:45] bblack: cmjohnson1 papaul sorry, but can the cabling conversation go into a nother channel, please [20:28:51] Did anyone get round to looking at the pc graphs around tuesday deploy? [20:29:12] i didn't see anything unusual but didn't look very carefully [20:29:24] greg-g i think we're done [20:29:29] cmjohnson1 / papaul : Just so I understand, to reiterate the plan is to get a drill bit that is 1 inch or so in diameter [20:29:31] Bblack please see link for cable manager https://www.google.com/shopping/product/14708664676568416323?sclient=tablet-gws&biw=1280&bih=752&q=srcableduct1uhd&oq=srcableduct1uhd&pbx=1&bav=on.2,or.&bvm=pv.xjs.s.en_US.SU4soCeLflY.O&tch=1&ech=1&psi=sEejU-L0AsiLqAaJo4CwBg.1403209659150.3&sa=X&ei=3UejU8PvDMWXqAb_qIDIAQ&ved=0CB4QuSQ [20:29:37] cmjohnson1: sure [20:29:37] and drill a single hole to route all the power cables through, correct? [20:29:38] or i lie [20:30:00] guys, there's a production issue going on, can the non-right-this-minute important conversation go somewhere else [20:30:25] i can't make any sense of backscroll :/ [20:30:29] 1U in a rack is 1.7" [20:30:59] :( [20:31:05] greg-g: but drilling holes is the best thing ever [20:31:06] the backscroll isn't very sensible even without it, fwiw [20:31:17] fine, tell me what happened then [20:31:25] greg-g and Reedy: my last comment - I'm using testwiki to test wmf10 and I'll wait until it is deployed everywhere else its going to do more [20:31:35] greg-g: if you'd like to give me a call, I can catch you up [20:31:54] robla: sure thing, it'll be a real cell phone one [20:32:01] greg-g: there was an edit to Template:Navbox , a current theory is that it caused a massive cache invalidation [20:32:37] greg-g: ^ but i don't buy that as the full explanation [20:32:38] do template edits still cap at only 200k invalidations no matter how many transclusions, or was that changed? [20:32:42] jfyi [20:32:49] (03PS1) 10Reedy: Set wmgMediaViewerBeta to false everywhere [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140811 [20:33:24] marktraceur: ^ ? [20:33:56] Reedy: That's the one [20:34:07] mediaviewer.dblist can disappear too, eventually [20:34:23] (or now? shrug) [20:34:58] marktraceur: It's still used for other settings [20:35:10] It is? [20:35:23] I don't know if it should be [20:36:02] logs look like something is up with wikidata [20:36:21] actually, not wikidata, just : Unable to allocate memory for pool. stuff [20:36:59] marktraceur: Sampling stuff. Ctrl + F 'mediaviewer' in https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php [20:37:02] Reedy: I think those settings should just be set conditionally based on wgMediaViewerIsInBeta [20:37:02] Reedy: Any idea why we would still have hosts with APC thrashing? mw1193 had a big "unable to allocate memory for pool" spike in the last 15 minutes. [20:37:09] Or whatever it is [20:37:12] huh, funny how the network dips match with increases in "Swift esams"...though the later always spiked periodically without incident [20:37:55] Or just set to default: 1000 [20:38:58] bd808: Nope... For which version? [20:39:18] Reedy: Looks like both 8 and 9 [20:39:26] :/ [20:39:33] That just sounds stupid [20:40:30] springle: why don't you think *just* navbox change? [20:40:36] Reedy: It happened on a few other boxes too; check out https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor for the last hour [20:41:10] other question, for ori: there hasn't been a puppet3 change recently, right? [20:41:15] greg-g: puppet probably didn't help matters either [20:41:17] Yeah [20:41:19] oh [20:41:22] today? [20:41:24] Reedy: The mediaviewer.dblist thing can get done later, I think. [20:41:28] Not sure [20:41:39] In most cases, puppet3 is quicker/lighter/better [20:41:54] I know it's been done on the mw hosts, not sure about mysql boxen [20:42:02] https://gerrit.wikimedia.org/r/#/c/139164/ [20:42:52] Reedy: you mean there was a puppet run at the same time? [20:42:59] Indeed [20:43:04] * greg-g nods [20:43:24] heh...sorry, meant to pm that Greg :-) [20:43:46] we were talking about probable red herrings [20:44:22] * robla ducks out for lunch [20:44:23] ok, so, here's my understanding: [20:44:58] * hashar rolls the drum [20:45:02] 1) at aroud the same time as Reedy was deploying (deploying at 19:06), there was a navbox change (19:02) [20:45:13] that caused a huge invalidation, obvs [20:45:20] "somehow" killed pc1002 [20:45:29] pc1002 going down took other things down, too? [20:45:50] but, pc1002 has recovered, so, there's no reason to presume it was wmf10/wmf9 causing it [20:45:56] (after a reboot) [20:46:07] well, it recovered afer reboot and manual start of mysql [20:46:11] * greg-g nods [20:46:22] Open questions: why the heck did pc1002 fall over? [20:46:39] greg-g: because one of the three parsercache boxes spiked in network and disk earlier than the others. a simple navbox change might have caused invalidation, but something happened to pc1002 before its peers [20:46:43] mutante suggested it was oom error, since ssh was unresponsive it makes sense [20:46:46] question 2) why did that falling over cause so much damage? [20:46:47] greg-g: it ran out of memory [20:46:48] I think pc1002 went down in such a way that mysqli connection attempts to it hung and that killed the apaches with pending requests, but I have no real proof of that. [20:47:13] greg-g: run rates for htmlcacheupdate/refreshlinks are unremarkable in the last 24hrs [20:47:24] greg-g: i think we've only identified symptoms so far [20:47:31] right, so, point being, a single box going down should have done this, right? [20:47:36] shouldn't* [20:47:57] or: are we not able to handle any edits to template:nav box ever again [20:48:41] I think that edit didn't matter [20:48:48] AaronSchulz: what's your working theory? [20:49:39] greg-g: and puppet ran on pc1002 just at that moment, might have added a bit of memory stress on an already starved machine [20:50:15] greg-g: I saved a pdf of pc1002 ganglia view with large graphs on a single column. Might help http://noc.wikimedia.org/~hashar/pc1002_2014-06-19_1900-large.pdf [20:50:25] thanks [20:50:40] how many threads does puppet agent use by default? [20:51:01] _joe_: ^ [20:51:09] not sure, I'd also want to know the connect and query timeouts [20:51:34] the memory used by puppet might have been a catalyst, but is unlikely to have been the real cause [20:51:37] I don't know if the errors in sql-bagostuff.log correspond to the client waiting first or just failing fast [20:52:00] springle: do you know if there were lots of slow queries on pc1002? [20:52:13] <_joe_> ori: one [20:52:59] <_joe_> I'm pretty sure puppet just killed pc1002, but it was just the igniter [20:53:02] AaronSchulz: there were not. just max_connections, bytes_in spike (which seems to have been puppet), then swap [20:54:33] springle: notice iowait plateaus at ~58%: http://graphite.wikimedia.org/render/?target=servers.pc100*.cpu.total.iowait.value&from=19%3A04_20140619&until=19%3A30_20140619&height=600&width=1000 [20:54:35] <_joe_> so yes, tons of queries (from cache invalidation probably) -> more memory, more i/o, + puppet -> swap -> death [20:54:45] which is consistent with something utilizing 14 of 24 cores [20:55:18] greg-g: I am cleaning up the IRC log for ya :] [20:55:27] hashar: :) :) [20:55:34] something was wrong with that puppet run [20:55:47] other puppet runs on that host immediately before the last finished in <30 seconds [20:56:06] <_joe_> ori: that's because the database was running very hot [20:56:08] the last one was still running a minute and twelve seconds in [20:56:16] does that explain the parser cache cluster going down though? [20:56:22] wow, what happend ? :( [20:56:23] <_joe_> greg-g: yes [20:56:34] _joe_: one machine takes the entire pc cluster? [20:56:44] <_joe_> greg-g: slow databases -> php processses hanging -> max clients [20:56:50] parser cache? [20:56:54] <_joe_> greg-g: all three pc100* suffered [20:57:09] <_joe_> so... no place to hide [20:57:27] they all suffered the navbox thing, and puppet ran simultaneously on them all? [20:57:38] <_joe_> also, we have one issue probably with the way we manage faults of databases [20:57:48] it dit not run simultaneously, pc1002 maybe had bad luck with the timing [20:57:50] so a single edit killed the site for the interwebs? [20:57:52] and that's why only that died [20:57:53] <_joe_> greg-g: no, puppet *killed* that machine, the other two were just slow [20:58:04] yeah, I thought so :-) [20:58:10] _joe_: so, one machine out + spike in load == outage [20:58:35] perfect storm? [20:58:47] are we just chalking it up to that? [20:59:10] <_joe_> greg-g: not perfect storm, normal shit that happens in web scale [20:59:11] greg-g: no, there is more to figure out [20:59:24] <_joe_> and no, I agree with springle [20:59:34] <_joe_> we should be able to recover from one failing host [20:59:34] ok [20:59:42] it's not like the root cause was that machine going down for hardware failure or so [20:59:43] right, that's my question/worry :) [20:59:44] <_joe_> and it does not necessarily seem the case [20:59:47] it's more a symptom that it went down [20:59:55] * greg-g nods [21:00:17] _joe_: it was still reporting metrics to diamond at 19:20 [21:00:25] so puppet was still running [21:00:27] <_joe_> wow [21:00:43] <_joe_> ori: diamond wins, even ping did not work [21:00:49] ok, so, springle / _joe_ it seems you have the best working theory of "what else" caused this: I assume you want to hold on pushing out wmf10/wmf9 (ie: go back to deploying)? [21:00:50] but i saw the OOM message on mgmt console [21:00:57] unfortunately not a screenshot now [21:01:01] mutante: do you still have that? timestamp? [21:01:10] anything you can recover? [21:01:10] <_joe_> greg-g: mh no opinion here [21:01:16] ori: right before RobH powercycled [21:01:42] <_joe_> mutante: instagram or didn't happen :) [21:02:00] greg-g: i think continue deploying. i'll do the report [21:02:17] springle: _joe_ ok, thanks [21:02:22] Reedy: ^^ [21:02:47] Reedy: wanna try again? :) [21:02:51] What's scheduled from now on? [21:02:55] roll the dice, play the cards [21:02:58] <_joe_> springle: isn't it like 5 am there? [21:03:09] 12:14 < mutante> oom death [21:03:12] 7 apparently [21:03:23] Reedy: nothing [21:03:26] ori: no :/ not in history [21:03:36] i connected to mgmt and disconnected [21:03:37] [19:15:48] !loh powercycled pc1002 [21:03:41] not sure how i'd get it back [21:03:48] _joe_: isn't it like 11pm there? [21:03:51] <_joe_> oh ok, so less uncomfortable than my 23:00 [21:03:55] <_joe_> yes [21:04:05] (03PS1) 10Reedy: Revert "Revert "Wikipedias to 1.24wmf9"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140819 [21:04:09] <_joe_> I still get pages at this time of the evening [21:04:23] <_joe_> uhh we're deploying? [21:04:34] <_joe_> let me hop on the pc servers :) [21:04:37] yep, we're deploying [21:04:43] Are we disabling puppet? ;) [21:04:44] yeah, watch the logs closely :) [21:04:48] Reedy++ [21:04:52] (03PS2) 10Reedy: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140819 [21:05:19] _joe_: I've been still up at gone 4am after problems around this time before now... (along with various opsen) [21:05:33] greg-g: are you interested in the logs to recover mysql ? [21:05:41] greg-g: or just the time frame of the actual outage? [21:05:44] hashar: I love all logs equally [21:05:53] <_joe_> Reedy: I know the feeling [21:05:55] greg-g: it is a bit tedious to filter out the noise :] [21:06:10] hashar: yeah, :/ [21:06:32] (03CR) 10Reedy: [C: 032] Wikipedias to 1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140819 (owner: 10Reedy) [21:06:38] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140819 (owner: 10Reedy) [21:06:39] so, i was planning on asking for a set of changes to Template:Infobox and others. I take it I should hold off on that? [21:06:42] why dont you edit the Template [21:06:44] and see what happens [21:06:45] jorm: shush you [21:06:51] :) [21:06:53] <_joe_> puppet running on pc1002 [21:06:54] No, I'm serious. [21:06:54] <_joe_> :P [21:06:59] <_joe_> not kidding [21:07:06] * Reedy waits [21:07:08] <_joe_> it ran [21:07:11] How long do we have till the next run? :P [21:07:17] <_joe_> 20 minutes [21:07:21] <_joe_> the clock is ticking [21:07:22] jorm: If everything settles down alright... [21:07:23] I talked with Ori and Aaron about it earlier, and was told to go ahead. [21:07:24] jorm: i figured, let's wait til after post-mortem to give you an opinion ;) [21:07:26] GOGOGOGOOG [21:07:39] Go go gadget mediawiki [21:07:44] jorm: I'd wait on springle's post-mortem to Ops [21:07:48] (03PS1) 10Yuvipanda: toollabs: Add GeoIP packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/140820 (https://bugzilla.wikimedia.org/62649) [21:07:51] kk [21:07:54] thanks [21:07:56] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf9 take 2 [21:08:01] Logged the message, Master [21:08:08] <_joe_> jorm: wait until deployment is settled, I'd say [21:08:12] to enwiki! [21:08:18] oh, it wouldn't be today. [21:08:28] _joe_: not today, please :) [21:08:45] _joe_: I'm waiting on sean before I "approve" any big template changes ;) [21:08:54] <_joe_> greg-g: eh :) [21:09:06] but it being a wiki...... [21:09:21] (To all you lurkers/evil doers out there: please don't) [21:09:32] I was about to cry [21:09:38] Then realised Chrome was complaining of network problems [21:11:27] (03PS2) 10Yuvipanda: toollabs: Add GeoIP packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/140820 (https://bugzilla.wikimedia.org/62649) [21:11:41] greg-g: stripped log http://noc.wikimedia.org/~hashar/20140619-pc1002-irc.log [21:11:47] * greg-g hugs hashar [21:11:53] greg-g: time is UTC obviously [21:11:56] (03CR) 10Andrew Bogott: [C: 032] toollabs: Add GeoIP packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/140820 (https://bugzilla.wikimedia.org/62649) (owner: 10Yuvipanda) [21:11:57] annotated, even [21:12:11] greg-g: I have let a few troll attempt in [21:12:13] springle: hashar has a great log for the post-mortem: http://noc.wikimedia.org/~hashar/20140619-pc1002-irc.log [21:12:18] greg-g: else postmortem are boring to read. [21:12:22] thanks hashar :) [21:12:43] springle: and the PDF showing pc1002 ganglia view is http://noc.wikimedia.org/~hashar/pc1002_2014-06-19_1900-large.pdf [21:13:31] springle: the upstart script probably need to be fixed. Might want to leave a bit more memory available to non mysql process. [21:13:48] springle: beside that. I have no clue how a database work. I guess you have some magic logs =) [21:14:07] oh magic; that would be nice [21:14:29] mysql-debug-oom --source mediawiki/core --time "1 hour ago" --blame [21:14:30] sudo apt-get install magic-log [21:14:38] (03PS2) 10Reedy: group0 to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140770 [21:15:15] (03CR) 10Reedy: [C: 032] group0 to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140770 (owner: 10Reedy) [21:15:21] (03Merged) 10jenkins-bot: group0 to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140770 (owner: 10Reedy) [21:16:06] springle: greg-g the logstash dashboard link is invalid . https://logstash.wikimedia.org/#dashboard/temp/Q35OBIU2RHiZdTRAhEXfKg would work [21:16:38] that dumps all the too many connections errors received [21:16:54] * greg-g thanks [21:16:55] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf10 [21:16:56] er [21:16:57] thanks [21:16:58] bd808: logstash is a breeze :] [21:17:00] Logged the message, Master [21:17:27] hashar: :) Much nicer than figuring out the right grep|awk incantation [21:18:06] Is gerrit slow for anyone else? [21:18:08] bd808: definitely [21:18:10] Or is it just my connnection fail? [21:18:13] (03PS2) 10Reedy: Set wmgMediaViewerBeta to false everywhere [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140811 [21:18:39] GOOD NEWS! BT have scheduled roadworks to install a FTTC cabinet [21:18:41] * Reedy dances [21:19:02] springle: https://gerrit.wikimedia.org/r/#/c/140809/ really should be configurable...I guess before we just relied on the ini [21:19:02] bad news, you will have to wait a few more months to have a service offered on that fiber [21:19:08] Yup [21:19:23] It's another step at least [21:19:46] greg-g, springle, mutante, _joe_, bblack, Reedy, RobH: https://etherpad.wikimedia.org/p/19-jun-2014-parsercache-outage [21:20:02] (03CR) 10Reedy: [C: 032] Set wmgMediaViewerBeta to false everywhere [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140811 (owner: 10Reedy) [21:20:11] (03Merged) 10jenkins-bot: Set wmgMediaViewerBeta to false everywhere [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140811 (owner: 10Reedy) [21:20:22] (03CR) 10QChris: [WIP] Add backup role and scripts to wikimetrics (0313 comments) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [21:20:29] In logstash related news, I've added dberror events to the fatalmonitor dashboard and highlighted APC thrash fatals with their own color (blue for sadness) [21:20:42] springle: you're being washed in resources to write a report :) [21:20:50] springle: let me know if you need me to help in anyway [21:20:55] bd808: you as well if you have annotations [21:21:05] datapoints to add to that timeline i mean [21:21:07] bd808: you should announce it on the engineering list with nice screenshots (people are too lazy to click a link and figure out their password) [21:21:27] bd808: I got the cluster's down blues [21:21:29] !log reedy Synchronized wmf-config/InitialiseSettings.php: Set wmgMediaViewerBeta to false everywhere (duration: 00m 15s) [21:21:32] hashar: wikimediaoutage.fm [21:21:32] Logged the message, Master [21:21:58] greg-g: thanks [21:22:13] Ta Reedy [21:22:26] springle: are you going to be writing it? i'm going to take off if so [21:22:51] see the etherpad for the timeline tho [21:23:03] he offered, yeah [21:23:04] * ori is a jerk, assumes springle's answer is "yes", runs away. [21:23:07] :) [21:23:57] ori: yes [21:28:10] * hashar offers a continental breakfast to springle [21:28:40] and I am cowardly heading bed. Have a good afternoon/ day folks [21:31:30] good night hashar [21:32:54] right, lunch [21:35:44] (03CR) 10QChris: [WIP] Add backup role and scripts to wikimetrics (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [21:40:11] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu 19 Jun 2014 18:39:48 UTC [21:40:11] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Jun 19 21:40:06 UTC 2014 [21:44:01] (03CR) 10QChris: [C: 04-1] Enable the new backup role in wikimetrics if set (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139558 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [21:47:55] (03CR) 10Gage: "Hi, thanks for the feedback. Some references:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140623 (owner: 10Gage) [21:51:02] <_joe_> http://aphyr.com/posts/317-call-me-maybe-elasticsearch sigh, ES [21:51:13] <_joe_> and we're using it [21:51:33] <_joe_> the test on single node network partitions is particularly frightening [21:51:51] manybubbles|away: ^d ^^^^ [21:52:34] <_joe_> I mean, not that this matters *that* much to us, since it's a search index after all [21:52:41] Regarding the pc1002 outage - is it fairly certain that the navbox update (at least partially) caused the outage? [21:54:21] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 72 data above and 9 below the confidence bounds [21:55:01] <^d> _joe_: This is why we spread shards and replicas across different racks. [21:55:10] <^d> :) [21:55:31] <_joe_> ^d: read the article, you may discover that makes the problem worse [21:55:43] <^d> I'm skimming. [21:56:27] <_joe_> he's clearly pissed at them btw. His reviews are usually less harsh [21:59:16] Excirial: no [22:01:39] <_joe_> ^d: basically, he advises to use ES as a seach index and not as a primary datastore, like we do [22:02:13] <^d> Well it's meant to be a search engine. [22:02:21] <_joe_> still it's an interesting read and it introduces pointers to interesting things like CRDT structures [22:02:45] <_joe_> if you're into distributed systems, it's an interesting read :) [22:03:11] <_joe_> (the whole call-me-maybe series is very interesting [22:04:27] "If you can, store your data in a safer database, and feed it into Elasticsearch gradually" [22:04:30] _joe_: woah, thats a lot of work [22:04:31] Well, that's what we do [22:04:57] Searching something on Wikimedia wikis has always been a bet. [22:05:22] At least now you can eventually sync the index with null edits or actual edits. :P [22:05:29] Nemo_bis: yeah, but it'd be unfortunate if we had to rebuild it due to some network partition [22:05:42] Nemo_bis: I actually have scripts to sync it too [22:05:54] I mean users [22:06:05] they were written to fix bugs like when we left ghosts in the index if the page was changed to a redirect [22:06:07] yeah [22:06:09] I get it [22:06:10] (03CR) 10Nikerabbit: [C: 04-1] "Thanks for the review. I added some comments which might clarify the intentions." (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [22:06:28] <_joe_> manybubbles: you will probably just need to rebuild all pages modified after a certain date [22:06:33] <_joe_> the date of the partition [22:06:51] _joe_: that was one of the first things I built [22:06:54] <^d> "Elasticsearch claims–and Wireshark traces confirm–that documents inserted without an ID will receive an auto-generated UUID for their document ID. How is it possible that an insert of a fresh document with a fresh UUID can fail because that document already exists? Something is seriously wrong here." [22:07:02] <^d> +1 for not using auto-ids and giving our own. [22:07:04] <^d> :) [22:07:17] ^d: logstash uses them [22:07:19] Anyone know why e.g. https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1397062971.274&from=-7days&target=MediaWiki.API.users.tp50 has a big ol' gap in the data? [22:08:12] <_joe_> ^d: happy to hear we got it right [22:08:18] <_joe_> marktraceur: I do [22:08:30] logstash isn't meant to be authoritative, it's just a "better grep" in my opinion. [22:08:49] bd808: sure [22:08:53] we loose god knows how many log events due to udp2log already :/ [22:08:54] <_joe_> marktraceur: the reason is mwprof hangs, and profiler-to-carbon waited forever [22:08:58] OK [22:09:09] <_joe_> until I notice and restart it [22:09:22] <_joe_> I submitted a patch that should resolve the issue [22:09:28] Thanks _joe_ [22:09:35] <_joe_> not sure if it's deployed though [22:10:21] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [22:11:51] <_joe_> mmmh this is not good [22:12:46] <^d> manybubbles, _joe_, bd808: https://github.com/elasticsearch/elasticsearch/tree/feature/improve_zen :) [22:13:04] ^d: I'm been watching that for a while [22:13:21] <^d> I followed a link from the blog post and ended up there. [22:14:33] elasticsearch is actually one of the best tested systems I've ever seen. [22:14:43] lucene is better tested but its scope is smaller [22:15:03] but zen really could use some work. [22:18:52] <_joe_> https://gdash.wikimedia.org/dashboards/reqerror/ <-- can you spot the outage maybe? [22:19:13] mwalker: i puppetized silverpop_export.yaml [22:20:40] mutante, awesome; thanks much [22:21:53] mutante, actually; if you want to fiddle around more; can you make that file owned root:jenkins 440? [22:22:21] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [22:22:32] that was implied knowledge for jeff that I didn't add to the ticket [22:22:34] mwalker: oh, jenkins runs there? [22:22:39] ok [22:22:40] *nods* [22:22:47] we have our own setup [22:22:50] yea, i'm a total fr noob :) [22:23:01] but i was also curious how it works [22:23:07] are you acting as our new backup ops person? [22:23:16] i randomly picked an RT ticket :) [22:23:20] heh [22:23:21] because now i always see them [22:23:57] ^d: manybubbles: You aware of [2014-06-19 22:23:28] Fatal error: Call to undefined method Elastica\Exception\Bulk\ResponseException::getResponse() at /usr/local/apache/common-local/php-1.24wmf10/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php on line 171 [22:24:03] in any case; ya -- we have a jenkins install on barium which runs our recurring jobs. we dont use cron because we need email notifications when they fail, and we need to be able to turn them on / off at will [22:24:07] <^d> Yeah, we've got a patch in master. [22:24:10] <^d> hoo: ^ [22:24:14] Ah ok :) [22:24:34] ^d: you wanna swat that this afternoon? its pretty safe to SWAT I think [22:24:45] <^d> Yeah we should. [22:27:05] ^d: I'm not going anywhere tonight so I can support it if need be [22:27:14] can you do the backport? because you are wonderful? [22:27:16] mwalker: group: group changed 'root' to 'jenkins' mode changed '0444' to '0440' [22:27:18] and I'm tired of it [22:27:28] <^d> I can be wonderful. [22:27:49] <^d> Just wmf10, right? [22:28:50] ^d: just, yeah [22:28:54] wmf9 doesn't have it [22:29:04] and wmf10 is only on testwiki right now [22:29:30] <^d> Yep. [22:34:57] <^d> manybubbles: submodule all ready to merge to core. [22:35:12] ^d: thanks! [22:35:36] can you put it on https://wikitech.wikimedia.org/wiki/Deployments send me a link to it? [22:48:47] greg-g: Can I sneak in a scap update before the deploy window? There are 2 pending changes: removal of the sync-*-old scripts and a fix for sync-common on fenari [22:49:02] bd808: While you do that [22:49:47] bd808: Could you check that 1) wmf10 is up-to-date and has its VisualEditor submodule updated? (I merged in a commit while Reedy was checking it out this morning) and 2) wmf9's VisualEditor's lib/ve submodule is up to date (in yesterday's SWAT 'git submodule update' was forgotten) [22:50:29] bd808: Hah also today's SWAT is empty [22:50:58] bd808: sure thing [22:51:05] RoanKattouw: I was just going to git-deploy updates for scap itself, but your task sounds like a good thing for the swat window :) [22:52:27] !log Updated scap to 792a572 [22:52:32] Logged the message, Master [22:53:46] (03CR) 10Manybubbles: [C: 031] "Looks like linting to me." [operations/puppet] - 10https://gerrit.wikimedia.org/r/140665 (owner: 10Matanya) [22:54:45] ^d: can you add that update to the window? I don't have a link for it [22:54:53] <^d> One moment. [22:55:31] thanks [22:55:46] RoanKattouw: VE is at 3052e6b for the wmf10 branch (Creating new wmf/1.24wmf10 branch) [22:56:36] https://github.com/wikimedia/mediawiki-extensions-VisualEditor/branches [22:56:41] 3 behind, 2 ahead [22:57:39] https://github.com/wikimedia/mediawiki-extensions-VisualEditor/compare/master...wmf;1.24wmf10 [22:58:40] bd808: I think that's one out of date. If you run git pull in MW wmf10, it should pull in a commit I merged just after Reedy set things up on tin [22:58:51] and that commit will change the submodule pointer for VE [22:59:11] sounds like a swat thing, in 1 minute ;) [22:59:20] <^d> manybubbles: It's there. [22:59:26] ^d: thanks! [22:59:44] * bd808 agrees with greg-g [23:00:04] mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140619T2300) [23:00:10] I'll do it [23:00:11] oh hey, look at that! [23:00:13] ;) [23:00:24] * greg-g is coping with dead SSD with sarcasm everywhere [23:00:39] MaxSem: Did you see RoanKattouw's requests above? [23:00:59] if they aren't on wiki page... :P [23:01:12] greg-g, dead ssd? that's unusual to hear about nowadays [23:01:19] MaxSem: It was on the wiki page yesterday and you didn't do it correctly :P [23:01:43] * MaxSem bites RoanKattouw [23:01:49] and you didn't check me:P [23:01:50] greg-g, I'm sorry to hear that you caught the wrong end of the bathtub [23:01:52] mwalker: maybe, haven't had a chance to do a real diagnostics, but there's something wonky going on. [23:03:01] Hi [23:03:02] I'm here [23:03:15] Hey that's my line [23:03:21] * marktraceur is helping with the MMV patches [23:03:50] (03PS1) 10QChris: Unqualify local variables [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/140860 [23:06:17] !log maxsem Synchronized php-1.24wmf9/extensions/MultimediaViewer/: (no message) (duration: 00m 05s) [23:06:22] Logged the message, Master [23:06:32] !log maxsem Synchronized php-1.24wmf10/extensions/MultimediaViewer/: (no message) (duration: 00m 05s) [23:06:37] Logged the message, Master [23:06:55] tgr, please verify ^^^ :) [23:07:35] Sweet. [23:07:53] (03CR) 10QChris: Unqualify local variables (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/140860 (owner: 10QChris) [23:08:03] * bd808 doesn't understand why scap events from tin aren't making it into logstash [23:09:07] bd808: because apparently we arne't allowed to know when deploys happen [23:09:10] see also: graphite [23:09:13] bd808, who cares - they're sowwwww faaasttttt!:P [23:10:28] tgr: I thought we backported the attribution section patch [23:10:29] wmf10 works [23:10:42] Oh, no, I'm wrong [23:10:53] greg-g: I see the messages on fluorine and I see other scap messages in logstash, but I'm not seeing messages from tin and fenari in logstash :( [23:11:07] wmf9 on arwiki doesn't, or not updated yet [23:11:09] could be an iptables thing [23:11:11] file an rt ticket [23:11:28] also a bunch of i18n messages seem to be missing [23:11:31] MaxSem: you are a scap hero, thanks for doing them [23:11:39] i haven't done SWAT in like two weeks [23:11:43] bd808: sounds firewall realted [23:11:48] ori: I don't think it can be iptables, the events all go to fenari and then are bounced to logstash [23:11:56] oh, ori beat me to it [23:12:05] s/fenari/florine/ [23:12:24] greg-g: fatals are high: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [23:12:26] still [23:12:28] So everything I see on florine should end up in logstash [23:12:54] [2014-06-19 23:12:31] Fatal error: Call to undefined method Elastica\Exception\Bulk\ResponseException::getResponse() at /usr/local/apache/common-local/php-1.24wmf10/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php on line 171 [23:12:59] wtf [23:12:59] ^ ^d, manybubbles [23:13:12] <^d> We have a patch going out in the swat for this. [23:13:17] <^d> [23:13:17] cool [23:13:19] sorry [23:13:20] <^d> :) [23:13:44] ^d, just tell me what to deploy) [23:14:25] <^d> https://gerrit.wikimedia.org/r/#/c/140848/ is the submodule update to wmf10 (only affected branch). [23:14:27] !log Restarted logstash service on logstash1001 [23:14:32] Logged the message, Master [23:16:39] !log maxsem Synchronized php-1.24wmf9/extensions/VisualEditor/: (no message) (duration: 00m 04s) [23:16:44] Logged the message, Master [23:17:09] marktraceur: isCommons() now works correctly on arwiki, but the button still seems wrong [23:17:27] any clue what's going on? [23:17:39] also half the tooltip texts are missing [23:18:07] !log maxsem Synchronized php-1.24wmf10/extensions/VisualEditor/: (no message) (duration: 00m 04s) [23:18:13] Logged the message, Master [23:18:16] RoanKattouw, ^^^ [23:18:17] (and the tooltips don't have tips but I suppose that's some local CSS hack they did) [23:18:27] !log maxsem Synchronized php-1.24wmf10/extensions/CirrusSearch/: (no message) (duration: 00m 03s) [23:18:32] Logged the message, Master [23:18:35] ^d, ^^^^ [23:18:44] <^d> ty! [23:18:59] <^d> ori: Those fatals should disappear now [23:19:56] !log maxsem Synchronized php-1.24wmf9/extensions/MobileFrontend/: (no message) (duration: 00m 04s) [23:20:01] Logged the message, Master [23:20:39] !log maxsem Synchronized php-1.24wmf10/extensions/MobileFrontend/: (no message) (duration: 00m 03s) [23:20:45] Logged the message, Master [23:22:57] MaxSem: Did you scap? [23:23:03] Because there were i18n additions [23:23:10] marktraceur, ehhh [23:23:37] !log maxsem Started scap: Mark Traceur made me do it! [23:23:42] Logged the message, Master [23:25:01] Grazie [23:25:03] tgr: ^^ [23:27:01] MaxSem: You should figure out how to make LocalisationCache rebuilds as fast as the sync is now ;) [23:28:52] bd808, procure a 700 core server and rebuild everything is super-parallel way? [23:29:45] I still think that most of the time is lost to stat calls against the filesystem, but I haven't been disciplined in verifying that. [23:30:41] srsly, wth is up with Flow atm [23:31:38] MaxSem: It's either fs stat or flushes for the cdb key inserts (or possibly both). If only we kept /a/common in a ramdisk it would be much faster. :) [23:32:49] duuuuude, I proposed that even before you were hired! they told me taking 60% of /tmp for that only was too much! [23:38:52] !log maxsem Finished scap: Mark Traceur made me do it! (duration: 15m 14s) [23:38:56] Logged the message, Master [23:38:59] Yay [23:39:09] marktraceur, tgr ^^^ [23:39:47] RobH: ping [23:39:51] the messages are still missing [23:40:01] I see them? [23:40:03] tgr: Which are missing? [23:40:36] popups for author, title, source, chevron [23:41:35] Ah, see it now [23:41:42] i18n cache probably [23:41:47] But we've had this problem before -.- [23:42:02] MaxSem: multimediaviewer-title-popup-text [23:42:18] thx [23:45:11] marktraceur, tgr, MaxSem: that sounds like https://bugzilla.wikimedia.org/show_bug.cgi?id=66543 [23:45:50] nah, even funnier: it's not present in json files [23:46:01] Urgh it does [23:46:10] MaxSem: https://commons.wikimedia.org/w/api.php?action=query&meta=allmessages&ammessages=multimediaviewer-title-popup-text works [23:46:38] dafuq [23:47:38] Dafuq indeed [23:48:38] greg-g: Restarting logstash seems to have at least temporarily fixed the issue with missing scap.announce events. https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor shows a marker for the last sync. [23:49:26] ^d: fatals haven't gone away [23:49:31] though maybe it's different ones now [23:49:32] i haven't looked [23:51:19] <^d> I'm not seeing it on the most-frequent list on logstash. [23:51:50] !log maxsem Synchronized php-1.24wmf9/extensions/MultimediaViewer/: (no message) (duration: 00m 04s) [23:51:55] Logged the message, Master [23:52:02] !log that was a touch [23:52:06] Logged the message, Master [23:52:11] RoanKattouw: do you remember what was needed last week to fix the issue with the missing RL i18n messages after SWAT? [23:52:26] <^d> GlobalVarConfig is yelling alot. [23:53:18] <^d> ori: I'm not seeing the Cirrus errors on logstash anymore. [23:55:41] greg-g: I've just finished reading the backscroll about the parser cache thing [23:56:04] did you need anything in particular from me about it? [23:57:44] I have a few comments about it... [23:58:20] I would like to know why pc1002 has 7GB of swap [23:59:23] I would like to know how swapdeath led to a lack of ping response -- usually in swapdeath, ping response is still OK, unless there is a kernel panic [23:59:55] but experience suggests that kernel panic is a pretty normal response to OOM