[00:01:38] JohnLewis: and no is supposed to be bokmal? [00:03:34] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Last successful Puppet run was Tue 19 Aug 2014 22:02:59 UTC [00:04:04] JohnLewis: https://meta.wikimedia.org/wiki/Mailing_lists/List_info [00:04:08] so many redlinks :S [00:07:52] (03CR) 10Ori.livneh: [C: 031] Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [00:30:09] (03PS1) 10Bsitu: Enable job queue to process notification on mediawikiwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155169 [00:36:49] (03PS1) 10Dzahn: wikistats - install deb package locally [operations/puppet] - 10https://gerrit.wikimedia.org/r/155173 [00:37:33] bye for now, /away [01:12:24] springle_: Thanks for the quick work on that table creation bug [01:12:44] bd808: yw. easy one [02:04:34] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Last successful Puppet run was Tue 19 Aug 2014 22:02:59 UTC [02:17:34] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 00:16:58 UTC [02:37:05] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Aug 20 02:37:02 UTC 2014 [02:40:43] !log LocalisationUpdate completed (1.24wmf16) at 2014-08-20 02:39:40+00:00 [02:40:50] Logged the message, Master [02:53:34] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 00:53:14 UTC [02:53:34] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 00:53:14 UTC [03:12:11] !log LocalisationUpdate completed (1.24wmf17) at 2014-08-20 03:11:08+00:00 [03:12:18] Logged the message, Master [03:32:44] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Wed Aug 20 03:32:40 UTC 2014 [04:05:34] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Last successful Puppet run was Tue 19 Aug 2014 22:02:59 UTC [04:06:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Aug 20 04:05:41 UTC 2014 (duration 5m 40s) [04:06:54] Logged the message, Master [04:10:14] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures [04:17:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [04:18:05] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [04:29:15] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:30:04] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:30:55] (03CR) 10Nemo bis: mailman: use a new default theme (prettier mailman) (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) (owner: 10John F. Lewis) [04:32:54] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Wed Aug 20 04:32:50 UTC 2014 [05:08:14] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [05:10:20] (03PS1) 10KartikMistry: Enable webfonts in English Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155206 (https://bugzilla.wikimedia.org/69655) [05:11:28] (03PS2) 10KartikMistry: Enable webfonts in English Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155206 (https://bugzilla.wikimedia.org/69655) [05:26:14] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:51:34] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.008 second response time [05:58:34] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [05:59:14] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [05:59:14] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [06:06:34] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Last successful Puppet run was Tue 19 Aug 2014 22:02:59 UTC [06:13:05] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Puppet last ran 29366 seconds ago, expected 14400 [06:13:54] RECOVERY - Puppet freshness on amssq61 is OK: puppet ran at Wed Aug 20 06:13:49 UTC 2014 [06:15:05] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:26:55] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:05] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Epic puppet fail [06:28:35] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:45] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:15] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:35] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:43] springle: legoktm said I should ask you about tool labs's db giving me the following error: Host '10.68.17.174' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts' [06:40:54] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:05] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [06:41:24] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [06:43:10] Earwig: which db is doing that? [06:43:31] .labsdb [06:43:48] it's tools-db [06:43:55] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:45:03] ok, I don't know what backend that translates to. Coren would. I'll check them all [06:45:15] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:45:44] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:47:04] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:51:25] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:05] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:53:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [06:55:15] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 2 failures [06:58:55] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:59:22] Earwig: I don't appear to have access to tools-db, only to labsdb boxes. You'll have to ask Coren, or maybe scfc_de [07:00:03] hmm... alright, well, thanks for your help [07:01:24] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53277 bytes in 0.704 second response time [07:02:40] (03PS1) 10Springle: Be kinder to future self and name files clearly. [operations/puppet] - 10https://gerrit.wikimedia.org/r/155211 [07:03:14] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: Epic puppet fail [07:12:59] (03CR) 10Springle: [C: 032] Be kinder to future self and name files clearly. [operations/puppet] - 10https://gerrit.wikimedia.org/r/155211 (owner: 10Springle) [07:13:15] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:22:14] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:03:10] (03PS1) 10Aude: Fix config for specialSiteLinkGroups in Wikibase [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155218 [08:03:20] (03PS1) 10Calak: Flagged Revisions configuration for uk.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155219 (https://bugzilla.wikimedia.org/67748) [08:46:57] (03CR) 10Hashar: "Thanks for the review. Should I keep 'critical => true' commented out or maybe explicitly define it as false?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155054 (https://bugzilla.wikimedia.org/69733) (owner: 10Hashar) [09:12:45] (03CR) 10Filippo Giunchedi: [C: 031] "we could just get rid of the critical => false line, it is simple enough that can be added if needed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155054 (https://bugzilla.wikimedia.org/69733) (owner: 10Hashar) [09:14:17] (03CR) 10Filippo Giunchedi: [C: 031] ssl_ciphersuite - change Header add to Header set [operations/puppet] - 10https://gerrit.wikimedia.org/r/155016 (owner: 10Chmarkine) [09:15:06] (03CR) 10Filippo Giunchedi: [C: 031] Turn on elasticsearch row awareness for shard allocation [operations/puppet] - 10https://gerrit.wikimedia.org/r/153805 (owner: 10Ottomata) [09:15:22] (03PS2) 10Hashar: contint: monitor tmpfs mount on production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/155054 (https://bugzilla.wikimedia.org/69733) [09:15:39] (03CR) 10Hashar: "PS2 remove the comment ;)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155054 (https://bugzilla.wikimedia.org/69733) (owner: 10Hashar) [09:15:54] (03CR) 10Filippo Giunchedi: [C: 031] contint: monitor tmpfs mount on production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/155054 (https://bugzilla.wikimedia.org/69733) (owner: 10Hashar) [09:16:10] godog: I guess you can lend that monitoring change :) [09:17:03] hashar: heheh lend? [09:20:22] err [09:20:27] godog: s/lend/land/ :-D [09:20:34] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 07:19:53 UTC [09:21:16] AH [09:21:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: monitor tmpfs mount on production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/155054 (https://bugzilla.wikimedia.org/69733) (owner: 10Hashar) [09:21:27] zuul is broken so I can play with the monitoring system [09:22:05] hashar: done [09:27:33] !log restarted Jenkins Gearman plugin. [09:27:39] Logged the message, Master [09:28:34] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Epic puppet fail [09:28:57] what the hell [09:29:01] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type nrpe::monitor::service at /etc/puppet/manifests/role/ci.pp:167 on node gallium.wikimedia.org [09:31:15] PROBLEM - puppet last run on lanthanum is CRITICAL: CRITICAL: Epic puppet fail [09:31:52] same deal [09:31:53] (03PS1) 10Hashar: nrpe::monitor::service -> nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/155227 [09:32:27] (03CR) 10Hashar: contint: monitor tmpfs mount on production slaves (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/155054 (https://bugzilla.wikimedia.org/69733) (owner: 10Hashar) [09:33:09] godog: sorry that broke puppet due to some lame typo : https://gerrit.wikimedia.org/r/155227 :-/ [09:33:23] hashar: no problem, I'll merge it [09:33:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] nrpe::monitor::service -> nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/155227 (owner: 10Hashar) [09:34:09] done [09:35:50] (03CR) 10Hashar: jenkins: use openjdk-7-jre-headless (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/153764 (owner: 10Hashar) [09:37:13] all good [09:37:15] RECOVERY - puppet last run on lanthanum is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:37:35] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:37:54] hey godog [09:39:59] greetings ori ! [09:40:23] * ori back in PDT [09:40:24] !log uploaded hhvm_3.3-dev+20140728+wmf5 to carbon [09:40:29] \o/ [09:40:30] Logged the message, Master [09:41:33] ori: jetlagged much? it is lateish [09:41:56] i just woke up :/ [09:43:05] "early start" [09:44:38] shall i update mw1017 & mw1053? [09:45:07] IIRC puppet has ensure => latest but worth double checking [09:45:52] I got as far as sync-common on mw1019 yesterday then didn't want to run scap on tin not knowing what I was doing [09:45:54] yes but as you know in the mediawiki community we prefer fallible manual work to perfect automation [09:47:18] requires some balancing :) ensure => latest is IMO a loaded shotgun pointed at everyone's feet, but makes sense in this case for example [09:49:18] i think ensure => present is fine [09:49:47] we should just set an 'hhvm' grain via puppet on all hhvm machines [09:50:39] I like ensure => "specific_version" [09:50:44] then we can do salt -g 'cluster:hhvm' cmd.run 'apt-get update hhvm' [09:50:47] that works too [09:50:59] s/update/upgrade [09:51:37] also, should we deploy scap via git? [09:52:00] it was formerly deployed via git; the primary reason to move to trebuchet was to allow bryan davis (a non-root) some finer control over deployments [09:52:16] but he has explicitly had to set it aside and move on to other things [09:52:28] and my expectation is that future scap work will come from me or ops [09:52:46] on the one hand, i think we should refine trebuchet so that it's not brittle, ever [09:53:00] on the other, i think deploying scap via git exec is one moving part fewer [09:53:20] so first make trebuchet not brittle, THEN abandon it for git ;) [09:53:51] sound advice is such a party pooper [09:54:00] ok, fair :P [09:55:23] hhvm fatal / error log aggregation locally and on fluorine via syslog looks like it's working well [09:59:44] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Wed Aug 20 09:59:40 UTC 2014 [10:09:10] (03PS1) 10QChris: Remove udp2log stream to University of Minnesota [operations/puppet] - 10https://gerrit.wikimedia.org/r/155230 [10:11:04] qchris: fyi, last time you guys removed one a few days ago an icinga alert freaked out about stale logs [10:11:15] may want to ack that (or remove the alert) if it's likely to happen again [10:11:32] ori: Thanks for the heads up. [10:11:53] Should not happen here, as the stream has been disabled a few days ago, and [10:12:06] we did not see icinga alert us about it. [10:12:43] But I'll double-check :-) [10:13:59] (03PS1) 10Ori.livneh: Canonicalize location of $wgSiteMatrixFile [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155232 [10:14:29] (03CR) 10Ori.livneh: [C: 032] Canonicalize location of $wgSiteMatrixFile [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155232 (owner: 10Ori.livneh) [10:14:35] (03Merged) 10jenkins-bot: Canonicalize location of $wgSiteMatrixFile [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155232 (owner: 10Ori.livneh) [10:14:55] ori: Sorry. I misread your comment. You were explicitly referring to the event a few days ago. [10:15:02] There the root partition filled up. [10:15:13] That was not related to the disabling of the filter, [10:15:14] ah, ok, so not related [10:15:17] disregard, then [10:16:25] !log ori Synchronized wmf-config/CommonSettings.php: Id2d5cfa4c: Canonicalize path to $wgSiteMatrixFile (duration: 00m 06s) [10:16:31] Logged the message, Master [10:22:56] (03PS1) 10Ori.livneh: Canonicalize some remaining references to /apache symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155233 [10:24:24] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /a/common/). [10:25:12] (03CR) 10Ori.livneh: [C: 032] Canonicalize some remaining references to /apache symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155233 (owner: 10Ori.livneh) [10:25:16] (03Merged) 10jenkins-bot: Canonicalize some remaining references to /apache symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155233 (owner: 10Ori.livneh) [10:26:10] godog: you know, for reimaging [10:26:24] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [10:26:26] we could just have salt declare the minion's public key for the host as a file [10:26:36] !log ori updated /a/common to {{Gerrit|Ic9d8837b1}}: Canonicalize some remaining references to /apache symlink [10:26:37] and notify the minion service [10:26:42] Logged the message, Master [10:27:02] we could do that by mounting the salt public key dir as a puppet file path [10:27:34] or use generate() or something [10:27:59] is it always the case (and will always be the case) that puppet masters are salt masters and vice versa? [10:28:31] I wouldn't assume that, but perhaps that's already the case (that there is this assumption) [10:30:46] i guess the salt master doesn't have the private key; that'd be silly [10:32:37] I think I'm not getting what problem you are trying to solve [10:33:31] !log ori Synchronized w/mobilelanding.php: Ic9d8837b1: Canonicalize some remaining references to /apache symlink (duration: 00m 05s) [10:33:36] Logged the message, Master [10:33:43] !log ori Synchronized w/touch.php: Ic9d8837b1: Canonicalize some remaining references to /apache symlink (duration: 00m 05s) [10:33:49] Logged the message, Master [10:34:08] godog: when the machine is reimaged, it generates new salt keys [10:34:18] those are rejected by the salt master, at least initially [10:34:32] those _joe_ wrote a script to make accepting the new puppet key and new salt key a single operation [10:34:36] so if that works well the point is moot [10:35:27] yep I used it yesterday and got it working [10:36:43] godog: none of the syncs i just did made it to mw1019 because it's still outside the mediawiki-installation dsh group [10:36:54] should we should un-comment it and re-sync-common on that host [10:37:24] ack, uncommenting now [10:39:47] (03PS1) 10Filippo Giunchedi: hhvm: restore mw1019 in dsh mediawiki-installation [operations/puppet] - 10https://gerrit.wikimedia.org/r/155237 [10:40:00] (03CR) 10Ori.livneh: [C: 031] hhvm: restore mw1019 in dsh mediawiki-installation [operations/puppet] - 10https://gerrit.wikimedia.org/r/155237 (owner: 10Filippo Giunchedi) [10:40:06] i'll run sync-common meanwhile [10:43:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] hhvm: restore mw1019 in dsh mediawiki-installation [operations/puppet] - 10https://gerrit.wikimedia.org/r/155237 (owner: 10Filippo Giunchedi) [10:44:10] ggrr ! b868d71..595398e production -> origin/production (unable to update local ref) [10:50:43] ori: should be working [10:54:28] ori: I wanted also to retry a reimage asap and see how it does if that works [10:54:54] nod, i'd like to try and add an auto i18n update step too [10:54:59] testing it now [10:58:57] (03PS14) 10Hashar: sanity test for refreshWikiversionsCDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 [11:02:29] (03PS1) 10Steinsplitter: Adding new domain to wgCopyUploadsDomains whitelist. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155239 (https://bugzilla.wikimedia.org/69777) [11:03:25] (03PS2) 10Hashar: Adding new domain to wgCopyUploadsDomains whitelist. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155239 (https://bugzilla.wikimedia.org/69777) (owner: 10Steinsplitter) [11:04:22] (03CR) 10Hashar: [C: 032] "deploying :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155239 (https://bugzilla.wikimedia.org/69777) (owner: 10Steinsplitter) [11:04:26] (03Merged) 10jenkins-bot: Adding new domain to wgCopyUploadsDomains whitelist. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155239 (https://bugzilla.wikimedia.org/69777) (owner: 10Steinsplitter) [11:05:37] !log hashar Synchronized wmf-config/InitialiseSettings.php: new domain www.veikkos-archiv.com to wgCopyUploadsDomains {{gerrut|155239}} {{bug|69777}} (duration: 00m 03s) [11:05:43] Logged the message, Master [11:05:49] 11:05:36 ['sync-common', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings.php', 'mw1010.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1019 returned [127]: bash: sync-common: command not found [11:05:49] :-((( [11:06:08] seems some hosts are filled in dsh but not yet fully active [11:06:24] !log hashar Synchronized wmf-config/InitialiseSettings.php: new domain www.veikkos-archiv.com to wgCopyUploadsDomains {{gerrut|155239}} {{bug|69777}} (duration: 00m 03s) [11:06:55] !log mw1019 is missing sync-common causing sync issues. [11:07:02] Logged the message, Master [11:07:46] that was expected to be working, mh and scap is on the machine, though sync-common might not in the PATH that the command above is using hashar_ ori [11:08:24] and I am wondering whether it serves as a rsync proxy [11:08:45] yep looks like the /usr/local/bin symlinks are not in place [11:09:40] they're not supposed to be [11:09:57] echo /etc/profile.d/add_scap_to_path.sh [11:10:09] # Add scap to $PATH for non-root users [11:10:09] if [ "$(id -u)" -ne "0" ]; then [11:10:09] export PATH="$PATH:/srv/deployment/scap/scap/bin" [11:10:11] fi [11:10:11] hashar_ :) [11:10:20] here is my sync-file output http://paste.openstack.org/show/97730/ [11:10:47] Steinsplitter: I guess you can close the bug report :-] [11:11:01] hashar, just wait pelase. [11:11:04] please. [11:11:13] ori: sure I am not doing anything :] [11:11:18] just giving you guys some trace hehe [11:31:33] apergos: hi there, would you be the person to ask about labs not getting up to date wikipedia dumps or labs folks ? [11:31:33] Coren: filed RT #8163 about network config [11:31:46] not yet [11:32:16] there's likely a backlog, if you don't see new files showing up later today then you should ask Coren I guess [11:32:30] thanks [11:33:22] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [11:33:22] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [11:35:52] actually I would just ask him as soon as he's on [11:36:17] Coren: from dataset1001 I have: ls: cannot access /mnt/dumps: Input/output error [11:36:23] though it shows it mounted, so back to you... [11:36:24] !log Updating Jenkins Job Builder fork 666e953..0268581 [11:36:31] Logged the message, Master [11:37:22] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [11:37:22] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [11:41:49] (03PS1) 10Bartosz Dziewoński: Set $wgCategoryCollation to 'uca-fr' on frwikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155241 (https://bugzilla.wikimedia.org/69782) [11:48:36] (03PS1) 10Ori.livneh: Update all symlinks to /apache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155242 [11:48:41] (03CR) 10jenkins-bot: [V: 04-1] Update all symlinks to /apache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155242 (owner: 10Ori.livneh) [11:50:02] hashar: let me guess [11:50:06] jenkins references /apache ? [11:52:27] Could not open input file: docroot/bits/w/404.php [11:52:46] but it's there [11:52:55] and it points to the right location [11:56:47] hashar: hey? [11:57:02] no clue :-D [11:57:05] ah the job fails [11:57:37] yeah the phplint follow symlink [11:57:45] seems some point to non existing files [11:57:48] right, but only of files that have changed [11:58:00] they were pointing to nonexistent files before [11:58:10] but they haven't changed since forever [11:58:14] so they never got checked [11:59:04] ori: and if they ever got changed, I guess jenkins job result got ignored / bypassed [11:59:11] phplint only check files being changed by the patchset [11:59:22] so I guess you can override it [11:59:38] yes, sounds right [11:59:48] (03Draft1) 10Filippo Giunchedi: swift-drive-audit: import icehouse version [operations/puppet] - 10https://gerrit.wikimedia.org/r/155244 [11:59:51] (03CR) 10Ori.livneh: [C: 032 V: 032] Update all symlinks to /apache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155242 (owner: 10Ori.livneh) [12:00:00] (03Draft1) 10Filippo Giunchedi: drive-audit: clear up exit status [operations/puppet] - 10https://gerrit.wikimedia.org/r/155245 [12:00:48] (03CR) 10Hashar: "Per IRC discussion, this is fine. The Jenkins phplint job only check jobs that are being changed in HEAD. I guess they never got checke" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155242 (owner: 10Ori.livneh) [12:03:28] !log ori updated /a/common to {{Gerrit|Ic3fe1ef83}}: Update all symlinks to /apache [12:03:35] Logged the message, Master [12:05:12] holy shit [12:05:17] i think /apache may finally be dead [12:05:21] it's working on mw1019 [12:05:31] going to try a couple of other hosts cautiously [12:06:45] \o/ [12:07:18] next thing: kill /a :-D [12:07:25] would stuff outside mw be using it? e.g. extensions? [12:07:36] hashar: while we are at it, what do you think re: https://bugzilla.wikimedia.org/show_bug.cgi?id=68255 ? [12:07:55] ah [12:08:03] godog: I should have it raised to the mediawiki/core team :-D [12:08:13] mind if i run scap? [12:08:20] i updated a few hosts by hand, everything looks good [12:08:54] the payload consists of nothing but symlink updates [12:10:07] *crickets* [12:11:08] i'll go for it in a minute unless someone objects [12:12:03] actually, sync-dir of docroot is sufficient [12:12:58] hashar: heheh is it on your radar or should I? [12:13:02] (03PS1) 10Yuvipanda: stats: Install sqlite3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/155251 [12:13:08] godog: please do :] [12:13:48] maybe we could use hhvm to handle linting [12:14:36] hashar: why core and not release/qa? [12:15:06] cause on core there will be several folks able to handle it [12:15:14] whereas in release/qa that will be assigned to me :D [12:15:21] (or maybe reedy) [12:16:16] anyway, godog proposed to use PECL extension runkit instead. that might be fine [12:16:48] that was frowned upon too, see my last comment [12:16:58] anyways I'll poke core [12:17:44] if I knew C, I would just add a recursive option to Zend PHP [12:17:55] php -l directory [directory ..] [12:19:33] i think it's a release/qa issue, tbh [12:21:46] I honestly dont want to deal with it [12:21:49] just merely pointed the issue [12:22:04] ??? [12:22:05] maybe we can have scap just xargs | php -l [12:26:21] there's https://github.com/nikic/PHP-Parser [12:26:26] and a bunch of other parsers [12:26:44] how much they match php or hhvm in behavior is another matter [12:26:54] but a parser that barfs at edge cases that php tolerates it not necessarily a bad thing [12:27:10] only the reverse is true (a parser that tolerates edge cases php won't) [12:27:14] yeah or http://github.com/facebook/pfff/ :D [12:27:33] sorry, but this sounds like a great job for release/qa [12:27:39] or we could run PHP internal tokenizer [12:27:58] otherwise i'm a bit confused about what qa means [12:30:44] godog: i synced the docroot change [12:31:08] i'm leaving /apache in place for now so we can wait a couple of days and see if any oddities crop up [12:32:14] but technically it should be safe to remove; it's gone on mw1019 and mw1017 [12:33:31] i'll email ops@ [12:34:28] ori: cool, removing stuff always feels nice [12:38:13] * mark lobotomizes godog [12:39:04] godog: email sent, see if there's anything i should add [12:39:44] * godog stares at mark with a vacuous look now [12:40:24] * ori places a pen in godog's hand and authorizes signs a few documents [12:42:46] hehe I'm not sure I like where this is going [12:44:20] godogestein? [12:44:27] frankendog? [12:45:04] frankendog! [12:46:48] (03PS1) 10Filippo Giunchedi: filippo: /home away from home [operations/puppet] - 10https://gerrit.wikimedia.org/r/155252 [12:47:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] filippo: /home away from home [operations/puppet] - 10https://gerrit.wikimedia.org/r/155252 (owner: 10Filippo Giunchedi) [13:07:50] is swift equiad overloaded https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Swift+eqiad&m=cpu_report&s=by+name&mc=2&g=load_report ? [13:10:18] Steinsplitter: not really no, all that load comes artificially from ms-be1003 having xfs in a funny state [13:10:55] !log reboot ms-be1003, xfs errors/panics [13:11:01] Logged the message, Master [13:13:01] PROBLEM - SSH on ms-be1003 is CRITICAL: Connection refused [13:13:01] PROBLEM - swift-object-updater on ms-be1003 is CRITICAL: Connection refused by host [13:13:12] PROBLEM - swift-account-reaper on ms-be1003 is CRITICAL: Connection refused by host [13:13:12] PROBLEM - swift-account-auditor on ms-be1003 is CRITICAL: Connection refused by host [13:13:22] PROBLEM - swift-account-replicator on ms-be1003 is CRITICAL: Connection refused by host [13:13:32] PROBLEM - swift-object-replicator on ms-be1003 is CRITICAL: Connection refused by host [13:14:02] PROBLEM - puppet last run on ms-be1003 is CRITICAL: Connection refused by host [13:14:11] PROBLEM - Disk space on ms-be1003 is CRITICAL: Connection refused by host [13:14:21] PROBLEM - DPKG on ms-be1003 is CRITICAL: Connection refused by host [13:14:22] PROBLEM - swift-container-auditor on ms-be1003 is CRITICAL: Connection refused by host [13:14:22] PROBLEM - swift-account-server on ms-be1003 is CRITICAL: Connection refused by host [13:14:22] PROBLEM - swift-container-updater on ms-be1003 is CRITICAL: Connection refused by host [13:14:22] PROBLEM - swift-container-server on ms-be1003 is CRITICAL: Connection refused by host [13:14:22] PROBLEM - swift-container-replicator on ms-be1003 is CRITICAL: Connection refused by host [13:14:31] PROBLEM - check configured eth on ms-be1003 is CRITICAL: Connection refused by host [13:14:32] PROBLEM - swift-object-auditor on ms-be1003 is CRITICAL: Connection refused by host [13:14:32] PROBLEM - RAID on ms-be1003 is CRITICAL: Connection refused by host [13:14:51] PROBLEM - swift-object-server on ms-be1003 is CRITICAL: Connection refused by host [13:14:51] PROBLEM - check if dhclient is running on ms-be1003 is CRITICAL: Connection refused by host [13:15:20] godog: will probably resolve https://bugzilla.wikimedia.org/show_bug.cgi?id=69760 ""backend-fail-internal error while [13:15:20] deleting files at Commons""... andre__ commented about it on wikitech-l [13:15:28] I guess that is the mail that triggered your investigation [13:15:37] wikitech-l -> engineering list [13:16:36] apergos: On it. [13:17:57] apergos: Eff. Yep. same boo-boo. At least one of the shelves is dead. [13:18:02] hashar: heh I'm not sure what mw does when it emits that message [13:18:09] * Coren attempts to figure out which and take it out. [13:18:45] includes/filebackend/FileBackendMultiWrite.php:328: $status->fatal( 'backend-fail-internal', $cBackend->getName() ); [13:18:45] includes/filebackend/FileBackendStore.php:1120: $subStatus = Status::newFatal( 'backend-fail-internal', $this->name ); [13:18:46] includes/filebackend/SwiftFileBackend.php:539: $status->fatal( 'backend-fail-internal', $this->name ); [13:19:28] see also /a/mw-log/swift-backend.log [13:19:45] i see a bunch of 401 unauthorized [13:20:48] possibly related, the exception log has a bunch of CentralAuth things saying the user doesn't exist [13:24:36] ori: though all from imagescalers [13:25:53] there's a nice spike of fatals referencing http://zh.wikipedia.org/zh/Template:Recent_changes_article_requests/full [13:26:19] the title doesn't sound full of win [13:26:42] i don't know the etiquette for nuking something like that into space so i usually rely on Carmela or Nemo_bis (hello) [13:27:35] i usually rely on domas [13:29:41] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Puppet has 1 failures [13:29:51] i asked on -stewards [13:29:59] matanya's on it [13:30:24] apergos: Just to make things fun, according to the controller there's nothing wrong with any shelf. The kernel just is no longer able to write to at least one of them. [13:30:32] if it will every let me do it [13:32:08] apergos: ... aaaand because it doesn't actually /fail/ (just hangs there indefinitely) I have no diagnostic message nor even hints at which drive/shelf/etc is actually the issue. [13:34:40] (03PS1) 10Aude: Add item-redirect to OAuth permissions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155257 [13:34:46] (03CR) 10Manybubbles: [C: 031] Fix config for specialSiteLinkGroups in Wikibase [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155218 (owner: 10Aude) [13:34:51] ori: done [13:36:12] thanks! :) [13:36:56] !log disabling puppet on analytics1027 temporarily [13:37:04] Logged the message, Master [13:37:07] mark: ze hardware, she is fail. :-( [13:41:21] ottomata: can I interest you in merging https://gerrit.wikimedia.org/r/#/c/155251/ when you have a few moments? Thanks :) [13:42:06] (03PS2) 10Ottomata: stats: Install sqlite3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/155251 (owner: 10Yuvipanda) [13:42:15] (03CR) 10Ottomata: [C: 032 V: 032] stats: Install sqlite3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/155251 (owner: 10Yuvipanda) [13:44:45] (03PS2) 10Ottomata: Remove udp2log stream to University of Minnesota [operations/puppet] - 10https://gerrit.wikimedia.org/r/155230 (owner: 10QChris) [13:44:53] (03CR) 10Ottomata: [C: 032 V: 032] Remove udp2log stream to University of Minnesota [operations/puppet] - 10https://gerrit.wikimedia.org/r/155230 (owner: 10QChris) [13:47:02] !log experimenting with lowering merge factor on enwiki's Cirrus index - should improve query performance at the cost of more background tasks in the Elasticserach cluster [13:47:09] Logged the message, Master [13:47:41] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:48:08] Coren: reading the backlog now, that sux [13:48:16] ottomata: ty [13:48:35] apergos, Coren : is that related to my question? [13:48:49] mutante: any reason I shouldn't do this? [13:48:49] https://rt.wikimedia.org/Ticket/Display.html?id=8140 [13:50:59] matanya: yep. no filesystem = no copies [13:51:14] ah, that is clear [13:51:16] thanks [13:57:54] apergos: I'm fiddling with raw devices right now trying to figure out if it's a single failed shelf. [13:58:08] "Error deleting file: An unknown error occurred in storage backend "local-swift-eqiad" [13:58:48] is there a ticket for todays swift errors ? [14:00:20] thedj: yes, https://bugzilla.wikimedia.org/show_bug.cgi?id=69760 [14:02:01] RECOVERY - swift-object-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:02:02] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 3467 seconds ago with 0 failures [14:02:11] RECOVERY - Disk space on ms-be1003 is OK: DISK OK [14:02:12] RECOVERY - swift-account-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:02:12] RECOVERY - swift-account-reaper on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:02:21] RECOVERY - DPKG on ms-be1003 is OK: All packages OK [14:02:22] RECOVERY - swift-container-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:02:22] RECOVERY - swift-account-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:02:22] RECOVERY - swift-account-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:02:22] RECOVERY - swift-container-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:02:22] RECOVERY - swift-container-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:02:22] RECOVERY - swift-container-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:02:31] RECOVERY - check configured eth on ms-be1003 is OK: NRPE: Unable to read output [14:02:32] RECOVERY - swift-object-auditor on ms-be1003 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:02:32] RECOVERY - swift-object-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:02:32] RECOVERY - RAID on ms-be1003 is OK: OK: optimal, 14 logical, 14 physical [14:02:51] RECOVERY - check if dhclient is running on ms-be1003 is OK: PROCS OK: 0 processes with command name dhclient [14:02:51] RECOVERY - swift-object-server on ms-be1003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:03:01] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [14:10:58] !log changing group ownership and permissions on raw webrequest data in hdfs.  Users now must be in the analytics-privatedata-users group to access. [14:11:03] Logged the message, Master [14:12:02] PROBLEM - Host labstore1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:42] RECOVERY - Host labstore1003 is UP: PING OK - Packet loss = 0%, RTA = 1.95 ms [14:13:59] (03CR) 10Andrew Bogott: [C: 032] mailman: new languages [operations/puppet] - 10https://gerrit.wikimedia.org/r/155164 (owner: 10John F. Lewis) [14:16:01] (03CR) 10Andrew Bogott: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155164 (owner: 10John F. Lewis) [14:19:54] apergos: ... AAUGH! And, of course, being unable to even test what was failed because the filesystem were mounted/md active, I had to reboot. And after reboot, everything works perfectly fine. [14:20:14] * Coren runs write load tests on the individual arrays. [14:23:11] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [14:23:22] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [14:25:12] of coure it does (work after reboot) [14:26:18] I'm writing stuff to each array right now, and I'll do so repeatedly until one of them goes boom. [14:28:21] PROBLEM - NTP on labstore1003 is CRITICAL: NTP CRITICAL: Offset unknown [14:32:21] RECOVERY - NTP on labstore1003 is OK: NTP OK: Offset -0.001352787018 secs [14:33:39] ottomata: can you install elasticsearch1.3.2 to apt? [14:33:55] its update time [14:35:01] <^d> wheee! [14:35:23] oh, can doo [14:35:27] thanks! [14:35:32] I should have asked earlier [14:38:23] (03PS2) 10Manybubbles: Update plugins for 1.3 [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/154827 [14:39:04] (03PS1) 10Ottomata: Update elasticsearch debs to 1.3 versions [operations/puppet] - 10https://gerrit.wikimedia.org/r/155265 [14:39:34] (03CR) 10Chad: "We talked about this yesterday. Everyone pretty much agreed "let's try it, but next week when things are quiet after the upgrade"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153805 (owner: 10Ottomata) [14:39:42] (03CR) 10Ottomata: [C: 032 V: 032] Update elasticsearch debs to 1.3 versions [operations/puppet] - 10https://gerrit.wikimedia.org/r/155265 (owner: 10Ottomata) [14:41:46] (03PS1) 10Mark Bergsma: Allocate codfw public/private subnets for rows A-D [operations/dns] - 10https://gerrit.wikimedia.org/r/155266 [14:42:09] (03CR) 10Hashar: "Giuseppe proposed to use 'apt-get build-dep hhvm' instead. I have no idea how to ingrate it in puppet though. Maybe it is as simple as a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [14:45:00] hi robh [14:45:27] manybubbles: there it is! [14:45:28] http://apt.wikimedia.org/wikimedia/pool/main/e/elasticsearch/ [14:45:36] thanks! [14:45:56] (03CR) 10Manybubbles: [C: 032 V: 032] Update plugins for 1.3 [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/154827 (owner: 10Manybubbles) [14:47:00] !log upgrading elasticsearch plugins on all elasticsearch servers in preparation to upgrade to elasticsearch 1.3 - if we roll back we'll have to redeploy the plugins [14:47:06] Logged the message, Master [14:48:19] ottomata: got some more spare cycles to mess with Gerrit ? :-] [14:48:44] ottomata: I got a configuration change pending to tweak the test results reported by Jenkins (Qchris reviewed it on monday) [14:49:06] !log installing elasticsearch 1.3.2 on elasticsearch1001 only right now as a test [14:49:12] Logged the message, Master [14:49:52] manybubbles: Which of us wants to SWAT today? [14:50:11] anomie: either way is fine with me [14:50:19] I'm doing an elasticsearch upgrade already so I'm distracted [14:50:23] manybubbles: I'll do it [14:50:26] cool [14:50:49] <^d> elastic1001? [14:50:53] <^d> Or elasticsearch1001? [14:51:52] aude: Sanity check on 155218: it doesn't depend on anything not already in wmf16? [14:52:42] (or it only applies to wikidata.org?) [14:53:22] looking [14:54:37] it's mainly for wikidata but should also be ok om wmf16 [14:54:44] then some things need touching [14:54:56] extensions/Wikidata/extensions/lib [14:54:59] ah [14:55:09] extensions/Wikidata/extensions/Wikibase/lib/resources/wikibase.sites.js [14:55:15] extensions/Wikidata/extensions/Wikibase/lib/resources/wikibase.Site.js [14:55:19] and [14:55:36] extensions/Wikidata/extensions/Wikibase/lib/includes/modules/SitesModule.php [14:57:05] only in wmf17 [14:58:17] aude: All those only need touching in wmf17? [14:59:04] yes [15:00:39] * anomie sees jouncebot still isn't jouncing [15:00:51] (03PS2) 10Anomie: Fix config for specialSiteLinkGroups in Wikibase [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155218 (owner: 10Aude) [15:00:58] (03CR) 10Anomie: [C: 032] "SWAT" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155218 (owner: 10Aude) [15:01:02] (03Merged) 10jenkins-bot: Fix config for specialSiteLinkGroups in Wikibase [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155218 (owner: 10Aude) [15:01:32] !log anomie Synchronized wmf-config/Wikibase.php: SWAT: Fix config for specialSiteLinkGroups in Wikibase [[gerrit:155218]] (duration: 00m 09s) [15:01:37] Logged the message, Master [15:01:42] !log anomie Synchronized php-1.24wmf17/extensions/Wikidata/extensions/Wikibase/lib/: SWAT: Touch files on advice of Wikidata folks (duration: 00m 09s) [15:01:43] aude: ^ test please [15:01:53] Logged the message, Master [15:02:00] godog: Errors scapping: "mw1019 returned [127]: bash: sync-common: command not found" [15:02:27] checking [15:03:30] i had to clear local storage but works now [15:04:04] Anything else needed to avoid the local storage problem? Or should we move on to the second patch? [15:04:06] anomie: yep newly provisioned machine, in theory should have scap in PATH but I guess it isn't looking at /etc/profile.d [15:04:16] would need to look if there is anything else to touch and can do follow up after swat, if needed [15:04:40] (03PS2) 10Anomie: Add item-redirect to OAuth permissions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155257 (owner: 10Aude) [15:04:47] (03CR) 10Anomie: [C: 032] "SWAT" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155257 (owner: 10Aude) [15:04:51] (03Merged) 10jenkins-bot: Add item-redirect to OAuth permissions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155257 (owner: 10Aude) [15:04:54] anomie: does it otherwise cause problems to the deployment? the machine isn't getting traffic [15:05:15] !log anomie Synchronized wmf-config/CommonSettings.php: SWAT: Add item-redirect to OAuth permissions [[gerrit:155257]] (duration: 00m 09s) [15:05:21] Logged the message, Master [15:05:32] magnus will verify that one, but should be ok [15:05:35] godog: Just that the machine may not be up-to-date when it does start receiving traffic, without manual resyncing [15:06:39] anomie: ack, will probably reimage it anyway [15:07:05] aude: Let me know if there's anything else needing doing for that local storage issue, otherwise SWAT is done [15:08:06] could be it just takes a few minutes to refresh [15:08:12] i will ask others to test [15:14:16] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:14:26] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:17:16] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [15:17:18] !log manually lowered elasticsearch recovery speeds to stem off high load caused by healing the restart of elastic1001 - we were slowing down enough that we were filling the pool counter [15:17:24] Logged the message, Master [15:17:25] ^d: ^^^^^ [15:17:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [15:17:40] <^d> Boo :( [15:17:41] ^d: dropped it to 2 streams at 30mb [15:17:55] pretty drastic decrease - just because I wanted the pain to stop [15:18:55] we could probably raise it some more - but I want to be somewhat careful with spikes from this shit [15:19:04] I should go poke the bug they have about restarts taking forever again [15:20:29] * Coren is annoyed. [15:20:38] apergos: I am failing to make it fail. :-( [15:20:45] * Coren writes moar stuff. [15:21:34] The only thing I see atm is that one of the shelves is ~10% slower than the other two. [15:22:18] oh joy [15:23:12] Possible things to do if I can't manage to make it fail: try again with just the one shelf. [15:23:44] Right now, I was striping over all three. [15:33:56] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 13:32:55 UTC [15:44:28] (03PS1) 10Hashar: Jenkins test for tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/155279 (https://bugzilla.wikimedia.org/69478) [15:44:38] (03CR) 10jenkins-bot: [V: 04-1] Jenkins test for tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/155279 (https://bugzilla.wikimedia.org/69478) (owner: 10Hashar) [15:45:53] (03PS2) 10Hashar: Jenkins test for tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/155279 (https://bugzilla.wikimedia.org/69478) [15:46:16] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:46:26] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:46:34] (03Abandoned) 10Hashar: Jenkins test for tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/155279 (https://bugzilla.wikimedia.org/69478) (owner: 10Hashar) [15:48:00] !log dns: Jenkins will now complain whenever you attempt to send tabs in any file of operations/dns.git {{bug|69478}} [15:48:06] mutante: ^^^ :-D [15:48:07] Logged the message, Master [15:49:16] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [15:49:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [15:51:18] hey godog, i see you recently merged some .bashrc .bash_profile stuff for your home dirs [15:51:31] do you have the same issue I do, where most of the time the .bashrc isn't sourced on login? [15:54:45] ottomata: .bashrc actually isn't sourced on login; this is normal bash behaviour. (It sources .profile instead on login shells). What most people do for consistency's sake is to explicitly source .bashrc at the end of their .profile [15:55:36] (or .bash_profile, which is also sourced on login shells (and only login shells)) [15:59:06] hm, interesting, Coren, sometimes it is sourced though [16:01:16] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:01:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:01:27] hm, maybe I just happen to have it .bashrc sourced via somethign on those nodes where it is sourced [16:01:28] k [16:01:29] thanks [16:04:55] ottomata: yep what Coren said [16:05:49] (03PS1) 10Ottomata: Add /home/otto/.bash_profile to source /home/otto/.bashrc [operations/puppet] - 10https://gerrit.wikimedia.org/r/155285 [16:06:08] (03CR) 10Ottomata: [C: 032 V: 032] Add /home/otto/.bash_profile to source /home/otto/.bashrc [operations/puppet] - 10https://gerrit.wikimedia.org/r/155285 (owner: 10Ottomata) [16:07:17] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Wed Aug 20 16:07:13 UTC 2014 [16:08:36] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Puppet has 1 failures [16:09:06] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [16:09:16] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures [16:11:45] !log elastic1001 upgrade went well - upgrading elastic1002 now [16:11:56] Logged the message, Master [16:26:06] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:26:52] greg-g: Good morning!!! I don't suppose I can have a deploy window today to push the commons config change to turn MMV off by default for logged-in folk? [16:27:16] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:27:18] i think he's off this week [16:27:24] i'm not sure who his delegate is; probably robla [16:27:25] Oh! [16:27:29] Wellp [16:27:36] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:27:37] <^d> We've just been kind of doing it ourselves. [16:27:41] <^d> Seat of the pants style. [16:27:41] I think Deskana|Away is usually greg-g replacement [16:27:47] marktraceur: can't we just swat that? [16:27:49] ^d: My favourite kind of style [16:27:51] ^d: amazing how that works [16:28:00] tgr: Eh, no reason not to get it out a little faster and on our own [16:28:11] the config change is dependent on a code change anyway [16:28:24] ori: But we obviously need a committee, and community oversight, and a second and third level of community oversight [16:28:27] And SUPERPROTECT [16:28:38] <^d> ori, marktraceur: So yeah, I'd say just go ahead, make sure you're open on the calendar and use good judgement. [16:28:43] Coolio [16:28:50] calendars and good judgement! [16:28:56] <^d> And add to calendar so greg knows what we did later :p [16:28:57] ^d, you know those are my two weaknesses [16:30:05] <^d> I thought it was whiskey and shiny objects. [16:30:13] :) [16:30:23] ^d, you know i have a deep man-crush on you, right? [16:30:32] if not, keep it a secret [16:30:35] <^d> s/shiny objects/performance metrics/ [16:30:36] <^d> :) [16:30:40] <^d> I won't tell [16:31:41] K, we signed up [16:32:08] (03CR) 10MarkTraceur: "I'm deploying this today at 14:00 PDT. Thanks! :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154396 (https://bugzilla.wikimedia.org/69363) (owner: 10Gergő Tisza) [16:35:46] PROBLEM - Varnish HTTP mobile-backend on cp1059 is CRITICAL: Connection refused [16:36:01] ^ that's still me [16:36:39] Don't worry bblack, we blame /every/ problem on you by default. :-) [16:36:46] RECOVERY - Varnish HTTP mobile-backend on cp1059 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.011 second response time [16:37:20] I'm just sayin' :) I logged that I'm messing with varnishes for days and expect problems, but that was a while back :) I don't want someone logging in to "fix" it in the middle of my fixes :) [16:38:38] _joe_: so one swift box failing causes that many problems...seems kind of spof-y :) [16:50:15] (03PS1) 10Manybubbles: Turn down Elasticsearch recovery speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/155296 [16:50:39] (03CR) 10Manybubbles: "Note: this only makes permanent a change that I made directly to the running cluster an hour ago." [operations/puppet] - 10https://gerrit.wikimedia.org/r/155296 (owner: 10Manybubbles) [17:28:17] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [17:28:27] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [17:33:39] awjr: i'm having trouble joining the hangout... [17:34:21] awjr: milimetric: can someone invite aotto@wikimedia.org to the hangout [17:34:23] it won't let me join [17:40:17] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:40:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:43:26] PROBLEM - Varnish HTTP mobile-backend on cp1060 is CRITICAL: Connection refused [17:44:26] RECOVERY - Varnish HTTP mobile-backend on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.003 second response time [17:49:55] ottomata, are you in sos? its not letting me join :( [17:50:34] bblack, i think i found a weird ip issue - a set of ips is not being detected as opera :( [17:50:47] ? [17:51:01] can you give me data to debug with? [17:51:50] bblack, zgrep 'opera="slot-no-opera"' /a/mw-log/zero.log | cut -d ' ' -f 5- | cut -d '/' -f 1 | [17:51:50] awk -F \\t '{print $2 FS $4 FS $3}' | [17:51:50] awk '{gsub(/(xff=)?"/, ""); gsub(/[, \t]+/, "\t"); gsub(/\t10\.[0-9.]+/, "\tprivate"); gsub(/\t91\.198\.174\.[0-9.]+/, "\tWMF"); gsub(/\t2620:0:862:[:0-9a-fA-F]+/, "\tWMF"); print}' | [17:51:51] awk -F \\t '{printf("%s\t%s\t",$1,$2); for (i=NF; i>2; i--) printf("%s\t",$i);print ""}' | [17:51:53] awk '{gsub(/\thttps:\t(private|WMF)\t(private|WMF)\t(private|WMF)/, "\thttps:\tvalidated"); gsub(/\thttp:\t(private|WMF)\t(private|WMF)/, "\thttp:\tvalidated"); gsub(/\t+$/, ""); print}' | rev | cut -f 2- | rev | sort -k 3 -k 1,2 | uniq -c | sort -n -r | less [17:52:00] hmm, need a pasting thingy, sec [17:52:05] lakjwo4ihgodljfhlskdjflWK3JEGL [17:52:27] just an example IP that should match the current prod opera set of networks, but doesn't would be fine :) [17:52:29] http://pastebin.com/N8FL24hP [17:52:47] on flourine [17:52:55] 82.145.222.240 [17:53:35] should match opera, but always doesn't? [17:54:10] manybubbles: https://gerrit.wikimedia.org/r/#/c/155296 is safe to merge now correct? won't touch ES in any way, if so I'll just merge it [17:54:32] godog: safe - yeah [17:54:41] that script takes zero log and pulls all cases when there is an opera slot (opera framework zero-rates us), but we don't detect it as opera (varnish does not set X-FORWARDED-BY=="Opera") [17:54:48] it really only takes effect when the entire cluster is restarted from scratch [17:54:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Turn down Elasticsearch recovery speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/155296 (owner: 10Manybubbles) [17:54:52] bblack, ^^ [17:55:08] i don't know if its *always* [17:55:20] what does "framework zero-rates us" mean? [17:55:30] opera mini servers [17:55:37] manybubbles: yup, done [17:55:46] thanks! [17:55:59] bblack, btw, i didn't see that ip yesterday [17:56:04] in the logs [17:56:27] when opera mini servers "framework zero-rate us" what does that mean? [17:57:03] opera mini servers add a X-OPERAMINI-ROUTE [17:57:11] ok [17:57:28] so all of the output of that awk stuff is IPs that are sending that header, but we don't set XFB and should? [17:57:56] correct [17:58:32] btw, my PHP logic is X-FORWARDED-BY ?: X-FORWARDED-BY2 -- which means that if the XFB is not set, it will use the XFB2 header [17:58:57] because we do some varnish magic for it [18:03:09] bblack, you can see the actual request with grep -A 20 -e 'xff.*82.145.222.240' /a/mw-log/zero.log |less [18:04:30] yurikR: I'm doing a unit test for netmapper first, against our current prod proxies dataset and some of those IPs, just to make sure whether it's a bug there [18:04:45] (a temporary unit test in my dev box that is, not committing the prod data) [18:04:56] (03PS11) 10John F. Lewis: mailman: use a new default theme (prettier mailman) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) [18:05:00] ok, thx [18:05:19] bblack, btw, just verified - there were other cases of that same ip, but it was properly detected [18:05:22] so its not always [18:05:23] (03PS12) 10John F. Lewis: mailman: use a new default theme (prettier mailman) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) [18:05:33] andrewbogott ^^ rebased for you :) [18:05:51] given the netmapper data is static while in use, it would be unlikely an intermittent issue lies there [18:06:16] it might be interesting to see if something else differs critically between a good and bad result for the same opera IP (maybe some other header messing with our varnish logic?) [18:06:23] bblack, compare this: these are the failing xfb detections: grep -A 20 -e 'slot-no-opera.*82.145.222.240' /a/mw-log/zero.log |less [18:07:57] bblack, and these are the ones that detected XFB, but had other (unrelated) issues: grep -v 'slot-no-opera' | grep -A 20 -e 'xff.*82.145.222.240' /a/mw-log/zero.log |less [18:08:55] bblack, could it be that one of the varnish servers hadn't pulled up-to-date info? [18:09:02] is the IP range new? [18:09:10] i changed it 2 days ago [18:09:35] but interestingly enough, this ip was NOT showing in the logs 2 days ago [18:10:50] opera's range 82.145.220.0/22 should include that ip [18:11:35] yurikR: all the ones that worked (the second grep above) seem to be the same opera? [18:11:49] [USER-AGENT] => Opera/9.80 (J2ME/MIDP; Opera Mini/6.0.24744/35.3956; U; ru) Presto/2.8.119 Version/11.10 [18:11:55] the other grep is varied [18:12:49] bblack, that would be strange if the useragent had any affect on this :) i just look at the known headers [18:12:52] err sorry, other way around, that user-agent is the one that always fails [18:13:26] remember, this log is only for various errors - so there could be tons of correctly processed requests that went through that proxy [18:15:22] yurikR: the json data on-disk on all the mobile caches matches md5sums [18:15:53] could be a json loading issue... can you touch them? [18:16:25] I can just check logs for that [18:16:50] for future reference, what would be really helpful in these logs is the header that shows which caches it went through [18:17:08] X-Cache [18:18:12] bblack, i don't see them [18:18:18] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [18:18:19] i only see X-VARNISH [18:18:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [18:19:50] the real requests have X-Cache in them [18:20:02] I don't know what's missing to get that over to the req logs [18:24:18] bblack, that log has every header that hits the backend [18:24:56] X-Cache is a response header, not a request one [18:25:01] it's set on the way out to the client [18:25:19] so how would i have that :) [18:25:25] beats me! [18:25:30] I'm just saying, it would be handy [18:25:45] we *could* set something like X-Cache in the bereq headers [18:25:48] bblack, if you want, simply add it in the vcl_recv [18:26:01] if it is already set :) [18:26:10] XFF is that [18:26:35] well, true [18:26:35] hmm.. good point :) [18:29:08] bblack, in that big awk, i replace our internal ips with 'private' or 'WMF', and later if the request is http, replace two (private|wmf)s with the word 'validated' (3 in case of https)- this way in the result you see validated followed by external ips only [18:29:34] the order of XFF is reversed in the awk [18:31:38] bblack, which by the way also shows that we have a number of other issues - like 25 slot=1 http: private 107.167.107.129 -- meaning that there was only 1 internal IP in the XFF [18:32:02] heh I shouldn't paste so blindly [18:32:15] grep -v 'slot-no-opera' | grep -A 20 -e 'xff.*82.145.222.240' /a/mw-log/zero.log |less <- does not do the right thing [18:32:16] found a bug? [18:32:56] that grep shows cases when 82.... was identified correctly by opera [18:32:58] as opera [18:33:10] see the other grep for the list of issues [18:33:32] grep -A 20 -e 'slot-no-opera.*82.145.222.240' /a/mw-log/zero.log |less [18:33:38] I mean the commandline is wrong [18:33:54] "grep -v foo | bar" has no input to the first grep ... [18:34:17] oh, hmm [18:34:19] right [18:34:38] sorry, move the file to the first one :) [18:35:11] won't that screw the -A 20? [18:35:23] no, will be fine [18:35:41] because it skips the start of the frame lines, which means that the second one will only pick up the ones we want [18:35:50] +20 lines afterwards [18:35:57] which are the frame [18:36:16] (aprox, sometimes it captures slightly more, so look for the '----' ) [18:41:00] (03PS1) 10Chad: New release of swift plugin, 0.6 [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/155313 [18:41:55] so the order of xff should be .... varnish-frontend, varnish-backend right? [18:42:39] all 4x esams frontends are present in both lists, but only 3/4 backends are [18:42:54] (cp3012 missing as a backend in the successful ones) [18:43:13] but I'm not sure what that really means yet [18:43:52] if it were just a failed data load, I'd think it would be a frontend difference where tag_carrier runs, and I'd think we'd see that frontend only in the bad list and not the good list [18:44:18] (03PS1) 10Rush: phabricator task priorities [operations/puppet] - 10https://gerrit.wikimedia.org/r/155314 [18:44:31] oh it's just a lack of input data. the successful ones with other errors is a very small dataset [18:44:44] so, no, this isn't localized to one cache [18:44:56] (03CR) 10Rush: [C: 032 V: 032] phabricator task priorities [operations/puppet] - 10https://gerrit.wikimedia.org/r/155314 (owner: 10Rush) [18:45:26] yurikR: http://paste.debian.net/hidden/988b5bf5/ [18:45:38] ^ all possible unique combinations of varnish fe->be from esams there [18:46:30] bblack, you didn't have to do the -A 20 for that - xff is in the main line :) [18:46:50] it's easier to keep it all together as I keep editing commandlines [18:47:16] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:47:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:47:34] but I'm grepping down in the uppercase XFF [18:50:16] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [18:50:27] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [18:52:29] every single one in the problem dataset is: [18:52:30] [USER-AGENT] => Opera/9.80 (Android; Opera Mini/7.5.35702/35.3956; U; en) Presto/2.8.119 Version/11.10 [18:52:49] I'm not saying the user-agent is directly causing it, but it's a good clue for looking into what's unique about these [18:53:13] mutante: yt? [18:53:32] for that matter, they're all the same client IP 197.79.31.237 [18:55:29] (oh that's because we're looking at one operamini proxy IP, though) [18:59:31] DB ops, I'd like to deploy these schema changes in the next hour, can someone remind me of the protocol for migrations?: https://bugzilla.wikimedia.org/show_bug.cgi?id=69654 [18:59:41] Should I wait for ops review? [19:00:45] yurikR: all the failures also have Host set to something.wikimedia.org, and the others have .wikipedia.org? odd difference, but consistent [19:00:48] you should definitely talk to Sean if you haven't already [19:01:37] [HOST] => commons.wikimedia.org [19:01:38] [REFERER] => http://commons.m.wikimedia.org/w/index.php?search=Gay+&fulltext=Search [19:01:55] mark: great, thx! [19:02:07] bblack, we wouldn't want to be posting that on the open chanel now, would we :)))) [19:02:13] if (req.http.host ~ "^([a-zA-Z0-9-]+\.)?(m|zero)\.wikipedia\.") { [19:02:15] springle: I'm hoping to get ops review of these schema changes, https://bugzilla.wikimedia.org/show_bug.cgi?id=69654 [19:02:16] call tag_carrier; [19:02:17] } [19:02:29] ^ there's your problem [19:02:36] awight: well he's in australia, not sure he's around now [19:02:53] the ones without XFB are in wikimedia.org, which doesn't match the regex to call tag_carrier [19:03:05] argh, ok no problem, luckily the author AndyRussG wrote the code to be safe w/o its schema [19:03:55] bblack, something tells me we soon will have to change that :( [19:04:08] well, regardless, that's the bug if there's a bug here [19:04:38] true, thx! what do you think of the other issues in that log - e.g. just one private ip in XFF instead of 2? [19:05:13] we have a mix of private and public -IP'd varnishes [19:05:19] they all had 2x caches in them [19:06:57] although what I don't understand (although it appears to be normal)... shouldn't fetches through remote caching centers list 3x varnishes? [19:07:32] oh right the last one is the literal client IP as the backend apaches see it [19:07:38] so it's all correct [19:09:14] bblack, 2014-08-20 13:46:50 mw1072 commonswiki: opera="slot-no-opera" slot="1" xff=", 107.167.107.129, 10.64.32.98" http://commons.m.wikipedia.org search=boobs&fulltext=Search [19:09:44] that's correct [19:09:56] but it only has 1 private ip, right? [19:10:04] yeah, that's the frontend cp1046.eqiad.wmnet [19:10:20] the backend doesn't go in XFF, it is the literal client IP when the request reaches the appserver [19:10:21] 107.167.107.129 is opera [19:10:51] i mean - we get only 1 ip here because itsa search request? [19:10:56] no [19:11:10] because all other requests have 2 wmf ips at the end [19:11:23] with remote cachign centers there are 3 varnishes, e.g. ulsfo-front, ulsfo-back, eqiad-back [19:11:34] if the request hit eqiad directly, it's just eqiad-front, eqiad-back [19:11:42] and the last cache in that list doesn't go in XFF [19:12:10] (it is the client IP, from the appserver perspective) [19:12:13] do we handle these cases when taging traffic? Because i thought we always assume 2 privates [19:12:25] I have no idea [19:12:46] I assume by "private" you mean a wmf address in general, private or public [19:13:49] if at the app layer you wanted a list of every forwarder, inclusive of the final one, you'd want to append the literal client IP in all cases (as in, the source address of the http connection to the appserver) [19:13:56] XFF doesn't carry that last one [19:14:49] yes, i understand that there is no local ip in xff, i just thought for some reason that xff always has 2 wmf ips at the end [19:14:55] plus the backend ip [19:15:07] no [19:15:37] thx for clarifying. I will put some filtering into the error detection to reduce these cases [19:15:37] in the case that the user hits eqiad's front IPs directly and doesn't use https, there will only be one wmf IP in the list, that of the eqiad frontend cache they hit [19:15:49] no no, i do account for https proxy [19:16:06] yes [19:16:13] I'm just saying, if https were involved it would make it two [19:16:28] the case where there's exactly one wmf IP in the XFF header is user->eqiad, non-https [19:16:30] that's why in my awk i have (https + 3 wmfs) OR (http + 2 wmfs) - replace with validated [19:16:44] gotcha, thx! [19:16:58] will sort through the logs and hopefully will come up with more errors :) [19:17:07] bu tnote for your parsing that direct-to-eqiad with https is (http + 1 wmf) too [19:17:17] yes, understood [19:17:26] err https + 1wmf :) [19:17:41] right. now i need to figure out how to tag all traffic without breaking everything :) [19:18:00] because with carriers switch to ips, we should at least mark the traffic properly [19:18:12] as being zero rated [19:19:17] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:19:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:20:42] any +2 er feeling like deploying a change to Gerrit configuration? It is to make the Jenkins message nicer (and with colors!) [19:21:00] colorrrrrrrrrrs [19:21:17] teaser: look at zuul comment on http://integration.wmflabs.org/gerrit/#/c/12/ :D [19:22:17] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [19:22:27] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [19:27:26] RECOVERY - RAID on ms-be1005 is OK: OK: optimal, 13 logical, 13 physical [19:28:12] hashar: you're making it pink ? [19:28:13] ^ that's me swapping a disk [19:28:29] mutante: na look at the SUCCESS messages at the bottom [19:28:49] they are green, and instead of showing the whole url only the job name is displayed :D [19:29:23] the bg is supposed to be "Lavender blush"? ok [19:29:59] http://www.colorhexa.com/fff0f5 [19:30:57] hashar: shorter message is cool! [19:32:00] the bg is just on that labs instance :] [19:32:08] so I immediately knows it is not the prod gerrit hehe [19:32:11] (03CR) 10Dzahn: [C: 032] gerrit: prettify Zuul build results [operations/puppet] - 10https://gerrit.wikimedia.org/r/154524 (https://bugzilla.wikimedia.org/66095) (owner: 10Hashar) [19:33:21] hashar: eh.. submitted, merge pending [19:33:23] mutante: awesome. Trick: there are two dependent changes which are for labs [19:33:31] noop on prod [19:33:51] i see, looking [19:35:56] (03CR) 10Dzahn: [C: 032] contint: proxy the colocated Gerrit instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/154488 (owner: 10Hashar) [19:36:23] (03CR) 10Chad: [C: 032 V: 032] New release of swift plugin, 0.6 [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/155313 (owner: 10Chad) [19:37:21] (03CR) 10Dzahn: [C: 032] contint: website config on labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/154484 (owner: 10Hashar) [19:38:00] hashar: i'm about to merge them together now [19:38:15] mutante nice [19:38:17] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:38:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:38:27] puppet will take care of updating gerrit config and restart it automatically (notify => ) [19:38:30] hashar: done.. are you doing the restarts? [19:38:35] ah, cool [19:39:25] you can force puppet run on ytterbium :D [19:40:04] looks at log, it applies right now [19:40:25] (/Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/etc/gerrit.config]/content) content changed [19:41:06] finished catalog run [19:41:07] sounds good [19:41:23] gerrit apparently restarting [19:41:38] yes. Service[gerrit]) Triggered 'refresh' from 1 events [19:42:32] hashar: that reminds me of https://gerrit.wikimedia.org/r/#/c/153849/ [19:43:03] mutante: and there is one for the contint website which I havent reviewed yet :( [19:44:03] true, one by one [19:44:34] meanwhile, the stupid Gerrit config regex doesn't seem to work hehe [19:46:49] !log awight Synchronized php-1.24wmf16/extensions/CentralNotice: push CentralNotice updates, including new hide cookie format (duration: 00m 07s) [19:46:55] Logged the message, Master [19:47:12] deploy fail: [19:47:12] 19:46:45 ['sync-common', '--include', 'php-1.24wmf16', '--include', 'php-1.24wmf16/extensions', '--include', 'php-1.24wmf16/extensions/CentralNotice', '--include', 'php-1.24wmf16/extensions/CentralNotice/***', 'mw1010.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1019 returned [127]: bash: sync-common: command not found [19:47:21] !log awight Synchronized php-1.24wmf16/extensions/CentralNotice: push CentralNotice updates, including new hide cookie format (duration: 00m 04s) [19:47:30] hashar: :p but it worked ...on the preview? [19:47:41] yeah and christian double checked it [19:47:46] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.113333333333 [19:47:50] will investigate :] thanks mutante ! [19:48:11] hashar: ok, yw [19:48:21] (03CR) 10Andrew Bogott: [C: 04-1] mailman: use a new default theme (prettier mailman) (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) (owner: 10John F. Lewis) [19:48:28] !log awight Synchronized php-1.24wmf17/extensions/CentralNotice: push CentralNotice updates, including new hide cookie format (duration: 00m 05s) [19:48:30] mutante: ah I know. I forgot to restart Zuul hehe [19:48:34] Logged the message, Master [19:49:05] * hashar whistles [19:49:12] :) [19:49:22] if that works [19:49:30] that is going to be much nicer to the eye [19:50:39] !log Restarting Zuul to prettify build results {{bug|66095}} [19:50:44] Logged the message, Master [19:50:50] (03PS1) 10Dzahn: retab gerrit config template [operations/puppet] - 10https://gerrit.wikimedia.org/r/155341 [19:51:26] mutante: speaking of tabs, I have enabled the Jenkins job tab killer for operations/dns.git [19:51:59] hashar: ah:) thanks much! also the templates are done :) [19:52:13] (03CR) 10Hashar: [C: 04-1] "commentlink "cve" !" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/155341 (owner: 10Dzahn) [19:52:22] OH MY GOD !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [19:52:37] teh works? [19:52:43] yeah look at https://gerrit.wikimedia.org/r/#/c/155341/ [19:52:46] looks nice ! [19:52:47] green / red links [19:52:50] yuuuhhhh [19:53:03] cool :) [19:54:22] (03PS2) 10Dzahn: retab gerrit config template [operations/puppet] - 10https://gerrit.wikimedia.org/r/155341 [19:55:09] we gotta add phab to that some time ? [19:58:39] mutante: we haven't even started to look at migrating gerrit to phab [19:58:48] I guess next year :D [19:59:12] ok, yes [19:59:56] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 17:59:44 UTC [20:00:07] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Wed Aug 20 20:00:05 UTC 2014 [20:01:01] (03CR) 10Dzahn: "variables need some @'s for puppet3 i suppose" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155341 (owner: 10Dzahn) [20:02:06] (03CR) 10Dzahn: [C: 032] wikistats - install deb package locally [operations/puppet] - 10https://gerrit.wikimedia.org/r/155173 (owner: 10Dzahn) [20:02:24] mutante: thank you very very much [20:02:57] hashar: no problem, new design much better:) [20:03:20] mutante: yeah that comes from a clever guy at OpenStack [20:03:27] he always comes up with nice solutions [20:04:20] now 3 colors? red for voting failure, orange for non-voting failure [20:05:00] ahh [20:05:07] yeah non-voting should probably be tweaked [20:05:35] or rename to "epic failure" to match icinga :p [20:06:59] (03CR) 10Hashar: [C: 031] "> variables need some @'s for puppet3 i suppose" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155341 (owner: 10Dzahn) [20:07:18] hah, yea, true, fixing this would also get rid of a bunch of noise ./manifests/passwords.pp:6 WARNING class not documented (documentation) [20:07:33] you can look at syslog maybe [20:08:05] maybe doc sprint [20:08:49] (03CR) 10Dzahn: [C: 032] retab gerrit config template [operations/puppet] - 10https://gerrit.wikimedia.org/r/155341 (owner: 10Dzahn) [20:10:17] mutante: when one day we end up with all servers syslog in logstash, that will be trivial to find out :] [20:11:01] oh, i can also look at puppet compiler [20:11:30] hashar: it has that "transition" mode [20:12:17] mutante: what does it do? [20:12:42] (03CR) 10Dzahn: "it does: http://puppet-compiler.wmflabs.org/239/change/155341/compiled/puppet_catalogs_3_production/ytterbium.wikimedia.org.warnings" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155341 (owner: 10Dzahn) [20:12:56] it does complain about the variables [20:13:07] see all the "is deprecated" in there [20:13:44] hashar: it takes a gerrit change number , list of nodes to compile on and it has TRANSTION true or false, which is [20:13:47] "Set to 1 if you want to test a change AND compare different puppet versions for that change." [20:14:01] but i dont even need to turn that on to get the warnings [20:14:14] well, i suppose they will turn from warnins to actual errors [20:14:16] on upgrade [20:17:46] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [20:18:20] (03CR) 10Andrew Bogott: [C: 031] "So, after this patch we have exactly the upstream icehouse file, right?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155244 (owner: 10Filippo Giunchedi) [20:19:49] (03CR) 10Andrew Bogott: [C: 031] drive-audit: clear up exit status [operations/puppet] - 10https://gerrit.wikimedia.org/r/155245 (owner: 10Filippo Giunchedi) [20:25:25] mutante: thanks :] [20:34:31] (03PS6) 10Dzahn: stats.wm.org - use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/153977 [20:41:30] manybubbles: for zinc/ttm, did you manage to look into the icinga alert and see if any meaningful history can be extracted from it? [20:41:52] I remember it was set at 10 seconds at some point because it got spammy, but not what happened after that [20:42:28] * marktraceur waves [20:42:36] I'll be taking the slot in 18 minutes because of anarchy [20:42:50] Anyone still working, or will I be clear? [20:43:55] (03CR) 10Dzahn: [C: 032] stats.wm.org - use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/153977 (owner: 10Dzahn) [20:44:26] I guess subbu and gwicke are deploying parsoid changes? [20:44:37] I doubt that will interfere with my config change. [20:44:47] marktraceur, no parsoid deploy today. [20:44:52] Oh, even easier. :) [20:45:00] found regressions and cancelled it. [20:47:39] (03PS1) 10Dzahn: Revert "stats.wm.org - use ssl_ciphersuite" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155430 [20:47:41] grrrr [20:48:48] (03CR) 10Dzahn: [C: 032] "undefined method `join' for nil:NilClass" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155430 (owner: 10Dzahn) [20:50:47] (03CR) 10Dzahn: "this was no good - undefined method `join' for nil:NilClass" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153977 (owner: 10Dzahn) [20:55:03] (03PS1) 10Dzahn: gerrit secure.config.erb - variable access [operations/puppet] - 10https://gerrit.wikimedia.org/r/155433 [21:03:11] OK, I'm sorting out a few final things about the SQL, but I'll push the config change first [21:03:13] hashar: re: your comments on https://gerrit.wikimedia.org/r/#/c/155207/, I'm in the wmf ldap group so I should be able to deploy new jobs, but can I also deploy the zuul triggers? [21:03:32] (03CR) 10MarkTraceur: [C: 032] Disable MediaViewer by default for logged-in users on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154396 (https://bugzilla.wikimedia.org/69363) (owner: 10Gergő Tisza) [21:03:43] (03Merged) 10jenkins-bot: Disable MediaViewer by default for logged-in users on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154396 (https://bugzilla.wikimedia.org/69363) (owner: 10Gergő Tisza) [21:04:06] legoktm: Jenkins is configured to let people in the wmf LDAP group configure jobs [21:04:12] which is needed by Jenkins Job Builder [21:04:28] we should pair together to get you on par [21:04:35] for Zuul, you would need a shell access on the cluster [21:04:43] and specially on gallium.wikimedia.org which host zuul [21:05:41] hm ok. I think I'm fine with having you review my changes right now if you don't mind :) [21:06:14] !log marktraceur updated /a/common to {{Gerrit|I226bd1468}}: Add item-redirect to OAuth permissions [21:06:20] Logged the message, Master [21:08:01] Should I just sync-dir rather than try to sync-file four files at once? [21:08:55] You would have to sync-file 4 times [21:09:06] Right [21:09:07] it's not sync-files :) [21:09:20] sync-dir should be fine [21:09:24] * marktraceur does [21:09:25] and still fast [21:09:52] Going! [21:09:56] !log marktraceur Synchronized wmf-config: Turn off Media Viewer for logged-in users at Commons. (duration: 00m 07s) [21:10:02] Logged the message, Master [21:10:34] No change in behaviour... [21:11:13] ottomata, can you take a look at https://rt.wikimedia.org/Ticket/Display.html?id=6981 ? (just picked you because it's your week :) ) [21:11:21] unless someone else wants to :) [21:12:31] Thehelpfulone, ^ [21:14:24] jeremyb: happy to do so, but I am about to sign off for the day, i think I have some questions, will ping you about them tomorrow? [21:16:14] ottomata, yeah, sure [21:16:54] danke [21:19:42] I'm going to run a manual query to opt-in people who opted out of the HideMediaViewer gadget [21:19:58] I blame legoktm for everything, but the query should work fine [21:20:06] :> [21:21:58] Nothing appears to have died. [21:23:11] Doesn't seem to be working either, though [21:24:02] Argh, the logged in preference didn't make it into wmf17 [21:24:52] Argh, it's not even merged [21:24:54] * marktraceur fails [21:37:41] Sorry for this slowness y'all [21:43:12] what's wmf17? [21:43:42] the branch name [21:46:00] !log marktraceur Synchronized php-1.24wmf17/extensions/MultimediaViewer/: Add disable-by-default option to MultimediaViewer (duration: 00m 07s) [21:46:06] Logged the message, Master [21:47:35] Done! [21:47:37] In time, too [21:47:46] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [21:48:26] jeremyb: We make a new release branch each week (mostly) and name them after the version that will eventually be released. Next official release is 1.24 so for the last 17 weeks we have been making branches like 1.24wmf1, 1.24wmf2 ... 1.24wmf17. Tomorrow Sam will create 1.24wmf18 and push it to the "group0" wikis (test.wp.o, test2.wp.o and wikimedia.org) [21:49:27] At the same time, 1.24wmf17 will go out to the wikipedia wikis which are running 1.24wmf16 at the moment [21:49:33] bd808, ohhhh. i was thinking a machine with that name [21:49:38] Naw [21:49:44] and i'm like, why tampa? [21:49:48] "release", it should be said, in the WMF cluster sense, not the tarball sense [21:50:29] right :) [21:51:26] The generic apache servers are named things like mw1017 [21:51:36] at least in eqiad [21:56:59] bd808, hence i said "tampa" [21:57:01] :) [21:57:46] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [22:00:42] I'd be thankful if somebody could give "backend-fail-internal error while deleting files" in https://bugzilla.wikimedia.org/show_bug.cgi?id=69760 another shot/attention - mw1119 and mw1132 were mentioned in latest comments but not sure if that's helpful at all [22:07:46] (03PS1) 10Dzahn: add halfak to researchers admin group [operations/puppet] - 10https://gerrit.wikimedia.org/r/155452 [22:08:34] (03CR) 10Dzahn: "for ottomata (analytics access)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155452 (owner: 10Dzahn) [22:08:51] (03PS2) 10Dzahn: add halfak to researchers admin group [operations/puppet] - 10https://gerrit.wikimedia.org/r/155452 [22:31:46] PROBLEM - MySQL Processlist on db1068 is CRITICAL: CRIT 110 unauthenticated, 0 locked, 0 copy to table, 0 statistics [22:33:46] RECOVERY - MySQL Processlist on db1068 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [22:37:56] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 20:36:48 UTC [22:37:59] (03PS13) 10John F. Lewis: mailman: use a new default theme (prettier mailman) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) [22:43:57] (03PS14) 10John F. Lewis: mailman: use a new default theme (prettier mailman) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) [22:50:56] !log disabled puppet on osmium to debug memory leak [22:51:01] Logged the message, Master [22:57:06] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Aug 20 22:56:57 UTC 2014 [23:02:32] (03PS1) 10Dzahn: wikistats - ensure php5-mysql is installed [operations/puppet] - 10https://gerrit.wikimedia.org/r/155461 [23:03:04] (03PS2) 10Dzahn: wikistats - ensure php5-mysql is installed [operations/puppet] - 10https://gerrit.wikimedia.org/r/155461 [23:03:19] (03PS3) 10Dzahn: wikistats - ensure php5-mysql is installed [operations/puppet] - 10https://gerrit.wikimedia.org/r/155461 [23:04:09] (03CR) 10Dzahn: [C: 032] "wonder if one day this will also be renamed to "php-mariadb" :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155461 (owner: 10Dzahn) [23:07:02] (03PS15) 10John F. Lewis: mailman: use a new default theme (prettier mailman) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) [23:34:44] apergos: Much fun. I've literally written 3T to the raw devices without causing so much as a vague hiccup. @&#^%@* [23:41:44] springle: i wish i had this https://forge.puppetlabs.com/puppetlabs/mysql#mysql_grant [23:41:53] to setup a db in labs, incl. the grants [23:45:30] bblack, when you have a sec - https://gerrit.wikimedia.org/r/#/c/155468/ [23:45:50] btw, is grrrit-wm1 down? didn't show new patch [23:45:50] (03PS1) 10Yurik: Zero: 401-01 is now ip-based, all langs, https [operations/puppet] - 10https://gerrit.wikimedia.org/r/155468 [23:46:02] hmm.. i guess it got scared [23:46:03] :) [23:47:03] (03CR) 10BBlack: [C: 032] Zero: 401-01 is now ip-based, all langs, https [operations/puppet] - 10https://gerrit.wikimedia.org/r/155468 (owner: 10Yurik) [23:47:33] how do you like the new output format from jenkins? [23:48:06] it's nice :) [23:56:03] Nemo_bis: regarding zinc/ttm - I'll check icinga now. I didn't see anything before but I wasn't being super thorough because I wanted to look at logs on the box [23:56:34] Nemo_bis: icinga alerta has been OK for 21 days - not useful, I think [23:59:34] Nemo_bis: not failing because the everage it reports is 3ish seconds. which is within tolerance. maybe the slow requests aren't as frequent as we thought - or maybe there are enough fast request to offset them