[00:00:14] jdlrobson: still seeing script tags on my network [00:00:31] jdlrobson: but not when i add an extra param [00:00:39] Krenair, if only this was ust for one wiki... [00:00:57] sadly i think this is the best we're gonna get right now. We should really look into getting Zero using ResourceLoader for this kinda thing [00:01:04] should be scriptable [00:02:16] okay we worked it out - it's in the html [00:02:28] hahahah [00:02:37] in every page's HTML? [00:02:38] action=purge did the trick [00:02:48] https://github.com/wikimedia/mediawiki/blob/master/maintenance/purgeList.php [00:03:09] echo "http://en.wikipedia.org/wiki/Foo" | mwscript purgeList.php --wiki=aawiki [00:03:47] thanks guys for the help. I'm gonna mail Yuri and Jeff and try and get a more long term solution [00:05:32] guys, i think that js file expiration is 15 minutes https://github.com/wikimedia/mediawiki-extensions-ZeroBanner/blob/937fcb19cc8fb0ba8d0745a6b0c3fae3f56620f7/includes/ZeroSpecialPage.php [00:05:35] ("spin down" might be the wrong phrase, just trying to agree with you) [00:06:29] dr0ptp4kt: thanks. It looks like we're in a somewhat better place now. [00:06:39] good. thanks [00:06:49] Krenair: thanks for all your help today and sorrry for the bug earlier ;-) [00:07:09] (hope you got the 'r' joke ;-)) [00:07:45] :D [00:54:54] What's the HTTP 500 spike from last our caused by? [00:55:02] https://grafana.wikimedia.org/#/dashboard/db/varnish-http-errors [00:55:07] hour* [00:55:42] site outage [00:55:56] probably [00:56:02] arrray [00:56:04] Right [00:56:06] That woudl cause it [00:56:11] https://gerrit.wikimedia.org/r/#/c/243836/1/wmf-config/InitialiseSettings.php [00:56:18] We don't php-lint in sync-file? [00:56:27] We do [00:56:34] but arrray can be a valid function name [00:57:01] *facepalm* [00:57:07] indeed [00:57:28] And phpcs can help? [01:15:58] Krinkle: MaxSem filed a task about the general issue. [01:16:08] https://phabricator.wikimedia.org/T114725 [01:19:01] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail [01:22:20] (03PS2) 10Jforrester: VisualEditor: Switch to opt-out for English Wikipedia logged-in users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242041 (https://phabricator.wikimedia.org/T112348) [01:32:53] (03CR) 10Krinkle: [C: 031] Fix varnishmedia comment [puppet] - 10https://gerrit.wikimedia.org/r/243838 (owner: 10Gilles) [01:47:30] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:48:09] (03PS1) 10Yuvipanda: puppetmaster: Remove an ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/243851 [02:11:05] (03CR) 10Yuvipanda: "'asherman'," [puppet] - 10https://gerrit.wikimedia.org/r/242779 (owner: 10Yuvipanda) [02:20:10] (03PS1) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [02:20:12] (03PS1) 10Dzahn: lint: double quoted strings pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/243853 [02:20:14] (03PS1) 10Dzahn: varnish: minor lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243854 [02:20:16] (03PS1) 10Dzahn: lint: double quoted strings pt.3 [puppet] - 10https://gerrit.wikimedia.org/r/243855 [02:20:18] (03PS1) 10Dzahn: lvs: double quoted string and other lint [puppet] - 10https://gerrit.wikimedia.org/r/243856 [02:20:20] (03PS1) 10Dzahn: toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243857 [02:20:22] (03PS1) 10Dzahn: lint: double quoted strings pt.4 [puppet] - 10https://gerrit.wikimedia.org/r/243858 [02:20:24] (03PS1) 10Dzahn: lint: re-enable double quoted strings check [puppet] - 10https://gerrit.wikimedia.org/r/243859 [02:21:04] (03CR) 10jenkins-bot: [V: 04-1] lint: double quoted strings pt.3 [puppet] - 10https://gerrit.wikimedia.org/r/243855 (owner: 10Dzahn) [02:21:12] (03CR) 10jenkins-bot: [V: 04-1] varnish: minor lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243854 (owner: 10Dzahn) [02:21:14] (03CR) 10jenkins-bot: [V: 04-1] lint: double quoted strings pt.4 [puppet] - 10https://gerrit.wikimedia.org/r/243858 (owner: 10Dzahn) [02:21:16] (03CR) 10jenkins-bot: [V: 04-1] lvs: double quoted string and other lint [puppet] - 10https://gerrit.wikimedia.org/r/243856 (owner: 10Dzahn) [02:21:18] (03CR) 10jenkins-bot: [V: 04-1] toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243857 (owner: 10Dzahn) [02:21:34] (03CR) 10jenkins-bot: [V: 04-1] lint: re-enable double quoted strings check [puppet] - 10https://gerrit.wikimedia.org/r/243859 (owner: 10Dzahn) [02:25:35] (03CR) 10Dzahn: [C: 031] Move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243651 (owner: 10Muehlenhoff) [02:26:24] "fun", eh 'fun' re: lint fixes being downvoted by the new .. stricter checks for other stuff [02:26:36] out now though [02:29:57] !log l10nupdate@tin Synchronized php-1.27.0-wmf.1/cache/l10n: l10nupdate for 1.27.0-wmf.1 (duration: 08m 53s) [02:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:32] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.1) at 2015-10-06 02:34:31+00:00 [02:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:22] (03PS3) 10Yuvipanda: admin: Provision all stat** users on bastion too [puppet] - 10https://gerrit.wikimedia.org/r/242779 [02:39:27] (03CR) 10jenkins-bot: [V: 04-1] admin: Provision all stat** users on bastion too [puppet] - 10https://gerrit.wikimedia.org/r/242779 (owner: 10Yuvipanda) [02:40:45] (03CR) 10Yuvipanda: "Ok, that had missed a bunch of users, corrected now to include accurate users and edited commit message with raitonale. https://phabricato" [puppet] - 10https://gerrit.wikimedia.org/r/242779 (owner: 10Yuvipanda) [02:44:18] (03PS4) 10Yuvipanda: admin: Provision all stat** users on bastion too [puppet] - 10https://gerrit.wikimedia.org/r/242779 [02:45:03] Krenair: also are there bugs for the two people we found who didn't have acccess? [02:48:40] (03CR) 10Yuvipanda: Add all groups to general bastions, mostly empty bastiononly group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227327 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [02:52:11] (03CR) 10Yuvipanda: "require_package should make sure that python-novaclient is installed." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [02:52:26] (03CR) 10Yuvipanda: "And thank you for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [02:53:24] (03PS2) 10Yuvipanda: toollabs: install hugin-tools [puppet] - 10https://gerrit.wikimedia.org/r/243500 (https://phabricator.wikimedia.org/T108210) (owner: 10Merlijn van Deen) [02:53:35] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: install hugin-tools [puppet] - 10https://gerrit.wikimedia.org/r/243500 (https://phabricator.wikimedia.org/T108210) (owner: 10Merlijn van Deen) [02:54:01] (03PS3) 10Yuvipanda: toollabs-genpp: add simple tool to check package availability [puppet] - 10https://gerrit.wikimedia.org/r/243498 (owner: 10Merlijn van Deen) [02:54:08] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs-genpp: add simple tool to check package availability [puppet] - 10https://gerrit.wikimedia.org/r/243498 (owner: 10Merlijn van Deen) [02:55:37] ori: that was a very short vacation [02:56:36] well, i got three voicemail messages about not having submitted quarterly review notes for the quarterly review, which is on thursday [02:56:57] apparently that counts as an emergency [02:57:50] because people would like to read your notes ahead of time before you recite them in person [03:00:06] ori: just in time to be able to forget them a few days later! [03:00:35] ori: did you actually come online in the middle of vacation to do quarterly review notes/ [03:00:37] ? [03:04:38] (03PS5) 10Yuvipanda: admin: Provision all stat** users on bastion too [puppet] - 10https://gerrit.wikimedia.org/r/242779 [03:05:19] (03CR) 10Yuvipanda: "Two users who we found had stat* but not bastion access were also notified, I believe there are tickets for these but I can't find them at" [puppet] - 10https://gerrit.wikimedia.org/r/242779 (owner: 10Yuvipanda) [03:13:53] (03PS5) 10Yuvipanda: elasticsearch: Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 (owner: 10EBernhardson) [03:21:21] (03PS2) 10Yuvipanda: puppetmaster: Remove an ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/243851 [03:21:47] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Remove an ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/243851 (owner: 10Yuvipanda) [03:22:07] (03PS1) 10Yuvipanda: Remove unused template in root templates/ [puppet] - 10https://gerrit.wikimedia.org/r/243861 [03:23:11] (03PS2) 10Yuvipanda: Remove unused template in root templates/ [puppet] - 10https://gerrit.wikimedia.org/r/243861 [03:23:26] (03PS3) 10Yuvipanda: Remove unused template in root templates/ [puppet] - 10https://gerrit.wikimedia.org/r/243861 [03:24:02] (03CR) 10Yuvipanda: [C: 032 V: 032] Remove unused template in root templates/ [puppet] - 10https://gerrit.wikimedia.org/r/243861 (owner: 10Yuvipanda) [03:26:41] (03PS1) 10Yuvipanda: tools: Remove ensure => absents [puppet] - 10https://gerrit.wikimedia.org/r/243862 [03:26:55] (03PS2) 10Yuvipanda: tools: Remove ensure => absents [puppet] - 10https://gerrit.wikimedia.org/r/243862 [03:32:04] (03CR) 10Yuvipanda: [C: 032] tools: Remove ensure => absents [puppet] - 10https://gerrit.wikimedia.org/r/243862 (owner: 10Yuvipanda) [03:40:51] PROBLEM - Host cr1-eqord is DOWN: CRITICAL - Network Unreachable (208.80.154.198) [03:42:20] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps DWDM]BR [03:43:11] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps DWDM]BR [03:43:50] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 114, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps DWDM]BR [03:54:41] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps DWDM]BR [03:58:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [03:59:01] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Puppet has 1 failures [04:12:40] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [04:14:20] ermmm [04:14:31] I guess I should page someone [04:14:35] * yuvipanda checks time in Europe [04:14:38] yes, fatals are up [04:15:43] ok I'm going to try paging bblack before paravoid [04:17:04] hey [04:17:18] heh I guess I won't page bblack than. [04:17:21] hi paravoid [04:17:49] (03PS1) 10Faidon Liambotis: Depool ulsfo, outage [dns] - 10https://gerrit.wikimedia.org/r/243871 [04:18:17] hrm [04:18:23] it's reachable though [04:18:29] sorry, it's way too early [04:18:33] oh I guess we lost eqord... [04:19:24] (03Abandoned) 10Faidon Liambotis: Depool ulsfo, outage [dns] - 10https://gerrit.wikimedia.org/r/243871 (owner: 10Faidon Liambotis) [04:20:50] paravoid: do you want me to send an sms / call bblack? [04:20:55] no [04:20:59] all is fine [04:21:01] RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [04:21:04] paravoid: ok [04:21:05] besides having lost a pop [04:24:46] !log all waves to eqord down, probably related to RT#9619 [04:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:25:00] time to call them [04:25:50] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [04:34:32] yeah, related [04:34:32] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [04:35:16] that maintenance announce was sent on Sept 10th, which is before we migrated these circuits, which is why the announcement doesn't mention them [04:35:47] ah [04:36:36] window is 03:00-07:00 PST [04:36:40] er [04:36:41] UTC :) [04:37:28] so it could be down for 2 and something hours? [04:37:43] yes [04:37:56] ok, and notnig much for us to do I gess [04:37:57] *guess [04:38:07] nope [04:38:18] alright! [04:38:23] I shall head home then :) [04:38:33] well, the ulsfo-eqiad MPLS is now trying to push more traffic than it can which results into some packet loss [04:39:02] because we don't have the backup wave in place yet... [04:39:10] the one I was asking about yesterday :) [04:39:39] ah! [04:39:47] are we planning on buying one? [04:40:09] yes, we already have a quote and have settled on it [04:40:12] nice [04:40:13] waiting for some stupid patch panel atm [04:42:26] paravoid: are you not going back to sleep? [04:42:33] no, I woke up [04:42:45] hah ok :D [04:42:52] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [04:42:53] he tweeted 50 minutes ago :) [04:42:56] heh [04:42:57] #outing [04:43:04] Clearly I'm not on twitter enouh [04:43:06] *enough [04:43:07] don't try to make sense of my sleep schedule [04:43:09] paravoid: in that case https://phabricator.wikimedia.org/T113979 [04:43:20] indeed, my sleep cycle is far more regular than yours atm. [04:43:30] what did I tweeted about though [04:43:33] I don't remember saying that to many people over the last few years. [04:43:36] I was on a bus for 2 hours, not much else to do when it's one of the city-style buses going 40 miles :/ [04:43:45] oh, https://www.aclu.org/blog/free-future/chinas-nightmarish-citizen-scores-are-warning-americans [04:43:48] yeah [04:43:48] scary shit [04:43:51] just reading it now [04:44:14] yuvipanda: what about it? [04:44:16] "It will hurt your score not only if you do these things, but if any of your friends do them. Imagine the social pressure against disobedience or dissent that this will create." [04:45:18] yuvipanda: I don't think you guys need me to tell you about unix permissions and umask and all that :) I was just throwing ideas on how to fix this issue in a more unix way [04:45:24] paravoid: Tim L just pointed out we can't set umask to what we wanted to because all users share wikidev and any sshing in will be done as them and not as the tool itself. Although, I'm not sure how that interacts with the s bit on the tool's homedirs. [04:45:37] yuvipanda: other than pitching ideas, I don't have a strong opinion on that [04:45:46] paravoid: oh yeah, I'm not 'telling' you as much as 'asking' you :D [04:46:34] everyone belonging to group wikidev is an unfortunate piece of legacy [04:46:45] or rather, having their primary gid set to wikidev [04:47:21] but too late for that I'm guessing [04:47:25] yeah [04:47:41] that yak has too much hair, I'd think... [04:48:17] that's /a/common misconception [04:49:00] ori the barber [04:49:05] ori the barberian? [04:50:31] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [04:53:00] RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [04:55:49] yuvipanda: (responded) [04:56:35] paravoid: thanks [04:56:43] paravoid: heh, yeah, tool accounts have shell set to sillyshell :D [04:56:48] need to 'fix' that if we want to do this the ssh way [04:57:03] root@tools-bastion-01:~# getent passwd tools.wikidata-todo [04:57:03] tools.wikidata-todo:x:51211:51211:tools.wikidata-todo:/data/project/wikidata-todo:/bin/bash [04:57:16] doesn't look like it? [04:58:09] hmm [04:58:12] dn: uid=tools.admin,ou=people,ou=servicegroups,dc=wikimedia,dc=org [04:58:16] loginShell: /usr/local/bin/sillyshell [04:58:49] we override it in /etc/nslcd.conf [04:59:04] user authentication in labs is too fucking complicated [04:59:12] I thought I made it not override it [04:59:14] at least a year ago [04:59:21] root@tools-bastion-01:~# grep map /etc/nslcd.conf [04:59:22] map passwd loginshell "/bin/bash" [05:03:33] uhm what? [05:03:54] we lost icinga-wm, logmsgbot and stashbot? [05:04:15] 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - https://phabricator.wikimedia.org/T67591#1704526 (10yuvipanda) [05:04:17] 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#1704524 (10yuvipanda) 5Resolved>3Open lol, apparently there is: ``` <% if @realm == "labs" %>map passwd loginshell "/bin/bash"<% end %> ``` in `nslc... [05:07:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:08:13] yuvipanda: the labs ldap/nss/nslcd/etc. stack is way too chaotic/complicated [05:08:46] and thing such as the labstore/ldap mess aren't helping :/ [05:08:50] or in fact the ssh-as-tool idea :) [05:21:10] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [05:21:13] counting the days till that mpls link dies heh [05:24:46] (03PS12) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [05:27:50] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [05:27:58] bblack: yes. [05:40:30] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps DWDM]BR [05:40:47] (03PS11) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [05:41:21] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [05:45:25] (03CR) 10BBlack: [C: 032] move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [05:49:18] (03PS1) 10BBlack: post-merge syntax fixup for f092ee39 [puppet] - 10https://gerrit.wikimedia.org/r/243880 [05:49:37] grrr fuck submodules [05:49:50] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [05:50:55] (03PS2) 10BBlack: post-merge syntax fixup for f092ee39 [puppet] - 10https://gerrit.wikimedia.org/r/243880 [05:51:10] (03CR) 10BBlack: [C: 032 V: 032] post-merge syntax fixup for f092ee39 [puppet] - 10https://gerrit.wikimedia.org/r/243880 (owner: 10BBlack) [05:54:32] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [05:56:02] (03PS9) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [05:56:38] (03PS13) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [05:57:09] <_joe_> what's happening with cr1-ulsfo? [05:57:50] there's a maintenance outage on a link from ulsfo<->eqord [05:57:57] <_joe_> oh, ok [05:58:05] we have a backup MPLS link to carry the traffic, it just sucks more [05:58:07] <_joe_> also, good night brandon :) [06:00:03] :P [06:01:50] it's actually three waves that are down... [06:01:59] ulsfo<->eqord, eqord<->eqiad, eqord<->codfw [06:02:17] (03CR) 10Giuseppe Lavagetto: [C: 032] Minor fixes to instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/243413 (owner: 10Giuseppe Lavagetto) [06:02:49] oh nice [06:02:51] (03Merged) 10jenkins-bot: Minor fixes to instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/243413 (owner: 10Giuseppe Lavagetto) [06:03:05] (03PS2) 10Giuseppe Lavagetto: Fix signal handling, some cleanup [debs/pybal] - 10https://gerrit.wikimedia.org/r/243414 [06:07:38] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix signal handling, some cleanup [debs/pybal] - 10https://gerrit.wikimedia.org/r/243414 (owner: 10Giuseppe Lavagetto) [06:07:46] (03PS1) 10BBlack: X-Client-IP regex fixup: whole string must match IP chars [puppet] - 10https://gerrit.wikimedia.org/r/243881 [06:08:08] (03Merged) 10jenkins-bot: Fix signal handling, some cleanup [debs/pybal] - 10https://gerrit.wikimedia.org/r/243414 (owner: 10Giuseppe Lavagetto) [06:08:10] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP regex fixup: whole string must match IP chars [puppet] - 10https://gerrit.wikimedia.org/r/243881 (owner: 10BBlack) [06:08:43] _joe_: since you're working on pybal... https://gerrit.wikimedia.org/r/#/c/187346/ [06:10:52] (03PS14) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [06:12:03] <_joe_> paravoid: yeah we discussed that with mark [06:12:20] <_joe_> it needs substantial work and it's in the "roadmap" [06:12:43] <_joe_> mark suggested we should reorg some of pybal's code and use a true state machine [06:12:52] (03CR) 10BBlack: [C: 032] Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [06:13:11] \o/ [06:14:11] <_joe_> paravoid: so my current plans, in order of priority, after I cut a release today would be 1) etcd integration 2) stop shelling out to ipvsadm 3) introduce a state machine for pybal [06:14:31] can we add better logging to the list? [06:14:36] <_joe_> yes [06:14:39] great :) [06:14:47] <_joe_> I should in fact create some tickets [06:14:50] yeah [06:14:54] let's make Pybal a proper project tag [06:14:56] <_joe_> I'm being my usual lazy ass [06:15:53] "better logging" = less spam mostly, or at least split the "All OK" spam and the error notices to different loglevels so that we could make a syslog output file that lacked the spam [06:16:23] <_joe_> bblack: we could also use the logging module instead of "print" [06:17:30] yeah that's what I meant :) [06:17:56] <_joe_> on loglevels, that would mean re-managing logfiles, or making syslog rules to send pybal logs to different files based on severity of the log message [06:18:03] (03PS1) 10BBlack: post-merge syntax fixup for d200332d [puppet] - 10https://gerrit.wikimedia.org/r/243882 [06:18:04] <_joe_> I think that could work [06:18:16] and ideally allow for some kind of introspection into pybal's state, but that's separate than logging [06:18:16] (03CR) 10BBlack: [C: 032 V: 032] post-merge syntax fixup for d200332d [puppet] - 10https://gerrit.wikimedia.org/r/243882 (owner: 10BBlack) [06:18:26] as in, it'd be nice to raise icinga alerts when pybal thinks something is down [06:18:30] <_joe_> so, there is no way for a puppet manifest to know the puppet agent config [06:18:31] that kind of thing [06:18:41] especially situations where e.g. depool threshold has been reached [06:18:42] <_joe_> paravoid: for that, we have instrumentation! [06:19:10] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 [06:19:15] <_joe_> paravoid: if you launch the pybal I'm building today with "instrumentation = true" in the config, it starts a web server on a port of your choice [06:19:20] RECOVERY - Host cr1-eqord is UP: PING OK - Packet loss = 0%, RTA = 32.08 ms [06:19:28] <_joe_> where you can query pools and single hosts for their state in pybal [06:19:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [06:19:40] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 12, down: 0, shutdown: 0 [06:19:40] !log eqord is back up [06:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:20:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [06:20:17] <_joe_> and I plan on building the first alerts as soon as this is available everywhere [06:20:22] _joe_: oh that's nice! [06:20:31] individual servers being down is probably less interesting [06:20:35] it's going to be a lot of spam [06:20:49] but e.g. depool threshold reached more so I guess [06:20:51] <_joe_> yeah that's interesting for example if you're a server and you depooled yourself [06:21:05] yeah [06:21:20] <_joe_> so while for monitoring just the state of a pool is interesting, single servers are interesting as well [06:21:40] <_joe_> for "programmatic" purposes [06:23:06] definitely some kind of alert on depool-threshold trigger [06:23:21] (possibly via icinga checking intstrumentation?) [06:23:49] (03PS10) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [06:24:14] (03PS1) 10Smalyshev: Allow SPARQL endpoint to be queries via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) [06:26:40] (03PS2) 10Smalyshev: Allow SPARQL endpoint to be queries via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) [06:30:20] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:30] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] (03PS1) 10Yuvipanda: puppetception: Remove module [puppet] - 10https://gerrit.wikimedia.org/r/243884 [06:31:01] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:41] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:22] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:04] (03PS2) 10Yuvipanda: puppetception: Remove module [puppet] - 10https://gerrit.wikimedia.org/r/243884 [06:37:19] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetception: Remove module [puppet] - 10https://gerrit.wikimedia.org/r/243884 (owner: 10Yuvipanda) [06:56:11] PROBLEM - HHVM rendering on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:56:41] PROBLEM - HHVM rendering on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:56:41] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:56:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:01] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:57:10] PROBLEM - HHVM rendering on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:11] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:12] PROBLEM - HHVM rendering on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:15] um [06:57:17] _joe_: ^ [06:57:21] PROBLEM - HHVM rendering on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:22] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:32] PROBLEM - HHVM rendering on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:41] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:41] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] paravoid: ^ [06:57:42] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:58:21] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:22] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:24] <_joe_> ugh [06:58:30] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:58:31] PROBLEM - HHVM rendering on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:58:32] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:58:41] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:41] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:47] are these all imagescalers? [06:58:51] <_joe_> yes [06:58:52] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:01] <_joe_> all of them [06:59:11] PROBLEM - HHVM rendering on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:22] tailing other_vhosts_access.log on mw1158 it seems ok [06:59:32] as in still serving requests [07:00:26] <_joe_> it's not [07:00:39] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1704605 (10santhosh) [07:00:46] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1704606 (10Yurik) Brandon, awesome work on this! In theory, zero should be able to handle x-cs for non WP or desktop sites, but has not been extensively tested. We will want that... [07:00:59] hmm indeed. I'm not sure where those log entries are coming from and why they are all 200s [07:01:06] <_joe_> we have a ton of connections in time_wait [07:01:16] <_joe_> they are ok, those log entires seem normal [07:06:40] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 36.36% of data above the critical threshold [500.0] [07:08:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [07:13:52] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [07:18:11] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.754 second response time [07:18:20] RECOVERY - HHVM rendering on mw1158 is OK: HTTP OK: HTTP/1.1 200 OK - 65265 bytes in 9.048 second response time [07:18:50] RECOVERY - HHVM rendering on mw1157 is OK: HTTP OK: HTTP/1.1 200 OK - 65265 bytes in 9.627 second response time [07:21:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:41] PROBLEM - HHVM rendering on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:10] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.789 second response time [07:22:10] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.767 second response time [07:22:10] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.098 second response time [07:22:20] RECOVERY - HHVM rendering on mw1155 is OK: HTTP OK: HTTP/1.1 200 OK - 65265 bytes in 7.465 second response time [07:22:20] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 14753 bytes in 7.988 second response time [07:22:41] RECOVERY - HHVM rendering on mw1159 is OK: HTTP OK: HTTP/1.1 200 OK - 65273 bytes in 6.859 second response time [07:22:41] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.889 second response time [07:22:42] RECOVERY - HHVM rendering on mw1153 is OK: HTTP OK: HTTP/1.1 200 OK - 65265 bytes in 5.688 second response time [07:22:50] RECOVERY - HHVM rendering on mw1160 is OK: HTTP OK: HTTP/1.1 200 OK - 65264 bytes in 8.739 second response time [07:23:01] RECOVERY - HHVM rendering on mw1156 is OK: HTTP OK: HTTP/1.1 200 OK - 65264 bytes in 8.437 second response time [07:23:01] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.726 second response time [07:23:01] RECOVERY - HHVM rendering on mw1154 is OK: HTTP OK: HTTP/1.1 200 OK - 65265 bytes in 9.266 second response time [07:23:20] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.851 second response time [07:23:20] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.563 second response time [07:23:22] RECOVERY - HHVM rendering on mw1158 is OK: HTTP OK: HTTP/1.1 200 OK - 65264 bytes in 9.326 second response time [07:23:44] back to normal ? [07:24:40] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.015 second response time [07:26:37] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Test CXServer in Jessie - https://phabricator.wikimedia.org/T107307#1704629 (10Pginer-WMF) [07:27:10] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1704631 (10Pginer-WMF) [07:27:41] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:30:50] <_joe_> akosiaris: nope, traffic to backends is unusually high [07:32:01] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:33:11] PROBLEM - HHVM rendering on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:34:42] RECOVERY - HHVM rendering on mw1153 is OK: HTTP OK: HTTP/1.1 200 OK - 65273 bytes in 3.039 second response time [07:38:04] (03PS1) 10Faidon Liambotis: Revert Varnish X-Analytics and netmapper changes [puppet] - 10https://gerrit.wikimedia.org/r/243885 [07:38:50] (03CR) 10Faidon Liambotis: [C: 032] Revert Varnish X-Analytics and netmapper changes [puppet] - 10https://gerrit.wikimedia.org/r/243885 (owner: 10Faidon Liambotis) [07:39:56] (03PS1) 10Giuseppe Lavagetto: imagescalers: double the number of light processes [puppet] - 10https://gerrit.wikimedia.org/r/243886 [07:40:00] PROBLEM - HHVM rendering on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:40:42] <_joe_> I'll test this change on mw1153 first [07:41:30] RECOVERY - HHVM rendering on mw1153 is OK: HTTP OK: HTTP/1.1 200 OK - 65265 bytes in 4.357 second response time [07:42:21] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 623 [07:43:58] (03PS2) 10Alexandros Kosiaris: mariadb: update submodule in production repo [puppet] - 10https://gerrit.wikimedia.org/r/243148 [07:46:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 7 below the confidence bounds [08:04:30] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:04:46] (03CR) 10Alexandros Kosiaris: [C: 032] mariadb: update submodule in production repo [puppet] - 10https://gerrit.wikimedia.org/r/243148 (owner: 10Alexandros Kosiaris) [08:06:10] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Seconds_Behind_Master: 248 [08:06:30] ^I am not sure what you just commited? [08:06:40] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: puppet fail [08:16:44] 6operations, 10ops-eqiad: db1026 degraded RAID - https://phabricator.wikimedia.org/T114738#1704680 (10jcrespo) 3NEW [08:17:17] ACKNOWLEDGEMENT - RAID on db1026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T114738 [08:18:20] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [08:20:19] (03PS1) 10Alexandros Kosiaris: icinga: remove unused notify-by-epager commands [puppet] - 10https://gerrit.wikimedia.org/r/243888 [08:26:00] RECOVERY - Freshness of OCSP Stapling files on cp1043 is OK: OK [08:27:02] RECOVERY - Freshness of OCSP Stapling files on cp1044 is OK: OK [08:28:41] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 75, down: 0, shutdown: 0 [08:33:38] (03PS1) 10Jcrespo: Set dbstore's MariaDB to not page to everyone when lagged [puppet] - 10https://gerrit.wikimedia.org/r/243892 [08:34:13] (03CR) 10jenkins-bot: [V: 04-1] Set dbstore's MariaDB to not page to everyone when lagged [puppet] - 10https://gerrit.wikimedia.org/r/243892 (owner: 10Jcrespo) [08:36:34] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:36:49] (03Abandoned) 10Jcrespo: Set dbstore's MariaDB to not page to everyone when lagged [puppet] - 10https://gerrit.wikimedia.org/r/243892 (owner: 10Jcrespo) [08:40:50] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1704702 (10fgiunchedi) in eqiad we're doing three machines in a rack, ATM row-wise we have 6x in A, 3x in B, 6x in C and 3x in D. so a rack in B or D will do [08:41:24] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:52:04] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 14.81% of data above the critical threshold [100000000.0] [08:59:24] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1704746 (10fgiunchedi) we have at least two options for expansion: 1. grow the current allocation of three swift zones (i.e. a row is a zone) by allocating 2x machines in each of A/B/C 2. create a ne... [09:00:18] (03CR) 10Giuseppe Lavagetto: "While the code is correct, I would have used virtual resources:" [puppet] - 10https://gerrit.wikimedia.org/r/243142 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [09:11:04] (03PS2) 10Alexandros Kosiaris: aqs: Allow CQL access from analytics [puppet] - 10https://gerrit.wikimedia.org/r/243635 (https://phabricator.wikimedia.org/T107056) [09:11:16] (03CR) 10Alexandros Kosiaris: [C: 032] aqs: Allow CQL access from analytics [puppet] - 10https://gerrit.wikimedia.org/r/243635 (https://phabricator.wikimedia.org/T107056) (owner: 10Alexandros Kosiaris) [09:13:22] (03CR) 10Hashar: [C: 031] lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 (owner: 10Dzahn) [09:14:12] (03CR) 10Hashar: [C: 031] lint: double quoted strings pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/243853 (owner: 10Dzahn) [09:14:35] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/243854 (owner: 10Dzahn) [09:15:13] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [09:15:16] 6operations, 6Analytics-Kanban, 10netops, 5Patch-For-Review: Puppetize a server with a role that sets up Cassandra on Analytics machines [13 pts] {slug} - https://phabricator.wikimedia.org/T107056#1704774 (10akosiaris) [09:15:38] ACKNOWLEDGEMENT - Restbase endpoints health on aqs1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) alexandros kosiaris known. https://phabricator.wikimedia.org/T114742 [09:15:38] ACKNOWLEDGEMENT - Restbase root url on aqs1001 is CRITICAL: Connection refused alexandros kosiaris known. https://phabricator.wikimedia.org/T114742 [09:15:38] ACKNOWLEDGEMENT - Restbase endpoints health on aqs1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) alexandros kosiaris known. https://phabricator.wikimedia.org/T114742 [09:15:38] ACKNOWLEDGEMENT - Restbase root url on aqs1002 is CRITICAL: Connection refused alexandros kosiaris known. https://phabricator.wikimedia.org/T114742 [09:15:38] ACKNOWLEDGEMENT - Restbase endpoints health on aqs1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) alexandros kosiaris known. https://phabricator.wikimedia.org/T114742 [09:15:38] ACKNOWLEDGEMENT - Restbase root url on aqs1003 is CRITICAL: Connection refused alexandros kosiaris known. https://phabricator.wikimedia.org/T114742 [09:16:13] (03PS2) 10Giuseppe Lavagetto: imagescalers: double the number of light processes [puppet] - 10https://gerrit.wikimedia.org/r/243886 [09:17:11] (03CR) 10Giuseppe Lavagetto: [C: 032] imagescalers: double the number of light processes [puppet] - 10https://gerrit.wikimedia.org/r/243886 (owner: 10Giuseppe Lavagetto) [09:17:18] (03PS3) 10Giuseppe Lavagetto: imagescalers: double the number of light processes [puppet] - 10https://gerrit.wikimedia.org/r/243886 [09:17:34] (03CR) 10Hashar: [C: 04-1] "Trivial indent oddity in modules/varnish/manifests/instance.pp" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243854 (owner: 10Dzahn) [09:18:26] (03CR) 10Hashar: [C: 031] lint: double quoted strings pt.3 [puppet] - 10https://gerrit.wikimedia.org/r/243855 (owner: 10Dzahn) [09:18:29] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1645318 (10mmodell) [09:19:12] (03CR) 10Hashar: [C: 031] lvs: double quoted string and other lint [puppet] - 10https://gerrit.wikimedia.org/r/243856 (owner: 10Dzahn) [09:19:24] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1645318 (10mmodell) [09:19:26] (03PS1) 10Alexandros Kosiaris: aqs: Fix typo introduced in 22ead0c [puppet] - 10https://gerrit.wikimedia.org/r/243898 [09:19:45] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1704787 (10mmodell) 5Open>3stalled [09:20:04] (03CR) 10Hashar: [C: 031] toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243857 (owner: 10Dzahn) [09:20:43] (03PS2) 10Alexandros Kosiaris: aqs: Fix typo introduced in 22ead0c [puppet] - 10https://gerrit.wikimedia.org/r/243898 [09:20:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] aqs: Fix typo introduced in 22ead0c [puppet] - 10https://gerrit.wikimedia.org/r/243898 (owner: 10Alexandros Kosiaris) [09:21:31] (03PS2) 10Giuseppe Lavagetto: New package version [debs/pybal] - 10https://gerrit.wikimedia.org/r/243415 [09:21:35] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: puppet fail [09:22:00] (03CR) 10Giuseppe Lavagetto: [C: 032] New package version [debs/pybal] - 10https://gerrit.wikimedia.org/r/243415 (owner: 10Giuseppe Lavagetto) [09:22:28] (03Merged) 10jenkins-bot: New package version [debs/pybal] - 10https://gerrit.wikimedia.org/r/243415 (owner: 10Giuseppe Lavagetto) [09:23:24] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:24:19] 6operations, 7Graphite, 7HHVM, 7Monitoring: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#1704789 (10fgiunchedi) another related case for UNKNOWN is when datapoints are not being pushed at all, for example "mediawiki memcached error... [09:27:54] (03PS1) 10Alexandros Kosiaris: aqs: join $analytics_networks on a whitespace [puppet] - 10https://gerrit.wikimedia.org/r/243899 [09:31:33] (03PS2) 10Alexandros Kosiaris: aqs: join $analytics_networks on a whitespace [puppet] - 10https://gerrit.wikimedia.org/r/243899 [09:31:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] aqs: join $analytics_networks on a whitespace [puppet] - 10https://gerrit.wikimedia.org/r/243899 (owner: 10Alexandros Kosiaris) [09:41:33] (03PS1) 10Alexandros Kosiaris: aqs: analytics_networks is IP subnets, no resolve [puppet] - 10https://gerrit.wikimedia.org/r/243902 [09:42:34] (03PS2) 10Filippo Giunchedi: cassandra: new metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) [09:43:19] (03CR) 10Alexandros Kosiaris: [C: 032] aqs: analytics_networks is IP subnets, no resolve [puppet] - 10https://gerrit.wikimedia.org/r/243902 (owner: 10Alexandros Kosiaris) [09:46:31] (03PS2) 10Filippo Giunchedi: cassandra: enable multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/243675 (https://phabricator.wikimedia.org/T95253) [09:50:06] <_joe_> !log uploaded a new pybal package for jessie [09:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:12] (03CR) 1020after4: [C: 031] Make deployment rev represent config state [tools/scap] - 10https://gerrit.wikimedia.org/r/243009 (owner: 10Thcipriani) [09:57:39] (03Restored) 10Alexandros Kosiaris: otrs: disable SessionCheckRemoteIP [puppet] - 10https://gerrit.wikimedia.org/r/242789 (https://phabricator.wikimedia.org/T87217) (owner: 10Faidon Liambotis) [09:58:00] (03CR) 10Alexandros Kosiaris: "Not to be merge before we move to OTRS 4" [puppet] - 10https://gerrit.wikimedia.org/r/242789 (https://phabricator.wikimedia.org/T87217) (owner: 10Faidon Liambotis) [10:02:14] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [10:12:02] (03PS1) 10Alexandros Kosiaris: akosiaris: Update dot files [puppet] - 10https://gerrit.wikimedia.org/r/243905 [10:14:14] (03PS1) 10Filippo Giunchedi: graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 [10:14:19] (03CR) 10Alexandros Kosiaris: [C: 032] akosiaris: Update dot files [puppet] - 10https://gerrit.wikimedia.org/r/243905 (owner: 10Alexandros Kosiaris) [10:14:53] (03CR) 10jenkins-bot: [V: 04-1] graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 (owner: 10Filippo Giunchedi) [10:22:55] !log dropping temp recovered tables from db1051 to prepare for repool [10:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:12] (03PS2) 10Filippo Giunchedi: graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 [10:24:49] (03CR) 10jenkins-bot: [V: 04-1] graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 (owner: 10Filippo Giunchedi) [10:26:23] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [10:30:04] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [10:31:15] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:31:23] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:31:41] ^me. fixed [10:33:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:33:13] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:36:29] (03PS1) 10Jcrespo: Repooling db1051 with db1055's roles, to fix SPOF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243910 [10:36:37] (03CR) 10jenkins-bot: [V: 04-1] Repooling db1051 with db1055's roles, to fix SPOF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243910 (owner: 10Jcrespo) [10:36:39] (03PS1) 10Alexandros Kosiaris: freshclam: populate proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/243911 [10:38:31] (03PS2) 10Jcrespo: Repooling db1051 with db1055's roles, to fix SPOF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243910 [10:38:35] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [100000000.0] [10:41:20] !log potential extra load on mediawiki recent changes and watchlist on enwiki, please report any slowdown [10:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:14] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:42:52] (03CR) 10Alexandros Kosiaris: [C: 032] freshclam: populate proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/243911 (owner: 10Alexandros Kosiaris) [10:43:19] (03CR) 10Jcrespo: [C: 032] Repooling db1051 with db1055's roles, to fix SPOF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243910 (owner: 10Jcrespo) [10:43:54] preparing a quick revert... [10:45:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1051 after maintenance (duration: 00m 17s) [10:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:55] seeing some slow queries, but nothing too problematic for now [10:49:28] (03CR) 10Alex Monk: [C: 04-1] "this still has the typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243837 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [10:52:18] jdlrobson, hey [10:52:40] (03PS2) 10Alex Monk: Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243837 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [10:58:47] just writing this incident report [11:03:04] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [11:07:32] (03PS1) 10Jcrespo: Revert "Repooling db1051 with db1055's roles, to fix SPOF" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243912 [11:08:58] (03CR) 10Jcrespo: [C: 032] Revert "Repooling db1051 with db1055's roles, to fix SPOF" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243912 (owner: 10Jcrespo) [11:09:04] (03Merged) 10jenkins-bot: Revert "Repooling db1051 with db1055's roles, to fix SPOF" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243912 (owner: 10Jcrespo) [11:10:08] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 for more maintenance (duration: 00m 17s) [11:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:58] Bug detected after only 18 API request errors, not that bad, isn't it? [11:11:39] now, to wait for another 5-day ALTER TABLE :-( [11:12:41] greg-g, jdlrobson: https://wikitech.wikimedia.org/wiki/Incident_documentation/20151005-MediaWiki [11:14:11] arrray is a valid function name, WTF? [11:14:43] ah, a user defined function, I suppose [11:17:52] this wouldn't happen in Java! [11:18:01] * jynus goes to rewrite mediawiki in Java [11:18:14] jynus: that was proposed years ago ;) [11:18:23] Actionables: * Jaime to rewrite MediaWiki in Java, where this would not happen [11:18:32] is there anything that hasn't been proposed? [11:18:55] rewrite it in FORTRAN? [11:19:02] Now you're just being silly [11:19:24] A lisp version would be cool, with thousands of ) at the end [11:19:30] :-) [11:25:18] <_joe_> jynus: if you want to write it in FORTRAN (which you properly wrote all capitalized) there was a cgi library somewhere [11:32:34] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: puppet fail [11:33:09] !log performing schema change on db1051 enwiki revision [11:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:25] this time is a "fast ALTER TABLE", but we will see how fast it is [11:35:36] (03PS3) 10Alex Monk: Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243837 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [11:39:47] (03PS1) 10Hashar: contint: remove pylint/pyflakes packages [puppet] - 10https://gerrit.wikimedia.org/r/243915 (https://phabricator.wikimedia.org/T114360) [11:45:43] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1705063 (10hashar) [11:49:59] (03PS3) 10Filippo Giunchedi: graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 [11:55:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [11:55:42] (03CR) 10Jcrespo: "I've started pt-heartbeat on es2 and es3 (es1 is ro), and provided extra grants needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [11:56:32] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1705091 (10fgiunchedi) ok so I've updated the patch and uploaded `0.29.0~git+20150813-2`, also the package has now a corresponding `operations... [11:57:32] <_joe_> godog: the videoscalers are mw1152 and mw1259-60 [11:57:38] <_joe_> just FYI [12:00:14] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:00:52] (03PS1) 10KartikMistry: Enable Suggestion in af, gl, gu, mk, oc, sh and simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243919 (https://phabricator.wikimedia.org/T112848) [12:03:05] (03PS1) 10Glaisher: Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) [12:04:09] 6operations, 7Database: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1705156 (10jcrespo) 3NEW [12:05:13] 6operations, 7Database: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1705167 (10jcrespo) CCing Aaron so that he know about its progress. [12:05:53] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:06:59] 6operations, 7Database: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1705170 (10jcrespo) [12:09:52] 6operations, 7Database: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1705178 (10jcrespo) [12:11:12] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1705183 (10Paladox) When will this be deployed on production since commons still shows the error. [12:14:16] (03PS1) 10Glaisher: Remove $wgLanguageCode for special wikis in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243921 [12:31:07] 6operations, 10Continuous-Integration-Config, 10Dumps-Generation, 5Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#1705244 (10hashar) [12:54:13] _joe_: ack, thanks! [12:55:09] (03CR) 10Florianschmidtwelzow: "Notice: Currently all wmf wikis uses the new 1.27 release cycle, so this could be merged now. It will work without this change at least fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [12:58:59] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/243915 (https://phabricator.wikimedia.org/T114360) (owner: 10Hashar) [13:04:35] (03PS1) 10Hashar: contint: restore unattended upgrade on slaves [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) [13:04:55] (03CR) 10Hashar: "Added back with https://gerrit.wikimedia.org/r/243925" [puppet] - 10https://gerrit.wikimedia.org/r/210391 (https://phabricator.wikimedia.org/T98876) (owner: 10Hashar) [13:05:54] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [13:06:35] (03PS4) 10Filippo Giunchedi: graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 [13:09:47] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1705369 (10NahidSultan) >>! In T109810#1572602, @Jalexander wrote: > Yeah I'm not sure we have a formal set of "whose responsible for what" right now with Google Webmaster... Personally I'm happy to help confir... [13:21:11] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Clarify the salt version to use on beta cluster - https://phabricator.wikimedia.org/T114755#1705376 (10hashar) 3NEW a:3ArielGlenn [13:23:19] 6operations, 10ops-eqiad: db1026 degraded RAID - https://phabricator.wikimedia.org/T114738#1705389 (10Cmjohnson) a:3Cmjohnson [13:24:31] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1705402 (10Cmjohnson) We have 10G options in row D...racks 6-8 are all 10G. [13:25:51] 6operations, 7Monitoring: I do not receive pages, ever - https://phabricator.wikimedia.org/T114653#1705405 (10Cmjohnson) a:3RobH assigning to Rob [13:27:41] 6operations, 10ops-eqiad: Return polonium/lead to spares - https://phabricator.wikimedia.org/T113962#1705413 (10Cmjohnson) a:3Cmjohnson [13:30:18] 6operations, 10Beta-Cluster-Infrastructure: Beta Cluster no longer listens for HTTPS - https://phabricator.wikimedia.org/T70387#1705417 (10hashar) p:5High>3Low [13:32:33] 6operations, 10Beta-Cluster-Infrastructure: setup a DB backed parser cache - https://phabricator.wikimedia.org/T55457#1705423 (10hashar) p:5Normal>3Low [13:49:38] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1705450 (10cscott) Note that `MediaWiki:Coll-attribution-page` is just the arbitrary title OCG provides for a `/tr... [13:53:56] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1705457 (10mark) @ssastry, @gwicke: This task is great, but could you guys please also turn this info into an inci... [14:15:17] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1705505 (10fgiunchedi) more context, in eqiad we are doing a group of 3x machines in a rack as a single swift 'zone' which means that replicas won't be placed in a single zone. this makes us tolerant... [14:24:18] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1705509 (10Ottomata) @Joe, there are two parts to this MVP: - Centralized (and CI controlled) schema sharing - An easy way to get valid data into Kafka. With eventlogging right now, w... [14:28:55] (03PS6) 10Filippo Giunchedi: swift: aggregate and report container object/byte stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 (https://phabricator.wikimedia.org/T92322) [14:29:31] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1705514 (10Ottomata) @ori I just edited the [[ https://phabricator.wikimedia.org/project/sprint/profile/1474/ | EventBus project description ]] to include a version of the problem statem... [14:39:17] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1705528 (10Joe) >>! In T114443#1703097, @Eevans wrote: >>>! In T114443#1701296, @Joe wrote: >> Apart from the concerns on a practical use case which I agree with, I have a big doubt abou... [14:41:53] (03PS7) 10Filippo Giunchedi: swift: aggregate and report container object/byte stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 (https://phabricator.wikimedia.org/T92322) [14:42:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: aggregate and report container object/byte stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 (https://phabricator.wikimedia.org/T92322) (owner: 10Filippo Giunchedi) [14:45:57] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1705533 (10Joe) >>! In T114443#1705509, @Ottomata wrote: > @Joe, there are two parts to this MVP: > > - Centralized (and CI controlled) schema sharing > - An easy way to get valid data... [14:51:24] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1705543 (10cscott) I could use some help translating the time period in question into a unix timestamp value appro... [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151006T1500). [15:00:04] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:09] here [15:00:19] Who is SWAT'ng? [15:00:44] kart_: I can SWAT [15:01:35] cool [15:01:36] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1705568 (10Joe) So, doing some research on the topic, there is already a kafka rest proxy builtin into Kafka: http://docs.confluent.io/1.0/kafka-rest/docs/intro.html did you take a loo... [15:02:06] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Clarify the salt version to use on beta cluster - https://phabricator.wikimedia.org/T114755#1705569 (10ArielGlenn) The jessie package that should run in labs and on production is the one that the wikimedia cluster provides. Is it possible t... [15:02:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243919 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [15:02:23] (03Merged) 10jenkins-bot: Enable Suggestion in af, gl, gu, mk, oc, sh and simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243919 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [15:02:27] 6operations, 5Patch-For-Review: Add monitoring of upload rate on commons to icinga alerts - https://phabricator.wikimedia.org/T92322#1705571 (10fgiunchedi) as a proxy metric from swift and not mw we can now use `swift.eqiad-prod.containers.mw-media.originals.objects` (also `.bytes` is available) to keep track... [15:02:30] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1705572 (10Ottomata) @joe, yes, please see this ticket: https://phabricator.wikimedia.org/T88459 [15:03:35] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:58] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Suggestion in af, gl, gu, mk, oc, sh and simplewiki [[gerrit:243919]] (duration: 00m 18s) [15:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:09] ^ kart_ check please [15:05:18] checking [15:05:56] looks good thcipriani [15:05:58] thanks [15:06:04] kart_: awesome. Thanks! [15:07:01] (03CR) 10DCausse: [C: 031] Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [15:09:08] (03CR) 10Filippo Giunchedi: "@gwicke thoughts on this? I'd like to go ahead with prefix tbh, cluster seems out of scope for this change anyway which is meant to separa" [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [15:14:26] 6operations, 10ops-codfw: update spares sheet with DAC cable count - https://phabricator.wikimedia.org/T114720#1705578 (10Papaul) @Robh the spares sheet is up to date with the information you requested [15:19:37] (03PS5) 10Filippo Giunchedi: graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 [15:19:48] Krenair: thanks [15:23:56] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1705598 (10Joe) Speaking with @ottomata, it seems that, contrarily to what I understood, this service would proxy connection to kafka only from producers and not from consumers. If this... [15:48:49] (03PS1) 10Faidon Liambotis: clamav: fix proxy support [puppet] - 10https://gerrit.wikimedia.org/r/243943 [15:49:23] PROBLEM - NTP on bromine is CRITICAL: NTP CRITICAL: No response from NTP server [15:53:13] (03PS2) 10Faidon Liambotis: clamav: fix proxy support [puppet] - 10https://gerrit.wikimedia.org/r/243943 [15:54:54] (03CR) 10Ori.livneh: [C: 031] graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 (owner: 10Filippo Giunchedi) [15:58:01] 6operations, 7Monitoring: I do not receive pages, ever - https://phabricator.wikimedia.org/T114653#1705695 (10RobH) 5Open>3Resolved So, this was an issue in our icinga configuration, not our SMS provider. We first sent a test sms via email gateway and it was successful. then I dug into the icinga configur... [15:58:57] (03PS1) 10Filippo Giunchedi: cassandra: add restbase-test2001 instances [puppet] - 10https://gerrit.wikimedia.org/r/243944 (https://phabricator.wikimedia.org/T95253) [16:00:04] _joe_ andrewbogott: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151006T1600). Please do the needful. [16:00:33] <_joe_> there is nothing in puppetSWAT, so, skipping for today :) [16:00:37] _joe_: I haven’t done this before… do I understand correctly that [16:00:42] ah, yeah, that’s what I was going to ask :) [16:01:44] <_joe_> andrewbogott: yep, nothing to do today :) [16:03:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline comments" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [16:06:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] cassandra: enable multi-instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243675 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [16:07:19] (03CR) 10Alexandros Kosiaris: [C: 032] clamav: fix proxy support [puppet] - 10https://gerrit.wikimedia.org/r/243943 (owner: 10Faidon Liambotis) [16:14:21] (03CR) 10Eevans: cassandra: add restbase-test2001 instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243944 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [16:15:39] (03CR) 10Filippo Giunchedi: cassandra: new metrics-collector version (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [16:15:48] (03PS3) 10Filippo Giunchedi: cassandra: new metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) [16:21:43] (03CR) 10Filippo Giunchedi: cassandra: enable multi-instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243675 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [16:23:58] RECOVERY - Restbase root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 690 bytes in 0.014 second response time [16:25:38] (03PS2) 10Dzahn: apache: remove visualwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243340 [16:25:42] (03CR) 10Filippo Giunchedi: cassandra: add restbase-test2001 instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243944 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [16:28:40] (03PS1) 10EBernhardson: Set labsearch ES cluster size to 1 node [puppet] - 10https://gerrit.wikimedia.org/r/243947 [16:28:48] PROBLEM - dhclient process on bromine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:00] PROBLEM - Restbase root url on aqs1001 is CRITICAL: Connection refused [16:29:06] yuvipanda: turns out i lied, one more nobelium patch: https://gerrit.wikimedia.org/r/243947 [16:29:45] dhclient on bromine? so random, looks like the neon side of things [16:30:11] PROBLEM - salt-minion processes on bromine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:34] ebernhardson: I can merge that, I know the deal [16:31:38] hrmm,,ok, i'll check bromine [16:31:39] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1705780 (10Paladox) Could it be to do with samplerate sinve 720p and 1080p for ogv doint do that instead they do videoQuality and audioQuality... [16:31:47] (03CR) 10Rush: [C: 032] Set labsearch ES cluster size to 1 node [puppet] - 10https://gerrit.wikimedia.org/r/243947 (owner: 10EBernhardson) [16:31:58] RECOVERY - NTP on bromine is OK: NTP OK: Offset 0.002529740334 secs [16:32:12] chasemp: thanks [16:32:13] is somebody using bromine to test? [16:32:19] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.007 second response time [16:32:29] this is just a VM for static HTML sites [16:32:37] there is no reason for it to be thaat busy [16:32:38] RECOVERY - dhclient process on bromine is OK: PROCS OK: 0 processes with command name dhclient [16:32:58] RECOVERY - salt-minion processes on bromine is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:33:39] hmm, puppet and apt [16:33:51] and it's over [16:34:03] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1705795 (10Cmjohnson) @fgiunchedi Seems reasonable to me. Space in row A may be tight but will find a place. Do we want to consider this the official plan? [16:34:48] chasemp: is the codfw cluster available for writes [16:34:55] Argh, Etherpad down? [16:35:00] +1 ^ [16:35:11] In the middle of a quarterly review. :-) [16:35:13] root@bromine:~# tail -f /var/log/puppet.log [16:35:13] E: Unable to lock directory /var/lib/apt/lists/ [16:35:13] E: Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable) [16:35:18] chasemp: the code goes out today, i'll be turning on testwiki mirroring to nobelium in this afternoon swat, and if all goes well turning on codfw [16:35:36] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1705800 (10Paladox) Seems that the video is a vp9 video using vorbus not opus. [16:35:38] ebernhardson: should be yes, do you mind if I buddy up in a screen session to see? [16:35:49] chasemp: sure, although i use tmux ;) [16:36:15] (but thats not on half the servers and its just in ~/bin :( [16:36:19] s/half/almost all/ [16:37:31] 95% chance of some whacky surprise especially fw wise so I would like to step through the inital with you, i can deal w/ tmux :) [16:38:28] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:39] James_F: Etherpad back [16:38:46] mutante: Thanks. [16:38:47] eh, lol icinga thinks no [16:38:54] but i see it [16:39:05] Seems to be working for me now too. [16:39:15] it's always in the middle of meetings because it tends to happen when it's used the most [16:39:39] fwiw, i didnt even restart it or anything [16:39:48] * James_F nods. [16:40:03] It comes and goes for me. I wonder if we're bumping against a max connections limit or something. [16:41:27] the backend is called "ueberDB" :p [16:41:33] Oh dear. [16:41:40] Not even Unicoded? :-) [16:41:50] heh [16:41:58] RECOVERY - Restbase root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 690 bytes in 0.018 second response time [16:42:16] [INFO] access - [CREATE] Pad "parsoidpower": hehe [16:42:57] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1705810 (10ssastry) >>! In T114558#1705457, @mark wrote: > @ssastry, @gwicke: This task is great, but could you gu... [16:43:47] yea, so the etherpad log gives me info like that, pads get created, users enter and leave pads... [16:44:07] but no obvious error or limit [16:45:08] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.011 second response time [16:46:11] some pads have names that suggest they might not be exatly wiki related .. [16:46:53] If only MW has RTC functionality so we didn't need Etherpad… ;-) [16:48:48] PROBLEM - Restbase root url on aqs1001 is CRITICAL: Connection refused [16:49:30] James_F: did you know https://www.mediawiki.org/wiki/Extension:EtherEditor [16:51:38] mutante: Yeah, it's terrible. [16:52:49] heh :) ok [16:53:48] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: Puppet has 1 failures [16:55:49] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1705859 (10Eevans) > A message queue is not a database, it's a router. ... Of course, but I drew the analogy because in both cases you have readers and writers of structured data. That... [16:58:46] (03PS1) 10Mobrovac: AQS RESTBase: Use full path to the server module script [puppet] - 10https://gerrit.wikimedia.org/r/243955 [17:01:19] (03PS2) 10Mobrovac: AQS RESTBase: Use full path to the server module script [puppet] - 10https://gerrit.wikimedia.org/r/243955 (https://phabricator.wikimedia.org/T114742) [17:03:49] (03CR) 10Ottomata: [C: 032] AQS RESTBase: Use full path to the server module script [puppet] - 10https://gerrit.wikimedia.org/r/243955 (https://phabricator.wikimedia.org/T114742) (owner: 10Mobrovac) [17:11:08] RECOVERY - Restbase root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 690 bytes in 0.009 second response time [17:19:37] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:48:00] (03PS2) 10Dzahn: apache: remove softwarewikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243342 [17:53:26] 6operations, 10ops-eqiad: db1026 degraded RAID - https://phabricator.wikimedia.org/T114738#1706088 (10Cmjohnson) Replaced disk Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun U... [17:55:48] RECOVERY - RAID on db1026 is OK: OK: optimal, 1 logical, 2 physical [18:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151006T1800). Please do the needful. [18:00:42] (03PS1) 10Dzahn: deactivate webhostingwikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/243970 [18:01:25] (03PS2) 10Dzahn: deactivate webhostingwikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/243970 [18:01:41] (03PS1) 10Andrew Bogott: openstack: Turn off verbose logging for designate [puppet] - 10https://gerrit.wikimedia.org/r/243971 (https://phabricator.wikimedia.org/T114544) [18:02:51] (03PS2) 10Dzahn: apache: remove webhostingwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243344 [18:03:53] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706171 (10brion) @fgiunchedi woohoo looks good! Is there any way to force the video scalers to install the update? It doesn't seem to have go... [18:04:34] (03CR) 10Andrew Bogott: [C: 032] openstack: Turn off verbose logging for designate [puppet] - 10https://gerrit.wikimedia.org/r/243971 (https://phabricator.wikimedia.org/T114544) (owner: 10Andrew Bogott) [18:05:31] (03PS1) 10Dzahn: deactive wikifamily.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243972 [18:05:45] (03PS2) 10Dzahn: deactivate wikifamily.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243972 [18:06:44] (03PS2) 10Dzahn: apache: remove wikifamily redirects [puppet] - 10https://gerrit.wikimedia.org/r/243345 [18:08:25] brion: *nod* I can force the videoscalers to upgrade if it looks good, I tried one transcode and it seemed to work [18:08:48] godog: looks ok locally, go for it :D [18:09:01] (03PS1) 10Dzahn: deactivate wikidisclosure.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243973 [18:09:45] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706198 (10Paladox) Fixed problem. Its to do with that video was using vorbis. Currently our setting in timedmediahandler only supports using... [18:09:49] (03PS2) 10Dzahn: apache: remove wikidisclosure redirects [puppet] - 10https://gerrit.wikimedia.org/r/243347 [18:10:11] !log upgrade videoscalers to ffmpeg2theora 0.29.0~git+20150813-2 [18:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:14] (03PS2) 10Dzahn: apache: remove wikiartpedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243341 [18:11:17] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706202 (10brion) No @paladox, input files can use whatever combination they like, that doesn't matter at all here. That particular file was a... [18:11:25] (03PS1) 10Andrew Bogott: Openstack: notify mdns and pool-manager of designate.conf changes. [puppet] - 10https://gerrit.wikimedia.org/r/243975 [18:12:04] brion: sweet, it is done [18:12:12] thanks :D [18:12:41] confirmed updated on mw1152, lemme run some files through the ui and make sure they work \o/ [18:13:13] (03PS1) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243976 (https://phabricator.wikimedia.org/T89177) [18:13:15] (03PS1) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243977 (https://phabricator.wikimedia.org/T89177) [18:13:23] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706215 (10fgiunchedi) @brion looks good, videoscalers are running the latest version ``` salt --out=raw -b 1 -t 300 -v -C 'G@cluster:videosc... [18:13:35] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706217 (10Paladox) Oh ok. I updated the file anyways with lower memory usage and corrected audio format. [18:13:38] brion: no problem! [18:14:14] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706219 (10brion) @Paladox PLEASE STOP interfering with bug testing. Leave the files alone. [18:14:48] now we have to rerun all the webms too [18:14:50] fucking paladox [18:15:28] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706226 (10Paladox) @brion could we use the version I uploaded which used less memory and used correct audio format. [18:15:51] I presume he means disk space, not memory [18:16:08] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706233 (10brion) @Paladox please just leave it alone. Stop messing with the file. Stop messing with this bug. [18:16:42] brion: you could chat with him as he is on IRC (e.g. #wikimedia-releng ), but you might want to calm down first :) [18:16:53] andre__: he's just as useless without a record [18:16:53] "Oh ok" [18:17:24] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706240 (10Paladox) Ok sorry. I wasent aware at the time that you were testing it only that you were using it as an example showing there was... [18:17:46] * valhallasw`cloud offers brion a hug [18:18:17] I think he needs something stronger [18:18:22] awwwww [18:18:23] (03PS1) 10Andrew Bogott: openstack: Add log-rotation for designate-mdns and designate-pool-manager logs [puppet] - 10https://gerrit.wikimedia.org/r/243978 (https://phabricator.wikimedia.org/T114544) [18:18:58] (03CR) 10Andrew Bogott: [C: 032] Openstack: notify mdns and pool-manager of designate.conf changes. [puppet] - 10https://gerrit.wikimedia.org/r/243975 (owner: 10Andrew Bogott) [18:19:51] ok where was i ..... coffee \o/ :D [18:22:21] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706257 (10brion) Ok, the 360p/240p/160p ogvs rendered correctly! Yay! The 480p ogg and the webms were damaged [18:22:46] mutante: The mailman upgrade appears to have broken Josie's mailman shell scripts -- can you help me debug? [18:23:05] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706261 (10brion) Have to wait another 48 minutes to re-run and fix the webm & 480p ogv transcodes. :( [18:23:26] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1706263 (10Andrew) well, setting debug=False and verbose=False didn't actually stop me from getting a verbose log. So that nee... [18:25:50] mutante: seems to be related to csrf cross site tokens.. [18:27:29] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: puppet fail [18:28:02] cajoel: can you paste errors on a pastebin [18:28:18] mutante: 'The form lifetime has expired. (request forgery check)' [18:28:39] looks like session cookies aren't enough anymore -- need to pull this token [18:28:39] name="csrf_token" [18:28:45] and submit that too [18:30:12] cajoel: ack, i can confirm this from release notes [18:30:16] - The web admin interface has been hardened against CSRF attacks by adding a hidden, encrypted token with a time stamp to form submissions and not accepting authentication by cookie if the token is missing, invalid or older than the new mm_cfg.py setting FORM_LIFETIME which defaults to one hour. [18:30:23] was added in 2.1.15 [18:30:47] I'll work out a way to parse it out [18:30:53] 6operations, 10ops-eqiad: db1026 degraded RAID - https://phabricator.wikimedia.org/T114738#1706290 (10Cmjohnson) 5Open>3Resolved Fixed cmjohnson@db1026:~$ sudo megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firm... [18:30:56] and resend [18:31:21] mutante: I'd really rather rewrite as python+requests, but meh.. hammer first [18:31:32] mutante: thanks for the code confirmation [18:31:46] cajoel: np, alright [18:36:18] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [18:43:15] (03PS8) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [18:46:59] (03PS1) 10Rush: Specify SSHD listen address for lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/243982 [18:48:29] !log fixing https://phabricator.wikimedia.org/T109216 on labstore1002 [18:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:48:40] ebernhardson: chasemp thanks for merging the other ES patch! [18:49:02] (03CR) 10Rush: [C: 04-1] "I am sitting on this until I can coordinate with brandon." [puppet] - 10https://gerrit.wikimedia.org/r/243982 (owner: 10Rush) [18:51:33] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1706362 (10brion) 5Open>3Resolved Ok it all seems to have worked out. :) Thanks for the package update @fgiunchedi! [18:53:14] (03CR) 10Rush: [C: 04-1] "I am only -1'ing as a formality so I don't forget why this is held up. We noticed these failures are already hitting graphite now (which " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [19:01:00] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, and 2 others: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1706391 (10chasemp) We are working through this slowly. Brandon yesterday outlined our existing options: ```bblack: the two best option... [19:07:55] (03PS1) 10Hashar: beta: point parsoid back to source code [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) [19:08:56] (03CR) 10Hashar: "The previous commit I refer to is https://gerrit.wikimedia.org/r/#/c/169622/2/manifests/role/parsoid.pp,unified" [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [19:09:23] (03CR) 10Dduvall: [C: 04-1] "I have one tiny concern regarding cleanup/rollback/checks but barring that this looks good to go." (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/243009 (owner: 10Thcipriani) [19:10:47] (03CR) 10Subramanya Sastry: "Even this fix should be sufficient for now. we don't update modules that often and when we do, we usually followup with a patch to the dep" [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [19:14:41] (03PS2) 10Dzahn: ishmael: remove module, decom service [puppet] - 10https://gerrit.wikimedia.org/r/243714 (https://phabricator.wikimedia.org/T109777) [19:14:55] (03CR) 10Hashar: [C: 031 V: 032] "I have cherry picked it on the beta cluster puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [19:16:59] (03PS3) 10Dzahn: ishmael: remove module, decom service [puppet] - 10https://gerrit.wikimedia.org/r/243714 (https://phabricator.wikimedia.org/T109777) [19:17:57] 6operations, 10Datasets-General-or-Unknown: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#1706499 (10Kelson) @Ariel Great, please let me know if you need something from my side. [19:18:42] (03PS2) 10Hashar: beta: point parsoid back to source code [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) [19:19:22] (03CR) 10Hashar: [V: 032] "PS2 removed the bit in the commit message about moving the beta setting files out of deploy. Turns out it is preferred there." [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [19:33:40] (03PS1) 1020after4: symlinks for wmf/1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243991 [19:36:47] (03CR) 10Dzahn: [C: 032] ishmael: remove module, decom service [puppet] - 10https://gerrit.wikimedia.org/r/243714 (https://phabricator.wikimedia.org/T109777) (owner: 10Dzahn) [19:37:21] (03PS1) 10Hashar: beta: parsoid now uses modules defined in source [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) [19:37:36] (03PS2) 10Hashar: beta: parsoid now uses modules defined in source [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) [19:38:46] (03CR) 10Hashar: [C: 04-1] "Requires the Jenkins job to do the npm install. It is probably better to have that step executed directly on the parsoid instance." [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [19:51:07] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1706630 (10VictorGrigas) These images were just approved by our legal team. Sorry for the wait: https://commons.wikimedia.org/w/index.php?title=Category:Wikimedia_Foundation_servers_2015&action=edit&redlink=1 If this doesn't work for som... [19:51:50] (03CR) 10Hashar: "Depends on CI change https://gerrit.wikimedia.org/r/243997" [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [19:56:17] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1706639 (10JanZerebecki) This is not yet doing any apache linting / parse checking, which it did before. Does this need to... [19:59:47] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1706657 (10hashar) Well done @VictorGrigas :) [20:00:23] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1706658 (10Dzahn) @JanZerebecki new task that is linked there though please? [20:00:46] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1706659 (10VictorGrigas) Thanks. This one is DEF my favorite: https://commons.wikimedia.org/wiki/File:Wikimedia_Foundation_Servers_2015-64.jpg [20:01:33] papaul: ^ I think I see you're future staff photo being photo #3 ;) [20:03:58] JohnFLewis: your [20:04:10] Reedy: bah sorry :( [20:11:22] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [20:18:49] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1706686 (10Florian) @VictorGrigas Great work, thanks! :) Would you like to add your photos to the appropriate categories of the server's //location//? Would be nice to know, where you did what photo (I assume the photos in the WMF Founrdat... [20:19:27] (03CR) 1020after4: [C: 032] symlinks for wmf/1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243991 (owner: 1020after4) [20:19:33] (03Merged) 10jenkins-bot: symlinks for wmf/1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243991 (owner: 1020after4) [20:20:54] !log twentyafterfour@tin Started scap: sync wmf/1.27.0-wmf.1 [20:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:54] (03CR) 10Dduvall: [C: 032] Make deployment rev represent config state (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/243009 (owner: 10Thcipriani) [20:25:03] (03PS3) 10Dduvall: Make deployment rev represent config state [tools/scap] - 10https://gerrit.wikimedia.org/r/243009 (owner: 10Thcipriani) [20:26:18] (03PS1) 10Smalyshev: Production logstash uses port 10514, fix the configuration for WDQS. [puppet] - 10https://gerrit.wikimedia.org/r/244045 [20:26:54] (03CR) 10Dduvall: [C: 032] Make deployment rev represent config state [tools/scap] - 10https://gerrit.wikimedia.org/r/243009 (owner: 10Thcipriani) [20:27:59] (03PS2) 10Smalyshev: Production logstash uses port 10514, fix the configuration for WDQS. [puppet] - 10https://gerrit.wikimedia.org/r/244045 [20:29:07] !log service "ishmael" has been removed (T109777) - removed docroot on neon. tarball exists in /root just in case. code is on https://github.com/asher/ishmael [20:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:23] (03CR) 10EBernhardson: [C: 031] "Matches the mediawiki-config repo, which uses port 10514 for logstash syslog." [puppet] - 10https://gerrit.wikimedia.org/r/244045 (owner: 10Smalyshev) [20:31:44] (03PS3) 10EBernhardson: Production logstash uses port 10514, fix the configuration for WDQS. [puppet] - 10https://gerrit.wikimedia.org/r/244045 (owner: 10Smalyshev) [20:33:33] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:12] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [20:35:31] (03Merged) 10jenkins-bot: Make deployment rev represent config state [tools/scap] - 10https://gerrit.wikimedia.org/r/243009 (owner: 10Thcipriani) [20:37:03] PROBLEM - Apache HTTP on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.087 second response time [20:37:23] PROBLEM - HHVM rendering on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.495 second response time [20:37:53] 6operations, 10Continuous-Integration-Config, 7Regression: operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801#1706729 (10JanZerebecki) 3NEW [20:42:43] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1706745 (10Andrew) best I can tell, loglevels can't be changed. Nonetheless, the attached rotation patch should save us. [20:48:12] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1706748 (10Andrew) [20:53:05] (03PS2) 10Dzahn: remove ishmael service [dns] - 10https://gerrit.wikimedia.org/r/243727 (https://phabricator.wikimedia.org/T109777) [20:53:07] !log twentyafterfour@tin Finished scap: sync wmf/1.27.0-wmf.1 (duration: 32m 13s) [20:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:54:52] (03CR) 10Dzahn: [C: 032] remove ishmael service [dns] - 10https://gerrit.wikimedia.org/r/243727 (https://phabricator.wikimedia.org/T109777) (owner: 10Dzahn) [20:55:48] Farewell, Ishmael. [20:56:48] 6operations, 7Database, 5Patch-For-Review: decom ishmael - https://phabricator.wikimedia.org/T109777#1706777 (10Dzahn) 13:31 < mutante> !log service "ishmael" has been removed (T109777) - removed docroot on neon. tarball exists in /root just in case. code is on https://github.com/asher/ishmael removed from... [20:57:17] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244052 [20:57:45] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244052 (owner: 1020after4) [20:57:47] (03PS2) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243976 (https://phabricator.wikimedia.org/T89177) [20:57:51] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244052 (owner: 1020after4) [20:58:09] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.2 [20:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:33] !log disabling puppet on caches for VCL testing [20:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:49] (03CR) 10BBlack: [C: 032] move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243976 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [20:59:42] train: deployed. [21:05:40] 6operations, 10Wikimedia-General-or-Unknown, 7Database, 7Performance: ishmael shows blank graphs - https://phabricator.wikimedia.org/T66581#1706817 (10Dzahn) [21:07:17] 6operations, 7Monitoring: Fix up icinga puppetization - https://phabricator.wikimedia.org/T110893#1706821 (10Dzahn) the ishmael part of this is gone now, T109777 [21:10:11] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1706863 (10Dzahn) Is T56713 just a duplicate of this? [21:10:15] 6operations: Use a clearer realm for logstash.wikimedia.org indicating who it's restricted to - https://phabricator.wikimedia.org/T67480#1706870 (10Krenair) a:5Dzahn>3Krenair [21:10:41] (03CR) 10Hashar: [C: 04-1] "A few random suggestions inline. I haven't really reviewed the code though." (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [21:13:04] (03CR) 10Dzahn: "25 # Default connection port is 11000 ( Wikimedia specific, general default is 11211 )" [puppet] - 10https://gerrit.wikimedia.org/r/243651 (owner: 10Muehlenhoff) [21:13:12] (03PS2) 10Dzahn: Move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243651 (owner: 10Muehlenhoff) [21:13:27] (03PS3) 10Dzahn: memcached: move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243651 (owner: 10Muehlenhoff) [21:14:11] (03CR) 10Dzahn: [C: 032] memcached: move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243651 (owner: 10Muehlenhoff) [21:14:58] greg-g, twentyafterfour: could https://phabricator.wikimedia.org/T114810 be an issue introduced in wmf.2? [21:15:03] !log restarted phd in response to a phabricator setup issue [21:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:41] Krenair: maybe.. [21:16:50] (03CR) 10Dzahn: "confirmed on mc1001. no issues here. this is just adding the already existing rule" [puppet] - 10https://gerrit.wikimedia.org/r/243651 (owner: 10Muehlenhoff) [21:17:03] Krenair: indeed, looks like it's related [21:18:31] (03PS2) 10Dzahn: openstack: Add log-rotation for designate-mdns and designate-pool-manager logs [puppet] - 10https://gerrit.wikimedia.org/r/243978 (https://phabricator.wikimedia.org/T114544) (owner: 10Andrew Bogott) [21:18:42] twentyafterfour: for the record, feel free to ignore the meeting you're in [21:20:02] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1706937 (10Dzahn) p:5Triage>3High [21:20:18] (03CR) 10Dzahn: [C: 032] openstack: Add log-rotation for designate-mdns and designate-pool-manager logs [puppet] - 10https://gerrit.wikimedia.org/r/243978 (https://phabricator.wikimedia.org/T114544) (owner: 10Andrew Bogott) [21:21:15] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1706954 (10VictorGrigas) Yes they are all in Dallas, would you like to label them as such? [21:23:36] greg-g: thanks ;) [21:24:00] Krenair: should we roll back? I'm thinking so... [21:24:35] twentyafterfour: +1 [21:25:17] I'm not sure how many pages are affected [21:25:20] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1707006 (10Florian) @VictorGrigas Already done :) -> https://commons.wikimedia.org/wiki/Category:Wikimedia_servers_in_Carrollton [21:25:29] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1707015 (10Dzahn) merged and config snippets got added in holmium. we should confirm tomorrow or so it got rotated [21:25:51] the error is happening enough to rack up a few hundred log hits [21:26:07] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1707021 (10Dzahn) p:5High>3Normal [21:26:10] Krenair: if merging https://phabricator.wikimedia.org/T114808 was correct, then I can't save _any_ page :) [21:26:14] that's a lot :P [21:26:21] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1707025 (10Dzahn) a:3Dzahn [21:26:36] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1698955 (10Dzahn) [21:27:45] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1707045 (10Dzahn) a:5Dzahn>3Andrew @Andrew here, you uploaded the fix. wanna close it tomorrow? [21:30:25] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding dcausse to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T114642#1707051 (10RobH) Summary: All manager approvals are in for this patchset: https://gerrit.wikimedia.org/r/#/c/243686/ The 3 day wait expires... [21:30:43] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding dcausse to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T114642#1707053 (10RobH) p:5Triage>3Normal [21:32:09] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1707062 (10Smalyshev) @DZahn any progress on this? [21:32:47] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for VBaranetsky - https://phabricator.wikimedia.org/T114308#1707064 (10RobH) Summary: All approvals have been granted for the access request to give @VBaranetsky shell access to bastion group + analytics-privatedata-users. Unfortunately, I did... [21:32:49] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1707066 (10Smalyshev) [21:34:53] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244056 [21:35:37] (03CR) 1020after4: "rollback due to T114810" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244056 (owner: 1020after4) [21:35:57] (03PS2) 1020after4: group0 wikis to 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244056 (https://phabricator.wikimedia.org/T114810) [21:36:10] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for VBaranetsky - https://phabricator.wikimedia.org/T114308#1707076 (10RobH) p:5Triage>3Normal [21:36:11] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244056 (https://phabricator.wikimedia.org/T114810) (owner: 1020after4) [21:36:17] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244056 (https://phabricator.wikimedia.org/T114810) (owner: 1020after4) [21:36:35] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.1 [21:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:37:20] FlorianSW: Krenair: rolled back. Now I'm still confused about what went wrong. I'm reading the stack traces and the code... [21:38:41] twentyafterfour: that looks good, thanks :) [21:41:46] (03PS1) 10EBernhardson: Update CirrusSearch config for testwiki to talk to second cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244060 [21:42:14] (03CR) 10jenkins-bot: [V: 04-1] Update CirrusSearch config for testwiki to talk to second cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244060 (owner: 10EBernhardson) [21:43:12] (03PS2) 10EBernhardson: Update CirrusSearch config for testwiki to talk to second cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244060 [21:43:37] (03CR) 10jenkins-bot: [V: 04-1] Update CirrusSearch config for testwiki to talk to second cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244060 (owner: 10EBernhardson) [21:43:39] twentyafterfour: https://gerrit.wikimedia.org/r/244061 [21:46:54] (03PS3) 10EBernhardson: Update CirrusSearch config for testwiki to talk to second cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244060 [21:50:17] tgr: very cool thanks [21:52:50] tgr: I commented on the change. I haven't tested it, but I think this wouldn't solve the problem? [21:53:06] twentyafterfour: you'll want https://gerrit.wikimedia.org/r/#/c/244062/ [21:53:08] $wgFileBackends is not keyed by name [21:57:17] uh, yeah, sorry [21:57:22] (03CR) 10Dzahn: "i changed my mind about combining unrelated things into one maintenance window. this one doesn't need a server or service restart" [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [21:57:32] I should have tested it [21:57:32] (03PS2) 10Dzahn: argon: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) [22:06:38] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/244063/ revert? :) [22:08:11] (03CR) 10Dzahn: "what about 9390? udpsock.bind(('', 9390)) in udpxircecho.py !" [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [22:10:09] twentyafterfour: https://twitter.com/irccloud was inconvenient [22:12:02] (03CR) 10Dzahn: [C: 04-1] "we still need a rule for 9390 udp from all appservers to argon" [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [22:18:27] (03PS1) 10Dzahn: mw-rc-irc: firewalling hole for RC bot [puppet] - 10https://gerrit.wikimedia.org/r/244068 (https://phabricator.wikimedia.org/T104943) [22:18:46] (03PS2) 10Dzahn: mw-rc-irc: firewall hole for RC IRC bot [puppet] - 10https://gerrit.wikimedia.org/r/244068 (https://phabricator.wikimedia.org/T104943) [22:20:42] (03PS3) 10Dzahn: mw-rc-irc: firewall hole for RC IRC bot [puppet] - 10https://gerrit.wikimedia.org/r/244068 (https://phabricator.wikimedia.org/T104943) [22:23:08] so that's what it looks like when irccloud dies? [22:24:44] (03PS4) 10EBernhardson: Update CirrusSearch config for testwiki to talk to second cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244060 [22:26:35] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1707218 (10Dzahn) changed my mind about putting unrelated things (IPv6, firewalling, change motd) into one maintenance window. the firewalling thing doesnt need a server nor service restart. will do that... [22:27:50] (03CR) 10MaxSem: [C: 031] Update CirrusSearch config for testwiki to talk to second cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244060 (owner: 10EBernhardson) [22:27:56] (03CR) 10Alex Monk: [C: 031] mw-rc-irc: firewall hole for RC IRC bot [puppet] - 10https://gerrit.wikimedia.org/r/244068 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [22:32:54] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1707234 (10Dzahn) is this ticket really only done once we used mira to deploy something? It seems almost all ops things have been done. (except i'm thinking of the... [22:34:03] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/244068/" [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [22:35:42] I'm going to deploy those fixes to wmf.2 and try syncing group0 again [22:35:55] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, and 2 others: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1707242 (10Legoktm) >>! In T100519#1706391, @chasemp wrote: > @demon explained some of the historical difficulties in having SSH be on a no... [22:36:53] 6operations, 6Discovery, 10Maps, 10Traffic: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1707246 (10Dzahn) [22:37:11] 6operations, 6Discovery, 10Maps, 10Traffic: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1655482 (10Dzahn) added project Traffic per "need to update the referrer in Varnish ERB" [22:40:26] !log twentyafterfour@tin Synchronized php-1.27.0-wmf.2: Deploy https://gerrit.wikimedia.org/r/#/c/244066/ (duration: 01m 40s) [22:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:56] (03PS1) 10Dzahn: varnish: add 'incubator' to maps-frontend regex [puppet] - 10https://gerrit.wikimedia.org/r/244070 (https://phabricator.wikimedia.org/T113122) [22:43:12] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244071 [22:43:33] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244071 (owner: 1020after4) [22:43:39] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244071 (owner: 1020after4) [22:44:06] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.2 [22:44:55] (03PS1) 10Yurik: Added incubator.wikimedia.org to maps referer test [puppet] - 10https://gerrit.wikimedia.org/r/244072 (https://phabricator.wikimedia.org/T113122) [22:46:49] 6operations, 6Discovery, 10Maps, 10Traffic, 5Patch-For-Review: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1707305 (10Yurik) Lol, @dzahn, I added exactly the same patch as you 4 minutes after :)) Closing mine. [22:47:34] bblack, funny, two ppl added identical patches for this bug ^ [22:48:34] (03Abandoned) 10Yurik: Added incubator.wikimedia.org to maps referer test [puppet] - 10https://gerrit.wikimedia.org/r/244072 (https://phabricator.wikimedia.org/T113122) (owner: 10Yurik) [22:49:50] heh, ok :) [22:51:56] (03PS1) 10BBlack: Revert "move netmapper processing to common VCL" [puppet] - 10https://gerrit.wikimedia.org/r/244074 [22:51:58] (03PS2) 10BBlack: Revert "move netmapper processing to common VCL" [puppet] - 10https://gerrit.wikimedia.org/r/244074 [22:52:00] (03CR) 10BBlack: [C: 032 V: 032] Revert "move netmapper processing to common VCL" [puppet] - 10https://gerrit.wikimedia.org/r/244074 (owner: 10BBlack) [22:55:32] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [22:55:33] 6operations, 6Discovery, 10Maps, 10Traffic, 5Patch-For-Review: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1707329 (10Dzahn) p:5Triage>3Normal [22:56:47] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Analytics statistics-users access on stat1002 for dpatrick - https://phabricator.wikimedia.org/T114119#1707332 (10Dzahn) [22:57:53] PROBLEM - very high load average likely xfs on ms-be1001 is CRITICAL: CRITICAL - load average: 328.00, 202.99, 95.47 [22:58:21] ^ that's all me, and it's already reverted/undo (the 5xx, the ms-be load, etc) [22:59:20] ok, i was starting to wonder about the ms-be part [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151006T2300). [23:00:25] James_F: what is "ee" actually used for? [23:00:33] as compared to last time around, I only applied the first half of the VCL changes. I put them on the cache_upload cluster only (which is where the big spikes in cache_miss and struct sess and backend retries, etc happened last night) [23:01:02] with the changes just on cache_upload things were fine for quite a while. When I deployed it to the other clusters (text/mobile), *that's* when upload started misbehaving and spiking out into ms-be as well, etc [23:01:22] so the problem it's really on the upload cluster, it's just some kind of secondary victim. [23:02:14] err [23:02:23] so the problem isn't really on the upload cluster, it's just some kind of secondary victim. [23:02:31] (03PS2) 10Dzahn: varnish: minor lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243854 [23:02:32] *nod* [23:03:17] (03CR) 10jenkins-bot: [V: 04-1] varnish: minor lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243854 (owner: 10Dzahn) [23:04:11] eh. No file(s) found for import of '../../../manifests/nagios.pp' heh [23:05:05] (03PS3) 10Dzahn: varnish: minor lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243854 [23:05:22] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:05:41] looks like i'm the only one in SWAT today, will be pushing it out in ~30 min [23:05:43] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:05:45] (currently in meeting) [23:06:35] JohnFLewis, Editor Engagement -> Collaboration [23:06:46] I think I missed a step in the middle there [23:06:53] Core Features or Growth or something [23:07:09] (03CR) 10Dzahn: varnish: minor lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243854 (owner: 10Dzahn) [23:07:25] Krenair: mostly meant for "does it serve any actual purpose" because re. https://phabricator.wikimedia.org/T114829 [23:07:31] (03PS2) 10Dzahn: lint: double quoted strings pt.3 [puppet] - 10https://gerrit.wikimedia.org/r/243855 [23:09:31] JohnFLewis: Nothing. [23:09:38] JohnFLewis: Hence my request. [23:10:11] James_F: mind emailing the list telling people to use wikitech-l and then I'll disable it? :) [23:10:17] JohnFLewis: It gets on average one, minor announcement e-mail a week. [23:10:42] JohnFLewis: RoanKattouw is doing so now. :-) [23:11:11] Awesome. I'll keep an eye for the email to appear then I'll disable it :) [23:11:41] JohnFLewis: Sent [23:15:02] (03CR) 10EBernhardson: [C: 032] Update CirrusSearch config for testwiki to talk to second cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244060 (owner: 10EBernhardson) [23:15:29] (03Merged) 10jenkins-bot: Update CirrusSearch config for testwiki to talk to second cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244060 (owner: 10EBernhardson) [23:15:45] JohnFLewis: https://lists.wikimedia.org/pipermail/ee/2015-October/001549.html [23:15:49] James_F RoanKattouw: and it's gone! :) [23:16:12] Whee. [23:16:13] Thank you. [23:18:24] !log ebernhardson@tin Synchronized wmf-config/: Enable multicluster ES on testwiki (duration: 00m 17s) [23:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:02] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [23:19:28] (03PS1) 10EBernhardson: Revert "Update CirrusSearch config for testwiki to talk to second cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244076 [23:19:41] (03CR) 10EBernhardson: [C: 032] Revert "Update CirrusSearch config for testwiki to talk to second cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244076 (owner: 10EBernhardson) [23:19:47] (03Merged) 10jenkins-bot: Revert "Update CirrusSearch config for testwiki to talk to second cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244076 (owner: 10EBernhardson) [23:19:58] sigh ... search still works but the job runners are complaining :( will try again tomorrow after figuring out what went wrong [23:20:40] !log ebernhardson@tin Synchronized wmf-config: Revert multicluster config for testwiki (duration: 00m 18s) [23:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:57] (03PS1) 10EBernhardson: Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244077 [23:22:01] RoanKattouw: your up [23:26:42] (03PS2) 10Dzahn: lvs: double quoted string and other lint [puppet] - 10https://gerrit.wikimedia.org/r/243856 [23:26:51] (03PS2) 10Dzahn: toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243857 [23:26:53] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [23:26:58] (03PS2) 10Dzahn: lint: double quoted strings pt.4 [puppet] - 10https://gerrit.wikimedia.org/r/243858 [23:27:06] (03PS2) 10Dzahn: lint: re-enable double quoted strings check [puppet] - 10https://gerrit.wikimedia.org/r/243859 [23:27:46] (03CR) 10jenkins-bot: [V: 04-1] lint: re-enable double quoted strings check [puppet] - 10https://gerrit.wikimedia.org/r/243859 (owner: 10Dzahn) [23:28:03] RECOVERY - very high load average likely xfs on ms-be1001 is OK: OK - load average: 15.33, 5.58, 2.02 [23:31:47] (03PS1) 10Dzahn: deactivate wikimaps.[com|net|org] domains [dns] - 10https://gerrit.wikimedia.org/r/244078 [23:32:26] (03CR) 10Jforrester: "Now due to go out tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242041 (https://phabricator.wikimedia.org/T112348) (owner: 10Jforrester) [23:33:03] (03PS2) 10Dzahn: apache: remove wikimaps redirects [puppet] - 10https://gerrit.wikimedia.org/r/243348 [23:33:20] greg-g: FYI, https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=190326&oldid=190323 /should/ be a no-op, but just in case RoanKattouw and Krenair will both be on duty to make sure it doesn't blow up (and revert immediately if it does). [23:33:25] (03PS2) 10Dzahn: deactivate wikimaps.[com|net|org] domains [dns] - 10https://gerrit.wikimedia.org/r/244078 [23:35:01] (03PS1) 10Dzahn: deactivate indiawikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/244081 [23:36:03] (03PS1) 10Dzahn: deactivate vikipedia.com.tr [dns] - 10https://gerrit.wikimedia.org/r/244082 [23:37:05] (03PS1) 10Dzahn: deactivate wikimedia.biz [dns] - 10https://gerrit.wikimedia.org/r/244084 [23:38:26] (03PS1) 10Dzahn: deactivate wekipedia.com [dns] - 10https://gerrit.wikimedia.org/r/244085 [23:43:26] (03PS1) 10Dzahn: deactivate wiki[p|m]ediastories.[com|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244086 [23:44:13] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [23:44:14] (03CR) 10Dzahn: "note how these are redirects to https://wikimediafoundation.org/wiki/Thank_You_All . this needs to be brought to the attention of Victor " [dns] - 10https://gerrit.wikimedia.org/r/244086 (owner: 10Dzahn) [23:47:07] (03PS1) 10Dzahn: deactivate wikimediacommons.info [dns] - 10https://gerrit.wikimedia.org/r/244089 [23:49:19] (03PS1) 10Dzahn: deactivate wikipaedia.net [dns] - 10https://gerrit.wikimedia.org/r/244090 [23:52:59] (03PS1) 10Dzahn: deactivate wikimediacommons.[co.uk|eu|info|jp.net|mobi|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244092 [23:54:21] (03Abandoned) 10Dzahn: deactivate wikimediacommons.info [dns] - 10https://gerrit.wikimedia.org/r/244089 (owner: 10Dzahn) [23:55:10] (03PS2) 10Dzahn: memcached: move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243652 (owner: 10Muehlenhoff)