[00:00:04] should I be worried pep8 and pyflakes are failing? =/ [00:00:14] Krenair: they are [00:00:38] (03PS1) 10Dzahn: allow mira to connect to tcpircbot on neon [puppet] - 10https://gerrit.wikimedia.org/r/223469 [00:00:53] elee: I would guess that since they are non-voting nobody has ever made them work there [00:01:00] hah yeah [00:01:06] uh its just trivial stuff looks like [00:01:14] I don't think I'm bothered enough to hit them [00:01:31] theyre not complaining about anything I did for sure though, that I'm happy about [00:02:14] elee: if you wanted to be a superstar you could submit a follow up patch that makes them pass :) [00:02:25] (03PS2) 10Dzahn: allow mira to connect to tcpircbot on neon [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) [00:02:29] * bd808 loves pep8 [00:03:28] bd808: boohoo [00:03:42] I've actually never dealt with these sort of errors with python [00:03:51] there is a step in between ignoring it and fixing it [00:03:53] I'm a sysadmin, what is software development? [00:03:57] tell jenkins officially to not check on them [00:04:01] by adding an exception [00:04:21] regarding lines that are > 80 chars long [00:04:30] whats the accepted python "design" standard? [00:06:13] (03CR) 10Dzahn: [C: 031] "mira.codfw.wmnet has address 10.192.16.132" [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [00:08:05] (03PS3) 10Dzahn: allow mira to connect to tcpircbot on neon [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) [00:08:37] elee: https://www.python.org/dev/peps/pep-0008/#maximum-line-length [00:09:16] Some teams strongly prefer a longer line length. For code maintained exclusively or primarily by a team that can reach agreement on this issue, it is okay to increase the nominal line length from 80 to 100 characters [00:09:41] 78 chars should be etched in stone [00:09:48] Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation. [00:10:10] roger lets see what I can do [00:10:49] !log updated scap to 303e72e (Increment deployment stats after sync-wikiversions) [00:11:21] Krenair: bd808: https://gerrit.wikimedia.org/r/#/c/223469/3/manifests/role/tcpircbot.pp [00:12:18] (03PS4) 10Dzahn: allow mira to connect to tcpircbot on neon [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) [00:13:04] I've got this line that violates length [00:13:04] return 'Wikimedia Server Admin Log bot -- https://wikitech.wikimedia.org/wiki/Morebots' [00:13:08] can I... [00:13:09] oh wait [00:14:19] return ("foo"\n"barf") [00:14:36] python magically concats strings inside parens [00:14:40] or put the whole URL into a variable? [00:14:50] (03PS6) 10Elee: added year into logging, made pep8 happy [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) [00:15:01] (03CR) 10Alex Monk: [C: 031] allow mira to connect to tcpircbot on neon [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [00:15:18] i don't [00:15:19] NO [00:16:17] missing whitespace around operator [00:17:07] (03CR) 10BryanDavis: [C: 031] "The only thing more awesome would be if these addrs came from hiera or another config source" [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [00:17:49] (03PS7) 10Elee: added year into logging, made pep8 happy [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) [00:18:12] okay I had to use \ but whatever [00:18:20] need to fix adminlogbot though [00:18:30] elee: return ( 'Wikimedia ... ? [00:18:38] not me [00:18:41] I'm just making pep8 happy [00:18:53] yes, it is pep8 [00:19:11] 00:14:51 ./adminlogbot.py:36:50: E225 missing whitespace around operator [00:19:14] that [00:19:39] yeah [00:19:47] uh trying to think [00:19:56] would it be reasonable to return 'Wikimedia Server Admin Log bot -- ' \ 'https://wikitech.wikimedia.org/wiki/Morebots' [00:20:03] and \n after the \ [00:20:14] so we get: [00:20:14] return 'Wikimedia Server Admin Log bot -- ' \ [00:20:14] 'https://wikitech.wikimedia.org/wiki/Morebots' [00:20:24] oh thats what you meant by ( [00:20:39] mutante: you were thinking: [00:20:40] return ('Wikimedia Server Admin Log bot -- ' [00:20:40] 'https://wikitech.wikimedia.org/wiki/Morebots') [00:21:28] elee: i was thinking it wants "return ( 'Wikimedia Server Admin Log bot -- https://wikitech.wikimedia.org/wiki/Morebots' ) [00:21:45] right sorry forgot to remove the ' [00:22:14] (03PS8) 10Elee: added year into logging, made pep8 happy [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) [00:22:30] sigh [00:22:35] (03PS5) 10Dzahn: allow mira to connect to tcpircbot on neon [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) [00:22:55] (03CR) 10Dzahn: "agree on hiera / config source - but separately" [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [00:23:50] wait no I asn't supposed to do that [00:23:50] standby [00:24:05] (03PS9) 10Elee: added year into logging, made pep8 happy [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) [00:24:32] finally. okay what's pyflakes complaining about... [00:24:33] elee: works :) [00:24:35] =] [00:26:23] (03CR) 10Dzahn: [C: 032] allow mira to connect to tcpircbot on neon [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [00:26:38] mutante: wtf where did these variables come from [00:27:22] how do I deal with 00:24:07 ./statusnet.py:11: redefinition of unused 'urlencode' from line 9 [00:27:33] when the purpose of the redefinition is because of different versions? [00:27:53] (as in the script checks what version it is and then defines the appropriate one) [00:29:24] bd808, but it should still have synced the file, right? [00:29:42] Krenair: ummm... yeah [00:30:08] elee: i dunno :/ for PEP8 there are .pep8 files where you can specify one specific check in one specific file to be skipped [00:30:12] krenair@mw1001:~$ ls -al /srv/mediawiki/wmf-config/testmiradeploy [00:30:12] ls: cannot access /srv/mediawiki/wmf-config/testmiradeploy: No such file or directory [00:30:16] not sure about pyflakes [00:31:16] Krenair: hmmm... I bet everything synced with tin not mira [00:31:20] perhaps `assert`? [00:31:34] bd808, .. what? [00:32:54] Krenair: yeah. :( we are going to tweak scap for this [00:33:10] Oh [00:33:18] So you think they all pulled the file from tin? [00:33:30] (03PS10) 10Elee: added year into logging, made pep8 and pyflakes happy [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) [00:33:33] tin is hardcoded in scap? [00:33:36] the sync command sent to the proxies is a bard `sync-common` so it falls back to syncing with the master from the config file which is tin [00:33:47] s/bard/bare/ [00:34:12] https://github.com/wikimedia/mediawiki-tools-scap/blob/master/scap/main.py#L85-L90 [00:34:23] (03PS11) 10Elee: added year into logging, made pep8 and pyflakes happy [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) [00:34:47] that will need to add in the sync origin server as an argument to work with multi-master [00:35:34] !log zirconium - temp puppet disable for role switch [00:35:34] so that it ends up running a command like `sync-common mira.codwf.wmnet` on each proxy server [00:37:06] (03PS12) 10Elee: added year into logging, made pep8 and pyflakes happy [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) [00:37:08] alright buddy [00:37:21] bah, puppet issue with that ferm rule change.. wtf [00:37:23] SUCCESS [00:37:25] SUCCESS EVERYWHERE [00:37:26] checks neon [00:37:33] elee: :) nicely done [00:37:58] ugh [00:37:59] dear god [00:38:06] okay what other "Easy" tickets are there [00:38:20] Execution of '/etc/init.d/ferm stop' returned 25: [00:38:24] 6operations, 7Easy, 5Patch-For-Review: server admin log should include year in date (again) - https://phabricator.wikimedia.org/T85803#1436510 (10Elee) Sorry for the delay, everything should be happy now. [00:38:29] Could not stop Service[ferm]: [00:38:32] 6operations, 7Easy, 5Patch-For-Review: server admin log should include year in date (again) - https://phabricator.wikimedia.org/T85803#1436511 (10Elee) a:3Elee [00:39:03] elee: https://phabricator.wikimedia.org/T56763 :) [00:39:28] fine mutante, let me go email bashing first [00:39:45] it's just the task where voting got enabled [00:39:52] ah [00:39:58] i was thinking it was the one to make all files pass it :) [00:40:07] wait, the jobs didn't vote though [00:40:38] oh wait [00:40:41] jenkins does [00:40:41] ffs [00:41:02] wait but jenkins only does +2 for build tests [00:41:07] not for pep8 or pyflakes [00:41:22] Error in /etc/ferm/conf.d/10_ncsa_allowed line 8: [00:41:23] ffs [00:42:02] no such variable: $ESAMS_PUBLIC_PUBLIC_SERVICES [00:42:15] how did that not happen earlier.. mann [00:50:40] mutante: know of any easy tickets? [00:53:12] (03CR) 10Dzahn: "the renaming of the esams network caused this on neon:" [puppet] - 10https://gerrit.wikimedia.org/r/199293 (owner: 10Faidon Liambotis) [00:56:14] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1436515 (10Krenair) We also need to make other changes to scap - to be able to deploy at all from mira, we need it to be able to tell each host which master to pull... [00:58:17] (03PS1) 10Dzahn: icinga: fix ferm rules on neon [puppet] - 10https://gerrit.wikimedia.org/r/223476 [00:58:23] (03CR) 10Negative24: [C: 031] "Good work!" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) (owner: 10Elee) [00:59:12] (03PS2) 10Dzahn: icinga: fix ferm rules on neon [puppet] - 10https://gerrit.wikimedia.org/r/223476 [01:01:06] (03PS3) 10Dzahn: icinga: fix ferm rules on neon [puppet] - 10https://gerrit.wikimedia.org/r/223476 [01:01:41] (03CR) 10BBlack: [C: 031] icinga: fix ferm rules on neon [puppet] - 10https://gerrit.wikimedia.org/r/223476 (owner: 10Dzahn) [01:01:53] (03CR) 10Dzahn: [C: 032] icinga: fix ferm rules on neon [puppet] - 10https://gerrit.wikimedia.org/r/223476 (owner: 10Dzahn) [01:03:12] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/223476/" [puppet] - 10https://gerrit.wikimedia.org/r/199293 (owner: 10Faidon Liambotis) [01:04:40] elee: technically,.. https://phabricator.wikimedia.org/tag/easy/ but i'm not sure how accurate that is and probably only Mediawiki [01:05:27] it seems everything is easy (if you have done it before) and nothing is (it never takes 5 minutes, especially when you think it's easy) [01:06:41] rule doesn't apply when tagged as "epic" :p [01:07:17] also: https://www.mediawiki.org/wiki/Annoying_little_bugs [01:08:40] (03CR) 10Dzahn: "nice work making all the jenkins check happy:)" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) (owner: 10Elee) [01:09:23] (03PS2) 10Dzahn: annualreport: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/223223 (https://phabricator.wikimedia.org/T104936) [01:11:04] (03CR) 10Dzahn: "now applied after follow-up fix https://gerrit.wikimedia.org/r/#/c/223476/" [puppet] - 10https://gerrit.wikimedia.org/r/223469 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [01:11:25] Krenair: bd808 ^ now the bot announcement should work [01:13:48] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1436534 (10Dzahn) ``` root@neon:~# iptables -L | grep 9200 ACCEPT tcp -- eventlog1001.eqiad.wmnet anywhere tcp dpt:9200 ACCEPT tcp -- tin.e... [01:17:00] mutante, bd808: no luck [01:17:08] 01:16:51 Synchronized wmf-config/testmiradeploy: Test sync logging from mira (duration: 00m 10s) [01:17:11] nothing here though [01:17:17] everything looks normal in console [01:19:09] "If the configuration specifies a CIDR range, only 10 clients within that range are allowed to connect." [01:19:15] suspects tcpircbot config [01:19:18] on top of iptables [01:26:31] (03PS1) 10Dzahn: tcpircbot: allow connections from mira [puppet] - 10https://gerrit.wikimedia.org/r/223478 [01:28:12] (03PS2) 10Dzahn: tcpircbot: allow connections from mira [puppet] - 10https://gerrit.wikimedia.org/r/223478 (https://phabricator.wikimedia.org/T95436) [01:28:46] (03PS3) 10Dzahn: tcpircbot: allow connections from mira [puppet] - 10https://gerrit.wikimedia.org/r/223478 (https://phabricator.wikimedia.org/T95436) [01:29:36] alex@alex-laptop:~$ host 2620:0:860:102:10:192:16:132 [01:29:37] 2.3.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa domain name pointer mira.codfw.wmnet. [01:29:39] interesting [01:29:48] (03CR) 10Dzahn: [C: 032] tcpircbot: allow connections from mira [puppet] - 10https://gerrit.wikimedia.org/r/223478 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [01:29:50] wouldn't've expected that to work from my local machine [01:29:56] :) [01:30:27] sometimes PTR is really there [01:31:07] Could not find class passwords::puppet::database for neon.wikimedia.org on node neon.wikimedia.org [01:31:12] not fun anymore [01:31:22] PROBLEM - puppet last run on mw2149 is CRITICAL puppet fail [01:31:24] ip6.arpa? [01:31:29] oh that's probably me :P [01:31:34] PROBLEM - puppet last run on db2043 is CRITICAL puppet fail [01:32:05] the puppetfails I mean [01:32:12] PROBLEM - puppet last run on db2057 is CRITICAL puppet fail [01:32:13] PROBLEM - puppet last run on db1043 is CRITICAL Puppet has 1 failures [01:32:13] PROBLEM - puppet last run on mw1125 is CRITICAL puppet fail [01:32:13] bblack: the passwords::puppet thing? [01:32:13] PROBLEM - puppet last run on plutonium is CRITICAL puppet fail [01:32:14] I fixed it I think, but there will be more trailing ones [01:32:18] mutante: yes [01:32:22] PROBLEM - puppet last run on virt1001 is CRITICAL Puppet has 4 failures [01:32:22] PROBLEM - puppet last run on cp1050 is CRITICAL Puppet has 1 failures [01:32:23] PROBLEM - puppet last run on polonium is CRITICAL Puppet has 1 failures [01:32:24] PROBLEM - puppet last run on db2007 is CRITICAL Puppet has 1 failures [01:32:24] PROBLEM - puppet last run on ms-be2001 is CRITICAL puppet fail [01:32:24] PROBLEM - puppet last run on cp3005 is CRITICAL Puppet has 2 failures [01:32:39] ok, let me kill that [01:33:08] kill what? [01:33:17] the bot [01:33:45] icinga-wm [01:33:56] ah [01:34:41] still working on the the passwords::puppet::database thing [01:36:06] (03CR) 10Dzahn: [C: 031] added year into logging, made pep8 and pyflakes happy [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) (owner: 10Elee) [01:37:58] er... [01:38:01] (03CR) 10Dzahn: varnish: Update default varnish error page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [01:38:04] do I review my own code? no right? [01:38:05] =p [01:38:42] (03CR) 10Dzahn: "how about loading the image from wikitech-static?" [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [01:38:45] mutante: ok should be fixed now, for new puppet runs [01:39:04] bblack: ok, thanks [01:39:52] elee: sometimes we'd put a +1 on our own stuff to indicate it can be merged anytime (vs. something that we want to merge ourselves) [01:39:59] heh [01:41:00] (03PS3) 10Dzahn: annualreport: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/223223 (https://phabricator.wikimedia.org/T104936) [01:42:43] PROBLEM - puppet last run on cp2022 is CRITICAL Puppet has 2 failures [01:43:21] re-stopped the bot, way too early spam-wise :) [01:48:03] Krenair: tcpircbot .. now ? [01:49:09] no luck [01:49:14] 01:48:57 Synchronized wmf-config/testmiradeploy: Test sync logging from mira (duration: 00m 10s) [01:50:04] hrmm..it even got restarted by pupppet [01:50:09] and the config changed to allow it [01:51:21] (03CR) 10Dzahn: [C: 032] annualreport: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/223223 (https://phabricator.wikimedia.org/T104936) (owner: 10Dzahn) [01:54:30] (03PS1) 10Dzahn: tcpircbot: allow v4 connections from mira [puppet] - 10https://gerrit.wikimedia.org/r/223484 [01:55:08] (03PS2) 10Dzahn: tcpircbot: allow v4 connections from mira [puppet] - 10https://gerrit.wikimedia.org/r/223484 [01:56:56] (03CR) 10Dzahn: [C: 032] "would have thought v6 is preferred anyways.. but nevertheless.." [puppet] - 10https://gerrit.wikimedia.org/r/223484 (owner: 10Dzahn) [02:02:19] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1436558 (10RobH) [02:02:38] RECOVERY - puppet last run on mw1071 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:42] RECOVERY - puppet last run on mw1185 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [02:02:42] RECOVERY - puppet last run on mw1167 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [02:02:43] RECOVERY - puppet last run on mw1155 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [02:02:43] RECOVERY - puppet last run on mw1223 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:44] RECOVERY - puppet last run on mw1193 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:44] RECOVERY - puppet last run on mw1152 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:44] RECOVERY - puppet last run on mw1219 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:52] RECOVERY - puppet last run on mw1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:52] RECOVERY - puppet last run on mw1241 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:53] RECOVERY - puppet last run on mw1090 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [02:02:53] RECOVERY - puppet last run on mw1037 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:02:53] RECOVERY - puppet last run on mw1209 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:54] RECOVERY - puppet last run on mw1229 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:54] RECOVERY - puppet last run on mw1022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:03:02] (03CR) 10Alex Monk: [C: 031] deployment::server: move releases::upload into role [puppet] - 10https://gerrit.wikimedia.org/r/223464 (owner: 10Dzahn) [02:03:03] RECOVERY - puppet last run on mw2148 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [02:03:03] RECOVERY - puppet last run on mw2156 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [02:03:03] RECOVERY - puppet last run on mw2137 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [02:03:03] RECOVERY - puppet last run on mw2153 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:03:03] RECOVERY - puppet last run on mw2185 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:03:03] RECOVERY - puppet last run on mw2209 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:03:03] RECOVERY - puppet last run on mw2211 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:03:04] RECOVERY - puppet last run on mw2188 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:03:04] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [02:03:05] RECOVERY - puppet last run on mw2201 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:03:05] RECOVERY - puppet last run on mw2175 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:03:06] RECOVERY - puppet last run on mw2087 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [02:03:06] RECOVERY - puppet last run on mw2195 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [02:03:26] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1013052 (10RobH) I still need to investigate/replace use of the ldap-mirror certificate (need to coordinate, as last time this happened it was tricky getting the ldap service to recognize the update correctly.... [02:04:06] 6operations, 10OTRS, 7HTTPS, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1436561 (10RobH) [02:04:09] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1436562 (10RobH) [02:04:11] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1436560 (10RobH) [02:09:54] (03PS2) 10Dzahn: misc-web varnish: switch annualreport to bromine [puppet] - 10https://gerrit.wikimedia.org/r/223222 (https://phabricator.wikimedia.org/T104936) [02:11:53] RECOVERY - puppet last run on wtp1004 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [02:13:00] (03CR) 10Dzahn: [C: 032] misc-web varnish: switch annualreport to bromine [puppet] - 10https://gerrit.wikimedia.org/r/223222 (https://phabricator.wikimedia.org/T104936) (owner: 10Dzahn) [02:13:02] RECOVERY - puppet last run on mw1114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:13:23] RECOVERY - puppet last run on mw2168 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [02:13:23] RECOVERY - puppet last run on labcontrol1002 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [02:13:33] RECOVERY - puppet last run on mw1098 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [02:14:43] RECOVERY - puppet last run on cp1044 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [02:16:02] RECOVERY - puppet last run on ganeti1003 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [02:16:41] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 00m 48s) [02:16:47] Logged the message, Master [02:16:50] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-08 02:16:50+00:00 [02:16:55] Logged the message, Master [02:17:13] RECOVERY - puppet last run on cp1066 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [02:17:16] 6operations, 10Traffic, 5Patch-For-Review: move annual report from zirconium to bromine - https://phabricator.wikimedia.org/T104936#1436565 (10Dzahn) [terbium:~] $ apache-fast-test annual.purls bromine.eqiad.wmnet testing 3 urls on 1 servers, totalling 3 requests spawning threads.. https://annual.wikimedia... [02:17:24] 6operations, 10Traffic, 5Patch-For-Review: move annual report from zirconium to bromine - https://phabricator.wikimedia.org/T104936#1436566 (10Dzahn) 5Open>3Resolved [02:17:25] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1436567 (10Dzahn) [02:18:48] Krenair: last attempt for today? [02:19:27] ok [02:19:47] !log krenair Synchronized wmf-config/testmiradeploy: Test sync logging from mira (duration: 00m 13s) [02:19:50] yay [02:19:53] nice :) [02:20:38] Now we have to get it actually syncing files :p [02:21:02] RECOVERY - puppet last run on mw1131 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:21:15] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1436571 (10Dzahn) after https://gerrit.wikimedia.org/r/#/c/223484/ the bot works now when using mira 19:21 < logmsgbot> !log krenair Synchronized wmf-config/testmi... [02:21:24] yes :) [02:23:41] (03PS4) 10Dzahn: annualreport: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/223220 (https://phabricator.wikimedia.org/T104936) [02:24:51] (03CR) 10Dzahn: [C: 032] annualreport: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/223220 (https://phabricator.wikimedia.org/T104936) (owner: 10Dzahn) [02:26:34] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 06m 30s) [02:29:47] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-08 02:29:46+00:00 [02:31:15] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 00m 50s) [02:31:24] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-08 02:31:24+00:00 [02:31:31] Logged the message, Master [02:46:43] (03PS1) 10BBlack: test secret() [puppet] - 10https://gerrit.wikimedia.org/r/223486 [02:48:14] (03CR) 10BBlack: [C: 032] test secret() [puppet] - 10https://gerrit.wikimedia.org/r/223486 (owner: 10BBlack) [02:57:16] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 10m 22s) [03:03:09] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-08 03:03:09+00:00 [03:14:24] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1436608 (10GWicke) I don't have any conclusive evidence either way. Most benchmarks others have conducted with cassandra and G1GC show better throughput using jdk8. [03:26:55] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1436614 (10bd808) >>! In T95436#1436515, @Krenair wrote: > We also need to make other changes to scap - to be able to deploy at all from mira, we need it to be able... [03:45:23] (03PS1) 10BBlack: Revert "test secret()" [puppet] - 10https://gerrit.wikimedia.org/r/223489 [03:45:25] (03PS1) 10BBlack: sslcert::certificate: no backup/show_diff on private [puppet] - 10https://gerrit.wikimedia.org/r/223490 [03:45:27] (03PS1) 10BBlack: sslcert::std_cert explicit deps [puppet] - 10https://gerrit.wikimedia.org/r/223491 [03:45:29] (03PS1) 10BBlack: sslcert: refactor std_cert [puppet] - 10https://gerrit.wikimedia.org/r/223492 [03:45:41] (03CR) 10BBlack: [C: 032 V: 032] Revert "test secret()" [puppet] - 10https://gerrit.wikimedia.org/r/223489 (owner: 10BBlack) [03:46:49] (03CR) 10BBlack: [C: 032] sslcert::certificate: no backup/show_diff on private [puppet] - 10https://gerrit.wikimedia.org/r/223490 (owner: 10BBlack) [03:47:24] (03CR) 10BBlack: [C: 032] sslcert::std_cert explicit deps [puppet] - 10https://gerrit.wikimedia.org/r/223491 (owner: 10BBlack) [03:54:02] PROBLEM - puppet last run on sodium is CRITICAL puppet fail [03:58:22] (03PS2) 10BBlack: sslcert: refactor std_cert [puppet] - 10https://gerrit.wikimedia.org/r/223492 [03:58:24] (03PS1) 10BBlack: add secret parser func [puppet] - 10https://gerrit.wikimedia.org/r/223494 [03:58:29] of course, always sodium [03:59:53] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 17.24% of data above the critical threshold [100000000.0] [04:00:38] which is puppet 2.7.7 :/ [04:01:12] Does anyone know how to make `webservice uwsgi-python` use Python 3? [04:03:19] bblack, is sodium the only <12.04 host sitting around? [04:04:16] https://phabricator.wikimedia.org/T80945 makes it look like that [04:05:14] PROBLEM - puppet last run on mw2174 is CRITICAL Puppet has 1 failures [04:14:03] PROBLEM - puppet last run on mw1051 is CRITICAL Puppet has 1 failures [04:21:53] RECOVERY - puppet last run on mw2174 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:27:27] (03PS1) 10GWicke: Increase read parallelism to 96 [puppet] - 10https://gerrit.wikimedia.org/r/223495 [04:27:29] (03PS1) 10GWicke: Increase the write request timeout to 5s [puppet] - 10https://gerrit.wikimedia.org/r/223496 [04:31:33] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:32:33] RECOVERY - puppet last run on mw1051 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [04:32:40] Krenair: yes, pretty sure just sodium.. /afk... [04:33:23] the last unicorn [04:33:34] (03PS1) 10Ebrahim: Enable anon user page creation on draft ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223497 [04:34:07] (03PS2) 10Ebrahim: Enable anon user page creation on draft ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223497 (https://phabricator.wikimedia.org/T105118) [04:41:24] (03CR) 10Dzahn: "how to line wrap properly within the ferm rule to make it readable?" [puppet] - 10https://gerrit.wikimedia.org/r/223476 (owner: 10Dzahn) [04:42:59] (03PS1) 10GWicke: Reduce compaction throughput to 100mb/s [puppet] - 10https://gerrit.wikimedia.org/r/223499 [04:44:37] (03PS2) 10GWicke: Reduce compaction throughput to 110mb/s [puppet] - 10https://gerrit.wikimedia.org/r/223499 [04:46:32] (03PS1) 10Dzahn: (WIP) - make icinga firewall more readable [puppet] - 10https://gerrit.wikimedia.org/r/223500 [04:47:40] (03PS2) 10Dzahn: (WIP) - make icinga firewall more readable [puppet] - 10https://gerrit.wikimedia.org/r/223500 [04:55:38] (03CR) 10Dzahn: varnish: Update default varnish error page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [04:59:22] (03CR) 10Dzahn: varnish: Update default varnish error page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [05:26:08] (03CR) 10Alex Monk: [C: 04-1] "Needs to update comment below, mention which wiki in the commit message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223497 (https://phabricator.wikimedia.org/T105118) (owner: 10Ebrahim) [05:44:16] (03PS1) 10Springle: repool db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223505 [05:44:52] (03CR) 10Springle: [C: 032] repool db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223505 (owner: 10Springle) [05:44:57] (03Merged) 10jenkins-bot: repool db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223505 (owner: 10Springle) [05:46:05] !log springle Synchronized wmf-config/db-eqiad.php: repool db1041, warm up (duration: 00m 13s) [05:46:11] Logged the message, Master [05:48:20] (03PS2) 10Krinkle: varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 [05:53:37] mutante: Hm.. could you help me for a sec with a puppet syntax? How can I in the errorpage.inc.vcl.erb have a variable and then embed that in the url. [05:53:39] in the html [05:54:16] maybe with something like snprintf() or just directly. Not sure how to combine that with "synthetic" [05:57:10] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 8 05:57:10 UTC 2015 (duration 57m 9s) [05:57:14] Logged the message, Master [06:01:55] <_joe_> Krinkle: gimme 20 minutes to have a coffee and I can help you :) [06:02:27] _joe_: Thanks :) [06:03:37] (03PS3) 10Krinkle: varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 [06:03:42] (03CR) 10Krinkle: varnish: Update default varnish error page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [06:07:55] <_joe_> Krinkle: you need to print a puppet variable into the template? [06:09:23] _joe_: https://gerrit.wikimedia.org/r/#/c/223012/3/templates/varnish/errorpage.inc.vcl.erb [06:09:37] _joe_: I want to create a local variable there before the synthetic literal [06:09:40] and then use it in the literal [06:10:01] e.g. <% myvar = "something %> at the top of the doc [06:10:30] <_joe_> and then cool [06:10:57] <_joe_> note that since you define the variable in the template, you don't use "@myvar" [06:12:03] <_joe_> you probably want <%- myvar = "..." -%> [06:12:32] <_joe_> the dash means "discard any trailing newline in the output" [06:13:12] It's a very large value (like 5K) [06:13:33] <_joe_> well an erb file is basically ruby [06:13:37] Does <%- work if the is a line break before "myvar = " ? [06:13:45] or does it have to be one line [06:14:00] <_joe_> no it can be as many lines as you want [06:15:14] <_joe_> whatever you put within <% %> tags is runy [06:15:16] <_joe_> *ruby [06:15:19] <_joe_> see https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/vcl/wikimedia.vcl.erb#L11-42 [06:15:24] (03PS4) 10Krinkle: varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 [06:15:40] Oh, it starts without - but ends with - [06:16:18] (03PS5) 10Krinkle: varnish: Update default varnish error page [puppet] - 10https://gerrit.wikimedia.org/r/223012 [06:16:23] I guess both work [06:16:27] but there's nothing leading in this case [06:16:40] <_joe_> lemme see :) [06:17:18] <_joe_> it's good [06:18:10] (03CR) 10Krinkle: varnish: Update default varnish error page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223012 (owner: 10Krinkle) [06:20:52] <_joe_> also, thanks for doing this, the old error page was a bit weird [06:24:27] (03CR) 10Matanya: [C: 031] enable ferm for neptunium [puppet] - 10https://gerrit.wikimedia.org/r/223355 (owner: 10Muehlenhoff) [06:31:52] PROBLEM - puppet last run on cp1061 is CRITICAL Puppet has 1 failures [06:34:02] PROBLEM - puppet last run on labsdb1003 is CRITICAL Puppet has 1 failures [06:34:23] PROBLEM - puppet last run on elastic1027 is CRITICAL Puppet has 1 failures [06:35:03] PROBLEM - puppet last run on ms-fe2001 is CRITICAL Puppet has 1 failures [06:35:12] PROBLEM - puppet last run on db2018 is CRITICAL Puppet has 1 failures [06:36:13] PROBLEM - puppet last run on rhodium is CRITICAL Puppet has 1 failures [06:38:03] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:38:13] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:38:53] PROBLEM - puppet last run on rdb2003 is CRITICAL puppet fail [06:40:03] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:40:04] PROBLEM - puppet last run on mw2090 is CRITICAL Puppet has 1 failures [06:45:54] PROBLEM - puppet last run on db1007 is CRITICAL Puppet has 1 failures [06:46:43] RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:23] RECOVERY - puppet last run on rhodium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:21] (03CR) 10Jcrespo: [C: 032] Adding an updated versions of the redact.sh scrip [software/redactatron] - 10https://gerrit.wikimedia.org/r/223344 (https://phabricator.wikimedia.org/T104900) (owner: 10Jcrespo) [06:48:51] (03CR) 10Jcrespo: [V: 032] Adding an updated versions of the redact.sh scrip [software/redactatron] - 10https://gerrit.wikimedia.org/r/223344 (https://phabricator.wikimedia.org/T104900) (owner: 10Jcrespo) [06:53:37] (03PS1) 10Matanya: add soundcloud to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223516 [06:54:42] (03PS2) 10Matanya: add soundcloud to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223516 [06:55:24] RECOVERY - puppet last run on rdb2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:47] (03PS2) 10Muehlenhoff: Convert firewall resource declarations to an include for consistency [puppet] - 10https://gerrit.wikimedia.org/r/222314 [07:02:43] RECOVERY - puppet last run on db1007 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:03:53] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [07:05:14] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 43.66 ms [07:05:33] RECOVERY - puppet last run on labsdb1003 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:06:02] RECOVERY - puppet last run on elastic1027 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:06:43] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:43] RECOVERY - puppet last run on db2018 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:07:28] (03CR) 10Mobrovac: [C: 031] restbase - fix => alignment (lint) [puppet] - 10https://gerrit.wikimedia.org/r/222535 (owner: 10Dzahn) [07:07:53] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:07:53] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [07:07:54] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:07:54] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:09:02] PROBLEM - Restbase root url on restbase1006 is CRITICAL - Socket timeout after 10 seconds [07:09:43] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.005 second response time on port 9042 [07:09:52] RECOVERY - puppet last run on mw2090 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:10:30] (03PS3) 10Muehlenhoff: Convert firewall resource declarations to an include for consistency [puppet] - 10https://gerrit.wikimedia.org/r/222314 [07:10:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Convert firewall resource declarations to an include for consistency [puppet] - 10https://gerrit.wikimedia.org/r/222314 (owner: 10Muehlenhoff) [07:15:39] (03Abandoned) 10Giuseppe Lavagetto: varnish: activate dynamic lookup on one esams host [puppet] - 10https://gerrit.wikimedia.org/r/222091 (owner: 10Giuseppe Lavagetto) [07:22:51] 6operations, 10RESTBase, 6Services, 7RESTBase-API: Expose RESTBase monitoring examples in Swagger spec - https://phabricator.wikimedia.org/T104850#1436896 (10Pchelolo) The PR for this was merged: https://github.com/wikimedia/restbase/pull/276 [07:27:23] RECOVERY - Restbase root url on restbase1006 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.004 second response time [07:42:38] hey we are going to upgrade Jenkins in 20 minutes. if anything urgent to merge-in feel free to force merge [07:45:46] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1436914 (10Matanya) 3NEW [07:46:22] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1436925 (10Matanya) [07:46:24] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1436924 (10Matanya) [07:46:39] 6operations: Ferm rules for video scalers - https://phabricator.wikimedia.org/T104970#1436926 (10MoritzMuehlenhoff) [07:47:35] 6operations: Ferm rules for video scalers - https://phabricator.wikimedia.org/T104970#1433459 (10MoritzMuehlenhoff) [07:52:45] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1436937 (10mmodell) @bblack, @dduvall and @krinkle, I think you are all suggesting essentially the s... [08:00:04] hashar zeljkof: Dear anthropoid, the time has come. Please deploy CI infrastructure (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150708T0800). [08:00:11] nice [08:04:20] 6operations: Ferm rules for swift - https://phabricator.wikimedia.org/T104965#1436940 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [08:11:53] !login shutdowning Jenkins for upgrade. [08:11:58] !log shutdowning Jenkins for upgrade. [08:12:03] Logged the message, Master [08:27:05] !log Jenkins is migrating old build histories. Lot of disk I/O [08:36:03] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [08:37:23] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [08:41:10] !log Jenkins is migrating old build histories. Lot of disk I/O [08:41:18] bah [08:41:19] morebots: ping [08:41:20] I am a logbot running on tools-exec-1217. [08:41:20] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [08:41:20] To log a message, type !log . [08:41:30] !log Jenkins is migrating old build histories. Lot of disk IO happening [08:41:33] ... [08:41:34] Logged the message, Master [08:41:44] can't log a sentence having a / [08:43:54] !log bounce cassandra on restbase1004, death by compaction [08:44:22] morebots: hop hop [08:44:22] I am a logbot running on tools-exec-1217. [08:44:22] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [08:44:22] To log a message, type !log . [08:44:41] !log bounce cassandra on restbase1004, death by compaction [08:45:03] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [08:45:06] !log bounce cassandra on restbase1004 death by compaction [08:45:22] I think wikitech might be in trouble [08:45:32] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.004 second response time on port 9042 [08:49:26] godog: wikitech is working for me? [08:52:57] moritzm: mhh logout and log back in isn't working for me ATM [08:53:08] well, sometimes [08:54:39] hmm, you're right, it accepts my login, but then I silently remain not-logged-in... [09:00:33] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1436987 (10MoritzMuehlenhoff) >>! In T103335#1429015, @MoritzMuehlenhoff wrote: > In Debian the De... [09:01:39] I _think_ it might be nutcracker, looking [09:06:20] !log Jenkins registering jobs with Zuul [09:06:26] Logged the message, Master [09:08:41] (03PS1) 10Ori.livneh: Fix request-matching logic in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/223522 [09:08:54] godog: we're in the same timezone again! :) [09:12:27] ori: hahaha very true [09:12:56] (03PS2) 10Ori.livneh: Fix request-matching logic in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/223522 [09:13:10] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix request-matching logic in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/223522 (owner: 10Ori.livneh) [09:18:43] (03PS1) 10Giuseppe Lavagetto: pybal: refactor pybal::pool, print pools in pybal::web [puppet] - 10https://gerrit.wikimedia.org/r/223523 [09:19:44] (03PS3) 10Giuseppe Lavagetto: imagescalers: reimage mw1153 with HAT [puppet] - 10https://gerrit.wikimedia.org/r/223331 (https://phabricator.wikimedia.org/T84842) [09:20:10] <_joe_> (this is just a cosmetic change, btw) [09:20:38] (03CR) 10Giuseppe Lavagetto: [C: 032] "Just a cosmetic change as I'm reimaging mw1153 today" [puppet] - 10https://gerrit.wikimedia.org/r/223331 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [09:21:03] <_joe_> !log starting reimaging of mw1153, depooling it and scheduling downtime [09:26:19] !log upgraded plugins on jenkins and restarting it [09:26:26] Logged the message, Master [09:28:22] PROBLEM - Host mw1153 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:12] (03PS3) 10Ebrahim: Enable IP user page creation on fawiki's Draft ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223497 (https://phabricator.wikimedia.org/T105118) [09:31:20] !log Jenkins is fully back and operational. Or so should be. [09:33:40] <_joe_> !log starting reimaging of mw1153, depooling it and scheduling downtime (at 9:21 UTC) [09:33:44] Logged the message, Master [09:34:01] <_joe_> grr why my schedule-downtime command doesn't work anymore [09:34:45] (03CR) 10Steinsplitter: [C: 031] add soundcloud to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223516 (owner: 10Matanya) [09:35:42] !log Nuking /var/lib/carbon/whisper/ResourceLoader on graphite[12]001. Data prior to rollout of I55f0c44cd considered bogus. [09:40:55] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent wikitech failures - https://phabricator.wikimedia.org/T105131#1437053 (10fgiunchedi) 3NEW [09:41:13] !log bounce nutcracker on silver [09:41:17] Logged the message, Master [09:41:41] moritzm: can you try again? [09:41:58] ori _joe_ that didn't get logged btw [09:42:14] morebots: hi [09:42:14] I am a logbot running on tools-exec-1217. [09:42:14] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [09:42:14] To log a message, type !log . [09:42:31] !log morebots, are you OK? [09:42:35] Logged the message, Master [09:42:47] yeah it is okay, wikitech was having troubles, see https://phabricator.wikimedia.org/T105131 [09:42:49] !log Nuked /var/lib/carbon/whisper/ResourceLoader on graphite[12]001. Data prior to rollout of I55f0c44cd considered bogus. [09:42:53] Logged the message, Master [09:43:06] !log _joe_: starting reimaging of mw1153, depooling it and scheduling downtime (at 9:21 UTC) [09:43:10] Logged the message, Master [09:43:20] godog: thanks [09:43:38] np, I think that might be the same nutcracker issue we've seen in production [09:43:53] godog: it's working now! [09:44:05] silver using nutcracker is silly [09:44:14] does it not use its own memcached instance on localhost? [09:44:15] <_joe_> it is [09:44:18] <_joe_> yes [09:44:35] there's a ticket about that too [09:44:37] at one point silver was meaningfully separate from production [09:44:54] now the whole "oh, it's by design" thing is cited as justification for it being a screwball host [09:45:26] there's no justification for it, the notion that it's more resilient somehow is goofy [09:45:58] what's more resilient? [09:46:09] silver, by virtue of being separate from the app server cluster [09:47:35] ah, indeed [09:47:54] besides (I'm ranting, I know) the useful content is so thoroughly buried under a layer of semantic mediawiki / openstack cruft that no one is crazy enough to rely on it as a source of wisdom for debugging outages [09:48:44] <_joe_> you mean wikitech? [09:48:54] <_joe_> yeah we admitted yesterday we're not good at this [09:49:10] this == ? [09:49:13] <_joe_> (keeping wikitech relevant) [09:49:31] <_joe_> I'm personally trying to be a bit better and update things I touch [09:50:00] I'm confused, wikitech the interface for labs or wikitech the documentation repository ori? [09:50:09] godog: exactly [09:50:20] <_joe_> eheh [09:51:46] heh, well the documentation part itself works, but yes it isn't always up to date [09:51:47] there is wikitech-static on a separate stack already [09:53:15] i think it will take something more drastic than good will to fix it [09:53:43] my suggestion was & is automatic deletion of pages that have not been touched in a year [09:54:06] and markdown support, which would be easy enough to do [09:54:17] but primarily the former [09:54:20] <_joe_> that might work, unless something is really untouched for multiple years [09:54:47] <_joe_> Keegan: you might be delighted to know we will be upgrading the imagescalers in the coming weeks [09:56:10] ori: might work, perhaps not deletion but a note certainly, on the basis that we don't necessarily cycle through every subsystem once a year [09:56:12] <_joe_> Actually, I might do it during wikimania so that people can directly complain to you guys in person :) [09:56:24] godog: is https://gerrit.wikimedia.org/r/218905 moot ? [09:56:41] <_joe_> godog: moving to a dedicated namespace could be good [09:56:51] DELETE [09:57:32] <_joe_> PUT [09:57:34] godog: take the wikitech challenge: hit Special:Random and find me a page that (a) hasn't been updated in a year, (b) is better than nothing [09:57:54] <_joe_> ori: that's a drinking game waiting to happen :P [09:58:01] haha [09:58:05] matanya: no, sorry, I'll +1 but unlikely I can deploy/babysit this week [09:58:14] (03CR) 10Filippo Giunchedi: [C: 031] jobchron: log rotate [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [09:58:24] ori: Oooo markdown support for wikitech? [09:58:39] thanks godog [09:58:39] YuviPanda: I'm suggesting, not volunteering :P [09:58:46] godog: hi! Can you take a deeper look at the uwsgi patch? :) [09:58:52] <_joe_> you say markdown and the young hipsters with magenta hair get excited [09:59:05] godog: we would lose such gems as https://wikitech.wikimedia.org/wiki/Adding_a_file_in_innodb [09:59:06] ori: its something I have wanted to build for a while except not sure at all how it will do templates [09:59:23] ori, WTF! [09:59:38] <_joe_> ori: ahahahah [10:00:26] hahaha [10:00:52] I'm sure there is someone on labs who would find that useful [10:00:58] perhaps you prefer https://wikitech.wikimedia.org/wiki/Add_a_server [10:01:16] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 676.015031008 [10:01:19] <_joe_> wow I'm finding only great pages that make me ashamed of myself [10:01:34] <_joe_> like https://wikitech.wikimedia.org/wiki/Search or https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hardware [10:02:02] oh i'm not denying that there's good content [10:02:09] i'm denying that there's good content that hasn't been touched in a year [10:02:19] <_joe_> "Apache not in nagios: Tim gets upset." [10:02:20] <_joe_> ahah [10:02:29] <_joe_> we ought to archive these pages :) [10:03:06] <_joe_> "No upload NFS mounts" [10:04:22] ori: I actually hope that we would be able to move it to the main cluster once we start using horizon [10:04:53] Horizon is on the horizon... [10:05:21] boah. so annoying. why i need to revfresh every 5 minutes my token -.- [10:05:29] (03CR) 10Filippo Giunchedi: [C: 031] "couple of nitpicks (feel free to ignore things prefixed with ~) but overall looks good!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223383 (owner: 10Yuvipanda) [10:05:30] YuviPanda: sure [10:05:57] paravoid: if it can function without Extension:OpenStack & friends [10:06:18] why wouldn't it be? [10:06:23] Hmm I can kill all the semantic stuff in a week probably. [10:06:44] Only the tools access request depends on it [10:06:56] godog: thanks! I'll update and merge shortly. [10:07:29] dunno, maybe the content namespaces can be rescued easily [10:07:52] YuviPanda: cool! [10:24:26] moritzm: does every ferm::service have to have srange [10:24:47] PROBLEM - RAID on mw1153 is CRITICAL: Connection refused by host [10:25:27] PROBLEM - configured eth on mw1153 is CRITICAL: Connection refused by host [10:25:46] PROBLEM - dhclient process on mw1153 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:25:57] PROBLEM - nutcracker port on mw1153 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:26:08] PROBLEM - nutcracker process on mw1153 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:26:18] PROBLEM - DPKG on mw1153 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:26:18] PROBLEM - puppet last run on mw1153 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:26:36] PROBLEM - salt-minion processes on mw1153 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:26:36] PROBLEM - Disk space on mw1153 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:26:57] PROBLEM - HHVM processes on mw1153 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:28:33] <_joe_> sigh, again [10:29:57] PROBLEM - puppet last run on cp3014 is CRITICAL puppet fail [10:30:46] PROBLEM - puppet last run on mw1092 is CRITICAL Puppet has 1 failures [10:37:38] (03PS5) 10Yuvipanda: uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 [10:37:45] (03CR) 10jenkins-bot: [V: 04-1] uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 (owner: 10Yuvipanda) [10:37:53] godog: ^ updated, and yes, the uwsgi-* trick does indeed work (I'm using it elsewhere) [10:39:42] matanya: no necessarily, that pretty much depends on the service [10:39:55] (03PS6) 10Yuvipanda: uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 [10:39:59] '#' are comments in systemd unit files too, right? [10:40:25] indeed [10:40:37] godog: alright, going to merge now! I'll keep an eye on graphite [10:41:34] (03CR) 10Yuvipanda: [C: 032] uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 (owner: 10Yuvipanda) [10:41:41] YuviPanda: cool! thanks [10:42:20] afaict upstart doesn't support the same globbing trick heh [10:43:24] YuviPanda: yeah, "#" are comment sin systemd units [10:43:53] godog: hmm, Service['uwsgi'] failures... [10:44:13] godog: so to be technically correct they should all be Service['uwsgi-graphite-app'] I guess [10:44:26] RECOVERY - configured eth on mw1153 is OK - interfaces up [10:44:37] RECOVERY - dhclient process on mw1153 is OK: PROCS OK: 0 processes with command name dhclient [10:44:48] (03CR) 10JanZerebecki: [C: 031] "After this is merged, to make jenkins voting one now can edit the file zuul/layout.yaml in the git repository integration/config and remov" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) (owner: 10Elee) [10:44:57] RECOVERY - nutcracker port on mw1153 is OK: TCP OK - 0.000 second response time on port 11212 [10:45:06] YuviPanda: yup, also I guess the old upstart files won't get removed [10:45:07] RECOVERY - nutcracker process on mw1153 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [10:45:17] RECOVERY - DPKG on mw1153 is OK: All packages OK [10:45:27] RECOVERY - salt-minion processes on mw1153 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:45:27] RECOVERY - Disk space on mw1153 is OK: DISK OK [10:45:37] RECOVERY - RAID on mw1153 is OK no RAID installed [10:45:57] RECOVERY - HHVM processes on mw1153 is OK: PROCS OK: 6 processes with command name hhvm [10:46:50] (03PS1) 10Yuvipanda: uwsgi: Update Service['uwsgi'] to more accurate Service defs [puppet] - 10https://gerrit.wikimedia.org/r/223529 [10:46:58] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:07] PROBLEM - puppet last run on graphite1001 is CRITICAL puppet fail [10:47:07] PROBLEM - puppet last run on mw1153 is CRITICAL Puppet has 6 failures [10:47:13] godog: ^ do you think I should shut down the old upstart based services by hand or by puppet? [10:47:16] * YuviPanda prefers by hand [10:47:28] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [10:47:48] RECOVERY - puppet last run on mw1092 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:52] YuviPanda: yeah by hand is fine, I don't think it is widely used anywaya now [10:48:03] I use it in two of my things :) [10:48:05] <_joe_> YuviPanda: I've personally given up on auto-decommissioning puppet modules [10:48:18] <_joe_> it takes a lot of effort and it's pointless in most cases [10:48:19] _joe_: me too :) [10:48:24] (03CR) 10Yuvipanda: [C: 032] uwsgi: Update Service['uwsgi'] to more accurate Service defs [puppet] - 10https://gerrit.wikimedia.org/r/223529 (owner: 10Yuvipanda) [10:48:36] <_joe_> you just reimage the thing, or if it's stateful you clean up by hand [10:49:06] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [10:49:07] RECOVERY - puppet last run on mw1153 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:50:56] !log bounce cassandra on restbase1004, death by compaction [10:50:56] RECOVERY - puppet last run on graphite1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:51:00] Logged the message, Master [10:52:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [10:52:27] godog: everthing seems ok on graphite! [10:52:47] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [10:52:57] ori: did you provision coal anywhere? [10:53:07] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.003 second response time on port 9042 [10:53:33] YuviPanda: graphite1001 [10:53:45] ori: did you give it a hostname anywhere?: [10:53:47] PROBLEM - puppet last run on labmon1001 is CRITICAL puppet fail [10:53:51] I see the service is up and running, need to check.. [10:54:24] _joe_: uhm... [10:54:24] YuviPanda: sweet! I guess the same on labmon1001 too for graphite labs? [10:54:32] > Error: Could not request certificate: The certificate retrieved from the master does not match the agent's private key. [10:54:34] godog: nope ^ [10:54:44] Is this from the private key rotation? [10:55:00] you mean the CA? possible yeah [10:55:27] <_joe_> YuviPanda: whassup? [10:55:42] _joe_: Error: Could not request certificate: The certificate retrieved from the master does not match the agent's private key. [10:55:44] on labmon1001 [10:56:12] <_joe_> YuviPanda: when did puppet last run successfully there? [10:56:20] > The last Puppet run was at Sat Jun 20 00:48:13 UTC 2015 (13 minutes ago). [10:56:25] although I'm not sure if that was successful [10:56:27] * YuviPanda checks [10:56:37] <_joe_> YuviPanda: sudo [10:56:40] <_joe_> ;) [10:56:48] aaaarrgghh [10:56:50] <_joe_> PEBKAC! [10:56:52] <_joe_> :) [10:56:53] clearly way too early in the morning [10:56:55] Jun 20 isn't 13 minutes ago [10:56:57] or I'm an idiot. [10:57:03] <_joe_> nope [10:57:06] hahaha wat [10:57:13] <_joe_> it's before the migration [10:57:34] * YuviPanda is forcing a puppet run now [10:57:35] <_joe_> actually, well before that [10:57:58] <_joe_> sorry, brb [10:59:28] RECOVERY - puppet last run on labmon1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:33] wheee [11:00:03] <_joe_> YuviPanda: so, what happened? [11:00:27] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [11:00:40] _joe_: it was the sudo issue... [11:00:45] I ran puppet with sudo and everything is ok [11:00:47] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [11:01:25] <_joe_> godog: can we do something about cassandra dying every 10 minutes? or just restart it? [11:01:39] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [11:01:45] <_joe_> ouch [11:01:48] <_joe_> two in a row [11:01:58] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [11:03:37] we can make the systemd unit respawn...? [11:04:05] <_joe_> godog: please ack if you're looking into those nodes [11:04:12] _joe_: heh, there's several mitigation attempts going on, but afaict for now restart is the thing to do [11:04:17] yes I am looking [11:04:30] <_joe_> ok I didn't want to overlap :) [11:04:49] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:04:51] <_joe_> godog: the main mitigation would be having a separate cassandra/rb cluster for enwiki [11:04:58] <_joe_> as I suggested early on :) [11:08:18] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [11:08:35] !log bounce cassandra on restbase1004 and restbase1005 'cannot achieve consistency level quorum' [11:08:39] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [11:08:39] Logged the message, Master [11:09:08] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.002 second response time on port 9042 [11:10:49] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.000 second response time on port 9042 [11:12:27] <_joe_> !log mw1153 passed the smoke tests, repooling [11:12:31] Logged the message, Master [11:17:38] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [11:18:28] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [11:21:29] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [11:22:19] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [11:23:28] !log bounce cassandra on restbase1004, heap space [11:23:33] Logged the message, Master [11:31:46] (03PS3) 10Lokal Profil: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) [11:33:42] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Define the details of the hardware we need to run WDQS - https://phabricator.wikimedia.org/T104879#1437296 (10Joe) Just as a note - labs instances are incredibly slower than production hardware. IOPS... [11:44:13] (03PS3) 10Aklapper: add soundcloud to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223516 (https://phabricator.wikimedia.org/T105052) (owner: 10Matanya) [11:52:50] (03PS2) 10Giuseppe Lavagetto: pybal: refactor pybal::pool, print pools in pybal::web [puppet] - 10https://gerrit.wikimedia.org/r/223523 [11:58:21] (03PS2) 10Yuvipanda: beta: include deployment-mediawiki03 in scap targets [puppet] - 10https://gerrit.wikimedia.org/r/223391 (https://phabricator.wikimedia.org/T72181) (owner: 10BryanDavis) [11:58:27] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: include deployment-mediawiki03 in scap targets [puppet] - 10https://gerrit.wikimedia.org/r/223391 (https://phabricator.wikimedia.org/T72181) (owner: 10BryanDavis) [11:58:44] (03PS1) 10Matanya: firewall: add ferm rule for kafka [puppet] - 10https://gerrit.wikimedia.org/r/223534 [12:08:45] (03PS2) 10Muehlenhoff: Enable packet filter for potassium [puppet] - 10https://gerrit.wikimedia.org/r/223282 [12:09:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable packet filter for potassium [puppet] - 10https://gerrit.wikimedia.org/r/223282 (owner: 10Muehlenhoff) [12:11:00] YuviPanda: I'm about to run puppet-merge, I'll merge your deployment-mediawiki03 change along [12:11:12] bah, forgot to hit yes again [12:11:14] thanks moritzm [12:12:16] YuviPanda: ok, done [12:14:50] (03PS2) 10Matanya: firewall: add ferm rule for kafka [puppet] - 10https://gerrit.wikimedia.org/r/223534 [12:17:59] (03CR) 10Hashar: "The related labs/private change is https://gerrit.wikimedia.org/r/223536" [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [12:22:02] 6operations: Ferm rules for parsoid / wtp* hosts - https://phabricator.wikimedia.org/T104966#1437382 (10Matanya) [12:24:52] 6operations, 7Icinga, 5Patch-For-Review: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1437383 (10Matanya) do we need to monitor http or https is enough? [12:27:18] (03PS1) 10Muehlenhoff: Add ferm rules for swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/223537 [12:28:18] PROBLEM - dhclient process on potassium is CRITICAL: Timeout while attempting connection [12:28:28] PROBLEM - RAID on potassium is CRITICAL: Timeout while attempting connection [12:28:29] PROBLEM - SSH on potassium is CRITICAL: Connection timed out [12:30:00] RECOVERY - dhclient process on potassium is OK: PROCS OK: 0 processes with command name dhclient [12:30:28] PROBLEM - poolcounter on potassium is CRITICAL: Timeout while attempting connection [12:30:39] PROBLEM - puppet last run on potassium is CRITICAL: Timeout while attempting connection [12:31:20] (03CR) 10Matanya: Add ferm rules for swift proxies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223537 (owner: 10Muehlenhoff) [12:31:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 42.86% of data above the critical threshold [500.0] [12:33:39] PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail [12:33:59] RECOVERY - RAID on potassium is OK Active: 2, Working: 2, Failed: 0, Spare: 0 [12:35:10] PROBLEM - HHVM busy threads on mw1233 is CRITICAL 33.33% of data above the critical threshold [115.2] [12:35:40] PROBLEM - HHVM busy threads on mw1134 is CRITICAL 44.44% of data above the critical threshold [86.4] [12:36:00] RECOVERY - SSH on potassium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0) [12:36:08] RECOVERY - poolcounter on potassium is OK: PROCS OK: 1 process with command name poolcounterd [12:36:18] PROBLEM - HHVM busy threads on mw1148 is CRITICAL 42.86% of data above the critical threshold [86.4] [12:39:02] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1437397 (10DaBPunkt) >>! In T91504#1435161, @BBlack wrote: > DNSSEC and DANE are not things we currently do. The question is: Why? A non-public-system like the OTRS is the ideal enviroment... [12:40:08] RECOVERY - puppet last run on potassium is OK Puppet is currently enabled, last run 17 minutes ago with 0 failures [12:41:08] PROBLEM - HHVM busy threads on mw1207 is CRITICAL 42.86% of data above the critical threshold [115.2] [12:41:29] PROBLEM - HHVM busy threads on mw1199 is CRITICAL 33.33% of data above the critical threshold [115.2] [12:42:09] PROBLEM - HHVM busy threads on mw1129 is CRITICAL 37.50% of data above the critical threshold [86.4] [12:43:00] Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes. [12:43:13] Error: 503, Service Unavailable at Wed, 08 Jul 2015 12:42:48 GMT # [12:43:29] PROBLEM - HHVM busy threads on mw1146 is CRITICAL 33.33% of data above the critical threshold [86.4] [12:43:39] PROBLEM - HHVM busy threads on mw1142 is CRITICAL 33.33% of data above the critical threshold [86.4] [12:43:39] PROBLEM - HHVM busy threads on mw1224 is CRITICAL 33.33% of data above the critical threshold [115.2] [12:43:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 8 below the confidence bounds [12:45:04] _joe_: are you aware ? ^ [12:46:09] PROBLEM - puppet last run on potassium is CRITICAL puppet fail [12:47:18] PROBLEM - HHVM busy threads on mw1146 is CRITICAL 37.50% of data above the critical threshold [86.4] [12:47:28] PROBLEM - HHVM busy threads on mw1258 is CRITICAL 37.50% of data above the critical threshold [115.2] [12:47:58] PROBLEM - HHVM busy threads on mw1204 is CRITICAL 44.44% of data above the critical threshold [115.2] [12:47:59] PROBLEM - HHVM busy threads on mw1148 is CRITICAL 33.33% of data above the critical threshold [86.4] [12:48:16] looking as well, seems from api [12:48:48] PROBLEM - Poolcounter connection on potassium is CRITICAL: Connection timed out [12:49:18] PROBLEM - HHVM busy threads on mw1251 is CRITICAL 33.33% of data above the critical threshold [115.2] [12:49:18] PROBLEM - HHVM busy threads on mw1254 is CRITICAL 42.86% of data above the critical threshold [115.2] [12:49:19] PROBLEM - HHVM busy threads on mw1126 is CRITICAL 33.33% of data above the critical threshold [86.4] [12:49:48] it's the potassium changeset [12:50:09] I've stopped ferm on potassium and will revert the change [12:50:28] PROBLEM - HHVM busy threads on mw1117 is CRITICAL 33.33% of data above the critical threshold [86.4] [12:50:34] poolcounter lost it ? [12:50:35] it's not it [12:50:39] PROBLEM - salt-minion processes on potassium is CRITICAL: Timeout while attempting connection [12:50:41] it's nf_conntrack being saturated [12:50:48] PROBLEM - HHVM busy threads on mw1233 is CRITICAL 44.44% of data above the critical threshold [115.2] [12:50:58] PROBLEM - HHVM busy threads on mw1221 is CRITICAL 50.00% of data above the critical threshold [115.2] [12:51:09] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:51:09] PROBLEM - HHVM busy threads on mw1199 is CRITICAL 33.33% of data above the critical threshold [115.2] [12:51:18] PROBLEM - HHVM busy threads on mw1208 is CRITICAL 42.86% of data above the critical threshold [115.2] [12:52:07] !log rmmod all iptables/netfilter-related modules from potassium [12:52:11] Logged the message, Master [12:52:20] RECOVERY - Poolcounter connection on potassium is OK: TCP OK - 0.000 second response time on port 7531 [12:52:29] RECOVERY - salt-minion processes on potassium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:52:39] PROBLEM - HHVM busy threads on mw1207 is CRITICAL 37.50% of data above the critical threshold [115.2] [12:52:52] (03PS1) 10Muehlenhoff: deactive packet filter on potassium [puppet] - 10https://gerrit.wikimedia.org/r/223538 [12:53:46] (03CR) 10Matanya: [C: 04-1] "better to set a rule such as:" [puppet] - 10https://gerrit.wikimedia.org/r/223538 (owner: 10Muehlenhoff) [12:53:53] (03PS2) 10Faidon Liambotis: Deactivate packet filter on potassium [puppet] - 10https://gerrit.wikimedia.org/r/223538 (owner: 10Muehlenhoff) [12:54:11] (03CR) 10Faidon Liambotis: [C: 032] Deactivate packet filter on potassium [puppet] - 10https://gerrit.wikimedia.org/r/223538 (owner: 10Muehlenhoff) [12:55:43] matanya: you are correct, but revert first, test and fix later [12:55:51] ok [12:56:08] yeah, I'll merge this [12:56:34] already did :) [12:56:48] ok :-) [12:56:50] RECOVERY - HHVM busy threads on mw1251 is OK Less than 30.00% above the threshold [76.8] [12:56:50] RECOVERY - HHVM busy threads on mw1254 is OK Less than 30.00% above the threshold [76.8] [12:57:28] 6operations, 10ops-eqiad: install 10g NIC card to labnet1002 - https://phabricator.wikimedia.org/T103849#1437478 (10Cmjohnson) I’ve run into some problems with the 10g NIC. I initially installed on a R420 and ran into the below issues and then changed to a spare R610 that is an exact match for labnet1001. I a... [12:57:38] PROBLEM - puppet last run on potassium is CRITICAL puppet fail [12:57:38] RECOVERY - HHVM busy threads on mw1129 is OK Less than 30.00% above the threshold [57.6] [12:58:27] (03PS1) 10Matanya: poolcounter: don't track connections on the firewall [puppet] - 10https://gerrit.wikimedia.org/r/223540 [12:58:28] RECOVERY - HHVM busy threads on mw1233 is OK Less than 30.00% above the threshold [76.8] [12:58:32] paravoid: moritzm ^^ [12:58:36] !log manually dpkg -P ferm on potassium [12:58:39] RECOVERY - HHVM busy threads on mw1221 is OK Less than 30.00% above the threshold [76.8] [12:58:40] Logged the message, Master [12:58:58] RECOVERY - HHVM busy threads on mw1224 is OK Less than 30.00% above the threshold [76.8] [12:59:29] RECOVERY - HHVM busy threads on mw1204 is OK Less than 30.00% above the threshold [76.8] [12:59:39] RECOVERY - HHVM busy threads on mw1148 is OK Less than 30.00% above the threshold [57.6] [13:00:19] RECOVERY - HHVM busy threads on mw1117 is OK Less than 30.00% above the threshold [57.6] [13:00:29] RECOVERY - HHVM busy threads on mw1207 is OK Less than 30.00% above the threshold [76.8] [13:00:49] RECOVERY - HHVM busy threads on mw1146 is OK Less than 30.00% above the threshold [57.6] [13:00:49] RECOVERY - HHVM busy threads on mw1199 is OK Less than 30.00% above the threshold [76.8] [13:00:58] RECOVERY - HHVM busy threads on mw1134 is OK Less than 30.00% above the threshold [57.6] [13:00:59] RECOVERY - HHVM busy threads on mw1208 is OK Less than 30.00% above the threshold [76.8] [13:00:59] RECOVERY - HHVM busy threads on mw1258 is OK Less than 30.00% above the threshold [76.8] [13:00:59] RECOVERY - HHVM busy threads on mw1126 is OK Less than 30.00% above the threshold [57.6] [13:00:59] RECOVERY - HHVM busy threads on mw1142 is OK Less than 30.00% above the threshold [57.6] [13:01:38] RECOVERY - puppet last run on potassium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:08:39] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:19:22] 6operations: Icinga check to detect saturation of nf_conntrack - https://phabricator.wikimedia.org/T105154#1437507 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [13:20:30] (03PS1) 10Hashar: nodepool: add guest disk image utilities [puppet] - 10https://gerrit.wikimedia.org/r/223543 [13:21:28] (03CR) 10Muehlenhoff: [C: 04-1] "swift also experiences many connections, connection needs to be dealt with." [puppet] - 10https://gerrit.wikimedia.org/r/223537 (owner: 10Muehlenhoff) [13:21:54] 6operations: Icinga check to detect saturation of nf_conntrack - https://phabricator.wikimedia.org/T105154#1437519 (10yuvipanda) We have a diamond collector (modules/diamond/files/collector/conntrack.py) and associated check that we used for labnet1001 earlier. [13:22:59] 6operations: Icinga check to detect saturation of nf_conntrack - https://phabricator.wikimedia.org/T105154#1437521 (10fgiunchedi) FWIW there's something similar not via icinga but via graphite for labs in nova.pp ``` monitoring::graphite_threshold { 'conntrack_saturated': description => 'Connection... [13:23:06] YuviPanda: haha [13:23:19] hahaha :P [13:23:22] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1437522 (10BBlack) >>! In T91504#1437397, @DaBPunkt wrote: >>>! In T91504#1435161, @BBlack wrote: >> DNSSEC and DANE are not things we currently do. > > The question is: Why? A non-public-... [13:23:39] great minds think alike [13:23:44] :D [13:25:47] (03PS1) 10Yuvipanda: ores: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/223546 [13:26:00] (03PS2) 10Yuvipanda: ores: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/223546 [13:26:06] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/223546 (owner: 10Yuvipanda) [13:26:39] (03CR) 10Hashar: "I have manually installed it on labnodepool1001.eqiad.Wmnet and it offers a bunch of very useful utilities such as guestmount, which let y" [puppet] - 10https://gerrit.wikimedia.org/r/223543 (owner: 10Hashar) [13:33:39] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /home/hashar/mount is not accessible: Permission denied [13:34:19] ... [13:34:43] /dev/fuse 1.5G 854M 529M 62% /home/hashar/mount [13:34:50] there is plenty of disk space!!!! [13:36:46] !log springle Synchronized wmf-config/db-eqiad.php: raise db1041 load (duration: 00m 13s) [13:36:50] Logged the message, Master [13:38:53] (03CR) 10Faidon Liambotis: [C: 04-1] add secret parser func (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223494 (owner: 10BBlack) [13:39:25] (03CR) 10Faidon Liambotis: add secret parser func (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223494 (owner: 10BBlack) [13:40:00] (03PS2) 10Hashar: nodepool: add guest disk image utilities [puppet] - 10https://gerrit.wikimedia.org/r/223543 [13:41:17] (03CR) 10Faidon Liambotis: [C: 04-1] sslcert: refactor std_cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223492 (owner: 10BBlack) [13:43:49] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1437558 (10hashar) [13:44:33] (03CR) 10BBlack: add secret parser func (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223494 (owner: 10BBlack) [13:45:12] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1433508 (10hashar) Added another use case triggered by `guestmount -a image.qcow2 /home/hashar/mount` which produces a /d... [13:47:03] (03CR) 10BBlack: sslcert: refactor std_cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223492 (owner: 10BBlack) [13:48:39] (03CR) 10Faidon Liambotis: add secret parser func (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223494 (owner: 10BBlack) [13:49:35] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 3 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1437568 (10coren) [13:49:58] (03CR) 10BBlack: add secret parser func (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223494 (owner: 10BBlack) [13:50:46] 6operations, 7Icinga: Icinga check to detect saturation of nf_conntrack - https://phabricator.wikimedia.org/T105154#1437571 (10Krenair) [13:52:46] (03PS1) 10Filippo Giunchedi: update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/223550 [13:52:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/223550 (owner: 10Filippo Giunchedi) [13:57:58] (03PS4) 10Filippo Giunchedi: cassandra: alternative metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) [14:05:42] (03PS5) 10Filippo Giunchedi: cassandra: alternative metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) [14:07:49] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) (owner: 10Filippo Giunchedi) [14:10:20] (03PS6) 10Filippo Giunchedi: cassandra: alternative metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) [14:14:59] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [14:17:12] (03CR) 10Alexandros Kosiaris: [C: 031] "Comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) (owner: 10Filippo Giunchedi) [14:19:24] (03CR) 10Alex Monk: "Steinsplitter - did you determine that this would work OK? what about https://phabricator.wikimedia.org/T96556#1230448 ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223516 (https://phabricator.wikimedia.org/T105052) (owner: 10Matanya) [14:19:42] (03PS1) 10BBlack: test secret() again on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/223553 [14:20:11] (03CR) 10BBlack: [C: 032 V: 032] test secret() again on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/223553 (owner: 10BBlack) [14:22:30] (03CR) 10Steinsplitter: "i (personally) see no problem at all, especially because the uploadbyurl function can only be used by experienced users. If the files are " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223516 (https://phabricator.wikimedia.org/T105052) (owner: 10Matanya) [14:29:02] (03CR) 10Alex Monk: "But what about @85jesse's note on T96556 around the 23rd of April?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223516 (https://phabricator.wikimedia.org/T105052) (owner: 10Matanya) [14:39:07] bblack, mutante: are you guys able to get stats on the usage of noc.wikimedia.org/conf ? [14:41:42] (03CR) 10Eevans: [C: 031] cassandra: alternative metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) (owner: 10Filippo Giunchedi) [14:47:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: alternative metrics collector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) (owner: 10Filippo Giunchedi) [14:47:18] (03PS7) 10Filippo Giunchedi: cassandra: alternative metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) [14:47:26] (03CR) 10Filippo Giunchedi: [V: 032] cassandra: alternative metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) (owner: 10Filippo Giunchedi) [14:49:27] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) (owner: 10John F. Lewis) [14:50:15] (03CR) 10Dzahn: [C: 032] added year into logging, made pep8 and pyflakes happy [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) (owner: 10Elee) [14:50:48] godog, urandom: are you aware of jmxtrans that we already use elsewhere? [14:50:55] (and is puppetized and everything) [14:53:16] (03PS2) 10BBlack: add secret parser func [puppet] - 10https://gerrit.wikimedia.org/r/223494 [14:53:18] paravoid: indeed, not sure how much would it be to match the metrics exactly to what we already have [14:53:18] (03PS3) 10BBlack: sslcert: refactor std_cert [puppet] - 10https://gerrit.wikimedia.org/r/223492 [14:53:43] https://gerrit.wikimedia.org/r/#/c/223382/1 - while trying to cherrypick to wmf12, Gerrit says, Could not create a merge commit during the cherry pick [14:53:57] Krenair: ostriches ^^ any idea? [14:54:09] paravoid: nope [14:54:44] (03PS1) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [14:55:02] what cherry-pick kart_? is that the right link? [14:55:24] oh I see [14:55:25] (03CR) 10jenkins-bot: [V: 04-1] monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [14:55:29] (03PS3) 10BBlack: Add parser function secret() to get secret data [puppet] - 10https://gerrit.wikimedia.org/r/223494 [14:55:31] (03PS4) 10BBlack: sslcert: refactor std_cert [puppet] - 10https://gerrit.wikimedia.org/r/223492 [14:55:33] you can't actually run the cherry-pick from there [14:55:34] Krenair: Try to cherry-pick 223382 -> wmf12 [14:55:44] try doing it manually [14:55:52] jouncebot, next [14:55:53] In 0 hour(s) and 4 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150708T1500) [14:56:02] (03CR) 10BBlack: "(This version manually tested on the real puppetmaster - functions correctly!)" [puppet] - 10https://gerrit.wikimedia.org/r/223494 (owner: 10BBlack) [14:56:09] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [14:56:10] Krenair: I always used Gerrit :) [14:56:50] urandom: sorry, I guess I was too late with that comment :) [14:56:54] (03PS2) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [14:57:59] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.532 second response time [14:58:51] kart_, yeah, if you cherry-pick this locally there's a merge conflict [14:58:54] so no wonder gerrit couldn't do it [14:59:15] Krenair: oops. [14:59:19] PROBLEM - puppet last run on restbase1005 is CRITICAL Puppet has 1 failures [15:00:00] kart_, there is no extension.json on this branch [15:00:04] manybubbles anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150708T1500). Please do the needful. [15:00:08] expect some restbase puppet failures, related to the cassandra metrics [15:00:39] PROBLEM - puppet last run on restbase1001 is CRITICAL Puppet has 1 failures [15:01:09] PROBLEM - puppet last run on restbase1002 is CRITICAL Puppet has 1 failures [15:01:58] kart_, so I'm not sure this is something that needs backporting? [15:02:39] PROBLEM - puppet last run on restbase1003 is CRITICAL Puppet has 1 failures [15:03:29] RECOVERY - puppet last run on restbase1005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:03:33] Krenair: Oops. [15:04:05] (I'm on too slow connection now, getting all msgs late) [15:05:15] Krenair: thanks. it was wmf13 correctly. [15:05:33] (cherry-pick is okay there) [15:05:43] that makes more sense :) [15:08:29] RECOVERY - puppet last run on restbase1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:08:37] (03PS2) 10Manybubbles: install_server: switch to elasticsearch 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/223251 (https://phabricator.wikimedia.org/T102008) (owner: 10Filippo Giunchedi) [15:08:52] (03CR) 10Manybubbles: [C: 031] install_server: switch to elasticsearch 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/223251 (https://phabricator.wikimedia.org/T102008) (owner: 10Filippo Giunchedi) [15:08:56] (03PS1) 10Yuvipanda: [WIP] labstore: Rewrite of replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/223564 [15:09:03] Coren: jynus ^ WIP patch [15:09:42] will take a look [15:11:49] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [15:13:13] Coren the idea now is to do: T104476#1433541, but feel free to disagree [15:13:21] "Currently the file is regenerated [15:13:21] but not sure what the use case is / why we support it" <-- because users are known to have accidentally deleted it in the past. Often. [15:13:40] well.... [15:13:50] :) [15:13:59] Coren: in that case, I think it's ok for us to give them a new password. [15:14:16] YuviPanda: And just redo the grants? [15:14:27] Coren: so the current replica-addusers depends on /var/cache (I think?) for credentials and that's empty since that was on labstore1001 and hence wiped. [15:14:33] Coren: yes. [15:14:49] PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail [15:15:24] uh, should I restart apache on puppetmaster before we get a flood of fails? [15:15:48] YuviPanda: Right, in practice that only means we lost the backups of the credentials, which is annoying but not catastrophic (since it defaults to "make a new password"). The only issue with it is users who copy the credentials away from the replica.my.cnf into their own config files; if the cnf files goes away and we regenerate the password theirs no longer work. But I suppose that's enough of [15:15:56] an edge case that we don't really care. [15:16:24] yes, and it's ok - they shoot themselves in the foot, we can provide bandaid but that does not change the fact that they did shoot themselves in the foot... [15:16:29] RECOVERY - puppet last run on restbase1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:47] Coren: but I think that is ok to leave to the second iteration. [15:17:38] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 5.156 second response time [15:19:43] (03CR) 10Dzahn: "now we need to build the .deb: https://phabricator.wikimedia.org/T105169" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) (owner: 10Elee) [15:20:09] RECOVERY - puppet last run on restbase1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:10] PROBLEM - puppet last run on cp1060 is CRITICAL Puppet has 2 failures [15:20:39] PROBLEM - puppet last run on mw2156 is CRITICAL puppet fail [15:20:40] (03PS2) 10Yuvipanda: [WIP] labstore: Rewrite of replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/223564 [15:20:49] PROBLEM - puppet last run on wtp1010 is CRITICAL puppet fail [15:21:10] PROBLEM - puppet last run on cp4002 is CRITICAL Puppet has 2 failures [15:21:18] PROBLEM - puppet last run on elastic1023 is CRITICAL puppet fail [15:21:30] PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 1 failures [15:22:18] PROBLEM - puppet last run on ms-be2004 is CRITICAL Puppet has 1 failures [15:22:29] 6operations: list xfp/sfp+ inventory @ codfw - https://phabricator.wikimedia.org/T105170#1437873 (10RobH) 3NEW a:3Papaul [15:23:00] 6operations: list xfp/sfp+ inventory @ codfw - https://phabricator.wikimedia.org/T105170#1437889 (10RobH) [15:23:19] PROBLEM - puppet last run on ms-fe2001 is CRITICAL Puppet has 2 failures [15:23:25] 6operations: list xfp/sfp+ inventory @ codfw - https://phabricator.wikimedia.org/T105170#1437873 (10RobH) You can resolve this task once its updated with the data, as its linked to the overall tracking task for preparing these shipments/sites. [15:23:28] PROBLEM - puppet last run on ms-fe1001 is CRITICAL Puppet has 1 failures [15:23:28] PROBLEM - puppet last run on mc1012 is CRITICAL Puppet has 3 failures [15:23:40] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 3 failures [15:23:58] PROBLEM - puppet last run on labnet1001 is CRITICAL Puppet has 1 failures [15:23:59] PROBLEM - puppet last run on db1021 is CRITICAL Puppet has 1 failures [15:24:08] PROBLEM - puppet last run on db1023 is CRITICAL Puppet has 1 failures [15:24:28] PROBLEM - puppet last run on db1028 is CRITICAL Puppet has 1 failures [15:24:38] PROBLEM - puppet last run on ms-be3002 is CRITICAL Puppet has 3 failures [15:24:38] PROBLEM - puppet last run on virt1004 is CRITICAL Puppet has 1 failures [15:24:48] PROBLEM - puppet last run on db1042 is CRITICAL Puppet has 1 failures [15:24:49] PROBLEM - puppet last run on mc1017 is CRITICAL Puppet has 1 failures [15:24:59] PROBLEM - puppet last run on db2016 is CRITICAL Puppet has 1 failures [15:25:09] PROBLEM - puppet last run on mc1014 is CRITICAL Puppet has 1 failures [15:25:10] PROBLEM - puppet last run on analytics1030 is CRITICAL Puppet has 1 failures [15:25:10] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 4 failures [15:25:19] PROBLEM - puppet last run on mc1001 is CRITICAL Puppet has 1 failures [15:25:19] PROBLEM - puppet last run on ms-be1008 is CRITICAL Puppet has 5 failures [15:25:20] !log handing over adminship of the "test" mailman list to John F. Lewis (was: Thehelpfulone) due to inactivity [15:25:24] Logged the message, Master [15:25:25] ^ puppetmaster? [15:25:29] PROBLEM - puppet last run on bast4001 is CRITICAL Puppet has 1 failures [15:25:30] PROBLEM - puppet last run on mc1005 is CRITICAL Puppet has 1 failures [15:25:39] PROBLEM - puppet last run on ms-be1007 is CRITICAL Puppet has 2 failures [15:25:40] PROBLEM - puppet last run on db1001 is CRITICAL Puppet has 1 failures [15:25:48] PROBLEM - puppet last run on iodine is CRITICAL Puppet has 1 failures [15:25:49] PROBLEM - puppet last run on ms-be2008 is CRITICAL Puppet has 1 failures [15:25:59] PROBLEM - puppet last run on ms-be1010 is CRITICAL Puppet has 1 failures [15:26:00] PROBLEM - puppet last run on ms-be1011 is CRITICAL Puppet has 1 failures [15:26:09] PROBLEM - puppet last run on ms-be2005 is CRITICAL Puppet has 2 failures [15:26:19] PROBLEM - puppet last run on ms-be2001 is CRITICAL Puppet has 1 failures [15:26:20] PROBLEM - puppet last run on ms-fe2002 is CRITICAL Puppet has 1 failures [15:26:27] 6operations, 7Easy, 5Patch-For-Review: server admin log should include year in date (again) - https://phabricator.wikimedia.org/T85803#1437893 (10JanZerebecki) [15:26:29] PROBLEM - puppet last run on ms-fe3002 is CRITICAL Puppet has 2 failures [15:26:29] PROBLEM - puppet last run on ms-be1009 is CRITICAL Puppet has 2 failures [15:26:29] PROBLEM - puppet last run on gallium is CRITICAL Puppet has 2 failures [15:26:29] PROBLEM - puppet last run on ms-be1014 is CRITICAL Puppet has 1 failures [15:26:30] PROBLEM - puppet last run on wtp2012 is CRITICAL Puppet has 1 failures [15:26:30] PROBLEM - puppet last run on ms-be2012 is CRITICAL Puppet has 2 failures [15:26:30] PROBLEM - puppet last run on ms-be2007 is CRITICAL Puppet has 2 failures [15:26:30] PROBLEM - puppet last run on ms-be3001 is CRITICAL Puppet has 1 failures [15:26:38] PROBLEM - puppet last run on db1004 is CRITICAL Puppet has 1 failures [15:26:39] PROBLEM - puppet last run on db1038 is CRITICAL Puppet has 1 failures [15:26:49] PROBLEM - puppet last run on mc1015 is CRITICAL Puppet has 1 failures [15:27:09] PROBLEM - puppet last run on titanium is CRITICAL Puppet has 1 failures [15:27:09] PROBLEM - puppet last run on ms-fe3001 is CRITICAL Puppet has 1 failures [15:27:09] PROBLEM - puppet last run on nitrogen is CRITICAL Puppet has 1 failures [15:27:19] PROBLEM - puppet last run on ms-be1015 is CRITICAL Puppet has 1 failures [15:27:19] PROBLEM - puppet last run on ms-be1002 is CRITICAL Puppet has 3 failures [15:27:19] RECOVERY - puppet last run on ms-fe1001 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:27:29] PROBLEM - puppet last run on tin is CRITICAL puppet fail [15:27:38] PROBLEM - puppet last run on db2066 is CRITICAL Puppet has 1 failures [15:27:39] PROBLEM - puppet last run on ms-be2009 is CRITICAL Puppet has 1 failures [15:27:40] PROBLEM - puppet last run on zirconium is CRITICAL Puppet has 1 failures [15:27:49] PROBLEM - puppet last run on ms-be2010 is CRITICAL Puppet has 2 failures [15:27:49] RECOVERY - puppet last run on db1021 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:27:58] RECOVERY - puppet last run on db1023 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:27:59] PROBLEM - puppet last run on lanthanum is CRITICAL Puppet has 1 failures [15:28:08] RECOVERY - puppet last run on ms-be2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:19] RECOVERY - puppet last run on db1028 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:28:29] RECOVERY - puppet last run on wtp2012 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:28:29] PROBLEM - puppet last run on mw1082 is CRITICAL Puppet has 1 failures [15:28:29] RECOVERY - puppet last run on virt1004 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:28:39] RECOVERY - puppet last run on db1042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:49] PROBLEM - puppet last run on mw2082 is CRITICAL Puppet has 1 failures [15:29:00] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:09] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:18] RECOVERY - puppet last run on mc1012 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:29:19] RECOVERY - puppet last run on mc1005 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:29:29] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:29:29] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:39] RECOVERY - puppet last run on labnet1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:59] PROBLEM - puppet last run on mw2145 is CRITICAL Puppet has 1 failures [15:30:09] PROBLEM - puppet last run on mw2136 is CRITICAL Puppet has 1 failures [15:30:09] RECOVERY - puppet last run on ms-be2001 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:30:19] RECOVERY - puppet last run on ms-fe3002 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:30:19] RECOVERY - puppet last run on ms-be1009 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:30:20] RECOVERY - puppet last run on gallium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:28] PROBLEM - puppet last run on mw2109 is CRITICAL Puppet has 1 failures [15:30:30] RECOVERY - puppet last run on ms-be2012 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:30:30] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 1 failures [15:30:30] RECOVERY - puppet last run on ms-be3002 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:30:30] RECOVERY - puppet last run on ms-be3001 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:30:30] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:30:30] RECOVERY - puppet last run on db1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:30] PROBLEM - puppet last run on mw1034 is CRITICAL Puppet has 1 failures [15:30:30] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [15:30:49] RECOVERY - puppet last run on db2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:49] PROBLEM - puppet last run on mw2191 is CRITICAL Puppet has 1 failures [15:30:49] PROBLEM - puppet last run on mw2048 is CRITICAL Puppet has 1 failures [15:30:50] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [15:30:50] RECOVERY - puppet last run on mc1014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:59] PROBLEM - puppet last run on mw1025 is CRITICAL Puppet has 1 failures [15:31:08] PROBLEM - puppet last run on mw1039 is CRITICAL Puppet has 1 failures [15:31:09] RECOVERY - puppet last run on mc1001 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:31:09] RECOVERY - puppet last run on ms-be1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:09] RECOVERY - puppet last run on bast4001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:19] PROBLEM - puppet last run on mw1198 is CRITICAL Puppet has 1 failures [15:31:20] RECOVERY - puppet last run on db2066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:28] PROBLEM - puppet last run on mw1229 is CRITICAL Puppet has 1 failures [15:31:29] RECOVERY - puppet last run on ms-be1007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:39] PROBLEM - puppet last run on mw2091 is CRITICAL Puppet has 1 failures [15:31:40] RECOVERY - puppet last run on ms-be2008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:49] RECOVERY - puppet last run on cp1060 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:31:49] PROBLEM - puppet last run on mw2166 is CRITICAL Puppet has 1 failures [15:31:49] PROBLEM - puppet last run on mw1210 is CRITICAL Puppet has 1 failures [15:31:50] RECOVERY - puppet last run on ms-be2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:59] PROBLEM - puppet last run on mw2159 is CRITICAL Puppet has 1 failures [15:31:59] PROBLEM - puppet last run on neon is CRITICAL Puppet has 1 failures [15:32:19] RECOVERY - puppet last run on ms-be2007 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:32:20] RECOVERY - puppet last run on mw1034 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:32:28] PROBLEM - puppet last run on mw1225 is CRITICAL Puppet has 1 failures [15:32:30] paravoid: jmxtrans looks interesting, i wish i had known about it, but i wonder if it would have changed anything [15:32:48] RECOVERY - puppet last run on mw2191 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:48] RECOVERY - puppet last run on mw2048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:49] RECOVERY - puppet last run on titanium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:49] PROBLEM - puppet last run on mw2059 is CRITICAL Puppet has 1 failures [15:32:49] RECOVERY - puppet last run on elastic1023 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:32:49] RECOVERY - puppet last run on nitrogen is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:32:49] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:33:09] urandom: sorry! [15:33:14] paravoid: we were looking at diamond + jolokia, too, but the current metrics are pretty specific to the dropwizard graphite reporter [15:33:19] RECOVERY - puppet last run on mw1198 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:33:19] RECOVERY - puppet last run on mw1229 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:33:20] RECOVERY - puppet last run on db1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:33:29] RECOVERY - puppet last run on iodine is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:33:37] so it would have meant redoing all graphs and thresholds, and we just put a bunch of work into them [15:33:40] RECOVERY - puppet last run on ms-be1011 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:33:48] RECOVERY - puppet last run on mw1210 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:34:00] RECOVERY - puppet last run on ms-fe2002 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:34:18] this thing godog is working to deploy is a couple hours work, and is compatible with the existing metrics [15:34:19] RECOVERY - puppet last run on mw1225 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:34:19] RECOVERY - puppet last run on db1038 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:34:28] RECOVERY - puppet last run on wtp1010 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:34:33] kart_, do you want that backport deployed? [15:34:36] urandom: wfm :) [15:34:36] paravoid: so even if it's just a short-term fix, it's probably worthwhile [15:34:48] RECOVERY - puppet last run on cp4002 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:34:48] RECOVERY - puppet last run on ms-fe3001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:35:00] RECOVERY - puppet last run on ms-be1002 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:35:28] RECOVERY - puppet last run on mw2091 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:35:59] RECOVERY - puppet last run on mw2159 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:36:09] RECOVERY - puppet last run on mw2156 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:36:19] RECOVERY - puppet last run on mc1015 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:36:19] RECOVERY - puppet last run on mc1017 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:36:29] Does dzahn live on IRC? [15:36:32] nope it looks like. [15:36:58] RECOVERY - puppet last run on ms-be1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:36:59] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:37:13] elee: dzahn is "mutante" on irc [15:37:16] (03PS3) 10Merlijn van Deen: Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 [15:37:30] RECOVERY - puppet last run on ms-be1010 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:38:00] RECOVERY - puppet last run on ms-be1014 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:38:10] elee: he does [15:38:18] kek [15:38:21] mutante: <3 [15:38:27] though he is away right now [15:38:35] roger [15:38:44] I'm considering taking on https://phabricator.wikimedia.org/T105169 [15:38:50] but don't know my head from my arse with this [15:39:09] RECOVERY - puppet last run on ms-be2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:12] elee: probably best to focus elsewhere :) [15:39:19] RECOVERY - puppet last run on ms-be2010 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:39:31] RECOVERY - puppet last run on lanthanum is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:18] RECOVERY - puppet last run on zirconium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:49] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:45:58] RECOVERY - puppet last run on mw1082 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:46:46] (03PS3) 10Yuvipanda: [WIP] labstore: Rewrite of replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/223564 [15:47:29] RECOVERY - puppet last run on mw2145 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:47:50] RECOVERY - puppet last run on mw2109 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:47:50] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:48:19] RECOVERY - puppet last run on mw2082 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:48:28] RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:49:31] 6operations: list xfp/sfp+ inventory @ codfw - https://phabricator.wikimedia.org/T105170#1437947 (10RobH) Actually, please update the google doc we made for spare tracking. It has been shared with you in Google Drive, under shared with me, named WMF Datacenter - On-site Spares [15:49:58] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:19] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:20] RECOVERY - puppet last run on mw2059 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:50:28] RECOVERY - puppet last run on mw1039 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:18] RECOVERY - puppet last run on mw2166 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:29] RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:40] (03PS3) 10Giuseppe Lavagetto: pybal: refactor pybal::pool, print pools in pybal::web [puppet] - 10https://gerrit.wikimedia.org/r/223523 [15:51:48] (03PS1) 10Filippo Giunchedi: cassandra: use lock file with flock [puppet] - 10https://gerrit.wikimedia.org/r/223570 (https://phabricator.wikimedia.org/T104208) [15:52:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: use lock file with flock [puppet] - 10https://gerrit.wikimedia.org/r/223570 (https://phabricator.wikimedia.org/T104208) (owner: 10Filippo Giunchedi) [15:58:16] 6operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#1437997 (10Aklapper) [16:06:01] (03PS4) 10Yuvipanda: [WIP] labstore: Rewrite of replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/223564 [16:07:14] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1438036 (10fgiunchedi) [16:07:17] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: alternative Cassandra metrics reporting - https://phabricator.wikimedia.org/T104208#1438034 (10fgiunchedi) 5Open>3Resolved this is complete, metrics are being pushed each minute from cron [16:07:19] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [16:08:02] _joe_: ping? [16:09:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: switch to elasticsearch 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/223251 (https://phabricator.wikimedia.org/T102008) (owner: 10Filippo Giunchedi) [16:09:46] (03PS3) 10Filippo Giunchedi: install_server: switch to elasticsearch 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/223251 (https://phabricator.wikimedia.org/T102008) [16:10:14] (03CR) 10Filippo Giunchedi: [V: 032] install_server: switch to elasticsearch 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/223251 (https://phabricator.wikimedia.org/T102008) (owner: 10Filippo Giunchedi) [16:10:19] interesting, that showed up as [C: 2 V: 2] but gerrit barfed on needing a rebase [16:11:19] PROBLEM - Host labnet1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:11:57] 6operations: list xfp/sfp+ inventory @ codfw - https://phabricator.wikimedia.org/T105170#1438051 (10Papaul) The google doc, update with SEP+ information. I have to check all the old items that came from Tampa to see how many XFP I have on site. it was easy to get the SEP+ information because we order 50 in 2014... [16:12:01] <_joe_> SMalyshev: oh man sorry, I didn't get notified for some reason [16:12:05] <_joe_> and I have 2 alarms [16:15:22] 6operations, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Import Elasticsearch 1.6.0 deb into wmf apt - https://phabricator.wikimedia.org/T102008#1438062 (10fgiunchedi) 5Open>3Resolved you should be all set! ``` root@carbon:~# reprepro --noskipold checkupdate aptmethod 'http' seems to have a obsoleted r... [16:16:00] (03PS1) 10Mforns: Add flag --all-projects to projectviews aggregator [puppet] - 10https://gerrit.wikimedia.org/r/223573 (https://phabricator.wikimedia.org/T95339) [16:17:10] (03PS1) 10Dzahn: up version to 1.7.7 - add year to logs [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223575 (https://phabricator.wikimedia.org/T85803) [16:18:21] (03CR) 10Dzahn: [C: 032] up version to 1.7.7 - add year to logs [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223575 (https://phabricator.wikimedia.org/T85803) (owner: 10Dzahn) [16:18:23] (03Merged) 10jenkins-bot: up version to 1.7.7 - add year to logs [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223575 (https://phabricator.wikimedia.org/T85803) (owner: 10Dzahn) [16:25:23] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1438123 (10Smalyshev) 3NEW [16:25:51] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1438134 (10Joe) a:3Joe [16:26:19] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /home/hashar/mount is not accessible: Permission denied [16:26:37] you can ignore that labnodepool error [16:26:42] there is a task filled about it already [16:27:15] (03PS20) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/216403 [16:27:30] godog: C:2V:2 is technically separate from the actual submit-merge itself. If you (commonly) try to do all at once, the C/V review marks go through, but the submit-merge fails at the end :) [16:27:40] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1438139 (10Joe) [16:27:56] (03CR) 10jenkins-bot: [V: 04-1] Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/216403 (owner: 10Smalyshev) [16:29:25] bblack: indeed! seems notifying on submit-merge instead of c:2 v:2 would be useful [16:30:18] (03CR) 10Eevans: "The rule of thumb here is 8 * num_cores, so upping this is a Good Idea(tm)." [puppet] - 10https://gerrit.wikimedia.org/r/223454 (owner: 10GWicke) [16:34:49] (03Abandoned) 10GWicke: Increase the write request timeout to 5s [puppet] - 10https://gerrit.wikimedia.org/r/223496 (owner: 10GWicke) [16:36:27] (03CR) 10Eevans: [C: 031] "Conventional wisdom is 16 * num_drives, (48), but many people continue to recommend 4 * num_cores (128), and I think everyone agrees that " [puppet] - 10https://gerrit.wikimedia.org/r/223495 (owner: 10GWicke) [16:42:59] RECOVERY - Host labnet1002 is UPING OK - Packet loss = 0%, RTA = 2.54 ms [16:45:58] (03PS1) 10Dzahn: up to 1.7.8 for trusty rebuild [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223576 [16:46:33] mutante: around? [16:50:18] elee: ping received. please leave a message [16:50:26] hah - figured you're busy. [16:50:44] mutante: I thought precise was EOLed? [16:51:08] (why is building for both trusty and precise necessary?)_ [16:51:25] somebody built 1.7.6 for trusty but did not update the repo [16:51:51] ah okay clued in, thanks. [16:51:54] tools-bastion was upgraded [16:52:01] i dont know why it still cant find it [16:52:38] it's a cycle.. i bet others worked around it because it was a pita [16:52:42] which now causes even more [16:54:48] 6operations, 7HTTPS: update ldap-mirror.wikimedia.org certificate to sha256 - https://phabricator.wikimedia.org/T105187#1438171 (10RobH) 3NEW a:3RobH [16:55:03] 6operations, 7HTTPS, 7LDAP: update ldap-mirror.wikimedia.org certificate to sha256 - https://phabricator.wikimedia.org/T105187#1438171 (10RobH) [16:55:27] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1438181 (10RobH) [17:00:18] PROBLEM - puppet last run on mw2137 is CRITICAL puppet fail [17:02:40] elee: precise EOLed? Lucid was only fairly recently EOLed. Mix up perhaps? [17:03:15] precise is indeed not EOLed [17:03:22] (03PS2) 10Dzahn: up to 1.7.8 for trusty rebuild [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223576 (https://phabricator.wikimedia.org/T105169) [17:03:26] and we have ~150 servers running precise still [17:03:31] (03CR) 10Dzahn: [C: 032] up to 1.7.8 for trusty rebuild [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223576 (https://phabricator.wikimedia.org/T105169) (owner: 10Dzahn) [17:03:42] paravoid: and one running Lucid ;) [17:03:44] oh wait right, sorry I'm confusing some other stuff I'm doing [17:03:44] =p [17:05:23] (03PS1) 10Smalyshev: add WDQS deployment repo [puppet] - 10https://gerrit.wikimedia.org/r/223580 [17:06:08] 6operations, 7HTTPS, 7LDAP: update ldap-mirror.wikimedia.org certificate to sha256 - https://phabricator.wikimedia.org/T105187#1438210 (10RobH) a:5RobH>3Andrew From what I can see, there is no issuer within the certificate. When you run "openssl x509 -in -noout -text" against a certificat... [17:06:15] <_joe_> SMalyshev: put the bug into the commit messages so that I can easily find all the related patches [17:08:28] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [17:09:19] !log bounced cassandra on restbase1004 [17:09:23] Logged the message, Master [17:09:56] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1438227 (10RobH) The ldap-mirror.wikimedia.org is not a RapidSSL certificate; so I've asked Jeff on the fr-certs sub-ticket to confirm he doesn't need the copies in our main repo. T104378 Additionally, sub-ta... [17:11:59] PROBLEM - RAID on ms-be2013 is CRITICAL 1 failed LD(s) (Offline) [17:12:18] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [17:13:38] PROBLEM - Disk space on ms-be2013 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error [17:16:35] !log installed libwmf security updates on various systems [17:16:39] Logged the message, Master [17:17:40] RECOVERY - puppet last run on mw2137 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:19:09] (03PS1) 10Elee: (WIP) this should work... [puppet] - 10https://gerrit.wikimedia.org/r/223581 [17:19:30] I have downtime'd temporarelly s5 lag on dbstore1002- there is a set of conditions that make it more likely during peak time today; do not worry if you see that alarm off [17:20:08] I left already a note for sean [17:20:38] mutante: okay so -strict isn't complaining about anything [17:20:44] (related to this that is) [17:21:33] so this appears to be good - someone should sanity check though [17:22:02] (also I'm not familiar with the entire workflow yet so if that should've been a patch with yours let me know) [17:25:37] elee: thanks!:) do you have the "git review" tool installed? [17:26:02] yeah - I typically -s before branching and then -R to submit a review [17:27:05] elee: so try this as an example: git review -d 223500 .. add your changes .. git commit --amend .. git review [17:27:23] it should add a new patch set to my existing change [17:27:38] RECOVERY - RAID on ms-be2013 is OK optimal, 13 logical, 13 physical [17:29:29] (03CR) 10Joal: Add flag --all-projects to projectviews aggregator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223573 (https://phabricator.wikimedia.org/T95339) (owner: 10Mforns) [17:30:10] PROBLEM - puppet last run on ms-be2013 is CRITICAL Puppet has 1 failures [17:30:37] (03PS3) 10Elee: (WIP) - make icinga firewall more readable [puppet] - 10https://gerrit.wikimedia.org/r/223500 (owner: 10Dzahn) [17:30:48] mutante: okay looks good - can I... delete the other one? [17:30:49] RECOVERY - Cassandra database on restbase1008 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [17:30:59] RECOVERY - Cassanda CQL query interface on restbase1008 is OK: TCP OK - 0.006 second response time on port 9042 [17:32:27] elee: yes, you can hit the "abandon" button [17:32:51] (03Abandoned) 10Elee: (WIP) this should work... [puppet] - 10https://gerrit.wikimedia.org/r/223581 (owner: 10Elee) [17:32:57] done, thanks mutante [17:34:05] (03PS2) 10Andrew Bogott: enable ferm for neptunium [puppet] - 10https://gerrit.wikimedia.org/r/223355 (owner: 10Muehlenhoff) [17:35:02] (03CR) 10Andrew Bogott: [C: 032] enable ferm for neptunium [puppet] - 10https://gerrit.wikimedia.org/r/223355 (owner: 10Muehlenhoff) [17:37:18] did… someone else merge ^ ? [17:39:04] elee: it seems you updated the commit message but the \ are not in the actual file [17:39:20] ah I forgot to git add [17:39:21] standby [17:39:42] andrewbogott: no, but linked to the ticket fo rit [17:39:49] (03PS4) 10Elee: (WIP) - make icinga firewall more readable [puppet] - 10https://gerrit.wikimedia.org/r/223500 (owner: 10Dzahn) [17:39:57] elee: thanks:) [17:40:04] mutante: I merged in gerrit but palladium says there’s nothing to merge. [17:40:05] thanks for pointing it out =p [17:40:15] and the affected system isn’t showing a change in puppet. [17:40:23] cassandra on restbase1008 is expected to be down, so don't worry about the related alert [17:40:48] andrewbogott: the second PS..it's like empty [17:40:55] andrewbogott: there is only a commit message [17:40:59] oh? hm [17:41:02] * andrewbogott looks again [17:41:09] rebased into nothing [17:41:09] is it me or is jenkins slowing down? [17:41:22] the time it takes to run tests is... increasing. [17:41:30] andrewbogott: wrong host, nembus alreayd hat it, you want neptunium [17:41:35] had [17:41:41] I guess people keep adding more tests and it's getting worse [17:41:48] yep, I see, the patch was wrong to begin with. There were two identical patches with different names :) [17:42:05] andrewbogott: nembus already has the rules, that's what confused me last night [17:42:18] andrewbogott: like how was it possible that neptunium did not have it but nembus did [17:42:36] mutante: we intentionally merged it in codfw first and gave it a day to settle [17:42:41] to ensure it didn’t break unexpected things [17:42:55] andrewbogott: makes sense.. *nod*.. so it made me investigate the whole thing [17:42:59] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [17:43:22] mutante: sorry :) I’m sorting it out and will merge on neptunium shortly [17:43:26] 6operations, 10ops-eqiad: logstash1003 - RAID failed - https://phabricator.wikimedia.org/T104592#1438432 (10Cmjohnson) The raid is failed but disk is good...state is Unconfigured(good), Spun Up. Need to fix that [17:44:10] andrewbogott: cool [17:44:35] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1438434 (10Cmjohnson) Shouldn't we name these lvs1007-13? [17:44:48] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7928 bytes in 0.450 second response time [17:46:07] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-102: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#1438442 (10Cmjohnson) We do not have spare md1200 shelves lying around. I have one that is not used that is waiti... [17:48:46] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1438450 (10Cmjohnson) I can take some of the sc-sc fibers from eqiad and bring with me. We have several spare and since we started using copper I don't need as many on-site. [17:49:57] (03PS1) 10Andrew Bogott: Add firewall to neptunium. [puppet] - 10https://gerrit.wikimedia.org/r/223584 (https://phabricator.wikimedia.org/T102481) [17:53:04] (03CR) 10Andrew Bogott: [C: 032] Add firewall to neptunium. [puppet] - 10https://gerrit.wikimedia.org/r/223584 (https://phabricator.wikimedia.org/T102481) (owner: 10Andrew Bogott) [17:53:09] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:54:59] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [18:00:04] twentyafterfour greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150708T1800). [18:00:54] so it's that time again... [18:04:10] (03PS1) 1020after4: group1 wikis to 1.26wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223587 [18:04:32] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223587 (owner: 1020after4) [18:04:37] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223587 (owner: 1020after4) [18:05:11] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf13 [18:05:19] Logged the message, Master [18:08:15] (03PS5) 10Dzahn: make icinga firewall more readable [puppet] - 10https://gerrit.wikimedia.org/r/223500 [18:10:33] (03PS6) 10Dzahn: make icinga firewall more readable [puppet] - 10https://gerrit.wikimedia.org/r/223500 [18:11:16] (03CR) 10Dzahn: [C: 032] "thanks for amending, Elee" [puppet] - 10https://gerrit.wikimedia.org/r/223500 (owner: 10Dzahn) [18:11:52] <3 you too mutante [18:12:46] (03PS3) 10Ori.livneh: Increase concurrent_writes to 128 [puppet] - 10https://gerrit.wikimedia.org/r/223454 (owner: 10GWicke) [18:13:20] wonders what happened to $EQIAD_PRIVATE_LABS_HOSTS1_C_EQIAD [18:13:28] since we have A, B and D [18:16:53] mutante: want me to shove it in? [18:17:15] robh: around? [18:18:11] (03CR) 10Gage: [C: 031] "Thanks Matanya. These rules look fine, though we'll also need a corresponding config in modules/jmxtrans/manifests/init.pp for TCP/2101" [puppet] - 10https://gerrit.wikimedia.org/r/223534 (owner: 10Matanya) [18:18:31] elee: no, it was right that we didnt make any changes to it for now [18:18:36] ? [18:19:01] "The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. " [18:19:02] ? [18:19:10] jynus: ^ [18:19:35] first UA, then NYSE, now us!? [18:19:37] I am trying to delete a stupid spammers posts right now [18:20:12] Oh [18:20:24] elee: i'm saying we did right, we did not change anything, it was just an observation that we don't have labs hosts in row C [18:20:32] roger mutante [18:20:44] also, it got applied on neon now [18:20:46] no issues [18:20:47] Bsadowski1: hm? [18:20:49] ^_^ [18:20:50] Bsadowski1: Which DB? [18:20:52] uhm wiki [18:20:56] mediawiki.org [18:21:04] the reason why I was sure \ would work was because puppet ignores that [18:21:35] when you `puppet apply ` [18:21:52] and see see the created file, it wraps properly [18:21:53] !log ran 'kafka preferred-replica-election' to promote analytics1021 back to Leader [18:21:55] Bsadowski1: Still happening? Looks good [18:22:00] Logged the message, Master [18:22:57] @seen hashar [18:22:58] mutante: Last time I saw hashar they were quitting the network with reason: no reason was given N/A at 7/8/2015 4:28:45 PM (1h54m12s ago) [18:24:45] elee: great :) [18:25:08] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 6516.02042619 [18:29:45] (03PS1) 10GWicke: Throttle RESTBase update jobs [puppet] - 10https://gerrit.wikimedia.org/r/223588 [18:30:26] 6operations, 6Phabricator: iridium (phab server) - Could not find dependency Package[php5] - https://phabricator.wikimedia.org/T105210#1438641 (10Dzahn) 3NEW [18:30:47] 6operations, 6Phabricator: iridium (phab server) - Could not find dependency Package[php5] - https://phabricator.wikimedia.org/T105210#1438648 (10Dzahn) [18:31:08] ACKNOWLEDGEMENT - puppet last run on iridium is CRITICAL puppet fail daniel_zahn https://phabricator.wikimedia.org/T105210 [18:31:38] ACKNOWLEDGEMENT - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /home/hashar/mount is not accessible: Permission denied daniel_zahn https://phabricator.wikimedia.org/T105209 [18:34:10] 6operations: sodium - puppet fail - : Invalid parameter show_diff - https://phabricator.wikimedia.org/T105212#1438660 (10Dzahn) 3NEW [18:34:20] ACKNOWLEDGEMENT - puppet last run on sodium is CRITICAL puppet fail daniel_zahn https://phabricator.wikimedia.org/T105212 [18:35:18] 6operations, 7Swift: ms-be2013 - - https://phabricator.wikimedia.org/T105213#1438670 (10Dzahn) 3NEW [18:35:46] 6operations, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1438679 (10Dzahn) [18:36:00] ACKNOWLEDGEMENT - Disk space on ms-be2013 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error daniel_zahn https://phabricator.wikimedia.org/T105213 [18:38:11] 6operations, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1438691 (10Dzahn) also: CRITICAL: Puppet has 1 failures Warning: /Stage[main]/Role::Swift::Storage/Swift_new::Init_device[/dev/sdc]/Swift_new::Mount_filesystem[/dev/sdc1]/File... [18:38:58] ACKNOWLEDGEMENT - puppet last run on ms-be2013 is CRITICAL Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T105213#1438691 [18:42:00] (03PS2) 10Smalyshev: T95679: add WDQS deployment repo [puppet] - 10https://gerrit.wikimedia.org/r/223580 [18:44:27] (03PS1) 10BBlack: fix sslcert::certificate for sodium [puppet] - 10https://gerrit.wikimedia.org/r/223591 [18:44:43] (03CR) 10BBlack: [C: 032 V: 032] fix sslcert::certificate for sodium [puppet] - 10https://gerrit.wikimedia.org/r/223591 (owner: 10BBlack) [18:46:21] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1438732 (10DaBPunkt) >>! In T91504#1437522, @BBlack wrote: > We don't have a "domain hoster", we do this all ourselves in the Operations team. I spoke of the provider where you manage th... [18:47:36] 6operations, 10RESTBase-Cassandra, 7Monitoring: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#1438736 (10Dzahn) [18:48:20] (03CR) 10coren: [WIP] labstore: Rewrite of replica-addusers.pl (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223564 (owner: 10Yuvipanda) [18:49:39] RECOVERY - puppet last run on sodium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:50:29] 6operations: sodium - puppet fail - : Invalid parameter show_diff - https://phabricator.wikimedia.org/T105212#1438758 (10Dzahn) 5Open>3Resolved a:3Dzahn fixed by @bblack https://gerrit.wikimedia.org/r/223591 [18:52:07] 6operations, 10RESTBase-Cassandra, 7Monitoring: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#1438764 (10GWicke) @dzahn, we are getting those pages, and have been working on fixing the underlying issue. See also the recent thread on the ops list (subject: "Cas... [18:52:29] 6operations, 10RESTBase-Cassandra, 7Monitoring: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#1438765 (10GWicke) p:5Triage>3Normal [18:55:07] JohnFLewis: back now, was afk, whats up? [18:56:24] robh: https://phabricator.wikimedia.org/T90407 could do with a look over/comment from an op on the idea since it may make your job harder to do, plus since your on ops duty this week, :) [18:56:27] (03PS1) 10BBlack: Align ssl_stapling_file bugfix with upstream nginx [software/nginx] (wmf-1.9.2-1) - 10https://gerrit.wikimedia.org/r/223593 [18:56:29] (03PS1) 10BBlack: Update multi-cert 100x-series patch line offsets [software/nginx] (wmf-1.9.2-1) - 10https://gerrit.wikimedia.org/r/223594 [18:56:31] (03PS1) 10BBlack: Release 1.9.2+wmf3 (stapling fix updated) [software/nginx] (wmf-1.9.2-1) - 10https://gerrit.wikimedia.org/r/223595 [18:57:01] 6operations, 10RESTBase-Cassandra, 7Monitoring: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#1438779 (10Dzahn) they were not acknowledged or had comments in icinga and i didn't see a matching ticket. in these cases i just assume they are not known yet [19:00:35] JohnFLewis: seems like there isnt even concensus on thread [19:01:18] robh: its one of these weird ones. It doesn't really need consensus either way though? [19:01:27] yea but i like it non searchable ;D [19:01:41] i happen to side with the entire mailing lists are for discussions, wikis are for documentation side. [19:01:56] but, thats not my office employee view!~ [19:02:02] officially, i have no stance. [19:02:10] plus you have to consider, archive removals will then get into the ground of needing google search removals and so [19:04:23] robh: well either way, no work is left for mailman now - next steps are literally the quarterly goal with the vm and any planning if needed whenever that's ready to start :) [19:05:36] 6operations, 7Graphite, 7HHVM, 7Monitoring: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#1438786 (10Dzahn) 3NEW [19:06:47] 6operations, 7Graphite, 7HHVM, 7Monitoring: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#1438802 (10Dzahn) [19:07:26] 6operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407#1438806 (10RobH) I'm not sure this request even has the kind of support behind it by the mailing list administrators to warrant enabling the indexing by various search engines. Additi... [19:07:58] though mutante is right, if nemo asked for that and didnt spam the admins, someone else would be mad for not being notified ;D [19:09:44] its (nearly) always better to overcommunicate a configuration change proposal than under. [19:12:57] true true [19:13:23] (03PS13) 10Hashar: nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) [19:13:48] (03CR) 10Hashar: "Add show: true metadata property so the images show up in Horizon/Wikitech. T105015" [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [19:15:35] * Nemo_bis nods [19:15:55] anyone deploying now? If not then I will deploy https://gerrit.wikimedia.org/r/#/c/223562/ for Nikerabbit [19:17:21] Krinkle, bd808: ^ [19:17:43] * Krinkle is not deploying anythign [19:17:49] * bd808 is not either [19:18:05] yeah looks all clear. We really need that deployment mutex ;) [19:18:31] twentyafterfour: you just acquired it [19:18:32] Krenair: curious, why us two? [19:18:42] we are logged in on tin [19:19:04] Krinkle, you were both listed on w after twentyafterfour . [19:20:12] Krenair: w=wikitech:Deployments? [19:20:42] Krinkle, the "ssh tin.eqiad.wmnet w" [19:20:56] it's a unix command ;) [19:21:00] oh, must be a screen [19:21:26] nice [19:21:31] didn' tknow about that one [19:21:39] can't have enough one-letter commands [19:21:43] aside from the obvious [19:22:34] there aren't enough keys for all the one letter commands ;) [19:22:44] * twentyafterfour wants a 202 key keyboard [19:23:25] * twentyafterfour imagines grafting together two old IBM boards. Frankenstein style. [19:23:39] do chinese characters count? [19:24:14] http://i.imgur.com/CR0NH.png [19:24:30] Although I'm quite sure they found a better system that involves fewer actual keys [19:24:46] but even for our alphabet, I dig this: http://images-cdn.9gag.com/photo/655720_700b.jpg [19:25:58] !log restbase rolling restart [19:26:05] Logged the message, Master [19:27:08] anyone know why scap sometimes hangs for a full minute on the scappy pig before outputting any status info? it's been happening a lot lately on sync-dir or sync-file [19:27:35] !log twentyafterfour Synchronized php-1.26wmf13: deploying UniversalLanguageSelector commit 2e0990ac9879 (duration: 01m 58s) [19:27:39] twentyafterfour: ssh init to the cluster hosts is most likely I think [19:27:41] Logged the message, Master [19:28:02] twentyafterfour: fix confirmed working [19:28:13] Nikerabbit: thanks. [19:28:31] twentyafterfour: thanks, you saved my day (actually, my night) [19:28:57] twentyafterfour: if it's sitting for a long time I would wonder about ipv6 dns for the host names of the rsync fanout servers [19:29:10] 6operations, 10RESTBase-Cassandra, 7Monitoring: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#1438837 (10Milimetric) Hm, I see that eventlogging alerts went off but I'm not sure how that's related to Cassandra. I looked at graphite and indeed the raw rate of... [19:30:05] bd808: it's about 1 minute generally. just long enough for me to wonder wtf [19:30:25] hmm.. I keep wondering if we should lock the ssh connections to ipv4 only [19:30:38] it does seem like the timeframe is about the same as ipv6 rollout [19:30:57] yeah. we had some other bigger blockers when that happened too [19:31:13] `host -6 mw1010.eqiad.wmnet` times out on tin [19:31:25] why is ipv6 dns flaky? [19:31:41] I think its less flakey and more not setup [19:31:48] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [19:31:51] (03CR) 10coren: [C: 04-1] "Slight problem in localuser" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [19:32:02] !log stopped cassandra on restbase1008 [19:32:08] Logged the message, Master [19:33:26] (03CR) 10coren: [C: 031] "fact » ugly LDAP hack" [puppet] - 10https://gerrit.wikimedia.org/r/221562 (owner: 10Andrew Bogott) [19:33:39] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.012 second response time [19:37:21] Fatal error: Call to undefined method SearchResultSet::addInterwikiResults() in /srv/mediawiki/php-1.26wmf12/extensions/CirrusSearch/includes/CirrusSearch.php on line 159 is happening a lot. It seems to be fixed in wmf13 but I don't know what fixed it so I'm not sure which patch to backport. Anyone have any ideas? [19:38:01] (03CR) 10coren: [C: 04-1] "I really dislike the hardcoded IP in the facter code (see inline comment). If that IP ever changes, puppet breaks because of the fail()" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [19:40:29] 6operations, 10RESTBase-Cassandra, 7Monitoring: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#1438874 (10GWicke) @dzahn, we don't actually have the necessary permissions to ack those events in icinga. Should I create a separate ticket for that, or is it okay f... [19:40:42] (03CR) 10coren: [C: 031] "Better that way." [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [19:41:10] (03CR) 10Andrew Bogott: Add a labsproject fact that doesn't rely on ldap config. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [19:43:49] (03PS2) 10GWicke: Throttle RESTBase update jobs [puppet] - 10https://gerrit.wikimedia.org/r/223588 [19:44:12] (03CR) 10coren: [C: 04-1] "Unless I'm missing something obvious, this makes 1004 into another master with no replication?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) (owner: 10Jcrespo) [19:50:04] (03PS1) 10Dzahn: phabricator: insert   for some footer strings [puppet] - 10https://gerrit.wikimedia.org/r/223604 [19:52:45] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1438931 (10coren) AFAICT, this problem solved itself (as expected) since we switched to a properly... [19:55:51] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1438941 (10coren) 5declined>3Resolved Indeed it has: ```marc@tools-bastion-01:~$ host notexist... [19:56:00] @seen aaronsw [19:56:00] mutante: I have never seen aaronsw [19:57:14] (03CR) 10coren: [C: 04-2] "This was the wrong workaround for a problem that no longer exists (that is, dnsmasq is no longer the [broken] source of DNS authority for " [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [19:57:36] (03PS3) 10GWicke: Throttle RESTBase update jobs [puppet] - 10https://gerrit.wikimedia.org/r/223588 [19:58:58] (03CR) 10coren: [C: 031] "AFAICT, this is only correct linting." [puppet] - 10https://gerrit.wikimedia.org/r/211356 (owner: 10Dzahn) [20:00:04] gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150708T2000). [20:00:55] (03PS5) 10Merlijn van Deen: Tools: Only forward mail for project users [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [20:01:14] (03CR) 10Mark Bergsma: [C: 032] Throttle RESTBase update jobs [puppet] - 10https://gerrit.wikimedia.org/r/223588 (owner: 10GWicke) [20:01:26] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1433508 (10hashar) [20:01:28] manybubbles: Is someone from search handling the itwiki fatals that are spamming the logs? [20:01:47] csteipp: first I've heard of it. [20:01:56] ebernhardson: ^^^ I'll have a look [20:02:31] (03PS2) 10Dzahn: ganglia: add aggregator for ulsfo on bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/223231 (https://phabricator.wikimedia.org/T93776) [20:02:55] ebernhardson and csteipp: ok. I can track this down. I hate that we don't test this for shit. [20:03:42] manybubbles: Thanks! [20:04:19] (03PS7) 10Dzahn: logstash: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223230 (https://phabricator.wikimedia.org/T93776) [20:04:50] (03CR) 10BBlack: [C: 032 V: 032] Align ssl_stapling_file bugfix with upstream nginx [software/nginx] (wmf-1.9.2-1) - 10https://gerrit.wikimedia.org/r/223593 (owner: 10BBlack) [20:05:03] (03CR) 10BBlack: [C: 032 V: 032] Update multi-cert 100x-series patch line offsets [software/nginx] (wmf-1.9.2-1) - 10https://gerrit.wikimedia.org/r/223594 (owner: 10BBlack) [20:05:09] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 8 20:05:09 UTC 2015 (duration 5m 8s) [20:05:15] Logged the message, Master [20:05:22] (03CR) 10coren: [C: 04-1] "AFAICT, this will work properly but I concur with Yuvi that the oneliners are painful to parse and understand." [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt) [20:05:24] manybubbles: kk [20:05:36] (03CR) 10BBlack: [C: 032 V: 032] Release 1.9.2+wmf3 (stapling fix updated) [software/nginx] (wmf-1.9.2-1) - 10https://gerrit.wikimedia.org/r/223595 (owner: 10BBlack) [20:07:15] ebernhardson: would you mind filing a phab issue for it while I fix it? [20:07:21] manybubbles: sure [20:07:25] or finding one [20:09:10] manybubbles: https://phabricator.wikimedia.org/T104189 [20:09:38] manybubbles: its suggested as fixed in wmf13 in that ticket. itwiki is on 12 [20:10:51] ebernhardson: looks like chad already proposed a fix too [20:13:13] (03PS1) 10Andrew Bogott: Increase the NFS blocking time to 180 seconds. [puppet] - 10https://gerrit.wikimedia.org/r/223656 [20:15:13] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1438997 (10hashar) The check_disk parameters for production are defined in `modules/base/manifests/monitoring/host.pp`:... [20:15:32] !log bounced cassandra on restbase1001 [20:15:34] 6operations, 5Continuous-Integration-Isolation, 7Icinga, 7Monitoring, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1439010 (10hashar) [20:15:38] Logged the message, Master [20:16:06] (03PS2) 10Andrew Bogott: Increase the NFS blocking time to 180 seconds. [puppet] - 10https://gerrit.wikimedia.org/r/223656 [20:16:28] starting parsoid deploy [20:17:52] (03CR) 10Andrew Bogott: [C: 032] Increase the NFS blocking time to 180 seconds. [puppet] - 10https://gerrit.wikimedia.org/r/223656 (owner: 10Andrew Bogott) [20:21:56] (03PS1) 10Gergő Tisza: Update graphite keys in API dashboard [puppet] - 10https://gerrit.wikimedia.org/r/223659 (https://phabricator.wikimedia.org/T85841) [20:22:16] (03CR) 10Dzahn: "also see https://phabricator.wikimedia.org/T104779" [puppet] - 10https://gerrit.wikimedia.org/r/215994 (owner: 10Faidon Liambotis) [20:22:21] (03CR) 10Josve05a: [C: 031] "Reviewing my own change. Feels...like I shouldn't do this." [puppet] - 10https://gerrit.wikimedia.org/r/223604 (owner: 10Dzahn) [20:22:44] ebernhardson: got a stopgap, please god stop the fatalspam, kind of fix at https://gerrit.wikimedia.org/r/#/c/223658 [20:23:04] manybubbles: checking it out [20:23:05] (03CR) 10Dzahn: "oh, in this case you should, i wrote it and could have made mistakes :)" [puppet] - 10https://gerrit.wikimedia.org/r/223604 (owner: 10Dzahn) [20:24:06] ebernhardson: the one chad proposed looks like it breaks some integration tests. cindy thinks it does [20:24:15] manybubbles: yea, double checking that now [20:24:16] I figure mine is more obvious and we can just SWAT it [20:26:10] Krinkle: does merging a zuul config change cause restarts or need interaction? https://gerrit.wikimedia.org/r/#/c/223559/1 [20:27:02] mutante: they are deployed manually [20:27:11] hashar: oh, you are here:) ok [20:27:13] thanks [20:27:15] just like puppet changes need a manual merge on palladium (or whatever is the puppet master) [20:27:23] the job is completing, going to deploy it :} [20:27:24] yes, it is.. gotcha [20:27:29] cool:) ty [20:27:32] I am watching the console log [20:27:53] (03CR) 10Hashar: "recheck" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [20:27:55] (03CR) 10jenkins-bot: [V: 04-1] Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [20:28:45] (03CR) 10Hashar: [C: 04-1] "Need rebase. The pep8/pyflakes issues have been fixed and the jobs made voting with https://gerrit.wikimedia.org/r/#/c/223559/" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [20:29:01] !log deployed parsoid sha c4cfc527 [20:29:04] mutante: kudos! [20:29:08] Logged the message, Master [20:29:51] hashar: they go to Elee :) [20:29:55] elee: [20:30:55] !log added explicit exit 1 in /etc/init.d/cassandra on restbase1008 to prevent cassandra from starting up there; is puppet restarting it? [20:31:01] Logged the message, Master [20:35:00] mutante: Yes [20:35:49] it's manually deployed out of git (ssh gallium, git pull, zuul reload) [20:40:52] wat [20:40:54] (03PS1) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [20:40:58] oh <3 you too hashar [20:41:03] and yyou too of course mutante [20:41:51] (03CR) 10jenkins-bot: [V: 04-1] Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 (owner: 10Smalyshev) [20:43:21] (03PS2) 10Mforns: Add flag --all-projects to projectviews aggregator [puppet] - 10https://gerrit.wikimedia.org/r/223573 (https://phabricator.wikimedia.org/T95339) [20:43:56] (03CR) 10Mforns: Add flag --all-projects to projectviews aggregator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223573 (https://phabricator.wikimedia.org/T95339) (owner: 10Mforns) [20:45:55] Krinkle: tx, hashar already did it [20:46:49] Is greg-g still away? [20:46:51] (03PS4) 10Merlijn van Deen: Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 [20:46:53] (03CR) 10jenkins-bot: [V: 04-1] Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [20:47:24] hoo: yeah he's out all week on a real live vaction [20:47:55] <_joe_> pfff [20:47:55] mh ok [20:47:58] (03PS2) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [20:48:04] I do not see anything on the logs about mediawiki [20:48:25] <_joe_> bd808: I was told you americans never take a vacation! [20:48:38] (03CR) 10jenkins-bot: [V: 04-1] Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 (owner: 10Smalyshev) [20:48:40] (03PS5) 10Merlijn van Deen: Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 [20:48:43] (03CR) 10jenkins-bot: [V: 04-1] Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [20:49:45] _joe_: it's not as common as it should be. Protestant work ethic or some crap like that. [20:50:30] we are pretty much taught that if we stop working for a couple of weeks a year the economy will collapse [20:50:46] (03PS6) 10Merlijn van Deen: Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 [20:50:58] 6operations, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1439066 (10Dzahn) 3NEW a:3Dzahn [20:51:07] 6operations, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1439074 (10Dzahn) p:5Triage>3Normal [20:51:13] (03PS3) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [20:51:19] 6operations, 10RESTBase-Cassandra, 7Monitoring: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#1439076 (10Dzahn) >>! In T105216#1438874, @GWicke wrote: > @dzahn, we don't actually have the necessary permissions to ack those events in icinga. Should I create a s... [20:51:52] <_joe_> bd808 yeah like http://www.ispot.tv/ad/7BkA/2014-cadillac-elr-poolside (I couldn't resist) [20:53:01] 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1439087 (10GWicke) [20:53:23] 6operations, 10RESTBase-Cassandra, 7Monitoring: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#1439089 (10GWicke) @dzahn, thanks! [20:53:27] !log deployed patch for T94116 for wmf12/wmf13 [20:53:31] _joe_: Can't actually agree with the moral of that ad, really. :-) [20:53:34] Logged the message, Master [20:54:07] oh man. I hate that ad [20:54:37] <_joe_> bd808: I actually love it. I show it to my friends, when I want to explain the US to them :D [20:54:39] It reeks of 'murica execptionalism. :-) [20:54:57] <_joe_> it's beyond horrible [20:55:01] its a great ad! its useful for so much stuff! [20:55:11] it explains lots of behavior [20:55:17] (03CR) 10Jcrespo: "Coren, what is for you the difference between a master and a slave?" [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) (owner: 10Jcrespo) [20:55:24] and its silly [20:55:41] <_joe_> it's not even "american", it's just a glorification of consummerism. "I will live a miserable life so that I can buy shiny things" [20:56:15] _joe_: I think its poking fun at that. at least a little. [20:56:51] !log deployed patches for T103022 & T103023 [20:56:57] Logged the message, Master [20:57:40] _joe_: well, if people didn't do that, nobody would buy that car, so it makes sense for them :) [20:58:03] <_joe_> bblack: ahah it's that bad? [20:58:17] <_joe_> I mean that expensive? [20:58:42] I'm just saying, nobody needs a shiny new cadillac EV [20:58:50] I have no idea what it costs [20:58:51] (03CR) 10Odder: [C: 031] "Looks OK after implementing Alex's suggestions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223497 (https://phabricator.wikimedia.org/T105118) (owner: 10Ebrahim) [20:59:02] MSRP¹ STARTING AT $75,000 [20:59:11] <_joe_> holy... [20:59:15] <_joe_> it's expensive [20:59:25] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1439115 (10Dzahn) 3NEW [20:59:53] 3/4 of my first house; 3x my current car [21:00:04] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1439125 (10Dzahn) [21:00:40] 2/3 of my current house.... [21:00:49] also, the american subculture that ad exemplifies is being oversold there. most people in that subculture and not working hard and making big bucks and buying shiny things. they're pretending to work hard, making less than they wish or pretend they do, and buying a lot of stuff on credit rather than cash assuming that their income will rise astronomically in the future to pay for it because [21:00:55] they're awesome. [21:00:57] or something like that [21:01:06] s/and not working/are not working/ [21:01:10] I... have a nicer house now. Thanks Lawyer wife! :) [21:01:15] (03CR) 10coren: "I would have expected the puppet config to mention the replication user and the master.info; though I realize now that the mysql class doe" [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) (owner: 10Jcrespo) [21:01:15] <_joe_> eheh [21:01:25] I just live in a cheap place.... [21:01:32] bblack: I'm awesome! [21:01:46] <_joe_> I drive what to you americans looks probably like a tin can, with a 1.2 L engine [21:02:04] _joe_: it's a motorcycle? [21:02:18] "euro car" [21:02:35] really the housing is the worst of it. that's where people really overextend and feel compelled to buy way more than they need or can afford. [21:02:41] <_joe_> urandom: https://en.wikipedia.org/wiki/Renault_Modus [21:03:01] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1439143 (10Andrew) This is blocked pending a replacement 10g card. [21:03:05] which I'm sure is a huge factor in the bubbliness of that market [21:03:12] _joe_: you drive a flashlight? [21:03:18] _joe_: sorry, joking [21:03:20] <_joe_> ahah [21:03:28] bblack: That said, I'm a sucker for large housing. I have 260m² for three people and we feel ridiculously constrained. [21:04:09] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: / 3380 MB (3% inode=98%) [21:04:13] <_joe_> urandom: it's the perfect balance of short length to be parked in Rome, and baggage real estate [21:04:14] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1439148 (10Dzahn) [21:04:49] that graphite thing is real: just since I've been on there looking around, rootfs has taken on 10G new stuff [21:05:13] I suspect it's related to statsd stats dropping off like a rock a couple hours ago, somehow [21:05:37] bblack: o_O I bet I know what it is... [21:05:45] 48G /var/log/upstart, 23G /var/log/syslog [21:05:49] nice spam [21:06:01] _joe_: i drive a honda, which here in texas is very dinky [21:06:25] it does have an enormous engine for the size, tho [21:06:28] because statsite is spamming: "statsite[20763]: Failed value conversion! Input: :getMessagesFileName:1" as fast as CPU/disk will allow it to apparently [21:06:33] bblack: This would have added quite a few new stats -- https://gerrit.wikimedia.org/r/#/c/222224/ [21:06:46] apparently some of them are broken [21:07:12] <_joe_> roll it back? [21:08:08] RECOVERY - Disk space on graphite1001 is OK: DISK OK [21:08:24] !log graphite: wiped /var/log/upstart/statsite* logs, restarted statsite processes [21:08:31] Logged the message, Master [21:08:46] statsite[2618]: Failed value conversion! Input: :moduleManager:1 [21:09:01] statsite[2620]: Failed value conversion! Input: :getMessagesFileName:1 [21:09:04] statsite[2620]: Failed value conversion! Input: :get:1 [21:09:05] all kinds of spam like that [21:09:17] <_joe_> bblack: not easy to roll that back apparently [21:09:25] <_joe_> there are followup commits :( [21:09:30] https://github.com/armon/statsite/blob/52e5d6b912b38da780c67be75757062185e6c986/src/conn_handler.c#L456-L461 [21:09:42] does 18:05 line up with when the effects would have hit? [21:09:53] when did the train roll? [21:10:16] 18:05 logmsgbot: twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf13 [21:10:19] so yes [21:10:48] <_joe_> roll back the train maybe? [21:11:25] https://graphite.wikimedia.org/render/?tz=Etc%2FUTC&title=2xx%2Fs&from=-12h&vtitle=&colorList=%23387aa3%2C%23649eb9%2C%239dc2d3%2C%23a888c2%2C%23d8aad6%2C%23e7cbe6&width=1166&height=693&_salt=1436389872.233&target=alias(sumSeries(varnish.*.backends.*.2xx.rate)%2C%20%22%22)&target=varnish.eqiad.backends.ipv4_10_2_2_27.2xx.sum [21:11:34] assuming you can even paste that horrible link [21:11:50] but that's an example of "all other statsd stats dropping off like a rock" [21:12:06] (03CR) 10Jcrespo: "Yes, the commit message is not final yes (the patch isn't)." [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) (owner: 10Jcrespo) [21:12:42] or this is an easier link to paste: https://tessera.wikimedia.org/dashboards/6/ciphers/d15/transform/Isolate?from=-12h [21:12:47] <_joe_> we need to roll this back. [21:12:58] _joe_: working on a patch [21:13:07] <_joe_> bd808: oh, great :) [21:14:59] I've been staring at this problem for an hour heh, but multitasking with 3 other things and hadn't gotten to the bottom yet [21:15:04] I should've focused more :P [21:15:49] _joe_, bblack: https://gerrit.wikimedia.org/r/#/c/223673/ [21:15:56] twentyafterfour: https://gerrit.wikimedia.org/r/223674 << Please +2 [21:16:08] We can cherry pick that back to the release branch [21:16:13] I just created that new branch from wmf12, because we have to do a breaking change for wmf13 [21:16:25] or whoever else wants to merge... bd808 thcipriani ^ [21:17:30] hoo: do you need that now? [21:17:31] <_joe_> hoo: wait a second, we have a rollback to perform [21:17:33] bd808: +1'd [21:17:56] bd808: Well, it's just so that we don't downgrade by mistake next week [21:17:56] this would be the second time I'd be performing a review: https://gerrit.wikimedia.org/r/#/c/180890/ [21:18:00] it all looks good [21:18:04] so I just... +1 it? [21:18:08] _joe_: That's not deployed itself, so nothing to worry ;) [21:19:03] <_joe_> elee: if you think it's good, yes [21:19:10] (03CR) 10Elee: [C: 031] Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [21:19:39] * bd808 waits on zuul [21:20:07] (03CR) 10Dzahn: "if you want to, edit debian/changelog and add a stanza to turn it into 1.7.9. or we could do that separately" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [21:20:10] hoo: {{done}} [21:20:28] Great, thanks [21:20:31] bblack, _joe_: I'll have that patch out as soon as zuul lets it through [21:20:37] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1439229 (10RobH) This has already been slightly discussed in IRC, but I'll note it on task for the record. Having rights to acknowled... [21:20:42] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1439234 (10RobH) This has already been slightly discussed in IRC, but I'll note it on task for the record. Having rights to acknowledge and silence alerts in icinga... [21:20:49] bd808: thanks! [21:22:35] in related news we really need to speed up out test pipeline [21:24:01] akosiaris, hi, MaxSem & I created the tasks, could you check and see if anything is missing? thx! :) [21:27:07] 6operations, 7HTTPS, 7LDAP: update ldap-mirror.wikimedia.org certificate to sha256 - https://phabricator.wikimedia.org/T105187#1439257 (10Andrew) a:5Andrew>3akosiaris ldap-mirror is plutonium which is I believe Alexandros's project. I've never touched it. So, I'm not unwilling to work on this, but it m... [21:28:00] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1439261 (10RobH) Daniel pulled up the varying icinga permission groups: P926 [21:28:05] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1439263 (10RobH) Daniel pulled up the varying icinga permission groups: P926 [21:29:28] manybubbles: there is an unpulled backport of CirrusSearch on the 1.26wmf13 branch :/ [21:29:54] bd808: you mean there isn't a submodule update for it? [21:30:13] there is an update but it hasn't been applied on tin yet [21:30:29] ebernhardson was preparing them for swat so I +2ed and figured he'd build the submodule update [21:30:32] and it's sitting on top of my fix for the statsd stuff [21:30:58] bd808: they can go together then [21:31:14] grumble [21:31:15] ours was going to be swatted [21:31:23] in a couple of hours [21:31:51] * bd808 channels roan and makes grumpy noises [21:32:01] hey - I was just doing normal swap prep. and it fixes a fatal! [21:32:15] coolio. I'll ship it [21:32:28] does it need scap or just a file sync? [21:32:55] file sync [21:33:19] Just includes/CirrusSearch.php? [21:34:46] manybubbles: I haven't done a submodule bump in a while. I do `git submodule update --init --recursive extensions/CirrusSearch` to fetch it correct? [21:35:01] bd808: +1 [21:35:26] bd808: weren't you doing a submodule update of cirrus for your fix? [21:36:04] manybubbles: no, I was backporting to core [21:36:29] but on the wmf13 branch where your submodule bump had already been merged [21:36:36] so my fetch picked it up [21:36:41] now I'm really confused.... [21:37:02] You merged the backport -- https://gerrit.wikimedia.org/r/#/q/Id46545dd77a00ccd4ba2f18efc5d7d57f2d96626,n,z [21:37:37] so when I fetched 1.26wmf13 on tin it got pulled [21:37:42] yeah! to the cirrus submodule.... i didn't merge the submodule update [21:38:32] I didn't do this step: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Updating_the_submodule [21:38:38] so how did you get the code? [21:38:46] just sync your code, dude, and we'll work it out [21:38:56] anybody around with the rights to look into job runners? [21:39:33] gwicke: You can probably do that yourself [21:39:34] !log bd808 Synchronized php-1.26wmf13/extensions/CirrusSearch/includes/CirrusSearch.php: Suppress interwiki results when they would break (duration: 00m 12s) [21:39:38] Updating the submodule [21:39:38] ... Should no longer be necessary. Gerrit should do this for you now. [21:39:40] oh [21:39:41] Logged the message, Master [21:39:43] well, unless you want to attach gdb or something along these lines [21:39:44] thats new [21:40:08] hoo: as in restart? [21:40:21] !log bd808 Synchronized php-1.26wmf13/includes/Hooks.php: Revert Count API module instantiations and Hook runs (1/2) (duration: 00m 12s) [21:40:28] Logged the message, Master [21:40:43] gwicke: You can restart hhvm... not nice, but it's never nice [21:40:50] bd808: I see the problem now. no submodule update is required any more.... [21:41:11] !log bd808 Synchronized php-1.26wmf13/includes/api/ApiMain.php: Revert Count API module instantiations and Hook runs (2/2) (duration: 00m 12s) [21:41:17] Logged the message, Master [21:41:18] manybubbles: oh. I heard folks whine about that before [21:41:21] manybubbles: merging the wmf branch in extensions now automatically creates submodule updates. [21:41:36] yeah - so I didn't know that and made bd808's life hard [21:41:51] manybubbles: I'll stop being grump at you then and hate on gerrit/zuul and communications instead! [21:42:03] It's actually useful if you're expecting it [21:42:06] yeah! [21:42:07] not so helpful if you're not [21:42:08] hoo: the runners are still on zend, afaik [21:42:32] but a shit show otherwise. Did I miss the giant "shit is different" email blast on that? [21:42:45] * bd808 says shit a few more times [21:42:45] I must have missed it too [21:42:59] I don't think I got an email about it [21:43:28] gwicke: No, have been migrated ages ago [21:43:31] _joe_: can you check and see if those errors have disappeared for statsite? [21:43:42] <_joe_> bd808: ack [21:43:44] sooooo - I've merged that patch. can I just go sync it now? [21:43:53] or should I roll it back [21:44:01] manybubbles: have at it. I synced wmf13 for you [21:44:17] so just wmf12 needs to go now [21:44:40] Can you poke me once you're done? I need to sync out a fix for a unbreak now bug :S [21:44:56] hoo: they are running off /usr/bin/php [21:45:03] (03CR) 10Dzahn: "what Moritz said. nevertheless we could merge it, since nothing would happen on sodium but as soon as we apply it on a new VM combined wit" [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) (owner: 10John F. Lewis) [21:45:06] hoo: I'm out of the way (pending confirmation that I guessed the right thing to fix) [21:45:23] oh, but you are right: that's actually hhvm on those boxes [21:45:31] <_joe_> bd808: nope [21:45:35] <_joe_> bd808: statsite[2622]: Failed value conversion! Input: 2.964393875E-314 [21:45:43] <_joe_> this seems pretty broken as well [21:45:45] blerg [21:45:46] hoo: in any case, I have no rights to restart the jobrunner service [21:46:27] <_joe_> what do you guys need about the jobrunners? [21:46:46] there are resourceloader patches from timo sitting behind my cirrus fixes [21:46:48] <_joe_> you want to deploy a new version of the jobrunner service? [21:47:07] gwicke: That might be true [21:47:23] Krinkle: looks like there are patches for you sitting on the clsuter [21:47:24] _joe_: the RB jobs don't seem to be doing much, and I see jobs queuing up [21:47:48] <_joe_> gwicke: ok, lemme take a look [21:47:52] manybubbles: branch? [21:48:03] (03CR) 10Dzahn: "needs rebase" [puppet] - 10https://gerrit.wikimedia.org/r/223229 (https://phabricator.wikimedia.org/T104937) (owner: 10Dzahn) [21:48:04] I do see various curl calls on one of the jobrunners, but no runner [21:48:11] wmf12 [21:48:17] manybubbles: rebase? [21:48:22] afaik that was merged and deplyed [21:48:26] local branch pointer is behind [21:48:34] will disappear into origin/wmf/1.26wmf12 [21:48:50] <_joe_> gwicke: no runner? [21:48:54] <_joe_> what do you mean? [21:48:59] <_joe_> and where exactly? [21:49:08] <_joe_> because they seem to be running alright [21:49:19] (03PS2) 10Dzahn: transparency: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223229 (https://phabricator.wikimedia.org/T104937) [21:49:22] this is on mw1001; it used to be possible to grep for a job name and see the runner processes [21:49:34] <_joe_> gwicke: ok, you did it wrong [21:49:37] Krinkle: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_tin is out of date again I guess [21:49:41] If the storm has passed, I could use a hand from someone who a) understands nutcracker and b) has a few minutes. https://phabricator.wikimedia.org/T102993 [21:49:46] <_joe_> lemme take a look [21:49:58] (03PS3) 10Dzahn: transparency: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223229 (https://phabricator.wikimedia.org/T104937) [21:50:02] manybubbles: it says to rebase [21:50:14] <_joe_> mw1001:~$ sudo service jobrunner status [21:50:15] <_joe_> jobrunner start/running, process 5102 [21:50:19] <_joe_> gwicke: ^^ [21:50:50] <_joe_> jobs are submitted to hhvm via posts from this service [21:50:56] <_joe_> since ~ 1 year [21:50:57] (03CR) 10Dzahn: [C: 032] transparency: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223229 (https://phabricator.wikimedia.org/T104937) (owner: 10Dzahn) [21:51:05] <_joe_> no, a bit less, but still :) [21:51:16] manybubbles: Yeah, looking on it. [21:51:18] tin [21:51:20] looks good now [21:51:26] it was in a weird state indeed [21:51:28] I rebased [21:51:29] I know, but I haven't looked into the jobrunner boxes themselves since [21:51:37] manybubbles: Krinkle: Are you doing stuff or can I go ahead? [21:51:45] git log doens't show this well because it compares the local pointer based on ref instead of content [21:51:50] !log manybubbles Synchronized php-1.26wmf12/extensions/CirrusSearch/: Stop some fatals in cirrus (duration: 00m 13s) [21:51:55] I'm just observing. [21:51:57] Logged the message, Master [21:52:06] _joe_: I just checked runJobs.log on fluorine, and it seems to show some RESTBase activity [21:52:24] hoo: all done. you can have it [21:52:30] Nice, thanks :) [21:52:36] _joe_: ohhh, request rates just picked up [21:52:42] <_joe_> gwicke: 2015-07-08T21:11:32+0000: Runner loop 2 process in slot 2 gave status '0': [21:52:45] maybe it fixed itself [21:52:45] <_joe_> curl -XPOST -s -a 'http://127.0.0.1:9005/rpc/RunJobs.php?wiki=frwiki&type=RestbaseUpdateJobOnDependencyChange&maxtime=30&maxmem=300M' [21:52:52] <_joe_> this timed out repeatedly [21:53:10] <_joe_> so it's restbase timing out, not the jobrunners having problems it seems [21:53:24] <_joe_> because all other jobs run fine :) [21:53:42] what is the timeout? [21:54:14] this is re-rendering articles through Parsoid, which can take ~60 seconds [21:54:15] <_joe_> "maxtime=30" seems to give a hint [21:54:29] <_joe_> so maybe that's wrong and needs tweaking [21:55:03] <_joe_> ask aaron for pointers on how to fix that [21:55:09] <_joe_> it surely needs more time. [21:55:11] kk [21:55:37] the job doesn't necessarily need to wait (we disabled retries since those seem to lead to jobs being retried forever) [21:56:03] <_joe_> the job doesn't, I don't know what happens on the restbase side if the client disconnects [21:56:28] I believe it finishes the request [21:56:29] bd808: Will gerrit also update submodules after commits if I manually do submodule update? [21:56:41] In my case where I go from wmf12 to a wmf13 branch [21:56:46] _joe_: in any case, it seems to be working now; thanks for looking into it! [21:56:52] <_joe_> np [21:56:59] <_joe_> but now I'm off to bed [21:57:11] <_joe_> midnight seems like a good time to stop working :) [21:57:12] hoo: I don't know. It's all new to me. [21:57:29] _joe_: indeed, goodnight! [21:58:13] Ok, I'll write an email to other Wikidata deployers then, just in case [22:00:28] 6operations, 10Deployment-Systems, 10RESTBase, 6Release-Engineering, 6Services: Get ops feedback regarding the use of SSH for deployment system control channel. - https://phabricator.wikimedia.org/T102687#1439388 (10thcipriani) I poked at Fabric a bit this morning. Fabric uses paramiko which doesn't app... [22:01:53] manybubbles: I'll update wikitech a bit [22:02:00] it's not wrong but it'll make it slightly less confusing [22:02:07] there's a few outdated git -pull mentions as well [22:02:14] eventhough none of the commands feature that [22:05:06] bblack: https://tessera.wikimedia.org/dashboards/6/ciphers/d15/transform/Isolate?from=-12h looks like it is recovered. _jo.e_ said there are still some errors in the statsite logs. [22:05:46] bd808: woooh... gerrit picked that change up on it's own [22:05:52] how on earth [22:06:05] (it updated to the wmf13 branch on its own) [22:06:14] the releng folks changed stuff in gerrit apparently [22:06:41] That's a scary amount of automation already [22:06:43] :P [22:08:14] Yeah this changed a few wmf branches ago (wmf7?) [22:08:16] And nobody knew why [22:08:17] (03PS2) 10GWicke: Increase read parallelism to 96 [puppet] - 10https://gerrit.wikimedia.org/r/223495 [22:08:20] I'm not complaining :D [22:08:24] !log hoo Synchronized php-1.26wmf13/extensions/Wikidata/: Update Wikibase: Fix JavaScript ULS usage (duration: 00m 20s) [22:08:30] Logged the message, Master [22:15:49] PROBLEM - puppet last run on cp3049 is CRITICAL puppet fail [22:17:35] (03PS8) 10Dzahn: logstash: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223230 (https://phabricator.wikimedia.org/T93776) [22:18:27] (03CR) 10Dzahn: [C: 032] logstash: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223230 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [22:20:47] (03PS1) 10Dduvall: contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) [22:24:15] @seen ottomata [22:24:15] mutante: Last time I saw ottomata they were quitting the network with reason: Quit: Leaving. N/A at 7/7/2015 3:00:43 AM (1d19h23m31s ago) [22:26:26] (03PS4) 10BBlack: Increase concurrent_writes to 128 [puppet] - 10https://gerrit.wikimedia.org/r/223454 (owner: 10GWicke) [22:26:28] (03PS1) 10Southparkfan: Apache (wikitech): use /w/404.php as 404 errorpage [puppet] - 10https://gerrit.wikimedia.org/r/223694 (https://phabricator.wikimedia.org/T102147) [22:26:51] (03PS3) 10BBlack: Increase read parallelism to 96 [puppet] - 10https://gerrit.wikimedia.org/r/223495 (owner: 10GWicke) [22:27:29] (03CR) 10BBlack: [C: 032] Increase concurrent_writes to 128 [puppet] - 10https://gerrit.wikimedia.org/r/223454 (owner: 10GWicke) [22:27:42] (03CR) 10BBlack: [C: 032] Increase read parallelism to 96 [puppet] - 10https://gerrit.wikimedia.org/r/223495 (owner: 10GWicke) [22:30:33] bd808: ok thanks :) [22:31:50] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1439456 (10Dzahn) eqiad is completely switched now. ---- logstash was the last cluster. caveat: only the unicast hosts are in the ganglia cluster: 21 elasticsearch::unica... [22:33:18] RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:33:54] !log legoktm Synchronized php-1.26wmf13/includes/changes/EnhancedChangesList.php: Unbreak missing flags in enhanced RC (duration: 00m 12s) [22:34:00] Logged the message, Master [22:34:34] pretty simple PS for CI if someone has a moment https://gerrit.wikimedia.org/r/#/c/223691/ [22:37:06] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1439472 (10scfc) Too bad for the readers Google will bring here in the future: Nearly four months o... [22:38:06] (03Abandoned) 10Dzahn: put base::firewall on neptunium (LDAP) [puppet] - 10https://gerrit.wikimedia.org/r/223232 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [22:42:00] (03PS3) 10Dzahn: restbase - fix => alignment (lint) [puppet] - 10https://gerrit.wikimedia.org/r/222535 [22:43:00] (03CR) 10Dzahn: [C: 032] "really only spaces - but fixes 70 warnings" [puppet] - 10https://gerrit.wikimedia.org/r/222535 (owner: 10Dzahn) [22:45:43] !log zirconium - stop puppet for role switch [22:45:49] Logged the message, Master [22:46:12] (03PS3) 10Dzahn: transparency report: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/223226 (https://phabricator.wikimedia.org/T104937) [22:48:56] (03CR) 10Dzahn: [C: 032] transparency report: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/223226 (https://phabricator.wikimedia.org/T104937) (owner: 10Dzahn) [22:49:46] mutante: is it me or have you removed two letters? https://gerrit.wikimedia.org/r/#/c/222535/3/manifests/role/restbase.pp [22:50:51] SPF|Cloud: a single space per line [22:50:58] Heh [22:51:31] Here I see you removed more than just spaces [22:51:36] SPF|Cloud: seeing that too [22:52:15] (03PS1) 1020after4: Bump phabricator tag to release/2015-07-08/1 [puppet] - 10https://gerrit.wikimedia.org/r/223697 [22:52:21] (03CR) 10jenkins-bot: [V: 04-1] Bump phabricator tag to release/2015-07-08/1 [puppet] - 10https://gerrit.wikimedia.org/r/223697 (owner: 1020after4) [22:52:36] SPF|Cloud: which line? [22:53:39] mutante: 36, 46 [22:53:46] (03PS2) 1020after4: Bump phabricator tag to release/2015-07-08/1 [puppet] - 10https://gerrit.wikimedia.org/r/223697 [22:54:33] gwicke: ouch, sorry [22:54:38] at least just the description [22:54:47] Two times :p [22:55:05] But np [22:55:10] mutante: no worries, thanks for cleaning up! [22:56:39] PROBLEM - puppet last run on graphite1001 is CRITICAL puppet fail [22:56:39] !log finished rolling restart of cassandra cluster to apply https://gerrit.wikimedia.org/r/#/c/223495/ [22:56:45] Logged the message, Master [22:58:16] (03PS2) 1020after4: phabricator: insert   for some footer strings [puppet] - 10https://gerrit.wikimedia.org/r/223604 (owner: 10Dzahn) [22:58:23] (03CR) 1020after4: [C: 031] phabricator: insert   for some footer strings [puppet] - 10https://gerrit.wikimedia.org/r/223604 (owner: 10Dzahn) [22:59:27] (03CR) 1020after4: [C: 031] "I enabled the option in the web interface, but we should merge this as well" [puppet] - 10https://gerrit.wikimedia.org/r/223067 (owner: 10Chad) [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150708T2300). [23:00:04] matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:21] Present [23:00:22] I'll do it [23:00:55] RoanKattouw, do you want to reschedule the triage, do it without you, or do both in parallel? [23:00:56] (03PS1) 10Dzahn: restbase: fix typos in description field [puppet] - 10https://gerrit.wikimedia.org/r/223700 [23:01:03] matt_flaschen: We'll start a bit late [23:01:45] Okay, just let me know. [23:01:53] (03CR) 10Dzahn: [C: 032] restbase: fix typos in description field [puppet] - 10https://gerrit.wikimedia.org/r/223700 (owner: 10Dzahn) [23:06:07] (03CR) 1020after4: "Looks like this is probably a dead end." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223516 (https://phabricator.wikimedia.org/T105052) (owner: 10Matanya) [23:06:44] !log Restarted logstash on logstash1001; no hhvm input seen for last hour [23:06:50] Logged the message, Master [23:06:58] !log catrope Synchronized php-1.26wmf13/extensions/Flow: SWAT (duration: 00m 14s) [23:07:04] Logged the message, Master [23:08:06] (03PS2) 10Dzahn: misc-web varnish: switch transparency to bromine [puppet] - 10https://gerrit.wikimedia.org/r/223227 (https://phabricator.wikimedia.org/T104937) [23:08:32] (03CR) 1020after4: [C: 031] [WIP] Phabricator: Create differential puppet role [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) (owner: 10Negative24) [23:09:01] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1439571 (10Tgr) >>! In T102566#1434294, @Tau wrote: > Still nothing ... Any further recommendations?? Make sure [[ https://w... [23:09:36] (03PS3) 10Dzahn: misc-web varnish: switch transparency to bromine [puppet] - 10https://gerrit.wikimedia.org/r/223227 (https://phabricator.wikimedia.org/T104937) [23:10:51] (03CR) 10Dzahn: [C: 032] "[terbium:~] $ apache-fast-test transparency.url zirconium.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/223227 (https://phabricator.wikimedia.org/T104937) (owner: 10Dzahn) [23:14:14] ACKNOWLEDGEMENT - puppet last run on graphite1001 is CRITICAL puppet fail daniel_zahn WIP - see ops list [23:14:54] ACKNOWLEDGEMENT - puppet last run on graphite1001 is CRITICAL puppet fail daniel_zahn WIP - see ops list [23:16:08] RECOVERY - puppet last run on graphite1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:16:54] (03PS2) 10Dzahn: transparency: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/223228 (https://phabricator.wikimedia.org/T104937) [23:17:19] (03CR) 10Dzahn: [C: 032] transparency: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/223228 (https://phabricator.wikimedia.org/T104937) (owner: 10Dzahn) [23:21:37] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1439648 (10Dzahn) [23:21:38] 6operations, 10Traffic, 5Patch-For-Review: move transparency report from zirconium to bromine - https://phabricator.wikimedia.org/T104937#1439646 (10Dzahn) 5Open>3Resolved moved and deleted on zirconium [23:26:55] (03PS2) 10Dzahn: pay-lvs: remove from hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/219077 [23:31:06] (03CR) 10Dzahn: "i guess nevermind, because they are just tools, still not sure how to feel about automatic upgrading or not" [puppet] - 10https://gerrit.wikimedia.org/r/222534 (owner: 10Dzahn) [23:31:10] (03Abandoned) 10Dzahn: role::deployment - no ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/222534 (owner: 10Dzahn) [23:33:17] (03PS3) 10Dzahn: deployment::server: move releases::upload into role [puppet] - 10https://gerrit.wikimedia.org/r/223464 [23:33:59] (03PS4) 10Dzahn: deployment::server: move releases::upload into role [puppet] - 10https://gerrit.wikimedia.org/r/223464 [23:35:04] (03PS5) 10Dzahn: deployment::server: move releases::upload into role [puppet] - 10https://gerrit.wikimedia.org/r/223464 [23:35:27] (03CR) 10John F. Lewis: [C: 031] Apache (wikitech): use /w/404.php as 404 errorpage [puppet] - 10https://gerrit.wikimedia.org/r/223694 (https://phabricator.wikimedia.org/T102147) (owner: 10Southparkfan) [23:35:49] (03PS3) 10Dzahn: releases::reprepro: move class into autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/223450 [23:37:56] (03CR) 10Dzahn: [C: 032] "https://wikitech.wikimedia.org/w/404.php" [puppet] - 10https://gerrit.wikimedia.org/r/223694 (https://phabricator.wikimedia.org/T102147) (owner: 10Southparkfan) [23:38:01] (03PS2) 10Dzahn: Apache (wikitech): use /w/404.php as 404 errorpage [puppet] - 10https://gerrit.wikimedia.org/r/223694 (https://phabricator.wikimedia.org/T102147) (owner: 10Southparkfan) [23:38:34] yay [23:42:05] SPF|Cloud: https://wikitech.wikimedia.org/foo works now .. thanks for the patch [23:42:28] (03PS1) 10Negative24: Remove php5 dependency check [puppet] - 10https://gerrit.wikimedia.org/r/223701 [23:42:36] It should work for everything [23:43:07] 6operations, 6Labs, 10Wikimedia-Apache-configuration, 10wikitech.wikimedia.org, 5Patch-For-Review: Make 404.php be served as the 404 error for wikitech. - https://phabricator.wikimedia.org/T102147#1439719 (10Dzahn) 5Open>3Resolved a:3Dzahn has been applied on silver and works now :) does look nic... [23:44:14] quick ops +2; currently has puppet hosed on labs -> https://gerrit.wikimedia.org/r/#/c/223701/ [23:44:23] SPF|Cloud: of course just example. yes, and then redirects too [23:44:45] Normally the errordocument is only defined in apache2.conf in the mediawiki module, so I guess someone might have missed that when wikitech.wikimedia.org.erb was written [23:45:22] Negative24: sounds like https://phabricator.wikimedia.org/T105210 right [23:45:41] ah yes [23:45:47] mutante: will add to commit message [23:46:11] (03PS2) 10Negative24: Remove php5 dependency check [puppet] - 10https://gerrit.wikimedia.org/r/223701 (https://phabricator.wikimedia.org/T105210) [23:46:29] mutante: ^ [23:46:45] SPF|Cloud: apache2.conf probably just like the default from the package there and the class only adds the file in sites-enabled/ [23:46:58] Yep [23:47:01] Negative24: just a minute [23:47:11] thanks [23:47:57] maybe it should just be changed to php5-cli... [23:48:44] 6operations, 6Phabricator, 5Patch-For-Review: iridium (phab server) - Could not find dependency Package[php5] - https://phabricator.wikimedia.org/T105210#1439739 (10Negative24) [23:50:03] Reedy: I'd still love your help with that [23:50:31] I do plan on digging into it a little bit more [23:50:37] It shouldn't be much work to rig it up [23:51:36] Negative24: i think it should depend more on libapache2-mod-php5 [23:52:00] since it's file { '/etc/php5/apache2/php.ini' [23:52:03] no? [23:52:20] yea [23:53:38] so we get that from include apache::mod::php5 [23:54:57] (03PS3) 10Negative24: Change php5 dependency to libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/223701 [23:55:10] 6operations, 6Phabricator, 5Patch-For-Review: iridium (phab server) - Could not find dependency Package[php5] - https://phabricator.wikimedia.org/T105210#1439764 (10Dzahn) 16:52 < mutante> Negative24: i think it should depend more on libapache2-mod-php5 16:53 < mutante> since it's file { '/etc/php5/apache2/p... [23:55:39] (03PS4) 10Negative24: Change php5 dependency to libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/223701 [23:56:08] mutante: ^ good? [23:56:18] er [23:56:20] (03CR) 10jenkins-bot: [V: 04-1] Change php5 dependency to libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/223701 (owner: 10Negative24) [23:56:30] mutante: sorry will look when I get back [23:56:34] it wasn't right [23:56:47] i'll do it [23:56:57] 6operations, 10ops-eqiad: logstash1003 - RAID failed - https://phabricator.wikimedia.org/T104592#1439789 (10Cmjohnson) When I attempted to add the disk back, the disk reported back as unconfigured bad. I attempted to make good and rebuild but it happened again. I am going to need to replace the disk. cmjohn... [23:58:04] (03PS5) 10Dzahn: Change php5 dependency to libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/223701 (owner: 10Negative24) [23:58:11] 6operations, 6Labs, 10Wikimedia-Apache-configuration, 10wikitech.wikimedia.org, 5Patch-For-Review: Make 404.php be served as the 404 error for wikitech. - https://phabricator.wikimedia.org/T102147#1439798 (10Krenair) a:5Dzahn>3Southparkfan [23:58:47] (03CR) 10jenkins-bot: [V: 04-1] Change php5 dependency to libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/223701 (owner: 10Negative24) [23:59:21] (03PS6) 10Dzahn: Change php5 dependency to libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/223701 (owner: 10Negative24)