[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151104T0000). Please do the needful. [00:00:05] James_F MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:07] hi [00:00:13] * James_F waves. [00:00:18] yo [00:00:19] Max can go first. [00:00:24] My patch doesn't exist yet. [00:01:09] I say you make James_F go first just for that reason :) [00:01:14] :-P [00:01:20] Then there'll be a wait. [00:01:28] (03CR) 10Alex Monk: [C: 032] Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250851 (owner: 10MaxSem) [00:02:23] (03Merged) 10jenkins-bot: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250851 (owner: 10MaxSem) [00:03:39] !log krenair@tin Synchronized portals: https://gerrit.wikimedia.org/r/#/c/250851/ (duration: 00m 17s) [00:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:44] MaxSem, ^ [00:04:20] thanks Krenair - I'll look at beta when it propagates there [00:04:37] I think I have to manually update beta for this change actually. [00:05:09] let's see if it's so [00:05:13] Yep. [00:05:26] I can push in there if zuuul doesn't work [00:05:41] Zuul won't automatically update this, it's a submodule in operations/mediawiki-config. [00:05:45] CI is being unhelpful. [00:05:51] mutante: yeah feel free to merge the k8s patch, we'll notice if it goes awry immediately since grrrit-wm will die :) [00:06:17] YuviPanda: hehe, ok :) [00:06:34] logged in -labs [00:06:38] (03PS3) 10Dzahn: k8s: Move the ferm fules into the role [puppet] - 10https://gerrit.wikimedia.org/r/246295 (owner: 10Muehlenhoff) [00:06:54] (03CR) 10Dzahn: [C: 032] k8s: Move the ferm fules into the role [puppet] - 10https://gerrit.wikimedia.org/r/246295 (owner: 10Muehlenhoff) [00:07:49] (03PS2) 10Alex Monk: Fix global for wgVisualEditorFullRestbaseURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250712 [00:07:55] (03CR) 10Alex Monk: [C: 032] Fix global for wgVisualEditorFullRestbaseURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250712 (owner: 10Alex Monk) [00:08:40] (03Merged) 10jenkins-bot: Fix global for wgVisualEditorFullRestbaseURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250712 (owner: 10Alex Monk) [00:08:44] I guess we should fix beta's logmsgbot at some point. [00:09:32] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/250712/ (duration: 00m 18s) [00:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:42] uhm [00:09:45] is Wikidata down? [00:09:54] Krenair: ^ [00:10:18] Krenair: https://gerrit.wikimedia.org/r/250869 [00:10:22] wfm? [00:10:24] WFM [00:10:31] other sites work for me as well hoo [00:10:35] ok [00:11:15] 9 packets transmitted, 0 received, 100% packet loss, time 7999ms [00:11:36] ge1-2-cr0.ixf.de.as6908.net is the last hop [00:11:43] heh, yeah, that's not going to be anything to do with me [00:11:52] yeah, just panicked for a second [00:12:05] I can't take down servers or networks [00:12:08] AFAIK :) [00:12:10] I know [00:12:12] You can ;) [00:12:21] But that's quite hard [00:12:31] (03CR) 10Jforrester: "Does this need mirroring across to MW-Vagrant?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250712 (owner: 10Alex Monk) [00:12:35] I just saw that wikidata stop working the second you're sync went out [00:12:45] so I thought I'd rather stop by to say hello :P [00:14:24] (03CR) 10Alex Monk: "Nope, Vagrant sets up MW via puppet: I65b60ae2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250712 (owner: 10Alex Monk) [00:14:36] (03PS3) 10Dzahn: apache: indentation of =>, minimal lint fix [puppet] - 10https://gerrit.wikimedia.org/r/250635 [00:14:48] (03PS4) 10Dzahn: apache: indentation of =>, minimal lint fix [puppet] - 10https://gerrit.wikimedia.org/r/250635 [00:14:51] 6operations, 6Commons, 10MediaWiki-Special-pages, 5MW-1.26-release, and 3 others: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1780264 (10Bawolff) >>! In T107265#1497228, @ori wrote: > This page is now marked as expensive, so... [00:14:55] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1184/" [puppet] - 10https://gerrit.wikimedia.org/r/250635 (owner: 10Dzahn) [00:16:09] Only my broadband is making trouble... several other things in Germany work (including my University network, a server of mine and mobile) [00:17:00] (03CR) 10MZMcBride: "I don't think this change should be merged and deployed until automatic updates from Git are in place. It seems like a major and entirely " [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [00:18:47] and it's back [00:25:18] James_F, syncing [00:25:26] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/250869/ (duration: 00m 19s) [00:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:22] looks good [00:29:40] James_F, do we also want to disable the education popups? [00:29:50] No. [00:30:00] And yes, looks good here too. [00:36:10] got another patch to add to swat [00:41:52] (03Abandoned) 10Dzahn: jmxtrans: auto-fixed indentation of => [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/250641 (owner: 10Dzahn) [00:48:34] (03PS2) 10Dzahn: icinga: move "standard" inclusion to icinga::web [puppet] - 10https://gerrit.wikimedia.org/r/250621 [00:48:54] (03CR) 10Dzahn: [C: 032] icinga: move "standard" inclusion to icinga::web [puppet] - 10https://gerrit.wikimedia.org/r/250621 (owner: 10Dzahn) [00:54:27] (03PS2) 10Dzahn: bastion: move 'standard' include to role [puppet] - 10https://gerrit.wikimedia.org/r/250618 [00:57:39] (03CR) 10Dzahn: [C: 032] bastion: move 'standard' include to role [puppet] - 10https://gerrit.wikimedia.org/r/250618 (owner: 10Dzahn) [01:01:00] (03PS1) 10Hoo man: Use Zend to create the DCAT-AP RDF on snapshot1003 [puppet] - 10https://gerrit.wikimedia.org/r/250878 (https://phabricator.wikimedia.org/T117534) [01:01:09] ori: mutante: ^ easy one [01:02:54] !log Manually recreated https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf using Zend php (instead of HHVM). See T117534. [01:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:06:10] (03PS2) 10Dzahn: Use Zend to create the DCAT-AP RDF on snapshot1003 [puppet] - 10https://gerrit.wikimedia.org/r/250878 (https://phabricator.wikimedia.org/T117534) (owner: 10Hoo man) [01:07:40] (03CR) 10Dzahn: [C: 032] Use Zend to create the DCAT-AP RDF on snapshot1003 [puppet] - 10https://gerrit.wikimedia.org/r/250878 (https://phabricator.wikimedia.org/T117534) (owner: 10Hoo man) [01:08:49] thanks [01:10:32] hoo: np, already ran puppet on the snapshot host [01:10:38] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: https://gerrit.wikimedia.org/r/#/c/250874/ (duration: 00m 17s) [01:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:11:11] The script wont be invoked before next Monday, so I manually kicked it off [01:11:20] alright [01:11:57] that sync fixed the issue, btw [01:15:11] (03PS2) 10Dzahn: toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250632 [01:15:19] (03CR) 10Dzahn: [C: 032] toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250632 (owner: 10Dzahn) [01:16:23] (03PS2) 10Dzahn: deployment,aptly: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250642 [01:19:12] (03CR) 10Dzahn: [C: 032] deployment,aptly: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250642 (owner: 10Dzahn) [01:25:41] (03PS5) 10Yuvipanda: [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [01:25:55] (03Abandoned) 10Dzahn: move mw jobqueue monitoring class out of misc [puppet] - 10https://gerrit.wikimedia.org/r/249345 (owner: 10Dzahn) [01:26:11] (03PS1) 10Ori.livneh: webperf: port services from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/250885 [01:26:51] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [01:27:16] (03PS2) 10Ori.livneh: webperf: port services from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/250885 [01:27:23] (03PS3) 10Dzahn: openstack: some more lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250064 [01:27:26] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf: port services from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/250885 (owner: 10Ori.livneh) [01:27:33] YuviPanda, I've been busy all day so never got a chance to look at that. Does it address all the comments? [01:27:47] Other than the known TODO (don't restart DNS server every time puppet runs) [01:27:49] (03CR) 10Dzahn: [C: 032] openstack: some more lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250064 (owner: 10Dzahn) [01:27:56] I'm pretty sure we had a comment about adding a Bug: T line [01:27:59] (03PS1) 10Ori.livneh: re-apply webperf role on hafnium [puppet] - 10https://gerrit.wikimedia.org/r/250886 [01:27:59] which hasn't been done [01:28:12] (03CR) 10Ori.livneh: [C: 032 V: 032] re-apply webperf role on hafnium [puppet] - 10https://gerrit.wikimedia.org/r/250886 (owner: 10Ori.livneh) [01:28:42] Krenair: except the restart one [01:28:49] Krenair: yeah let me do that too [01:29:02] what else is missing? [01:29:08] Krenair: I think the restart stuff is it [01:29:30] I had an idea for the prevention of restarts without changes, what did you think of that? [01:29:41] (03PS4) 10Dzahn: openstack: some more lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250064 [01:30:31] (03PS6) 10Yuvipanda: [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [01:30:49] Krenair: looking at it now [01:31:50] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [01:31:56] Krenair: we can't actually have puppet check if a file has changed if it hasn't changed through puppet [01:32:02] aka we can't track it through execs [01:32:06] (03PS1) 10Ori.livneh: re-apply eventlogging role on hafnium [puppet] - 10https://gerrit.wikimedia.org/r/250887 [01:32:15] (03CR) 10Ori.livneh: [C: 032 V: 032] re-apply eventlogging role on hafnium [puppet] - 10https://gerrit.wikimedia.org/r/250887 (owner: 10Ori.livneh) [01:32:29] you could look at the timestamps? [01:32:44] yeah but not via puppet [01:32:46] hmm [01:32:49] actually maybe, yeah [01:33:01] notify if mtime > something? [01:33:04] that'd work mutante [01:33:18] i thought something gets executed and looks at mtime or so, yea [01:33:47] yeah, so we can do a 'service restart' based on mtime [01:33:58] Krenair: so we'd need it to write to the file only if it is different [01:34:25] ok [01:35:12] so i just got a page and a clear and no icinga output in here? [01:35:21] did someone silence the bot? [01:35:38] good Q [01:35:40] idk [01:35:40] (03PS1) 10Ori.livneh: rename statsv.conf to statsv.service, now that it is a systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/250888 [01:35:59] (03CR) 10Ori.livneh: [C: 032 V: 032] rename statsv.conf to statsv.service, now that it is a systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/250888 (owner: 10Ori.livneh) [01:36:17] !log ran delete from globaluser where gu_name="" limit 1 on centralauth. (T96233) [01:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:36:52] robh: i dont think so, but icinga-wm is suspiciously quiet [01:36:54] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: puppet fail [01:36:58] lol, or not [01:37:01] wtf... [01:37:14] well, it didnt page about the mariadb lag on m3 slave [01:37:19] i mean, it paged, but didnt irc echo. [01:39:34] robh: the "Slave Lag: s3" check says it has not changed since 115 days.. uhm,... [01:39:44] is the paging not from icinga? [01:40:02] db1015? [01:40:06] I think jynus made that check not page [01:40:08] a few weeks ago [01:40:12] well not page all of us [01:40:27] (03PS1) 10Ori.livneh: Port statsv service to systemd [puppet] - 10https://gerrit.wikimedia.org/r/250889 [01:40:38] eh, db1048 [01:40:40] (03CR) 10Ori.livneh: [C: 032 V: 032] Port statsv service to systemd [puppet] - 10https://gerrit.wikimedia.org/r/250889 (owner: 10Ori.livneh) [01:41:04] i did get it too, and it's icinga... wth [01:41:31] at least that matches in web ui [01:41:43] so just the icinga-wm part is still odd [01:42:34] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [01:44:18] (03PS1) 10Ori.livneh: declare python-pykafka dependency for statsv [puppet] - 10https://gerrit.wikimedia.org/r/250890 [01:44:53] yea, so icinga did notify as normal and also wrote it to the bot logfile [01:45:08] must have been between the bot reading it there and the IRC network [01:45:25] robh: [01:49:05] (03PS1) 10Ori.livneh: include eventlogging::package in webperf role [puppet] - 10https://gerrit.wikimedia.org/r/250894 [01:49:19] (03CR) 10Ori.livneh: [C: 032] declare python-pykafka dependency for statsv [puppet] - 10https://gerrit.wikimedia.org/r/250890 (owner: 10Ori.livneh) [01:49:29] (03CR) 10Ori.livneh: [C: 032 V: 032] include eventlogging::package in webperf role [puppet] - 10https://gerrit.wikimedia.org/r/250894 (owner: 10Ori.livneh) [01:55:17] Krenair: did you have a host you were testing this on? [01:55:42] labs-dnsrecursor or something, in the openstack project [01:55:51] might've been labs-dnsrecursor2 [01:57:13] mutante: hafnium is now fully jessified, many thanks [01:57:24] (03PS1) 10Bgerstle: adds an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) [01:57:48] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1780627 (10ori) 5Open>3Resolved a:5ori>3Dzahn [01:58:08] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1774610 (10ori) Many thanks! [01:58:17] ori: :) [01:59:25] cu later [01:59:58] 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1780634 (10ori) a:5ori>3None [02:00:07] cya [02:01:46] 6operations: Problems applying role::mediawiki to a fresh Trusty install - https://phabricator.wikimedia.org/T87550#1780638 (10ori) 5Open>3Resolved This must have been fixed at some point, because it has not been an issue for recent installs. Please re-open if that is not the case. [02:05:45] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1780643 (10ori) >>! In T107507#1534816, @coren wrote: > Opsen consensus on IRC is that jessie-backports should be disabled fleet-wide and any needed package brought into jessie-wikimedia. Why?... [02:06:11] Coren: ^? [02:14:37] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1780649 (10ori) 5Open>3Resolved a:3ori [02:15:05] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Create a package for python-pykafka for ubuntu precise and debian sid - https://phabricator.wikimedia.org/T109567#1780652 (10ori) 5Open>3Resolved hafnium is now on jessie. [02:21:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [02:26:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 2 below the confidence bounds [02:30:37] 6operations, 10vm-requests: request for ganeti vm for people.wm.org - https://phabricator.wikimedia.org/T117517#1780687 (10Dzahn) Yes, let's 2G please. I looked up what the next free element name is, it's going to be ... "rutherfordium" :) [02:31:46] (03PS1) 10Dzahn: introduce rutherfordium to netboot [puppet] - 10https://gerrit.wikimedia.org/r/250902 (https://phabricator.wikimedia.org/T117517) [02:32:49] !log l10nupdate@tin Synchronized php-1.27.0-wmf.4/cache/l10n: l10nupdate for 1.27.0-wmf.4 (duration: 07m 34s) [02:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:48] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.4) at 2015-11-04 02:36:48+00:00 [02:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 5 below the confidence bounds [02:56:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 5 below the confidence bounds [03:03:18] !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 10m 08s) [03:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:07] (03PS2) 10BryanDavis: Add an apple-app-site-association file to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [03:09:44] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.5) at 2015-11-04 03:09:44+00:00 [03:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:10:12] (03CR) 10BryanDavis: [C: 04-1] Add an apple-app-site-association file to support iOS deep-linking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [03:13:01] (03CR) 10BryanDavis: [C: 031] scap: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250634 (owner: 10Dzahn) [03:18:10] (03PS7) 10Yuvipanda: Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [03:19:03] Krenair: ^ differetnt method but shoul dowrk too [03:19:43] bd808: uh, nice catch :D [03:39:38] (03PS8) 10Yuvipanda: Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [03:39:58] Krenair: ok, I've fixed all the puppet issues [03:40:03] now we just need to test the python script itself [03:54:10] Is sqstat.pl still used? [03:55:35] It's somewhat undocumented it seems and not mentioned in many places in puppet [03:55:51] I'd like to know where/how it runs and whether it is sampled or contains all varnish frontend stats [03:57:08] There is also a separate varnish::logging::reqstats resource used in puppet for frontend varnishes [03:58:11] text specifically* [03:58:17] which presumably doesn't include everything [04:03:23] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:10:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [04:16:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [04:27:43] https://grafana-admin.wikimedia.org/dashboard/db/varnish-http-errors [04:27:48] I've added some percentages [04:27:53] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [04:29:41] 6operations, 7Graphite, 7Monitoring: evaluate tessera dashboards - https://phabricator.wikimedia.org/T104366#1414678 (10Krinkle) [04:57:46] (03CR) 10Chad: [C: 032] ContentTranslation: Use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248938 (owner: 10Chad) [04:58:24] (03Merged) 10jenkins-bot: ContentTranslation: Use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248938 (owner: 10Chad) [04:59:30] !log demon@tin Started scap: using wfLoadExtension() for ContentTranslation, shut up warning spam [04:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:00:15] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.54% of data above the critical threshold [100000000.0] [05:00:33] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [05:02:16] !log demon@tin scap failed: CalledProcessError Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/.svn/lock', '--exclude=**/.git/objects', '--exclude=**/.git/**/objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'tin.eqiad.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 24 (duration: 02m 45s) [05:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:03:13] Well shiz. [05:03:24] !log demon@tin Started scap: using wfLoadExtension() for ContentTranslation, shut up warning spam [05:06:42] ostriches: I'm around if you want [05:06:57] Eh, it was transient. [05:07:10] ah ok :) [05:07:14] A .swp file vanished during rsync which made rsync barf. [05:07:49] Note to self: exclude .swp files from rsync. [05:42:30] YuviPanda: small patch for you, not urgent. [05:52:46] !log demon@tin Finished scap: using wfLoadExtension() for ContentTranslation, shut up warning spam (duration: 49m 22s) [05:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:02:13] and it truly was a no-op! [06:12:14] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 33, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps DWDM]BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps DWDM]BR [06:13:03] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 114, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps DWDM]BR [06:13:13] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps DWDM]BR [06:14:14] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:30:03] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [06:30:23] (03PS1) 10KartikMistry: WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 [06:30:34] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:44] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:54] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [06:31:04] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:24] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:44] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 3 failures [06:38:28] Q: How to use https://puppet-compiler.wmflabs.org :) [06:38:33] Any documentation? [06:44:23] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Nov 4 06:44:23 UTC 2015 (duration 44m 22s) [06:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:46:14] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1780914 (10Arrbee) p:5Normal>3High [06:46:26] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1334128 (10Arrbee) p:5High>3Normal [06:46:33] (03PS2) 10KartikMistry: WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) [06:56:14] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:56:33] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:45] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:57:53] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:58:34] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:06:44] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: puppet fail [07:36:23] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:55:42] (03PS2) 10Giuseppe Lavagetto: maintenance: move upload stash off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249708 (https://phabricator.wikimedia.org/T116728) [08:14:25] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [08:20:27] <_joe_> uhm what's up with hafnium? [08:21:26] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: move upload stash off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249708 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [08:22:54] I remember it being reinstalled, I think [08:23:43] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:25:15] (03PS2) 10Giuseppe Lavagetto: maintenance: move parser cache purging off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249709 (https://phabricator.wikimedia.org/T116728) [08:25:42] <_joe_> !log moved clean_uploadstash to mw1152 from terbium [08:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:27:07] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: move parser cache purging off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249709 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [08:27:29] <_joe_> !log moved parser cache purging off of terbium [08:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:38:43] (03PS2) 10ArielGlenn: copy pagecounts-al-sites files over to labs from datasets [puppet] - 10https://gerrit.wikimedia.org/r/249175 (https://phabricator.wikimedia.org/T93317) [08:39:40] (03CR) 10ArielGlenn: [C: 032] copy pagecounts-al-sites files over to labs from datasets [puppet] - 10https://gerrit.wikimedia.org/r/249175 (https://phabricator.wikimedia.org/T93317) (owner: 10ArielGlenn) [08:44:35] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: fix up an import of a class now in a module [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250431 (owner: 10ArielGlenn) [08:46:09] (03CR) 10ArielGlenn: [C: 032 V: 032] dumpadmin script: add "rerun" which reruns a broken job [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/249754 (owner: 10ArielGlenn) [08:47:49] !log reduced durability on dbstore1002 and db1047 to improve performance [08:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:48:08] (03CR) 10ArielGlenn: [C: 032 V: 032] dumpadmin: mark a job for wiki latest run as done or failed [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250432 (owner: 10ArielGlenn) [08:54:34] (03Abandoned) 10ArielGlenn: dumps: move Runner, DumpItemList out of command line script to module [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250433 (owner: 10ArielGlenn) [08:54:52] (03PS1) 10ArielGlenn: dumps: move more classes into library, refactor link/feed/etc handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250921 [08:55:55] (03PS2) 10ArielGlenn: Add cirrussearch to dumps.wikimedia.org/other html page [puppet] - 10https://gerrit.wikimedia.org/r/249761 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [08:56:51] Morning godog! Any chance you could take a look at this? https://gerrit.wikimedia.org/r/#/c/247866/ I have been steered toward you as you apparently do lots of graphite things ;) [08:58:18] addshore: Denied. [08:58:24] :D [08:59:57] * addshore slaps Reedy_ with a large trout [09:00:18] (03CR) 10ArielGlenn: [C: 032] Add cirrussearch to dumps.wikimedia.org/other html page [puppet] - 10https://gerrit.wikimedia.org/r/249761 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [09:01:14] addshore: hey, I replied on the ticket but not on gerrit heh, anyways it feels more an analytics type of analysis (?) [09:01:52] ahh yes! :) [09:03:17] (03PS2) 10ArielGlenn: dumps: move more classes into library, refactor link/feed/etc handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250921 [09:04:02] well, currently I am just shoving all of the numbers in labs-store.staging in wikidata_* tables. Simply calling statd is easier of course ;) But fair enough, maybe I'll stick with sql dbs and hadoop [09:04:19] or just flat files [09:05:11] Ans I'll just write some wrapper for doing so ;) [09:06:44] addshore: I'm not opposed to it in principle btw, but having the raw data stored in hadoop might prove more flexible e.g. if you need to ask other questions [09:08:46] well, the data is essentially just a number / count with a timestamp attached to it :/ [09:09:01] which is again one of the reasons I was thinking graphite [09:09:29] But on the other side of the argument again there in graphite I have still set the retention to 25 years, which means we wouldnt actually be keeping it for ever. [09:09:35] (03PS2) 10ArielGlenn: Replace Bugzilla with Phabricator [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243364 (owner: 10John Vandenberg) [09:10:07] Right now I'm creating a single database table for every number I am tracking, which is a bit ugly, but would be fine with a wrapper I imagine! [09:10:10] (03PS1) 10Giuseppe Lavagetto: maintenance: move wikidata jobs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250923 (https://phabricator.wikimedia.org/T116728) [09:10:12] (03PS1) 10Giuseppe Lavagetto: maintenance: move refreshlinks off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250924 (https://phabricator.wikimedia.org/T116728) [09:10:14] (03PS1) 10Giuseppe Lavagetto: maintenance: move pagetriage off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250925 (https://phabricator.wikimedia.org/T116728) [09:10:16] (03PS1) 10Giuseppe Lavagetto: maintenance: move translation-related jobs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250926 (https://phabricator.wikimedia.org/T116728) [09:10:18] (03PS1) 10Giuseppe Lavagetto: maintenance: move email batch, flaggedrevs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250927 (https://phabricator.wikimedia.org/T116728) [09:10:20] (03PS1) 10Giuseppe Lavagetto: maintenance: move update article count off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250928 (https://phabricator.wikimedia.org/T116728) [09:10:22] (03PS1) 10Giuseppe Lavagetto: maintenance: move updatequerypages off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250929 (https://phabricator.wikimedia.org/T116728) [09:10:24] (03PS1) 10Giuseppe Lavagetto: terbium: remove role mediawiki::maintenance [puppet] - 10https://gerrit.wikimedia.org/r/250930 (https://phabricator.wikimedia.org/T116728) [09:10:26] (03PS1) 10Giuseppe Lavagetto: terbium: move mediawiki monitoring [puppet] - 10https://gerrit.wikimedia.org/r/250931 (https://phabricator.wikimedia.org/T116728) [09:10:59] (03CR) 10ArielGlenn: [C: 032 V: 032] Replace Bugzilla with Phabricator [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243364 (owner: 10John Vandenberg) [09:12:05] addshore: I see, unrelated but I surely hope we won't be using graphite 25y from now :P not even 5 possibly [09:12:22] haha, true ;) [09:13:42] right, I will try and make a wrapper of some sorts today, but I'll leave that patch and ticket open for now until I have another concrete route! :) [09:14:08] (03CR) 10Addshore: [C: 04-1] "on hold after a discussion with godog" [puppet] - 10https://gerrit.wikimedia.org/r/247866 (https://phabricator.wikimedia.org/T117402) (owner: 10Addshore) [09:15:36] (03CR) 10Filippo Giunchedi: Retain daily.* graphite metrics for longer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247866 (https://phabricator.wikimedia.org/T117402) (owner: 10Addshore) [09:16:55] addshore: sounds good, let me know how that goes [09:17:26] (03PS1) 10Muehlenhoff: Reorg server groups for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/250932 [09:17:52] godog: will do! many thanks! :) [09:18:59] np! [09:19:04] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [09:19:53] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 116, down: 0, dormant: 0, excluded: 1, unused: 0 [09:20:23] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 [09:21:18] (03PS1) 10Filippo Giunchedi: cassandra: add xenon-b instance [puppet] - 10https://gerrit.wikimedia.org/r/250933 [09:22:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add xenon-b instance [puppet] - 10https://gerrit.wikimedia.org/r/250933 (owner: 10Filippo Giunchedi) [09:25:21] !log bootstrap cassandra xenon-b.eqiad.wmnet [09:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:37:43] PROBLEM - cassandra-b CQL 10.64.0.203:9042 on xenon is CRITICAL: Connection refused [09:37:58] icinga-wm: shush [09:39:57] 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown, 10Wikidata: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1781063 (10Addshore) [09:52:32] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, modulo what mobrovac said re: refresh" [puppet] - 10https://gerrit.wikimedia.org/r/250682 (https://phabricator.wikimedia.org/T103134) (owner: 10Giuseppe Lavagetto) [09:55:06] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1781117 (10fgiunchedi) xenon-b bootstrapping ``` $ nodetool-b netstats | grep "bytes total" Receiving 53 files, 49753415567 bytes total.... [09:57:59] (03PS6) 10Muehlenhoff: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) [10:04:53] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [10:08:53] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:11:15] (03PS2) 10Giuseppe Lavagetto: ganglia: remove jobqueue stats [puppet] - 10https://gerrit.wikimedia.org/r/250674 [10:13:12] PROBLEM - Host ms-be2017 is DOWN: PING CRITICAL - Packet loss = 100% [10:14:11] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia: remove jobqueue stats [puppet] - 10https://gerrit.wikimedia.org/r/250674 (owner: 10Giuseppe Lavagetto) [10:14:23] RECOVERY - Host ms-be2017 is UP: PING OK - Packet loss = 0%, RTA = 34.49 ms [10:15:22] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:47] (03CR) 10DCausse: "There's also elastic 1.7.3 which contains a fix for If796b85 and some fixes concerning synced flush." [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [10:56:39] (03CR) 10Alex Monk: "Only wikipedia because there's only an app for wikipedia?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [11:07:48] !log testing new mariadb-server version on db2056.codfw.wmnet [11:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:11:14] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1781285 (10fgiunchedi) 3NEW [11:30:22] (03PS1) 10KartikMistry: Beta: Consistence quote [puppet] - 10https://gerrit.wikimedia.org/r/250946 [11:30:30] (03PS2) 10Giuseppe Lavagetto: maintenance: move wikidata jobs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250923 (https://phabricator.wikimedia.org/T116728) [11:30:50] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [11:31:39] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: move wikidata jobs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250923 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [11:32:29] <_joe_> !log migrating wikidata jobs off of terbium [11:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:42] (03PS1) 10KartikMistry: Beta: Keep deployment-urldownloader format same as other [puppet] - 10https://gerrit.wikimedia.org/r/250947 [11:35:04] j [11:52:10] 6operations, 7Performance: ERR_SPDY_PROTOCOL_ERROR while opening files at commons - https://phabricator.wikimedia.org/T115541#1781407 (10zhuyifei1999) [11:53:24] (03PS2) 10Giuseppe Lavagetto: maintenance: move refreshlinks off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250924 (https://phabricator.wikimedia.org/T116728) [11:54:07] (03PS2) 10Filippo Giunchedi: swift: force rsync protocol version 30 [puppet] - 10https://gerrit.wikimedia.org/r/250693 (https://phabricator.wikimedia.org/T93587) [11:54:21] (03Abandoned) 10Filippo Giunchedi: swift: force rsync protocol version 30 [puppet] - 10https://gerrit.wikimedia.org/r/250693 (https://phabricator.wikimedia.org/T93587) (owner: 10Filippo Giunchedi) [11:59:05] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: move refreshlinks off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250924 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [11:59:56] <_joe_> !log moved refreshlink jobs off terbium [12:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:21:30] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [12:22:52] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [5000000.0] [12:27:00] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [500.0] [12:32:39] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [12:34:19] interestingly, from the new reqstats stuff, the 5xx spike referenced above can be isolated to only the text cluster in esams [12:34:55] isn't 500 now also up? [12:35:00] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [12:35:13] it was mostly 503 [12:35:20] yes, the spike [12:35:32] but after that [12:35:53] bulk of the spike was 12:17 - 12:19 UTC, with maybe a partial minute of ramp in/out on either side of that [12:36:18] ok [12:36:40] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:36:59] the graphite anomaly reported is often a bit async/late [12:37:34] I don't see the other 5xx codes having anything too strange. there's some very small elevations in 500, but not big enough to be sure it's not just part of the usual background noise [12:37:37] no, but I referred to this: [12:37:38] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-1hours&from=-1hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.500,%22500%20resp/min%22%29%29,%22red%22%29&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [12:38:08] ok with the "background noise" explanation [12:38:45] specially because it is down again [12:38:57] *going [12:38:59] https://grafana-admin.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1446619127505&to=1446640727505&var-site=All&var-cache_type=All&var-status_type=5 [12:39:07] ^ look through that, it's an interesting view [12:39:38] you know what tricked me? [12:39:38] can click on the legend stuff at the bottom of the bottom graph too, to bring out just 500 or 503 alone [12:39:53] the different scale on 500 and 5xx [12:39:58] on grafana [12:40:00] PROBLEM - Disk space on xenon is CRITICAL: DISK CRITICAL - free space: /srv 12900 MB (3% inode=99%) [12:40:11] to isolate, I just tried different clusters and/or DCs in the drop-downs at the top. Obviously, we need better visualizations so that narrowing things isn't so manual. [12:40:16] but it works for now as an interactive tool heh [12:40:46] it is nice to have it, but it can deceive you [12:40:52] yeah [12:41:13] bblack, that is a nice dashboard [12:41:40] do you know what it would be better than that? [12:41:58] probably we'll end up making several different dashboards from that data. there's about 100 useful ways to slice it to see something interesting. [12:42:02] something to do some kind of pattern, based on db, url, server, etc. [12:42:22] s/db/datacenter/ [12:42:28] that data isn't available. pretty much the info you see there is all we have to go on for stats [12:42:51] datacenter name, cluster (text, upload, misc, etc), and the status code [12:43:03] thats enough, [12:43:29] well and also request-method, but the method stats and status stats are independent. You can see that it's 67% GET and it's 4% 503, but you can't tell directly that the 503s were all GETs or not. [12:43:46] so we do not have to go testing each one [12:44:10] but anyways, the one I linkde with all the dropdowns at the top, is mostly an exploring tool in practice [12:44:39] I think we'll want some succint dashboards that e.g. visually break down just 5xx and make it obvious where it's coming from and what it is, without using dropdowns. [12:48:49] what's interesting about that one spike though is it's apparently *only* esams text cluster. no other DC, and no other cluster in esams. [12:49:18] my guess would be it was induced by user traffic someshow [12:49:55] or alternatively, it was an ipsec fail causing a short interrupt of traffic for some esams<->eqiad text traffic but not all [12:52:49] RECOVERY - pybal on lvs1012 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [12:55:49] RECOVERY - pybal on lvs1011 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [12:57:10] RECOVERY - pybal on lvs1009 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [12:57:20] RECOVERY - pybal on lvs1008 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [12:59:00] RECOVERY - Disk space on xenon is OK: DISK OK [12:59:52] RECOVERY - pybal on lvs1010 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:03:15] !log upgrading pybal (1.10 -> 1.12) on lvs4004.ulsfo.wmnet (inactive low-traffic LVS @ ulsfo) [13:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:03:31] well not low-traffic, high-traffic2 heh [13:06:02] (03PS1) 10Alexandros Kosiaris: Introduce rutherfordium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/250954 (https://phabricator.wikimedia.org/T117517) [13:09:07] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [13:18:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:20:31] (03PS2) 10Alexandros Kosiaris: Introduce rutherfordium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/250954 (https://phabricator.wikimedia.org/T117517) [13:20:51] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce rutherfordium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/250954 (https://phabricator.wikimedia.org/T117517) (owner: 10Alexandros Kosiaris) [13:25:58] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Consistence quote [puppet] - 10https://gerrit.wikimedia.org/r/250946 (owner: 10KartikMistry) [13:26:03] (03PS2) 10Alexandros Kosiaris: Beta: Consistence quote [puppet] - 10https://gerrit.wikimedia.org/r/250946 (owner: 10KartikMistry) [13:26:10] (03CR) 10Alexandros Kosiaris: [V: 032] Beta: Consistence quote [puppet] - 10https://gerrit.wikimedia.org/r/250946 (owner: 10KartikMistry) [13:28:57] (03PS2) 10Alexandros Kosiaris: Beta: Keep deployment-urldownloader format same as other [puppet] - 10https://gerrit.wikimedia.org/r/250947 (owner: 10KartikMistry) [13:29:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Beta: Keep deployment-urldownloader format same as other [puppet] - 10https://gerrit.wikimedia.org/r/250947 (owner: 10KartikMistry) [13:42:42] (03CR) 10Alexandros Kosiaris: [C: 032] Update Parsoid server.js path in the upstart config [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry) [13:42:47] (03PS4) 10Alexandros Kosiaris: Update Parsoid server.js path in the upstart config [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry) [13:42:51] (03CR) 10Alexandros Kosiaris: [V: 032] Update Parsoid server.js path in the upstart config [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry) [13:58:16] (03PS1) 10Jcrespo: Depool db2048 from codfw for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250957 [14:01:58] (03CR) 10Jcrespo: [C: 032] Depool db2048 from codfw for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250957 (owner: 10Jcrespo) [14:04:19] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2048 for maintenance (duration: 00m 18s) [14:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:42] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1781708 (10mobrovac) >>! In T117560#1778582, @yuvipanda wrote: > From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis to... [14:15:02] (03PS3) 10ArielGlenn: dumps: move more classes into library, refactor link/feed/etc handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250921 [14:20:21] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1781721 (10fgiunchedi) as it stands however there isn't enough space on eqiad test cluster to allocate two instances (I've stopped bootstrapping `... [14:20:48] PROBLEM - service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [14:22:45] expected ^ [14:23:19] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [14:23:58] 6operations, 5Patch-For-Review, 7Swift: incompatible rsync transfers between rsync 3.0.9 and 3.1 (precise vs trusty) - https://phabricator.wikimedia.org/T93587#1781743 (10MoritzMuehlenhoff) I would tend towards upgrading the remaining precise hosts to a trusty backport of rsync, after all the same problem yo... [14:27:23] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1781745 (10mobrovac) >>! In T114443#1777558, @Ottomata wrote: > @gwicke, I think this may be a problem. From my perspective, the goal of this project is a generalized event servi... [14:30:26] !log cloning sqldata from db2048 to db2055 [14:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:07] (03CR) 10Hashar: "Paladox please stop adding me as a reviewer to puppet changes. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [14:44:09] (03PS2) 10Alexandros Kosiaris: introduce rutherfordium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/250902 (https://phabricator.wikimedia.org/T117517) (owner: 10Dzahn) [14:46:13] (03PS1) 10Filippo Giunchedi: swift: don't wait for data disks on boot [puppet] - 10https://gerrit.wikimedia.org/r/250968 (https://phabricator.wikimedia.org/T107416) [14:47:13] godog: I think I have decided graphite might still be the best option [14:49:23] addshore: ack, see also my comment on the code review [14:49:29] well, graphite & statsd. It seems silly to recreate it using some other backend ;) [14:49:47] (03PS1) 10Jcrespo: Depool db2049 and db2050 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250970 [14:50:14] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1781831 (10Eevans) >>! In T95253#1781721, @fgiunchedi wrote: > as it stands however there isn't enough space on eqiad test cluster to allocate two... [14:50:46] I did see that too! I believe the data is agregated in different ways? *goes to find the bit in the config* [14:51:29] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/graphite.pp#L49 [14:52:08] yeah but I don't think you'd want it aggregated after a week no? [14:52:14] yes, bah, agregation will be annoying [14:52:34] godog: no :/ Either and average, or the value at the begining of the week [14:52:41] or simply keep daily data :/ [14:52:49] yeah daily seems best [14:53:11] that means moooore space though :p [14:53:34] *goes to do the maths of the daily stuff compared with the current stuff* [14:53:38] (03CR) 10Jcrespo: [C: 032] Depool db2049 and db2050 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250970 (owner: 10Jcrespo) [14:54:23] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1781842 (10fgiunchedi) yeah we can proceed with praseodymium and cerium meanwhile to save time. I don't think it'd be able to finish bootstrapping... [14:55:08] addshore: yeah I wouldn't worry too much about that, it is tiny anyways, you can play with whisper-create [14:55:16] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2049 ans db2050 for maintenance (duration: 00m 17s) [14:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:52] addshore: also the number of distinct metrics doesn't seem high? [14:57:18] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1781844 (10Eevans) >>! In T95253#1781842, @fgiunchedi wrote: > yeah we can proceed with praseodymium and cerium meanwhile to save time. I don't th... [14:57:37] 6operations, 5Patch-For-Review, 7Swift: incompatible rsync transfers between rsync 3.0.9 and 3.1 (precise vs trusty) - https://phabricator.wikimedia.org/T93587#1781845 (10fgiunchedi) sounds good! I've backported `rsync_3.1.0-2ubuntu0.1~wmf1` to precise and will upload to `precise-wikimedia` [14:58:07] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [14:59:57] godog: heh, yeh, keepign 10 years of daily stuff is a drop in the ocean... about 1/3rd of the data points than just the first 7 days of the other metrics ;) [14:59:59] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [15:01:06] and the number of metrics we want to store it rather small, expecially considering the ammount of data already in graphite [15:02:28] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [15:02:46] I would guess a hundred or so static things currently, but a few more dynamic ones (that might not be tracked for ever) [15:03:52] for the dynamic ones right now one of the things I am tracking could in theory be split into 2000 points per day, but lots of those are 0s / I wouldnt send to graphite... and really right now there are about 20 there... [15:04:30] addshore: ack, can you put an estimation on how many metrics on the ticket? thanks [15:04:35] let's sync up there [15:04:50] yeh, I'll go and ammend it now to get rid of the agregation [15:08:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 6 below the confidence bounds [15:08:58] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run for unit replicate-maps was over 1 day, 1:00:00 ago [15:11:02] (03PS2) 10Giuseppe Lavagetto: maintenance: move pagetriage off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250925 (https://phabricator.wikimedia.org/T116728) [15:11:15] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/250925 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [15:12:34] (03PS2) 10Giuseppe Lavagetto: maintenance: move translation-related jobs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250926 (https://phabricator.wikimedia.org/T116728) [15:12:53] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/250926 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [15:13:35] (03PS5) 10Addshore: Retain daily.* graphite metrics for longer (25y) [puppet] - 10https://gerrit.wikimedia.org/r/247866 (https://phabricator.wikimedia.org/T117402) [15:13:47] godog: ^^ ;) I sum everything up in the commit message [15:15:07] <_joe_> !log moved translationnotification, updatetranslationstats, pagetriage from terbium to mw1152 [15:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:54] (03PS2) 10Giuseppe Lavagetto: maintenance: move email batch, flaggedrevs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250927 (https://phabricator.wikimedia.org/T116728) [15:18:37] NTP issue or are those new installs? [15:19:06] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: move email batch, flaggedrevs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250927 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [15:19:39] godog: in theory there are some daily things we want to keep /forever/ but others that we could easily kill after a year. but these things that could only last for 1 year again wouldnt really fit with the current aggregation stuff :P [15:19:45] (03PS2) 10Giuseppe Lavagetto: maintenance: move update article count off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250928 (https://phabricator.wikimedia.org/T116728) [15:20:10] <_joe_> addshore: whatever we want to keep "forever", should be somewhere else than graphite [15:20:19] <_joe_> or in general any ops monitoring tool [15:20:36] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/250928 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [15:20:48] !log syncing sqldata on db2049 -> db2056, db2050 -> db2057, eta 2-4 hours [15:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:25] I was studying graphite storage model during the weekend [15:21:43] It is ok, I only see a couple of issues: [15:21:56] well _joe_ graphite does exactly what I want it to do, the only issue is the config [15:21:57] bad for sparse data (data that does not change much) [15:22:27] and reliance on too many filesystem files in our case [15:22:29] <_joe_> !log moving email batch, flagged reviews, update article count off of terbium [15:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:39] <_joe_> addshore: well, theory and practice tend to differ [15:23:13] <_joe_> addshore: my point being that any opsen if given the choice between a 10 minutes downtime of a monitoring tool and dropping old data will choose the latter [15:23:31] <_joe_> because it's a tool we use for monitoring production /now/ [15:23:54] <_joe_> if we want to keep historical data about performance, better options are available IMO to store such data [15:24:56] the good thing is that whisper is literally 1 file of code, so it would be very easy to implement a different backend, if needed [15:25:24] plus it is easy to shard [15:25:44] well, from what you have said _joe_ it seems like 2 graphite instances could be a solution. or what other alternatives could you point me toward? [15:27:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [15:28:54] (03PS3) 10Alexandros Kosiaris: introduce rutherfordium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/250902 (https://phabricator.wikimedia.org/T117517) (owner: 10Dzahn) [15:29:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] introduce rutherfordium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/250902 (https://phabricator.wikimedia.org/T117517) (owner: 10Dzahn) [15:33:29] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [15:37:28] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [15:37:55] _joe_: quick question, will mw1152 aka new terbium just be called "mw1152"? [15:38:29] <_joe_> greg-g: we can change that [15:38:55] <_joe_> greg-g: my final plan was to get everything off terbium, and reimage it eventually [15:38:59] * greg-g nods [15:39:00] gotcha [15:39:06] thanks :) [15:39:10] <_joe_> but well, it depends on a few factors [15:39:15] always [15:39:32] <_joe_> I will send out an email asking people to stop using terbium soon [15:39:50] (03CR) 10Bgerstle: "eep, amending now X-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [15:39:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [15:40:00] <_joe_> anyone who had access to terbium has access to the new server [15:40:23] (03PS3) 10Bgerstle: adds an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) [15:40:34] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1781916 (10Papaul) [15:40:35] 6operations, 10ops-codfw, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: on-site tasks for labs deployment cluster - https://phabricator.wikimedia.org/T117107#1781914 (10Papaul) 5Open>3Resolved This task is complete. closing it [15:40:43] _joe_: kk, ty [15:41:38] (03PS4) 10Bgerstle: Add an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) [15:42:03] (03CR) 10Bgerstle: "Alex Monk, that's correct. Starting with Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [15:42:42] (03CR) 10Bgerstle: "bd808, this will be served from root of both *.wikpedia.org and *.m.wikipedia.org?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [15:43:23] (03PS2) 10Giuseppe Lavagetto: maintenance: move updatequerypages off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250929 (https://phabricator.wikimedia.org/T116728) [15:44:11] jynus: yeah whisper implementation is simple, not really friendly in terms of seeks though, way less of a problem nowadays with ssd, still [15:45:58] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1781920 (10GWicke) Alternatively, wouldn't rm -rf /var/lib/cassandra/* clear up the space as well? [15:46:29] (03PS1) 10Filippo Giunchedi: cassandra: add cerium-a instance [puppet] - 10https://gerrit.wikimedia.org/r/250975 [15:49:14] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: move updatequerypages off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/250929 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [15:49:22] (03PS1) 10Muehlenhoff: Enable ferm on logstash1004 [puppet] - 10https://gerrit.wikimedia.org/r/250976 [15:49:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [15:50:37] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1781940 (10fgiunchedi) >>! In T95253#1781920, @GWicke wrote: > Alternatively, wouldn't rm -rf /var/lib/cassandra/* clear up the space as well? fr... [15:50:58] (03PS2) 10Rush: Enable ferm on logstash1004 [puppet] - 10https://gerrit.wikimedia.org/r/250976 (owner: 10Muehlenhoff) [15:51:35] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1781948 (10Eevans) >>! In T95253#1781920, @GWicke wrote: > Alternatively, wouldn't rm -rf /var/lib/cassandra/* clear up the space as well? Can yo... [15:53:37] (03PS1) 10Papaul: Fixed productionn DNS for ms-be2020 and ms-be2021 Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/250977 (https://phabricator.wikimedia.org/T114712) [15:54:12] I might have an extra patch to add to swat [15:56:03] (03CR) 10Rush: [C: 032] Enable ferm on logstash1004 [puppet] - 10https://gerrit.wikimedia.org/r/250976 (owner: 10Muehlenhoff) [15:56:05] or two [15:57:59] (03CR) 10Filippo Giunchedi: [C: 031] Fixed productionn DNS for ms-be2020 and ms-be2021 Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/250977 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [15:58:17] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1781977 (10GWicke) > from xenon? I was thinking of praseodymium and cerium, but re-imaging is probably desirable anyway. We should probably look... [15:58:27] (03CR) 10BryanDavis: "> this will be served from root of both *.wikpedia.org and *.m.wikipedia.org?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [15:58:43] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1781982 (10Joe) Hi, sorry to jump in when a discussion is already underway (which seems my speciality nowada... [15:59:28] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [16:00:41] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1781998 (10scfc) Is the instance still inaccessible after the first Puppet run (30 minutes)? Otherwise this looks like a duplicate of T110304. [16:01:25] !log reimage cerium for multi-instance [16:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:43] jouncebot, next [16:02:43] In 2 hour(s) and 57 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151104T1900) [16:02:51] What happened to swat, jouncebot? [16:03:06] Krinkle, hey [16:03:13] o/ [16:03:55] PROBLEM - Host cerium is DOWN: PING CRITICAL - Packet loss = 100% [16:04:14] Krinkle: Is there any chance someone could C+2 the core patch in master before it's SWATed? ;-) [16:04:51] Krinkle, this isn't merged on master... [16:04:55] It might not even need to be in master, rahter the dependent patch will go in master I was hoping. This is just to gather some data in logstash to see if the problem is fixed [16:05:45] You could do the $text === null check before is_string, and avoid the $text !== null check [16:07:05] RECOVERY - Host cerium is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [16:07:19] <_joe_> what's happening to cerium? [16:07:47] jouncebot: next [16:07:48] In 2 hour(s) and 52 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151104T1900) [16:08:03] * greg-g shrugs [16:08:06] jouncebot: refresh [16:08:07] I refreshed my knowledge about deployments. [16:08:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [16:09:01] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1782049 (10fgiunchedi) >>! In T95253#1781977, @GWicke wrote: >> from xenon? > > I was thinking of praseodymium and cerium, but re-imaging is prob... [16:11:06] PROBLEM - cassandra CQL 10.64.16.147:9042 on cerium is CRITICAL: Connection refused [16:11:15] PROBLEM - puppet last run on cerium is CRITICAL: Connection refused by host [16:11:25] PROBLEM - Check size of conntrack table on cerium is CRITICAL: Connection refused by host [16:11:35] PROBLEM - configured eth on cerium is CRITICAL: Connection refused by host [16:11:40] (03PS2) 10Filippo Giunchedi: cassandra: add cerium-a instance [puppet] - 10https://gerrit.wikimedia.org/r/250975 [16:11:44] PROBLEM - DPKG on cerium is CRITICAL: Connection refused by host [16:11:44] PROBLEM - salt-minion processes on cerium is CRITICAL: Connection refused by host [16:11:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add cerium-a instance [puppet] - 10https://gerrit.wikimedia.org/r/250975 (owner: 10Filippo Giunchedi) [16:11:56] PROBLEM - Disk space on cerium is CRITICAL: Connection refused by host [16:12:05] PROBLEM - service on cerium is CRITICAL: Connection refused by host [16:12:14] PROBLEM - RAID on cerium is CRITICAL: Connection refused by host [16:12:15] PROBLEM - Restbase root url on cerium is CRITICAL: Connection refused [16:12:25] PROBLEM - dhclient process on cerium is CRITICAL: Connection refused by host [16:12:34] PROBLEM - Restbase endpoints health on cerium is CRITICAL: Connection refused by host [16:12:59] Krenair: Go. [16:13:13] (03PS1) 10Filippo Giunchedi: cassandra: add cerium-{a,b} to seeds [puppet] - 10https://gerrit.wikimedia.org/r/250985 [16:14:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 4 below the confidence bounds [16:14:25] (03PS1) 10Muehlenhoff: Enable ferm for remaining logstash hosts [puppet] - 10https://gerrit.wikimedia.org/r/250986 [16:14:43] (03PS2) 10Filippo Giunchedi: cassandra: add cerium-{a,b} to seeds [puppet] - 10https://gerrit.wikimedia.org/r/250985 [16:14:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add cerium-{a,b} to seeds [puppet] - 10https://gerrit.wikimedia.org/r/250985 (owner: 10Filippo Giunchedi) [16:15:11] (03PS2) 10Rush: Enable ferm for remaining logstash hosts [puppet] - 10https://gerrit.wikimedia.org/r/250986 (owner: 10Muehlenhoff) [16:15:51] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1782090 (10GWicke) > we could get bigger SSDs for those too, how would we go about trimming space if need be? We'd need to delete data. The quick... [16:16:18] (03CR) 10Rush: [C: 032] Enable ferm for remaining logstash hosts [puppet] - 10https://gerrit.wikimedia.org/r/250986 (owner: 10Muehlenhoff) [16:16:26] (03CR) 10Bgerstle: "bd808 great, thanks :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [16:18:16] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [16:18:58] (03PS5) 10BryanDavis: Add an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [16:19:45] (03CR) 10BryanDavis: [C: 031] "I made the commit message a bit more informative." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [16:20:46] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1782099 (10akosiaris) Hello, I am having a hard time grasping what we are talking about here to be honest.... [16:20:55] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1782101 (10Nuria) >I don't see these two as being mutually-exclusive. In order to meet the end goal of a generalised event service we are starting with the Services' use case. The... [16:23:27] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1782118 (10mobrovac) >>! In T95253#1782090, @GWicke wrote: > We'd need to delete data. The quickest way would be to drop all wikipedia keyspaces,... [16:24:15] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [16:24:45] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 33 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 26, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 63, initializing_shards: 5, number_of_data_nodes: 3, [16:25:24] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 33 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 26, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 63, initializing_shards: 5, number_of_data_nodes: 3, [16:25:53] moritzm, chasemp: It looks like logstash1004 got really sad [16:25:55] we know and are waiting on this to go green ^ [16:25:56] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 33 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 26, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 63, initializing_shards: 5, number_of_data_nodes: 3, [16:25:58] yeah [16:25:58] * bd808 looks for logs [16:26:10] we think a blip during fw apply [16:26:20] which if so it's a one time thing we have to wait out [16:26:24] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 33 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 26, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 63, initializing_shards: 5, number_of_data_nodes: 3, [16:26:34] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 33 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 26, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 63, initializing_shards: 5, number_of_data_nodes: 3, [16:26:34] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 33 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 26, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 63, initializing_shards: 5, number_of_data_nodes: 3, [16:27:07] It looks like 1004 dropped out of the cluster entirely with zen-disco-master_failed errors [16:27:11] (03CR) 1020after4: "How am I omitting the heading http?" [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [16:27:53] chasemp: I'm going to restart elasticsearch there [16:27:57] ok [16:28:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [16:28:22] !log db2056 clone failed due to wrong partitioning, needs reinstall. Cloning db2049 -> db2063 instead [16:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:20] !log restarted elasticsearch on logstash1004; it fell out of the cluster during ferm firewall application [16:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:24] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 88, initializing_shards: 6, number_of_data_nodes: 3, delayed_unassigned_sh [16:29:39] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1782155 (10GWicke) > Why would we need to wait for all of the data to be back there? I never said we'd need to wait. However, no data makes for l... [16:29:55] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 88, initializing_shards: 6, number_of_data_nodes: 3, delayed_unassigned_sh [16:30:17] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/250971/ (duration: 00m 18s) [16:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:21] moritzm, chasemp: It is healing now. It looks like it may take a while to settle back to green everwhere [16:30:24] RECOVERY - ElasticSearch health check for shards on logstash1006 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 88, initializing_shards: 6, number_of_data_nodes: 3, delayed_unassigned_sh [16:30:32] James_F, ^ [16:30:34] bd808: did it require the restart? that would be interesting [16:30:35] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 88, initializing_shards: 6, number_of_data_nodes: 3, delayed_unassigned_sh [16:30:35] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 88, initializing_shards: 6, number_of_data_nodes: 3, delayed_unassigned_sh [16:30:44] RECOVERY - ElasticSearch health check for shards on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 88, initializing_shards: 6, number_of_data_nodes: 3, delayed_unassigned_sh [16:31:06] Krenair: Thanks. Testing now. [16:31:33] lgtm [16:31:38] chasemp: the restart seemed to be needed, yes. The logs showed the master discovery failures and nothing else. Strangely however it wasn't split brain. I could see all the other nodes from monitors on 1004 [16:32:08] Krenair: The 404 issue is still there for me? [16:32:19] I think this is a product of the unicast vs multicast in prod difference [16:32:25] for node reintegration [16:32:34] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:45] ...and I think we talked about this before :) deja vu [16:32:47] James_F, WFM... [16:33:16] chasemp: well this time should be the last right :) we have the firewalls up finally [16:33:23] yes [16:33:29] Krenair: Yup, working now… Must have been cacheing. [16:33:45] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [16:34:12] bd808: and if there's a new logstash node in the future, it'll have ferm from the (I'll enable it in the role once we're done with 100[4-6] [16:34:20] (from the start) [16:35:41] chasemp: the cluster status now is that all shards are recovering. Once that finished I expect to see some rebalancing because there are several shards I dropped the replica count on earlier this week as a temporary disk space saving measure. At some point the cluster will realize that 1005 and 1006 have more shards than 1004 and rebalance to fix that [16:35:55] understood [16:36:07] we can wait [16:36:26] worst case I'll do it if it gets late for moritz, moritzm you cool w/ this? [16:36:59] that's fine, I'll be around [16:40:29] (03CR) 10Chad: "It's url-downloader.wikimedia.org, not eqiad.wmnet." [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [16:40:57] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1782199 (10mobrovac) >>! In T114443#1782101, @Nuria wrote: > I sure hope we are not thinking of having a node rest endpoint and another one based on eventlogging at the same time,... [16:42:19] 6operations: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1782204 (10akosiaris) [16:42:22] 6operations, 10vm-requests, 5Patch-For-Review: request for ganeti vm for people.wm.org - https://phabricator.wikimedia.org/T117517#1782202 (10akosiaris) 5Open>3Resolved Created, installed, puppetized, salt key signed. Thank you for you business, sir :-). Resolving [16:42:31] (03CR) 1020after4: "sorry that was a bad paste, this is what I was attempting to use:" [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [16:43:20] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1782215 (10Joe) @mobrovac so let me get this straight, we discussed something that was already overridden by an existing implementation? As far as deploying the python app, who i... [16:44:58] twentyafterfour: I can clone from both github and gerrit on iridium as myself, how is it not working? [16:45:30] ostriches: probably because I have modified the global gitconfig to make it work [16:45:36] 6operations, 10vm-requests: request for ganeti vm for people.wm.org - https://phabricator.wikimedia.org/T117517#1782227 (10revi) [16:45:38] I think I found a working setup [16:46:24] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [16:48:35] as always remember to remove patch-for-review [16:48:53] the evil project never gets self-removed :( [16:49:35] (03PS3) 1020after4: iridium system-wide gitconfig needs http.proxy [puppet] - 10https://gerrit.wikimedia.org/r/250370 [16:49:58] (03CR) 1020after4: [C: 031] "ok this seems to work." [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [16:51:21] (03CR) 10Chad: [C: 031] "Little bit of extra spacing and one inline question but functionally ok." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [16:51:44] (03PS1) 10Muehlenhoff: Assign salt grains for dbstore systems [puppet] - 10https://gerrit.wikimedia.org/r/250994 [16:53:25] (03CR) 1020after4: iridium system-wide gitconfig needs http.proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [16:53:57] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1782267 (10mobrovac) >>! In T114443#1782215, @Joe wrote: > @mobrovac so let me get this straight, we discussed something that was already overridden by an existing implementation?... [16:54:42] (03PS2) 10Filippo Giunchedi: Fixed productionn DNS for ms-be2020 and ms-be2021 Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/250977 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [16:54:50] (03PS3) 10Filippo Giunchedi: Fixed productionn DNS for ms-be2020 and ms-be2021 [dns] - 10https://gerrit.wikimedia.org/r/250977 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [16:54:57] (03PS4) 10Filippo Giunchedi: Fixed production DNS for ms-be2020 and ms-be2021 Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/250977 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [16:55:11] (03PS5) 10Filippo Giunchedi: Fixed production DNS for ms-be2020 and ms-be2021 [dns] - 10https://gerrit.wikimedia.org/r/250977 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [16:55:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Fixed production DNS for ms-be2020 and ms-be2021 [dns] - 10https://gerrit.wikimedia.org/r/250977 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [16:56:16] (03CR) 10Chad: iridium system-wide gitconfig needs http.proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [16:57:04] RECOVERY - Check size of conntrack table on cerium is OK: OK: nf_conntrack is 0 % full [16:57:04] RECOVERY - DPKG on cerium is OK: All packages OK [16:57:25] RECOVERY - configured eth on cerium is OK: OK - interfaces up [16:57:25] RECOVERY - Disk space on cerium is OK: DISK OK [16:57:26] RECOVERY - salt-minion processes on cerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:57:35] RECOVERY - RAID on cerium is OK: OK: no disks configured for RAID [16:57:35] RECOVERY - dhclient process on cerium is OK: PROCS OK: 0 processes with command name dhclient [16:58:57] * dbrant waves at YuviPanda [16:59:05] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:25] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:45] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [17:01:45] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:01:45] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:01:54] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 34.38 ms [17:02:04] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:02:04] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:02:05] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:02:06] RECOVERY - service on cerium is OK: OK - cassandra-a is active [17:02:53] (03CR) 10Rush: "cool, whatever I was reading into it before idk :) glad it works" [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [17:02:56] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:03:04] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:03:04] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:03:04] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:03:04] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:03:04] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:03:15] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1782334 (10fgiunchedi) yep still inaccessible, console log keeps repeating like below. also T110304 seems a duplicate of T103808 ? ```lines=5 [1;35merr: Could not request certificate: Connec... [17:03:16] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:03:16] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4007_v4, cp4007_v6 [17:05:24] ulsfo link issues? [17:05:46] PROBLEM - swift-account-server on ms-be2021 is CRITICAL: Connection refused by host [17:05:57] PROBLEM - swift-container-server on ms-be2021 is CRITICAL: Connection refused by host [17:06:05] PROBLEM - swift-account-reaper on ms-be2021 is CRITICAL: Connection refused by host [17:06:05] PROBLEM - puppet last run on ms-be2021 is CRITICAL: Connection refused by host [17:06:06] all about cp4007 tho [17:06:14] PROBLEM - RAID on ms-be2021 is CRITICAL: Connection refused by host [17:06:25] PROBLEM - swift-container-updater on ms-be2021 is CRITICAL: Connection refused by host [17:06:27] PROBLEM - swift-object-replicator on ms-be2021 is CRITICAL: Connection refused by host [17:06:35] yeah it's confusing when this happens as all nodes will report issues to one node etc so probably all cp4007 related? [17:06:35] PROBLEM - dhclient process on ms-be2021 is CRITICAL: Timeout while attempting connection [17:06:45] PROBLEM - swift-account-auditor on ms-be2021 is CRITICAL: Timeout while attempting connection [17:06:54] PROBLEM - swift-object-auditor on ms-be2021 is CRITICAL: Timeout while attempting connection [17:06:54] PROBLEM - DPKG on ms-be2021 is CRITICAL: Timeout while attempting connection [17:06:54] PROBLEM - swift-object-updater on ms-be2021 is CRITICAL: Timeout while attempting connection [17:07:04] PROBLEM - configured eth on ms-be2021 is CRITICAL: Timeout while attempting connection [17:07:04] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1782369 (10Eevans) >>! In T95253#1782049, @fgiunchedi wrote: > we could get bigger SSDs for those too FWIW, it //might// be enough to get the dis... [17:07:14] PROBLEM - very high load average likely xfs on ms-be2021 is CRITICAL: Timeout while attempting connection [17:07:15] PROBLEM - swift-account-replicator on ms-be2021 is CRITICAL: Timeout while attempting connection [17:07:24] PROBLEM - swift-object-server on ms-be2021 is CRITICAL: Timeout while attempting connection [17:07:25] PROBLEM - Check size of conntrack table on ms-be2021 is CRITICAL: Timeout while attempting connection [17:07:31] ^godog is this you / on your radar? [17:08:00] chasemp: Can you merge https://gerrit.wikimedia.org/r/#/c/250370/? [17:08:21] chasemp: yeah papaul is reinstalling [17:08:23] looks like I missed cp4007 dying? [17:08:27] (donwtimed) [17:08:33] yeah was I about to reboot bblack objections? [17:08:36] it's unresponsive [17:08:46] let's depool it first before reboot [17:08:48] 6operations, 6Analytics-Backlog, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Turn off sqstat udp2log instance - https://phabricator.wikimedia.org/T117727#1782387 (10Ottomata) 3NEW a:3Ottomata [17:08:58] I'll go do that [17:09:13] 6operations, 6Analytics-Backlog, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Turn off sqstat udp2log instance - https://phabricator.wikimedia.org/T117727#1782397 (10Ottomata) [17:09:15] 6operations, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1782396 (10Ottomata) [17:09:22] 6operations, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#915957 (10Ottomata) [17:09:23] k [17:09:24] 6operations, 6Analytics-Backlog, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Turn off sqstat udp2log instance - https://phabricator.wikimedia.org/T117727#1782387 (10Ottomata) [17:10:26] !log ms-be2020 installation complete [17:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:10:35] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: puppet fail [17:11:05] !log cp4007 depooled in pybal + confctl [17:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:17] chasemp: go ahead! [17:12:06] !log rebooting cp4007.mgmt.ulsfo.wmnet as it's uresponsive [17:12:09] :) [17:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:14:44] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK [17:14:45] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 60 ESP OK [17:14:45] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 60 ESP OK [17:14:45] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [17:14:45] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 60 ESP OK [17:14:45] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 78.64 ms [17:14:45] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 60 ESP OK [17:15:05] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [17:15:05] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [17:15:34] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 60 ESP OK [17:15:34] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 60 ESP OK [17:15:45] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK [17:15:46] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 60 ESP OK [17:15:46] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [17:17:21] (03CR) 10Rush: [C: 032] iridium system-wide gitconfig needs http.proxy [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [17:17:37] ostriches: gtg [17:17:44] Ok, thx! [17:17:48] chasemp: thanks! [17:20:41] Force ran puppet, change picked up [17:22:15] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: puppet fail [17:34:25] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:36:46] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:36:50] ostriches: woot! thanks [17:37:12] np. Replication to github is working now for scap. [17:37:26] 6operations, 10Traffic: cp4007 crashed - https://phabricator.wikimedia.org/T117746#1782679 (10BBlack) 3NEW [17:39:07] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1782700 (10faidon) >>! In T114443#1776521, @GWicke wrote: > We have a [simple node service](https://github.com/wikimedia/restevent) that does what we need & integrates with our no... [17:45:02] !log reinstalling ms-be2021 [17:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:15] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1782752 (10coren) >>! In T107507#1780643, @ori wrote: > Why? Is the chat log public, by any chance? Sorry, if I recall correctly most of it was held on an unlogged channel. The discussion at h... [17:48:26] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [17:51:39] !log ms-be2021 installation complete [17:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:23] !log ms-be2021 - fixing puppet certs after reinstall [17:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:56] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1782793 (10Papaul) [17:55:58] 6operations, 10ops-codfw, 5Patch-For-Review, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1782791 (10Papaul) 5Open>3Resolved Servers OS installation complete. [17:57:26] !log ms-be2020 - added to puppet [17:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:45] RECOVERY - DPKG on ms-be2021 is OK: All packages OK [18:00:45] RECOVERY - swift-object-updater on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [18:00:45] RECOVERY - swift-object-auditor on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [18:00:55] RECOVERY - configured eth on ms-be2021 is OK: OK - interfaces up [18:01:04] RECOVERY - very high load average likely xfs on ms-be2021 is OK: OK - load average: 0.27, 0.20, 0.14 [18:01:24] RECOVERY - Check size of conntrack table on ms-be2021 is OK: OK: nf_conntrack is 0 % full [18:01:45] RECOVERY - swift-account-server on ms-be2021 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:01:54] papaul: ^ all done. thank you [18:01:55] RECOVERY - swift-container-server on ms-be2021 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:02:14] RECOVERY - RAID on ms-be2021 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:02:24] RECOVERY - swift-container-updater on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [18:02:34] RECOVERY - dhclient process on ms-be2021 is OK: PROCS OK: 0 processes with command name dhclient [18:02:44] RECOVERY - swift-account-auditor on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [18:03:06] RECOVERY - swift-account-replicator on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [18:03:14] RECOVERY - swift-object-server on ms-be2021 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [18:04:05] RECOVERY - swift-account-reaper on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [18:04:05] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:04:16] RECOVERY - swift-object-replicator on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:04:52] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1782837 (10Dzahn) ms-be2020/2021 reinstalled by @papaul. fixed puppet certs and added to puppetmaster. you can see them appearing in Icinga now. [18:06:16] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [18:06:17] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1782841 (10Dzahn) [18:07:19] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1782842 (10JMinor) Hello I'll try to address the questions of motivation and potential scale. As for maintai... [18:08:04] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1782844 (10Dzahn) a:5Papaul>3fgiunchedi @filippo fyi, the last 2 are also in puppet now and in monitoring and already have all the swift checks from the role. this is for you to check... [18:12:36] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 1 failures [18:13:43] (03CR) 10Alexandros Kosiaris: [C: 031] openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [18:17:53] twentyafterfour, greg-g: I've got a tiny backport to wmf.5 to get in in front of the train. zuul is processing it now and I'll sync as soon as it lands [18:17:54] (03PS6) 10Bgerstle: Add an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) [18:18:02] (03PS7) 10Bgerstle: Add an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) [18:18:40] (03CR) 10Bgerstle: "bd808 just amended the commit msg to describe that the user chooses how URLs are handled, the default behavior is still to open in Safari" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [18:18:59] kk [18:20:11] bd808: no problem [18:23:44] !log ms-be2016-2020 - accepted salt keys, hafnium.wm.org - deleted salt key - still unaccepted: capella, haedus, rhodium [18:23:44] !log bd808@tin Synchronized php-1.27.0-wmf.5/includes/debug/logger/MonologSpi.php: MonologSpi: add support for customizing Monolog\Logger instances (59eed42) (duration: 00m 17s) [18:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:10] twentyafterfour: all done and thanks [18:31:14] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:43] PROBLEM - HHVM rendering on mw1159 is CRITICAL: Connection timed out [18:31:54] PROBLEM - configured eth on mw1159 is CRITICAL: Timeout while attempting connection [18:31:54] PROBLEM - Disk space on mw1159 is CRITICAL: Timeout while attempting connection [18:32:04] PROBLEM - HHVM processes on mw1159 is CRITICAL: Timeout while attempting connection [18:32:14] PROBLEM - nutcracker port on mw1159 is CRITICAL: Timeout while attempting connection [18:32:43] PROBLEM - nutcracker process on mw1159 is CRITICAL: Timeout while attempting connection [18:33:23] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM. silly typo mistake by me back then." [dns] - 10https://gerrit.wikimedia.org/r/250458 (owner: 10Rush) [18:33:23] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [100000000.0] [18:33:29] (03PS4) 10Alexandros Kosiaris: Fix codfw row a labs-hosts1 and labs-support1 IP overlap [dns] - 10https://gerrit.wikimedia.org/r/250458 (owner: 10Rush) [18:34:01] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/250458 (owner: 10Rush) [18:34:53] PROBLEM - Check size of conntrack table on mw1159 is CRITICAL: Timeout while attempting connection [18:35:13] PROBLEM - DPKG on mw1159 is CRITICAL: Timeout while attempting connection [18:35:58] !log mw1158: power cycled - crashed, no console output [18:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:13] PROBLEM - RAID on mw1159 is CRITICAL: Timeout while attempting connection [18:36:18] arg, 1159 of course, i cycled the right one though :p [18:37:05] !log correction. ^ that was mw1159 [18:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:21] mutante, I just fixed in page text :) [18:37:25] PROBLEM - Host mw1159 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:41] (03PS1) 10Alexandros Kosiaris: labs reservations [dns] - 10https://gerrit.wikimedia.org/r/251017 [18:37:43] MaxSem: :) thanks [18:37:54] RECOVERY - nutcracker process on mw1159 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [18:38:04] RECOVERY - RAID on mw1159 is OK: OK: no RAID installed [18:38:04] RECOVERY - Host mw1159 is UP: PING OK - Packet loss = 0%, RTA = 2.24 ms [18:38:14] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:38:24] RECOVERY - configured eth on mw1159 is OK: OK - interfaces up [18:38:26] RECOVERY - Disk space on mw1159 is OK: DISK OK [18:38:31] 7Blocked-on-Operations, 10Flow, 3Collaboration-Team-Current, 5Patch-For-Review, 7WorkType-Maintenance: Migrate Flow content to new separate logical External Store - https://phabricator.wikimedia.org/T106363#1783046 (10Mattflaschen) [18:38:31] (03CR) 10Alexandros Kosiaris: [C: 032] labs reservations [dns] - 10https://gerrit.wikimedia.org/r/251017 (owner: 10Alexandros Kosiaris) [18:38:35] RECOVERY - Check size of conntrack table on mw1159 is OK: OK: nf_conntrack is 0 % full [18:38:54] RECOVERY - HHVM processes on mw1159 is OK: PROCS OK: 11 processes with command name hhvm [18:39:05] RECOVERY - DPKG on mw1159 is OK: All packages OK [18:39:13] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [18:39:14] RECOVERY - nutcracker port on mw1159 is OK: TCP OK - 0.000 second response time on port 11212 [18:39:23] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.756 second response time [18:39:24] 6operations, 10vm-requests: request for ganeti vm for people.wm.org - https://phabricator.wikimedia.org/T117517#1783047 (10Dzahn) Thank you very much @akosiaris. Appreciate the service. [18:40:57] bblack: sorry I got pulled away for a sec there, what do you want to do for cp4007? [18:42:04] RECOVERY - HHVM rendering on mw1159 is OK: HTTP OK: HTTP/1.1 200 OK - 65376 bytes in 0.278 second response time [18:44:01] mutante: are you doing a reinstall? [18:44:34] cmjohnson1: no, i just powercycled mw1159 [18:44:44] okay..thx [18:44:48] cmjohnson1: whats up [18:45:21] i am reinstall mw1083 right now...so was going to to add dsh at same time but not important [18:45:38] ah, ok [18:50:22] 6operations, 10Traffic: cp4007 crashed - https://phabricator.wikimedia.org/T117746#1783068 (10MoritzMuehlenhoff) I did some digging, but couldn't identify an obvious commit which might explain this. Independant of that I'm planning to update our kernel to the latest 3.19.8-ckt9 on Friday, it brings a couple o... [18:52:43] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [18:54:12] godog: reimages cerium? [18:54:17] s/reimages/reimaged/ [18:55:49] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1783095 (10yuvipanda) @fgiunchedi was this created with the same name of an instance that was deleted a little while ago? [18:56:26] (03PS1) 10Muehlenhoff: Move base::firewall include in the kibana and logstash roles [puppet] - 10https://gerrit.wikimedia.org/r/251019 [18:58:28] (03PS1) 10Dzahn: admin: add all groups to new "people" host [puppet] - 10https://gerrit.wikimedia.org/r/251021 (https://phabricator.wikimedia.org/T116992) [18:58:44] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [18:59:48] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1783127 (10GWicke) I also think that we should be careful about the interaction with other production services running on this cluster. There is not much isolation betwee... [19:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151104T1900). [19:00:36] (03CR) 10John F. Lewis: [C: 031] "sane [but not nice!] and love the alphabetical sort feature here." [puppet] - 10https://gerrit.wikimedia.org/r/251021 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [19:01:04] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1783135 (10mobrovac) >>! In T114443#1782700, @faidon wrote: > So either someone else should make it for you (//soon//) or you'll just use your own thing? No, it doesn't work like... [19:01:25] JohnFLewis: thanks! so i think krenair is on something to create an "all" keyword or something [19:01:38] Krenair: https://gerrit.wikimedia.org/r/#/c/251021/1/hieradata/hosts/rutherfordium.yaml :) [19:01:39] thought it was chase? [19:01:54] It was me, then Chase [19:02:09] 60 groups :p [19:02:17] yea, alpha-sorted [19:02:27] we're just referencing different times then :) [19:06:45] Krenair: ok so I can't actually test the script itself on labs [19:06:59] which script? [19:07:01] Krenair: I guess I've to create things manually and test it on some prod host [19:07:02] the labs dns change? [19:07:03] yeah [19:07:08] why not? [19:07:16] can't access the Nova api from inside labs [19:07:26] PROBLEM - salt-minion processes on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:07:27] PROBLEM - swift-object-server on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:07:29] Oh, right. [19:07:46] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Puppet has 7 failures [19:07:53] YuviPanda, I think I generated the file on silver [19:07:57] PROBLEM - swift-account-auditor on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:08:17] PROBLEM - swift-account-reaper on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:08:23] kk let me scp script, create a config file and see how it goes [19:08:26] PROBLEM - swift-object-replicator on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:08:33] godog: ^ [19:08:36] PROBLEM - swift-account-replicator on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:08:43] or is this a new server? [19:09:51] yes, it is a new server [19:09:56] handling it now [19:10:39] mutante: ah ok thanks [19:10:58] ACKNOWLEDGEMENT - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Puppet has 7 failures daniel_zahn fresh install [19:10:59] ACKNOWLEDGEMENT - salt-minion processes on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn fresh install [19:10:59] ACKNOWLEDGEMENT - swift-account-auditor on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor daniel_zahn fresh install [19:10:59] ACKNOWLEDGEMENT - swift-account-reaper on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper daniel_zahn fresh install [19:10:59] ACKNOWLEDGEMENT - swift-account-replicator on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator daniel_zahn fresh install [19:10:59] ACKNOWLEDGEMENT - swift-object-replicator on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator daniel_zahn fresh install [19:10:59] ACKNOWLEDGEMENT - swift-object-server on ms-be2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server daniel_zahn fresh install [19:11:38] ^ i do that because we cant schedule downtimes beforehand and an ACK does exactly what we want, imho: it stops it from talking about it ..until there is a status change again. and you dont have to know the end time [19:12:00] mobrovac: yep [19:12:08] YuviPanda: what mutante said :D [19:12:33] kk [19:19:50] (03PS1) 10EBernhardson: Make three new nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 [19:19:52] (03PS1) 10EBernhardson: Remove old ES nodes from master capable list [puppet] - 10https://gerrit.wikimedia.org/r/251025 [19:20:07] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [19:20:18] RECOVERY - Restbase root url on cerium is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.015 second response time [19:20:57] (03PS2) 10EBernhardson: Make three new nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 [19:20:59] (03PS2) 10EBernhardson: Remove old ES nodes from master capable list [puppet] - 10https://gerrit.wikimedia.org/r/251025 [19:21:09] (03PS3) 10EBernhardson: Make three of the newer ES nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 [19:21:14] 6operations, 7Database: Puppetize grants for mysql analytics servers - https://phabricator.wikimedia.org/T114476#1783322 (10jcrespo) This has been partially done on https://phabricator.wikimedia.org/rOPUP757b0ebf40b936c1e07a3881105588b333c8b500 It has some blockers for auditing and some puppet work on refacto... [19:21:16] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1783323 (10fgiunchedi) >>! In T117673#1783095, @yuvipanda wrote: > @fgiunchedi was this created with the same name of an instance that was deleted a little while ago? the name was used befor... [19:21:42] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1783326 (10Halfak) @Gwicke, what interactions are you worried about. Is this specific to ORES or maybe a more general concern? [19:21:50] (03PS4) 10EBernhardson: Make three of the newer ES nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 [19:21:52] (03PS3) 10EBernhardson: Remove old ES nodes from master capable list [puppet] - 10https://gerrit.wikimedia.org/r/251025 [19:22:00] (03PS9) 10Yuvipanda: LabsDNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [19:22:11] Krenair: ^ works I think. [19:23:29] (03CR) 10Dzahn: [C: 032] "all shell users having access is the point of people.wm, it replaced fenari which used to be the bastion for all." [puppet] - 10https://gerrit.wikimedia.org/r/251021 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [19:23:30] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1783344 (10GWicke) @halfak, it's a general concern, but something computationally intense and research-driven like ORES is especially difficult to gauge in that regard. [19:23:32] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1783345 (10yuvipanda) What other services are currently running in scb? Is there any isolation between the multiple services running in sca? [19:23:56] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1783347 (10fgiunchedi) p:5High>3Normal \o/ using another name works, I guess this is known? `filippo@filippo-test-precise2:~$ ` [19:24:56] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1783355 (10yuvipanda) Yeah that error message is consistent with some component somewhere not caring about DNS TTL and just caching the old name forever, preventing puppet from running. @andr... [19:25:59] Krenair: so it runs the script first in a no-op mode, and if it exits with 1 then it runs the script in 'real' mode [19:26:04] nice [19:26:12] this also means that if the 'real' mode errors out we'll actually get a puppet failure [19:27:17] PROBLEM - salt-minion processes on mw2214 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:27:26] RECOVERY - swift-account-auditor on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:27:47] RECOVERY - swift-account-reaper on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:27:48] RECOVERY - swift-object-replicator on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:27:53] (03PS1) 10Dzahn: rutherfordium: add domain search, fix admin groups [puppet] - 10https://gerrit.wikimedia.org/r/251027 (https://phabricator.wikimedia.org/T116992) [19:28:06] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:28:06] RECOVERY - swift-account-replicator on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:28:17] PROBLEM - salt-minion processes on db2053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:28:27] (03CR) 10Dzahn: [C: 032] rutherfordium: add domain search, fix admin groups [puppet] - 10https://gerrit.wikimedia.org/r/251027 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [19:28:56] RECOVERY - swift-object-server on ms-be2021 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:29:07] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:29:57] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: puppet fail [19:30:03] (03CR) 10Alex Monk: [C: 031] LabsDNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [19:31:16] RECOVERY - salt-minion processes on mw2214 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:32:42] I might've borked my laptop in the process of upgrading ot to stretch [19:32:44] wehlp [19:33:40] I'm going to reboot and hope! [19:33:44] but I might be screwed >_> [19:33:46] <_< [19:34:16] RECOVERY - salt-minion processes on db2053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:34:34] (03PS1) 10Dzahn: rutherfordium: add debdeploy grains, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/251029 (https://phabricator.wikimedia.org/T116992) [19:35:19] hah! nope [19:35:34] (03CR) 10Dzahn: [C: 032] rutherfordium: add debdeploy grains, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/251029 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [19:35:47] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:41] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1783481 (10scfc) Does it become accessible after a reboot? (I //think// I successfully launched a Precise instance in that way a few months ago, but with the open bugs stalled I didn't bothe... [19:39:01] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1783491 (10scfc) (Ignore my last comment; I fell into the trap of not reloading the task page before clicking "Submit" again. Oh, Bugzilla, sometimes I miss you very much.) [19:39:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [19:39:36] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:41:45] JohnFLewis: "11:43 < Rotonen> on the topic of www subdomains, i'm usually also making a gopher subdomain, just because :P [19:42:21] :P [19:44:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [19:45:40] (03PS1) 10Dzahn: peopleweb: move standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/251030 [19:46:17] (03CR) 10Dzahn: [C: 032] peopleweb: move standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/251030 (owner: 10Dzahn) [19:46:24] (03CR) 10Matanya: [C: 031] deactivate wikimedia.biz [dns] - 10https://gerrit.wikimedia.org/r/244084 (https://phabricator.wikimedia.org/T81344) (owner: 10Dzahn) [19:47:06] (03CR) 10Matanya: [C: 031] deactivate wikidisclosure.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243973 (owner: 10Dzahn) [19:47:22] (03CR) 10Matanya: [C: 031] deactivate wekipedia.com [dns] - 10https://gerrit.wikimedia.org/r/244085 (owner: 10Dzahn) [19:47:36] :) [19:47:40] (03CR) 10Matanya: [C: 031] deactivate wikifamily.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243972 (owner: 10Dzahn) [19:47:56] (03CR) 10Matanya: [C: 031] deactivate wikimemory.org [dns] - 10https://gerrit.wikimedia.org/r/244101 (owner: 10Dzahn) [19:48:23] (03CR) 10Matanya: [C: 031] deactivate wikimediacommons.[co.uk|eu|info|jp.net|mobi|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244092 (owner: 10Dzahn) [19:48:46] (03CR) 10Matanya: [C: 031] deactivate wikipaedia.net [dns] - 10https://gerrit.wikimedia.org/r/244090 (owner: 10Dzahn) [19:49:12] (03CR) 10Matanya: [C: 031] deactivate wiki[p|m]ediastories.[com|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244086 (owner: 10Dzahn) [19:49:19] (03PS1) 10Dzahn: rutherfordium: include peopleweb role [puppet] - 10https://gerrit.wikimedia.org/r/251031 (https://phabricator.wikimedia.org/T116992) [19:49:25] (03CR) 10Matanya: [C: 031] deactivate wikiknihy.cz [dns] - 10https://gerrit.wikimedia.org/r/244104 (owner: 10Dzahn) [19:49:41] (03CR) 10Matanya: [C: 031] deactivate wikimania.asia [dns] - 10https://gerrit.wikimedia.org/r/244103 (owner: 10Dzahn) [19:49:57] (03CR) 10Dzahn: [C: 032] rutherfordium: include peopleweb role [puppet] - 10https://gerrit.wikimedia.org/r/251031 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [19:49:57] PROBLEM - Disk space on labvirt1002 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 90714 MB (3% inode=99%) [19:52:56] jynus, just wanted to ping you about https://gerrit.wikimedia.org/r/#/c/226544/ . Technically that change itself doesn't require an immediate schema change, but it's the first step of the external store work we discussed. [19:53:03] It's tested, but I wanted to check with you that it's on the right track before we +2. [19:53:56] (03PS1) 10Paladox: Allow viewing php files in raw format in browsers [puppet] - 10https://gerrit.wikimedia.org/r/251034 (https://phabricator.wikimedia.org/T117621) [19:54:12] (03PS2) 10Paladox: Allow viewing php files in raw format in browsers [puppet] - 10https://gerrit.wikimedia.org/r/251034 (https://phabricator.wikimedia.org/T117621) [19:54:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [19:55:17] matt_flaschen, we need to test it on a spare slave first [19:56:26] does it require more commits after the maintenance or it would work automatically? [19:57:41] jynus, basically, it assumes that a new external is setup (similar to the way external24, etc. are normally set up), then wants to get called like: [19:57:49] php maintenance FlowExternalStoreMoveCluster.php --from=cluster24,cluster25 --to=flowcluster1 [19:58:24] so it requires a configuration change? [19:58:43] for subsequent reads and writes [19:58:49] In order to test it on fake DBs, we would need a fake External Store (because it calls ExternalStore::insertWithFallback on flowcluster1) and a sandbox/slave copy of the flow_revision to modify. [19:59:14] jynus, it needs the new Flow External Store (or a test version of it) before running. Merging itself won't do anything. [19:59:33] ok [19:59:45] let me review it [19:59:49] Thanks [19:59:58] and I will propose a schedule [20:00:11] jynus, if you want to discuss (now or at another time), just let me know. [20:00:27] it is a bit late for me, but I need to see the code first [20:00:39] I will send you feedback on the ticket [20:00:42] I understand, there's not a rush on our side, but we understand it's ultimately blocking the ES recompression. [20:00:50] I know :-) [20:01:00] I am actually interested on it [20:02:12] another thing we could do is one run that copies but not deletes entries [20:02:32] anyway, let me review it and I will get back [20:02:42] ok I'm ready to deploy the train to group1 ...a bit late [20:02:50] this things are prone to race conditions [20:03:02] talk to you tomorrow [20:03:51] also, it may need some compaction job at db level [20:03:54] jynus, it actually doesn't delete the old ES entries (this script doesn't, but the re-compression does). However, it does modify flow_revision. But it accepts different dbr and dbw connections, so the dbw could be a sandbox/slave flow_revision. [20:03:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 4 below the confidence bounds [20:04:04] (03PS1) 10Yuvipanda: ores: Allow tuning workers per core for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/251037 [20:04:26] jynus, okay, talk to you tomorrow. Do you want to set a time? [20:04:48] matt_flaschen, how does it handle writes-while-it-is-executing? [20:05:31] jynus, that is solved by cutting over new content to flowexternal1 before the script is run. Changing where Flow writes won't stop it reading old ES content. [20:05:35] (03PS10) 10Yuvipanda: LabsDNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [20:05:37] (03PS2) 10Yuvipanda: ores: Allow tuning workers per core for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/251037 [20:05:50] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Allow tuning workers per core for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/251037 (owner: 10Yuvipanda) [20:06:08] ok, so it kind of requires a code change at around the same time [20:06:14] I see your idea [20:06:29] I will check the code [20:06:33] bye [20:06:39] Talk to you later [20:07:14] The flowexternal1 cut for new content doesn't have to be the same time, just some time before. [20:08:08] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251040 [20:08:45] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251040 (owner: 1020after4) [20:09:29] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251040 (owner: 1020after4) [20:09:47] (03PS1) 10Ori.livneh: Add backports repository, but with a low priority [puppet] - 10https://gerrit.wikimedia.org/r/251042 [20:10:11] (03PS1) 10Dzahn: peopleweb: add ferm rules, firewalling [puppet] - 10https://gerrit.wikimedia.org/r/251043 [20:10:38] (03CR) 10jenkins-bot: [V: 04-1] Add backports repository, but with a low priority [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh) [20:10:42] jzerebecki: "< ille> what wrong do i when i got when i use letsencrypt-auto with apache and got CN happy hacker fake CA when im go to https://www.mysite.com" :p [20:11:55] hey is there any background or recurring process that could clear out files from /tmp on the video scalers? trying to track down some oddities on https://phabricator.wikimedia.org/T117771 and it looks like in some cases files just vanish from /tmp while we're working with them [20:11:57] (03PS2) 10Ori.livneh: Add backports repository, but with a low priority [puppet] - 10https://gerrit.wikimedia.org/r/251042 [20:12:19] brion: yes, let me get you the link [20:12:41] whee! [20:13:02] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1783632 (10chasemp) I spoke with @papaul since he is going on vacation as it would be good to get the hosts mentioned [[ https://phabricator.wikimedia.org/T117107#1770439 | he... [20:13:09] brion: https://github.com/wikimedia/operations-puppet/blob/production/modules/mediawiki/manifests/multimedia.pp#L24-L30 [20:13:30] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.5 [20:13:33] interesting [20:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:41] mutante: tell him that is very good there is bounty waiting if he can tell you how to reproduce [20:13:51] jzerebecki: 12:13 < LotR> ille: you forgot the --server option that was explained in the mail you got about being accepted in the beta program [20:14:08] ori: it may take over an hour to run long transcodes, so 15 minutes from creation time is definitely going to be a problem for 2-pass encodings (we need the source file for both passes) [20:14:13] haha didn't know that was the test CN [20:14:15] it actually does "happy hacker fake CA" if you dont [20:14:25] i just learned that :) [20:14:25] brion: gotcha. let's change it then. [20:14:26] i'm also seeing some where it looks like it's deleting immediately, which _shouldn't_ explode on a 15-minute check [20:14:47] so i'm still suspicious what could be expoding there [20:15:28] don't we have some tmpfile abstraction that deletes a tmpfile when an object falls out of scope? [20:15:33] i vaguely remember seeing that somewhere in our php code [20:15:58] ori: yeah, i checked down that rabbithole for a while and can't find an obvious slip-up [20:16:12] the TempFSFile object seems to stick around in memory as it should [20:16:17] for that puppet config -- would ctime >= 2h be ok? or should it be higher? [20:16:26] ori: yeah 2h should be ok [20:16:39] slower transcodes than that really shouldn't happen :D [20:17:35] (03PS1) 10Ori.livneh: mediawiki: allow temp files to linger up to 2 hours after creation [puppet] - 10https://gerrit.wikimedia.org/r/251044 [20:18:11] hmm, i wonder if the ones that failed immediately could have been something weird where it took too long to download the file out of swift [20:18:40] 15 minutes to get ~300mb out of swift shouldn't happen though :D [20:19:22] we also have tmpreaper enabled (deletes tmp files, runs via cron) but it is configured for TMPREAPER_TIME=7d [20:19:31] (03CR) 10Brion VIBBER: [C: 031] "Looks good -- this should help with at least some of the cases we're seeing on T117771 where the file disappears while we're still working" [puppet] - 10https://gerrit.wikimedia.org/r/251044 (owner: 10Ori.livneh) [20:19:55] ori: thanks! [20:20:06] (03CR) 10Ori.livneh: [C: 032] mediawiki: allow temp files to linger up to 2 hours after creation [puppet] - 10https://gerrit.wikimedia.org/r/251044 (owner: 10Ori.livneh) [20:20:53] brion: where is the code that fetches from swift? [20:21:39] ori: SwiftFileBackend::doFetchLocalCopyMulti() or something -- calls through into Http::getMulti i believe [20:27:57] (03PS2) 10Dzahn: peopleweb: add ferm rules, firewalling [puppet] - 10https://gerrit.wikimedia.org/r/251043 [20:28:22] (03CR) 10Dzahn: [C: 032] peopleweb: add ferm rules, firewalling [puppet] - 10https://gerrit.wikimedia.org/r/251043 (owner: 10Dzahn) [20:29:54] jzerebecki: https://cipherli.st/ [20:30:23] (03PS1) 10Rush: Establish IPv6 for labs-support1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/251114 (https://phabricator.wikimedia.org/T115491) [20:30:32] (03PS1) 10John F. Lewis: misc: add rutherfordium and point people.wm.o to it [puppet] - 10https://gerrit.wikimedia.org/r/251115 (https://phabricator.wikimedia.org/T116992) [20:32:14] mutante: nice. thx. [20:39:51] (03CR) 10Dzahn: [C: 031] Establish IPv6 for labs-support1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/251114 (https://phabricator.wikimedia.org/T115491) (owner: 10Rush) [20:40:29] (03CR) 10Faidon Liambotis: [C: 04-1] Add backports repository, but with a low priority (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh) [20:47:32] (03PS5) 10EBernhardson: Make three of the newer ES nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) [20:47:47] (03PS4) 10EBernhardson: Remove old ES nodes from master capable list [puppet] - 10https://gerrit.wikimedia.org/r/251025 (https://phabricator.wikimedia.org/T112556) [20:47:55] (03PS5) 10EBernhardson: Remove old ES nodes from master capable list [puppet] - 10https://gerrit.wikimedia.org/r/251025 (https://phabricator.wikimedia.org/T112556) [20:49:42] 6operations, 10CirrusSearch, 6Discovery, 5Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#1783751 (10EBernhardson) It seems this fell by the wayside as chase and I were both otherwise occupied. I think we should... [20:50:46] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/7: down - Transit: ! CyrusOne OOB (IP-000008-01) {#1099} [1Gbps Cu]BR [20:53:04] (03PS1) 10Dzahn: peopleweb: add migration class to rsync homes [puppet] - 10https://gerrit.wikimedia.org/r/251122 (https://phabricator.wikimedia.org/T116992) [20:55:06] (03CR) 10John F. Lewis: [C: 031] peopleweb: add migration class to rsync homes [puppet] - 10https://gerrit.wikimedia.org/r/251122 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [20:56:15] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1783771 (10Milimetric) So assuming 1000 bytes for each event, and like 3 million events per day, that means:... [20:57:06] (03CR) 10Dzahn: [C: 032] peopleweb: add migration class to rsync homes [puppet] - 10https://gerrit.wikimedia.org/r/251122 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [21:00:05] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151104T2100). Please do the needful. [21:01:54] !log starting parsoid deploy [21:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:01] no mobileapps deploy today //cc:mdholloway [21:05:13] !log synced code + restarted parsoid on wtp1002 as a canary [21:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:02] looking good. going to restart parsoid on all nodes now. [21:17:31] !log finished deploying parsoid sha 04893a18 [21:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [21:25:36] (03PS2) 10Dzahn: scap: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250634 [21:26:23] (03PS3) 10Mdann52: Tidy robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) [21:27:16] (03CR) 10Mdann52: "Hopefully fixed the issues in the above comments!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [21:27:36] (03PS3) 10Dzahn: scap: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250634 [21:27:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [21:29:29] (03CR) 10Nemo bis: [C: 04-1] Tidy robots.txt (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [21:34:30] (03CR) 10Dzahn: [C: 032] scap: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250634 (owner: 10Dzahn) [21:35:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [21:36:09] 6operations, 5Patch-For-Review: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1783879 (10Dzahn) rutherfordium already has all the user accounts now. people can connect. apache is running .. also added firewalling and an rsyncd... [21:37:18] (03CR) 10Ori.livneh: Add backports repository, but with a low priority (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh) [21:37:27] (03PS3) 10Ori.livneh: Add backports repository, but with a low priority [puppet] - 10https://gerrit.wikimedia.org/r/251042 [21:37:46] (03PS5) 10Dzahn: logstash: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250070 [21:39:21] (03CR) 10Dzahn: "checked with compiler - no diff" [puppet] - 10https://gerrit.wikimedia.org/r/250070 (owner: 10Dzahn) [21:40:23] (03PS1) 10Ori.livneh: Remove references to tungsten, which no longer exists [puppet] - 10https://gerrit.wikimedia.org/r/251129 [21:40:33] (03PS1) 10Jdlrobson: Deploy QuickSurveys to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) [21:41:17] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [21:44:03] (03CR) 10Dzahn: [C: 04-1] "https://phabricator.wikimedia.org/T106563 is open and says to use tungsten" [puppet] - 10https://gerrit.wikimedia.org/r/251129 (owner: 10Ori.livneh) [21:46:16] (03PS1) 10Jdlrobson: WIP: First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) [21:47:34] (03CR) 10Jhobs: [C: 04-1] "Awaiting finalized survey configuration by EOD Thursday (11/5)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [21:51:51] (03PS1) 10Ori.livneh: add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 [21:56:45] (03PS2) 10Ori.livneh: add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 [21:57:22] (03CR) 10jenkins-bot: [V: 04-1] add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 (owner: 10Ori.livneh) [22:00:50] Nemo_bis: do you know if anything remains to be done for https://phabricator.wikimedia.org/T106565 ? [22:03:26] (03CR) 10BryanDavis: "twentyafterfour: reading would like this to ride the train with the 1.27.0-wmf.6 branch. It has already been through beta cluster testing " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:06:11] (03CR) 10Legoktm: [C: 04-1] "Shouldn't be enabled on loginwiki or votewiki at the least. Does it need to be on private and fishbowl wikis at all?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:08:34] (03CR) 1020after4: "https://gerrit.wikimedia.org/r/#/c/251140/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:12:16] (03CR) 10John Vandenberg: "why the unnecessary rebase?" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243364 (owner: 10John Vandenberg) [22:13:48] 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1783981 (10Dzahn) @matanya are you are still blocked here? [22:15:49] 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1783988 (10Dzahn) >>! In T106563#1758383, @ArielGlenn wrote: > did these videos get transcoded already? or is this host still needed, I wonder? I see nothing t... [22:16:03] /win/win 12 [22:19:06] (03PS3) 10Dzahn: deactivate wikimedia.biz [dns] - 10https://gerrit.wikimedia.org/r/244084 (https://phabricator.wikimedia.org/T81344) [22:19:53] (03CR) 10Dzahn: [C: 032] deactivate wikimedia.biz [dns] - 10https://gerrit.wikimedia.org/r/244084 (https://phabricator.wikimedia.org/T81344) (owner: 10Dzahn) [22:22:09] (03CR) 10Jdlrobson: "I've been told all wikipedia's. Is there a simple way to do that? e.g. "wiki" => ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:26:16] (03CR) 10Alex Monk: "'wiki' will do more than just wikipedias. We deliberately got rid of uses of that. Use 'wikipedia'." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:26:22] (03CR) 10Krinkle: "'wikipedia' is a project key that can be used for all $langs of that $site. It's used for several other configuration keys, such as 'wgSer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:27:17] (03CR) 10Alex Monk: "(Assuming going on wikipedias *only* is justified of course. Otherwise no, you can't.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:27:35] (03PS2) 10Jdlrobson: Deploy QuickSurveys to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) [22:27:47] (03CR) 1020after4: [C: 031] Update comments/hints for WMF MW version format changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244735 (owner: 10Reedy) [22:28:09] (03CR) 10Dzahn: ""all wikis" is a lot more than "all wikipedias"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:28:28] (03PS3) 10Jdlrobson: Deploy QuickSurveys to all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) [22:29:16] chasemp: nothing really, just waiting to see if it crashes again, and then maybe digging a bit about that kernel oops in case it's a known bug in our kernels [22:29:44] sounds good man [22:29:52] looks like moritz already did some digging: https://phabricator.wikimedia.org/T117746#1783068 [22:32:59] (03CR) 10Alex Monk: [C: 04-1] "Wikipedia only?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:33:34] !log aaron@tin Synchronized php-1.27.0-wmf.5/maintenance/findOrphanedFiles.php: fb43e95858b9 (duration: 00m 18s) [22:33:36] (03CR) 10Andrew Bogott: [C: 04-1] "I don't hate this as much as I thought I would :) Curious about why the require_package stuff is mixed in though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [22:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:39:58] (03CR) 10Greg Grossmeier: "See upstream discussion which lead to this patch: https://secure.phabricator.com/T8170" [puppet] - 10https://gerrit.wikimedia.org/r/251034 (https://phabricator.wikimedia.org/T117621) (owner: 10Paladox) [22:45:44] (03CR) 10Jdlrobson: "Yes. I've made that assumption based on the fact this survey = https://phabricator.wikimedia.org/T113443 = needs to be run next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [22:46:22] (03PS11) 10Yuvipanda: LabsDNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [22:46:24] (03PS1) 10Yuvipanda: openstack: Make designate use require_package [puppet] - 10https://gerrit.wikimedia.org/r/251146 [22:46:40] andrewbogott: ^ [22:46:43] split it out! [22:47:14] I also learnt in the process that git will let you revert a commit sha that doesn't actually exist in any existing branches, which isn't surprising but a pretty neat feature but also probably something that'll bite someone in the ass at some point [22:47:56] (03CR) 10Andrew Bogott: [C: 031] openstack: Make designate use require_package [puppet] - 10https://gerrit.wikimedia.org/r/251146 (owner: 10Yuvipanda) [22:48:15] (03CR) 10Yuvipanda: [C: 032] openstack: Make designate use require_package [puppet] - 10https://gerrit.wikimedia.org/r/251146 (owner: 10Yuvipanda) [22:48:31] (03CR) 10Andrew Bogott: [C: 031] LabsDNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [22:50:04] (03PS3) 10Paladox: Add "composer test" command to lint files and run tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [22:51:02] (03PS4) 10Paladox: Add "composer test" command to lint files and run tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [22:52:10] andrewbogott: ok, first one was noop. let me do the second one careeefully [22:52:29] (03CR) 10Paladox: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [22:53:15] (03CR) 10Yuvipanda: [C: 032] LabsDNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (https://phabricator.wikimedia.org/T100990) (owner: 10Alex Monk) [22:53:37] (03CR) 10Paladox: [C: 031] "Updated phplint to 0.9.* and phpunit to 4.8.x" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [22:54:41] (03PS4) 10Jdlrobson: Deploy QuickSurveys to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) [22:56:00] andrewbogott: hmm that failed with [22:56:15] Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns: novaclient.exceptions.Unauthorized: Invalid user / password (Disable debug mode to suppress these details.) (HTTP 401) [22:56:31] bah because empty password?! [22:56:46] that’d do it [22:57:28] (03PS1) 10Andrew Bogott: Add modern nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/251149 [22:57:30] (03PS1) 10Andrew Bogott: Update novaenv for keystone v3 api [puppet] - 10https://gerrit.wikimedia.org/r/251150 [22:57:32] (03PS1) 10Andrew Bogott: Update keystone policy.json to allow the 'observer' role to observe. [puppet] - 10https://gerrit.wikimedia.org/r/251151 (https://phabricator.wikimedia.org/T104588) [22:58:04] (03PS1) 10BBlack: ssl_ciphersuite: add DHE+3DES option as well [puppet] - 10https://gerrit.wikimedia.org/r/251153 [22:58:23] (03PS1) 10Yuvipanda: labsdns: Include nova password class before using it [puppet] - 10https://gerrit.wikimedia.org/r/251154 [22:58:23] of course [22:59:50] (03PS2) 10Andrew Bogott: Add modern nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/251149 [22:59:52] (03PS2) 10Andrew Bogott: Update novaenv for keystone v3 api [puppet] - 10https://gerrit.wikimedia.org/r/251150 [22:59:54] (03PS2) 10Andrew Bogott: Update keystone policy.json to allow the 'observer' role to observe. [puppet] - 10https://gerrit.wikimedia.org/r/251151 (https://phabricator.wikimedia.org/T104588) [23:00:32] (03CR) 10Andrew Bogott: [C: 032] Update novaenv for keystone v3 api [puppet] - 10https://gerrit.wikimedia.org/r/251150 (owner: 10Andrew Bogott) [23:01:24] (03CR) 10Yuvipanda: [C: 032] labsdns: Include nova password class before using it [puppet] - 10https://gerrit.wikimedia.org/r/251154 (owner: 10Yuvipanda) [23:01:57] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [23:03:26] (03PS1) 10Yuvipanda: labsdns: Fix typo in variable name [puppet] - 10https://gerrit.wikimedia.org/r/251157 [23:03:31] (03PS3) 10Andrew Bogott: Update novaenv for keystone v3 api [puppet] - 10https://gerrit.wikimedia.org/r/251150 [23:03:32] ^ I'm on it [23:04:06] (03CR) 10Yuvipanda: [C: 032 V: 032] labsdns: Fix typo in variable name [puppet] - 10https://gerrit.wikimedia.org/r/251157 (owner: 10Yuvipanda) [23:04:44] * andrewbogott fights yuvi for the git head [23:05:03] (03PS4) 10Andrew Bogott: Update novaenv for keystone v3 api [puppet] - 10https://gerrit.wikimedia.org/r/251150 [23:05:04] andrewbogott: :) I usually don't wait on jenkins for rebases [23:05:20] well, conflict free rebases at least [23:05:47] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [23:07:26] (03PS3) 10Andrew Bogott: Add modern nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/251149 [23:09:18] !log aaron@tin Synchronized php-1.27.0-wmf.5/maintenance/findOrphanedFiles.php: 33d378bee6c4ac (duration: 00m 17s) [23:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:12] (03PS1) 10Yuvipanda: labsdns: Use the proper comment character for lua [puppet] - 10https://gerrit.wikimedia.org/r/251158 [23:10:31] Krenair: https://gerrit.wikimedia.org/r/#/c/251158/ is super funny :) [23:10:37] (03PS2) 10Yuvipanda: labsdns: Use the proper comment character for lua [puppet] - 10https://gerrit.wikimedia.org/r/251158 [23:10:47] (03CR) 10Yuvipanda: [C: 032 V: 032] labsdns: Use the proper comment character for lua [puppet] - 10https://gerrit.wikimedia.org/r/251158 (owner: 10Yuvipanda) [23:11:10] YuviPanda: Ugh. [23:11:41] yeah :D [23:11:43] 'tis alright [23:12:20] (03PS2) 10Dzahn: postgresl,sslcert: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250053 [23:12:33] Krenair: seems to be ok now [23:12:34] let's test! [23:13:20] all looks good to me [23:13:21] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1188/" [puppet] - 10https://gerrit.wikimedia.org/r/250053 (owner: 10Dzahn) [23:13:35] andrewbogott: can you check that DNS seems all ok to you? [23:13:51] * YuviPanda runs puppet again to make sure [23:13:55] ok [23:14:15] (03PS4) 10Dzahn: mariadb: 32 lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249038 [23:14:40] YuviPanda: looks go to me [23:14:45] um… looks ok, I mean [23:15:16] yeah to me too [23:15:18] and puppet's clean [23:15:20] (03CR) 10Dzahn: [C: 032] mariadb: 32 lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249038 (owner: 10Dzahn) [23:15:24] and isn't restarting pdns over and over again [23:15:26] \o/ [23:15:36] so I'mma call it done [23:15:58] andrewbogott: ok, your turn to break things now! [23:16:06] great! [23:16:09] (I think there might've been a 30s or so time period when the recursor on holmium was down) [23:16:26] (03PS4) 10Andrew Bogott: Add modern nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/251149 [23:17:01] Krenair: https://gerrit.wikimedia.org/r/#/c/251154/ was the other thing we missed [23:17:54] (03CR) 10Andrew Bogott: [C: 032] Add modern nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/251149 (owner: 10Andrew Bogott) [23:17:54] Right... I vaguely recall running into a similar error elsewhere [23:18:11] yeah [23:19:00] Krenair: thanks for working on this Krenair! [23:19:00] and sorry for letting it lie for a while [23:19:02] I'm going to do the invisible unicorn patch now too [23:21:44] YuviPanda: seems unbroken to me; let me know if you see any bad effects. [23:21:58] let me try [23:22:06] * YuviPanda runs puppet [23:22:24] You should also delete the invalid ones from the server btw, YuviPanda [23:23:52] andrewbogott: kk puppet's happy [23:24:06] dammit I can't build packages locally anymore because I am no longer on jessie [23:24:10] * YuviPanda learns to use copper [23:26:28] Failed to compile catalog for node lvs1007.eqiad.wmnet: Error from DataBinding 'hiera' while looking up 'standard::has_default_mail_relay': (): found character that cannot start any token while scanning for the next token at line 3 column 30 on node lvs1007.eqiad.wmnet [23:26:42] apparently we have this issue on all LVS boxes, so no changes can be compiled [23:31:08] (03PS2) 10Rush: Establish IPv6 for labs-support1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/251114 (https://phabricator.wikimedia.org/T115491) [23:31:58] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [23:32:00] (03CR) 10Rush: [C: 032] Establish IPv6 for labs-support1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/251114 (https://phabricator.wikimedia.org/T115491) (owner: 10Rush) [23:33:18] PROBLEM - Recursive DNS on 208.80.155.118 is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:34:18] hmm [23:34:21] andrewbogott: ^ is that us? [23:34:27] us as in labs? [23:34:59] templates/wikimedia.org:labs-recursor1 [23:35:02] yes seems it is [23:35:08] looks like [23:35:35] is that holmium? [23:35:41] it seems ok... [23:36:01] it’s in dallas [23:36:14] holmium is recursor0 [23:36:17] oh, no wait, not dalls [23:36:18] aaaah [23:36:20] oh? [23:36:24] it’s probably labs-services1001 [23:37:15] um, sorry, labservices1001.wikimedia.org [23:37:16] yeah I can hit labs-recursor0 from inside labs and labs-recursor1 just hangs [23:37:18] ok let me ssh in [23:46:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [23:50:50] andrewbogott: ok, so it can't connect to labcontrol1001 [23:50:53] andrewbogott: and hence fails... [23:51:08] did you open a special firewall hole for holmium? [23:51:22] wait, that should only make puppet fail, not pdns [23:52:16] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [23:52:45] andrewbogott: yeah I'm trying to figure that out now [23:53:05] it must generate unparseable lua [23:53:21] andrewbogott: and no I didn't open up any holes [23:53:21] um, lua? ruby? whatever [23:53:28] andrewbogott: I think what's happening is that it generates *no* files [23:53:33] and then pdns tries to read it and fails [23:53:47] so this is a unique failure mode that happens only when we do not have any file there [23:55:30] probably the firewall hole is in puppet/ferm and we just need to add a second for that second dns server [23:56:43] !log aaron@tin Synchronized php-1.27.0-wmf.5/maintenance/findOrphanedFiles.php: ced9a62893c (duration: 00m 17s) [23:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:07] possibly [23:58:28] andrewbogott: no there is only one holmium specific ferm hole [23:58:34] and that's for amqp [23:59:58] YuviPanda: look for ‘labs_designate_hostname’ in modules/openstack/manifests/controller_firewall.pp