[00:00:05] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160728T0000). Please do the needful. [00:00:31] kaldari: by the way `echo $wgCategoryCollation` on aawiki still gives me uppercase, on enwiki, uca-default-u-kn [00:00:40] so your change is well deployed on beta [00:01:44] Do you have in your .ssh/config a ProxyCommand ssh -a -W %h:%p bastion.wmflabs.org instruction for *.wmflabs? [00:01:46] Dereckson: true. I guess I don't really need to regenerate the sortkeys, but it would be nice to test that too. [00:02:15] are you using your labs key to get into that one? [00:02:35] usually you get a password prompt if you don't present the correct ssh key [00:02:41] ah no. lemme add that... [00:05:09] got in now [00:05:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [00:05:59] running script now [00:06:09] (03CR) 10Brian Wolff: Change default gallery mode to 'packed' on the English Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester) [00:06:10] anomie: we've a lot of 114 Undefined variable: index in /srv/mediawiki/php-1.28.0-wmf.12/includes/api/ApiQueryUserContributions.php on line 340 [00:07:32] Dereckson, bawolff: worked beautifully: http://en.wikipedia.beta.wmflabs.org/wiki/Category:Sort_test [00:07:59] Woo [00:08:23] er I've 9 100 101 99 [00:08:28] Dereckson: Now I'm going to go write some documentation on how to do Beta Cluster testing :) [00:08:48] Dereckson: might have to clear your cache [00:08:51] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 4 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2500729 (10Yurik) @esanders, I agree that maps are not finalized for the general Wikipedia deployment,... [00:09:09] I purged the page, seemed to fix things [00:09:13] 1 9 99 100 101 after purge :) [00:09:13] i.e. hard refresh [00:09:15] yay [00:09:41] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [00:09:46] Thanks for the help guys! [00:10:02] You're welcome. [00:10:25] Next step is the same for roman numerals? [00:10:52] :) [00:15:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [00:17:41] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [00:22:13] uh oh [00:27:02] (03PS1) 10Tim Starling: Add html5depurate::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/301526 [00:28:04] (03CR) 10jenkins-bot: [V: 04-1] Add html5depurate::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/301526 (owner: 10Tim Starling) [00:28:36] (03PS1) 10Yuvipanda: icinga: Take yuvipanda out of paging for ores [puppet] - 10https://gerrit.wikimedia.org/r/301527 [00:29:06] (03PS5) 10BryanDavis: [WIP] Provision Striker via scap3 [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) [00:30:41] (03PS1) 10Dzahn: planet: switch many feed URLs to https [puppet] - 10https://gerrit.wikimedia.org/r/301528 (https://phabricator.wikimedia.org/T141480) [00:32:17] (03PS2) 10Tim Starling: Add html5depurate::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/301526 [00:33:21] (03CR) 10jenkins-bot: [V: 04-1] Add html5depurate::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/301526 (owner: 10Tim Starling) [00:34:20] (03PS3) 10Tim Starling: Add html5depurate::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/301526 [00:36:33] (03CR) 10Tim Starling: [C: 032] Add html5depurate::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/301526 (owner: 10Tim Starling) [00:38:48] (03PS2) 10Dzahn: planet: switch many feed URLs to https [puppet] - 10https://gerrit.wikimedia.org/r/301528 (https://phabricator.wikimedia.org/T141480) [00:40:35] (03CR) 10Dzahn: [C: 032] planet: switch many feed URLs to https [puppet] - 10https://gerrit.wikimedia.org/r/301528 (https://phabricator.wikimedia.org/T141480) (owner: 10Dzahn) [00:41:04] (03PS3) 10Dzahn: planet: switch many feed URLs to https [puppet] - 10https://gerrit.wikimedia.org/r/301528 (https://phabricator.wikimedia.org/T141480) [00:43:43] (03PS1) 10Yuvipanda: tools: Reduce number of days central logs are kept to 3 [puppet] - 10https://gerrit.wikimedia.org/r/301529 (https://phabricator.wikimedia.org/T141270) [00:43:45] (03PS2) 10Yuvipanda: icinga: Take yuvipanda out of paging for ores [puppet] - 10https://gerrit.wikimedia.org/r/301527 [00:43:53] (03CR) 10Yuvipanda: [C: 032 V: 032] icinga: Take yuvipanda out of paging for ores [puppet] - 10https://gerrit.wikimedia.org/r/301527 (owner: 10Yuvipanda) [00:44:26] (03PS2) 10Yuvipanda: tools: Reduce number of days central logs are kept to 3 [puppet] - 10https://gerrit.wikimedia.org/r/301529 (https://phabricator.wikimedia.org/T141270) [00:44:30] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Reduce number of days central logs are kept to 3 [puppet] - 10https://gerrit.wikimedia.org/r/301529 (https://phabricator.wikimedia.org/T141270) (owner: 10Yuvipanda) [00:45:43] TimStarling mutante I puppet merged both your changes [00:46:10] thanks [00:46:44] yep, cool [00:47:12] (03PS2) 10Dzahn: ipmi: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298902 [00:48:50] (03CR) 10Dzahn: [C: 032] ipmi: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298902 (owner: 10Dzahn) [00:55:55] (03PS3) 10Yuvipanda: tools: Reduce number of days central logs are kept to 3 [puppet] - 10https://gerrit.wikimedia.org/r/301529 (https://phabricator.wikimedia.org/T141270) [00:55:58] (03CR) 10Yuvipanda: [V: 032] tools: Reduce number of days central logs are kept to 3 [puppet] - 10https://gerrit.wikimedia.org/r/301529 (https://phabricator.wikimedia.org/T141270) (owner: 10Yuvipanda) [00:56:08] fuck you too, Gerrit [00:58:41] (03PS1) 10Tim Starling: html5depurate: notify on config changes and update security.policy [puppet] - 10https://gerrit.wikimedia.org/r/301531 [01:02:58] (03CR) 10Tim Starling: [C: 032] html5depurate: notify on config changes and update security.policy [puppet] - 10https://gerrit.wikimedia.org/r/301531 (owner: 10Tim Starling) [01:11:31] (03PS1) 10Yuvipanda: tools: Centrally collect all logs from k8s related nodes [puppet] - 10https://gerrit.wikimedia.org/r/301532 (https://phabricator.wikimedia.org/T141270) [01:13:08] (03PS2) 10Yuvipanda: tools: Centrally collect all logs from k8s related nodes [puppet] - 10https://gerrit.wikimedia.org/r/301532 (https://phabricator.wikimedia.org/T141270) [01:14:21] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [01:14:42] (03PS3) 10Yuvipanda: tools: Centrally collect all logs from k8s related nodes [puppet] - 10https://gerrit.wikimedia.org/r/301532 (https://phabricator.wikimedia.org/T141270) [01:19:52] PROBLEM - puppet last run on rdb2005 is CRITICAL: CRITICAL: puppet fail [01:22:12] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [01:22:13] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [01:29:15] (03PS1) 10Yuvipanda: k8s: Don't tell kubelet to read manifests locally [puppet] - 10https://gerrit.wikimedia.org/r/301533 [01:29:41] (03PS4) 10Yuvipanda: tools: Centrally collect all logs from k8s related nodes [puppet] - 10https://gerrit.wikimedia.org/r/301532 (https://phabricator.wikimedia.org/T141270) [01:29:49] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Centrally collect all logs from k8s related nodes [puppet] - 10https://gerrit.wikimedia.org/r/301532 (https://phabricator.wikimedia.org/T141270) (owner: 10Yuvipanda) [01:29:55] !log Deploying #phab-2016.30 (https://phabricator.wikimedia.org/project/profile/2118/) - no downtime is expected. [01:29:59] (03PS2) 10Yuvipanda: k8s: Don't tell kubelet to read manifests locally [puppet] - 10https://gerrit.wikimedia.org/r/301533 [01:30:03] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Don't tell kubelet to read manifests locally [puppet] - 10https://gerrit.wikimedia.org/r/301533 (owner: 10Yuvipanda) [01:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:34:42] (03PS1) 10Yuvipanda: tools: Don't load kernel logging module explicitly [puppet] - 10https://gerrit.wikimedia.org/r/301534 [01:36:32] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [01:36:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:36:32] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [01:46:11] RECOVERY - puppet last run on rdb2005 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [01:49:07] (03PS2) 10Yuvipanda: tools: Don't load kernel logging module explicitly [puppet] - 10https://gerrit.wikimedia.org/r/301534 [02:01:55] (03PS3) 10Dzahn: jsbench: add systemd compat for jsbench-browser [puppet] - 10https://gerrit.wikimedia.org/r/300425 (https://phabricator.wikimedia.org/T141023) [02:02:03] (03PS4) 10Dzahn: jsbench: add systemd compat for jsbench-browser [puppet] - 10https://gerrit.wikimedia.org/r/300425 (https://phabricator.wikimedia.org/T141023) [02:02:54] (03CR) 10Dzahn: [C: 032] jsbench: add systemd compat for jsbench-browser [puppet] - 10https://gerrit.wikimedia.org/r/300425 (https://phabricator.wikimedia.org/T141023) (owner: 10Dzahn) [02:12:51] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [02:13:16] !log restarted apache2 on iridium to deploy 4305a9bb0300650ea40de433261c7e59cc88e4bc [02:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:14:51] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:18:46] (03PS1) 10Tim Starling: html5depurate: fix missing equals sign [puppet] - 10https://gerrit.wikimedia.org/r/301535 [02:24:02] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 08m 49s) [02:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:05] 06Operations, 07Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#2500885 (10Phabricator_maintenance) [02:35:34] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2500925 (10Phabricator_maintenance) [02:41:06] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.12) (duration: 08m 03s) [02:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:44:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 1, unused: 0 [02:47:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jul 28 02:47:28 UTC 2016 (duration 6m 22s) [02:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:21] (03CR) 10Tim Starling: [C: 032] html5depurate: fix missing equals sign [puppet] - 10https://gerrit.wikimedia.org/r/301535 (owner: 10Tim Starling) [03:10:53] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [03:10:53] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [03:11:01] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [03:12:52] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [03:12:52] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [03:12:52] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [03:13:58] (03PS1) 10Tim Starling: html5depurate: Fix security.policy for non-localhost clients [puppet] - 10https://gerrit.wikimedia.org/r/301537 [03:20:45] (03CR) 10Tim Starling: [C: 032] html5depurate: Fix security.policy for non-localhost clients [puppet] - 10https://gerrit.wikimedia.org/r/301537 (owner: 10Tim Starling) [03:53:22] PROBLEM - puppet last run on mw2062 is CRITICAL: CRITICAL: Puppet has 1 failures [03:55:03] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Puppet has 1 failures [04:19:12] RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [04:20:53] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:56:15] !log shutting down db1055 for upgrade [05:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:01:36] PROBLEM - MariaDB Slave IO: s1 on db1055 is CRITICAL: CRITICAL slave_io_state could not connect [06:02:17] PROBLEM - MariaDB Slave SQL: s1 on db1055 is CRITICAL: CRITICAL slave_sql_state could not connect [06:04:54] (03PS2) 10Giuseppe Lavagetto: puppetmaster: pin all the needed packages [puppet] - 10https://gerrit.wikimedia.org/r/301412 [06:31:02] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [06:31:32] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:02] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:21] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:33] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:23] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:39] <_joe_> !log refreshed puppet facts for the compiler [06:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:36:46] !log installing perl security updates on eqiad and codfw jessie systems [06:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:37:43] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: pin all the needed packages [puppet] - 10https://gerrit.wikimedia.org/r/301412 (owner: 10Giuseppe Lavagetto) [06:44:45] <_joe_> !log installed puppet 3.8 from backports on rhodium [06:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:45:57] (03PS2) 10Giuseppe Lavagetto: puppetmaster: fix master.conf on 3.7+ servers [puppet] - 10https://gerrit.wikimedia.org/r/301414 [06:49:05] (03PS3) 10Giuseppe Lavagetto: puppetmaster: fix master.conf on 3.7+ servers [puppet] - 10https://gerrit.wikimedia.org/r/301414 [06:54:27] !log installing java security updates on restbase staging systems [06:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:14] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: fix master.conf on 3.7+ servers [puppet] - 10https://gerrit.wikimedia.org/r/301414 (owner: 10Giuseppe Lavagetto) [06:56:42] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:57:11] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:11] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:22] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:57] does someone know why log messages are now (after upgrade?) truncated in kibana or logstash? [07:15:23] (03PS1) 10Elukey: Remove references of decom api servers mw1114->mw1148 [puppet] - 10https://gerrit.wikimedia.org/r/301549 (https://phabricator.wikimedia.org/T139353) [07:16:14] <_joe_> !log regenerated the ssl key for rhodium, 1024 bits [07:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:22:53] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Puppet has 2 failures [07:40:53] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:42:24] (03PS1) 10Raimond Spekking: Labs: Set CategoryCollation for dewiki to 'uca-de-u-kn' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301550 (https://phabricator.wikimedia.org/T128806) [07:43:17] !log starting decom process for old api servers - mw11(1[4-9]|20|3[0-9]|4[0-8]).eqiad.wmnet (tracked in https://etherpad.wikimedia.org/p/appservers-decom) [07:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:55:21] (03Abandoned) 10Alexandros Kosiaris: puppetmaster: Conditionally set always_cache_features [puppet] - 10https://gerrit.wikimedia.org/r/301413 (https://phabricator.wikimedia.org/T98173) (owner: 10Alexandros Kosiaris) [08:02:56] heads up to folks, my sleep schedule is very screwed up (migraines). going to try time-shifting my schedule today [08:04:27] (03PS1) 10Muehlenhoff: Update to Linux 4.4.16 [debs/linux44] - 10https://gerrit.wikimedia.org/r/301552 [08:07:38] 06Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Traffic: piuparts fail with WARN: Broken symlinks: /etc/systemd/system... - https://phabricator.wikimedia.org/T141454#2501246 (10hashar) syslog-ng fixed the warning by adding `rm` and `rmdir` in the package `postrm` step: http://anonscm.... [08:09:47] akosiaris: hello, are you the one reviewing the apertium packages with kart_ ? [08:10:22] I got a the debian job added on all the repos, but not sure how helpful it is [08:10:40] hashar: yes I am [08:10:49] that actually sounds helpful [08:11:00] what did it take to do it ? [08:11:09] magic! [08:11:19] started as a summer project by Azathoth back in 2013 [08:11:44] then your work on package_builder puppet module that inject hooks to recognize foo-wikimedia merely solved most of it [08:12:05] until I found some spare time to rebuild the cow images on jessie instances, refactor the Jenkins job to use the latest jenkins-debian-glue version [08:12:09] and it magically happened [08:12:42] the job build on the distribution mentionned in the debian/changelog and I think apertium was targeting 'unstable' [08:12:50] but kart_ supposedly switched to jessie [08:13:43] yup [08:13:46] so ... https://gerrit.wikimedia.org/r/#/q/owner:+kartik+status:open+projects:operations/debs [08:13:56] all those v+2 are by jenkins ? [08:14:11] yeah [08:14:16] hashar: you rock! [08:14:20] though the job result is ignored so jenkins ALWAYS V+2 [08:14:34] ? [08:14:35] I am hopping package maintainers to look at the output, maybe try to fix them [08:14:41] and later on make the job voting for realy [08:14:43] real [08:14:58] example https://gerrit.wikimedia.org/r/#/c/296229/ [08:15:12] it says "main build succeeded" [08:15:36] but: debian-glue FAILURE in 17s (non-voting) [08:15:52] and correctly [08:16:11] giving a -1 on that one would be correct [08:16:22] so it is not going to annoy patches submitters [08:16:23] (03CR) 10Muehlenhoff: [C: 032] Update to Linux 4.4.16 [debs/linux44] - 10https://gerrit.wikimedia.org/r/301552 (owner: 10Muehlenhoff) [08:17:28] hashar: so, where can I get the config for that ? [08:17:46] https://gerrit.wikimedia.org/r/integration/config.git ? [08:18:20] yeah [08:18:25] akosiaris: jjb/operations-debs.yaml [08:18:50] https://github.com/wikimedia/integration-config/blob/master/jjb/operations-debs.yaml [08:18:53] akosiaris: ^^ [08:19:34] (03PS1) 10Muehlenhoff: Add CVE ID to changelog which was assigned retroactively [debs/linux44] - 10https://gerrit.wikimedia.org/r/301553 [08:19:37] the environment variables based settings are rather a mess. Each of cowbuilder/pbuilder/jenkins-debian-glue having their own set and sudo does not necessarily pass them around :D [08:20:11] moritzm: out of curiosity is there any reason the debian/changelog for linux is stuck to 4.4.2 when the changelog says 4.4.16? [08:22:35] hashar: 4.4.2 is the underlying tarball, it gets updated to 4.4.16 via patches, the kernel tarballs are really big (around 90M) and there's no real benefit to migrate to a new orig tarball every time [08:24:11] (03CR) 10Muehlenhoff: [C: 032] Add CVE ID to changelog which was assigned retroactively [debs/linux44] - 10https://gerrit.wikimedia.org/r/301553 (owner: 10Muehlenhoff) [08:25:09] moritzm: so it is really 4.4.2 + patch for 4.4.2..4.4.3 + patch 4.4.3..4.4.4 ? [08:25:11] etc [08:28:22] yeah, the 4.4.2 tarball (the last version in Debian sid until it moved to 4.5) plus all incremental patches towards 4.4.16 [08:29:02] and to build our package one "just" has to download the 4.4.2 tarball and point gbp to it right? [08:30:17] (03PS1) 10Giuseppe Lavagetto: Add puppetmaster.test service alias [dns] - 10https://gerrit.wikimedia.org/r/301554 [08:30:20] <_joe_> akosiaris: ^^ [08:30:42] hashar: say I want to make the jenkins jobs for operations/debs/contenttranslation/* voting [08:30:50] how would I do that ? [08:31:06] I see zuul/layout.yaml calls test: ['debian-glue'] [08:31:12] should I create a new one ? [08:31:20] akosiaris: I thought about that. I would duplicate a job named debian-glue-non-voting [08:31:33] make debian-glue voting and debian-glue-non-voting ... non voting [08:31:41] then in the zuul/layout change the names accordingly [08:31:50] ok I 'll try that then [08:31:51] I dont think there is a way to change the voting status based on a repo [08:31:58] will probably ping you soon [08:32:01] thanks! [08:32:11] well maybe there is but I am pretty sure it is going to be a very weird bug prone hack [08:32:40] akosiaris: YAML aliasing is perfect to avoid code duplication [08:35:36] hashar: all the patch application is handled inside the rules file, from gbp's perpective it doesn't matter whether which kind of patches are applied [08:37:21] (03PS1) 10KartikMistry: Deploy Compact Language Links as default (Stage 5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301555 (https://phabricator.wikimedia.org/T136677) [08:39:50] (or which tarball is used) [08:48:54] moritzm: and does it make any sense to have Jenkins to attempt to build the linux4 kernel when patches are proposed? [08:49:08] not sure how long it will take or whether lintian is going to be of any help [08:49:36] !log swift eqiad-prod: ms-be102[3456] weight 3000 [08:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:49:56] lintian won't be of any help for the kernel, we only update the kernel source and I'm not sure how lintian-clean it's to begin with [08:50:37] also a kernel build requires quite some diskspace due to dbg packages and the kernel being big in kernel (also also takes about 3hrs to build on copper) [08:50:51] and probably longer on the CI host [08:51:51] it's probably best to exclude "linux" and "linux44" at this point [08:52:14] (03CR) 10Alexandros Kosiaris: [C: 031] Add puppetmaster.test service alias [dns] - 10https://gerrit.wikimedia.org/r/301554 (owner: 10Giuseppe Lavagetto) [08:56:01] (03CR) 10Giuseppe Lavagetto: [C: 032] Add puppetmaster.test service alias [dns] - 10https://gerrit.wikimedia.org/r/301554 (owner: 10Giuseppe Lavagetto) [09:10:10] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [09:12:00] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [09:13:57] (03PS1) 10Giuseppe Lavagetto: puppetmaster: add alt names support to puppetmaster.test [puppet] - 10https://gerrit.wikimedia.org/r/301560 [09:14:49] (03CR) 10Nikerabbit: [C: 031] Deploy Compact Language Links as default (Stage 5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301555 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [09:17:44] (03PS2) 10Giuseppe Lavagetto: puppetmaster: add alt names support to puppetmaster.test [puppet] - 10https://gerrit.wikimedia.org/r/301560 [09:18:39] (03PS2) 10KartikMistry: Deploy Compact Language Links as default (Stage 5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301555 (https://phabricator.wikimedia.org/T136677) [09:22:22] (03CR) 10Filippo Giunchedi: [C: 031] relaxed directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/301502 (owner: 10Eevans) [09:22:49] (03PS3) 10Giuseppe Lavagetto: puppetmaster: add alt names support to puppetmaster.test [puppet] - 10https://gerrit.wikimedia.org/r/301560 [09:22:56] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: add alt names support to puppetmaster.test [puppet] - 10https://gerrit.wikimedia.org/r/301560 (owner: 10Giuseppe Lavagetto) [09:24:02] <_joe_> a brief swarm of puppet failures might happen [09:25:23] !log installing PHP security updates [09:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:29:59] (03CR) 10Filippo Giunchedi: [C: 032] Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo [puppet] - 10https://gerrit.wikimedia.org/r/301409 (owner: 10Chad) [09:30:08] (03CR) 10Filippo Giunchedi: [C: 031] Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo [puppet] - 10https://gerrit.wikimedia.org/r/301409 (owner: 10Chad) [09:45:37] (03CR) 10Giuseppe Lavagetto: [C: 031] "I am unsure what is wrong with the current manifests, as no production machine showed this problem - but this won't harm I guess; on the o" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301404 (owner: 10BryanDavis) [09:50:30] 06Operations, 10Traffic: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#2501317 (10ema) >>! In T141373#2498969, @ema wrote: > Next steps: > > - port the test case to v4 and see how varnish 4 behaves https://phabricator.wikimedia.org/P3592 is a functional... [09:53:01] hashar: https://gerrit.wikimedia.org/r/301566 and https://gerrit.wikimedia.org/r/301567 :-) [09:55:21] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: puppet fail [10:05:05] (03CR) 10Alexandros Kosiaris: [C: 031] relaxed directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/301502 (owner: 10Eevans) [10:05:06] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2501335 (10Joe) 05Open>03Resolved [10:06:08] 06Operations, 06Services, 13Patch-For-Review, 07Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#2501336 (10Joe) @GWicke nothing left now as we've separated service-checker to be an external repository [10:06:16] 06Operations, 06Services, 13Patch-For-Review, 07Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#2501337 (10Joe) 05Open>03Resolved [10:07:03] (03PS1) 10Gilles: Add .gitreview [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301569 [10:07:20] (03CR) 10Gilles: [C: 032] Add .gitreview [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301569 (owner: 10Gilles) [10:07:34] (03CR) 10Gilles: [V: 032] Add .gitreview [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301569 (owner: 10Gilles) [10:10:52] (03PS1) 10Gilles: Copy over from https://github.com/gi11es/thumbor-debian [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301570 [10:11:53] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove references of decom api servers mw1114->mw1148 [puppet] - 10https://gerrit.wikimedia.org/r/301549 (https://phabricator.wikimedia.org/T139353) (owner: 10Elukey) [10:14:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Copy over from https://github.com/gi11es/thumbor-debian [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301570 (owner: 10Gilles) [10:32:17] akosiaris: https://gerrit.wikimedia.org/r/#/c/301566/1/jjb/operations-debs.yaml is off :D [10:32:57] akosiaris: the project part is merely to group similar jobs. the name: is overridden and both generate the same job :} But that is the idea yea. I am going to amend [10:37:22] (03PS2) 10Muehlenhoff: Remove DNS entries for old trusty scalers [dns] - 10https://gerrit.wikimedia.org/r/301406 (https://phabricator.wikimedia.org/T141352) [10:39:42] akosiaris: yaml aliasing + mapping merging for the win https://gerrit.wikimedia.org/r/#/c/301566/1..2/jjb/operations-debs.yaml :D [10:39:48] thanks for the patch [10:43:02] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 4 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2501350 (10Esanders) Adding the features to support the most basic use case is not "perfecting it". We... [10:43:32] (03PS2) 10Elukey: Remove references of decom api servers mw1114->mw1148 [puppet] - 10https://gerrit.wikimedia.org/r/301549 (https://phabricator.wikimedia.org/T139353) [10:45:43] (03CR) 10Elukey: [C: 032] Remove references of decom api servers mw1114->mw1148 [puppet] - 10https://gerrit.wikimedia.org/r/301549 (https://phabricator.wikimedia.org/T139353) (owner: 10Elukey) [10:50:31] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2501353 (10Joe) a:03Joe [10:52:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:53:38] FYI I'm going to merge https://gerrit.wikimedia.org/r/#/c/282356/ shortly, statsd will be impacted briefly in production, labs won't be affected [10:55:02] (03PS3) 10Filippo Giunchedi: Add a new statsd_proxy module and replace statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/282356 (https://phabricator.wikimedia.org/T126447) (owner: 10Faidon Liambotis) [10:55:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add a new statsd_proxy module and replace statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/282356 (https://phabricator.wikimedia.org/T126447) (owner: 10Faidon Liambotis) [10:55:31] (03CR) 10Hashar: "recheck" [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/294426 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [10:55:36] (03PS1) 10Gilles: Upgrade to plugins 0.1.7 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301573 [10:55:38] (03CR) 10Hashar: "recheck" [debs/contenttranslation/hfst-ospell] - 10https://gerrit.wikimedia.org/r/296231 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:55:49] (03CR) 10Hashar: "recheck" [debs/contenttranslation/foma] - 10https://gerrit.wikimedia.org/r/295183 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [10:56:00] (03CR) 10Hashar: "recheck" [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/293494 (https://phabricator.wikimedia.org/T95653) (owner: 10KartikMistry) [10:56:10] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/293497 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:56:24] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/293507 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:56:34] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/294020 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:56:54] (03CR) 10Hashar: "recheck" [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/293484 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [10:57:45] (03CR) 10Hashar: "recheck" [debs/contenttranslation/giella-sme] - 10https://gerrit.wikimedia.org/r/294430 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [10:58:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:58:27] hashar: I forgot about zuul merger /o\ [10:58:31] do you still need to do it? [10:58:37] hashar: is that going to queue builds or do in parallel? [10:58:37] oh man you are so right [10:58:42] elukey: completely forgot about it as well [10:59:13] godog: in a queue handled by Zuul and processed in parallel on two Jessie slaves each having 1 executor slot [10:59:22] godog: so a queue of 10 being consumed by 2 x 1 executors [10:59:53] the queue being reflected at https://integration.wikimedia.org/zuul/ and processing in the Jenkins executor status at https://integration.wikimedia.org/ci/ [11:00:02] (03PS2) 10Gilles: Upgrade to plugins 0.1.7 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301573 [11:00:33] if the deb building prove to be useful, it is rather trivial to add more slaves. Maybe have each instance process 2 in parallel [11:00:55] elukey: I have to double check and maybe rebuild the Zuul package for Jessie though [11:01:13] hashar: sure, let's re-sync when you are ready to do the work [11:01:43] elukey: give me an hour. Will grab a snack, hook on my home computer and do the merge/patchset/quilt dance :D [11:02:06] life pro tip: I have learned about git-buildpackage patch queue system which make it a breeze to manage patches [11:02:08] hashar: \o/ thanks for the amend. I can't say I understood what jjb did there and why my change was not good enough though [11:02:12] (03PS1) 10Filippo Giunchedi: statsd_proxy: remove dependency on /etc/default/statsd-proxy [puppet] - 10https://gerrit.wikimedia.org/r/301574 [11:02:32] akosiaris: yeah the YAML based DSL is a bit awkward :-/ That comes with practice [11:02:51] akosiaris: the CI folks usually welcome patches and polish/tweak them then deploy&& merge [11:03:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsd_proxy: remove dependency on /etc/default/statsd-proxy [puppet] - 10https://gerrit.wikimedia.org/r/301574 (owner: 10Filippo Giunchedi) [11:03:04] so.. that project thing is not an array ? [11:03:23] aaarg, no it's not [11:03:25] my fault [11:03:31] the idea is the 'project' is given a "name" that name is then passed to the job template as a variable "name" [11:03:38] even I got fooled my yaml [11:03:42] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: puppet fail [11:03:47] yeah I see it now [11:03:54] so if you have project {name: "mediawiki" } then in the list of jobs list the template: {name}-phplint [11:04:02] JJB will realize the job 'mediawiki-phplint' [11:04:35] to reduce code duplication. I found out that PyYAML supports aliasing and mapping merging which is totally a hack/non trivial but is in the spec :D [11:05:07] so what I did is create a job-templates 'debian-glue-non-voting' which is populated from all the content of the 'debian-glue' [11:05:08] so, the project we set up there though will now spawn 2 debian-glue tests, right ? [11:05:21] then have the project 'debian-glue' to realize both of the job templates [11:05:23] on debian-glue and one debian-glue-non-voting [11:05:37] JJB Is only to generate the jobs [11:05:42] the workflow is handled in Zuul zuul/layout.yaml [11:05:55] that is where we manually (bah) associate a repository to a set of jobs [11:06:08] so we could have ten of thousands of jobs generated but only trigger a couple ones [11:06:17] a so we don't associate to a jjb project [11:06:22] yeah [11:06:26] ok then [11:06:26] though that could be possible :} [11:06:39] in theory we could inspect the JJB projects / jobs associations [11:06:51] and use that to automatically generate / amend the Zuul layout [11:07:27] so if you create a JJJB definition such as : project: { name: 'operations-debs-varnish4, jobs: ['debian-glue'] } [11:07:31] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:07:45] we should be able to interpolate that in Zuul we need operations/debs/varnish4 to trigger that 'debian-glue' job [11:08:33] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 4 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2501370 (10Esanders) Already the second comment on ca.wiki: https://ca.wikipedia.org/wiki/Tema:T878phn... [11:09:04] !log testing schema change on db2038 [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:09:42] PROBLEM - statsdlb process on graphite2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [11:10:43] elukey: so lets aim at 2pm :} heading out to get a snack [11:10:53] (03PS1) 10Filippo Giunchedi: statsd_proxy: fix config template [puppet] - 10https://gerrit.wikimedia.org/r/301576 [11:11:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsd_proxy: fix config template [puppet] - 10https://gerrit.wikimedia.org/r/301576 (owner: 10Filippo Giunchedi) [11:15:58] (03PS3) 10Gilles: Upgrade to plugins 0.1.7 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301573 [11:17:09] !log replace statsdlb with statsd-proxy on graphite1001 [11:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:46] (03PS4) 10Gilles: Upgrade to plugins 0.1.7 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301573 [11:18:22] !log deploying schema change to all ores databases T140803 [11:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:18:40] T140803: Remove oresc_rev index - https://phabricator.wikimedia.org/T140803 [11:20:31] PROBLEM - statsdlb process on graphite2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [11:22:03] that's me, bogus alert [11:24:01] PROBLEM - statsdlb process on graphite1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [11:28:30] PROBLEM - statsdlb process on graphite1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [11:35:51] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission all old mediawiki appservers in eqiad - https://phabricator.wikimedia.org/T139353#2501400 (10elukey) All done in https://etherpad.wikimedia.org/p/appservers-decom, last step is to remove DNS entries. [11:35:58] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission all old mediawiki appservers in eqiad - https://phabricator.wikimedia.org/T139353#2501401 (10elukey) [11:37:10] (03CR) 10Hashar: "recheck" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301573 (owner: 10Gilles) [11:46:18] (03PS1) 10Paladox: Support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [11:48:51] 06Operations, 10DBA, 06Revision-Scoring-As-A-Service, 07Blocked-on-schema-change, and 3 others: Remove oresc_rev index - https://phabricator.wikimedia.org/T140803#2501457 (10jcrespo) 05Open>03Resolved a:03jcrespo The schema change seems to have been succesful: ``` $ mysql -h s5-master wikidatawiki -e... [11:48:58] (03PS2) 10Paladox: Support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [11:49:08] (03PS1) 10Hashar: Basic gbp.conf / build for jessie-wikimedia [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301581 [11:49:33] (03CR) 10Hashar: "Also git build package default to build the tarball from a tag upstream/(version). Might want to point to the local HEAD. Within debian/gb" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301573 (owner: 10Gilles) [11:49:59] hashar: so, about the chicken+egg problem.... I know :-(. Can't be solved easily though. I 'll be solving it manually [11:50:36] akosiaris: also bunch of repos are not passing due to random reasons [11:50:56] akosiaris: be it lintian or piuparts. So maybe we want to hold a bit on making the job voting for content translation repos [11:51:10] 00:00:03.369 gbp:error: HEAD is not a valid branch [11:51:11] bah [11:51:16] works for me ™ [11:51:24] hashar: no don't. The idea is to get them fixed [11:51:50] so having it actually reporting the issues is great! [11:51:51] isn't it going to annoy you guys to have Jenkins vote -1 constantly ? [11:52:01] it's just me and kart [11:52:04] okkk [11:52:05] (03CR) 10Gilles: "python-derpconf is hosted on jessie-backports, not jessie-wikimedia: https://packages.debian.org/jessie-backports/python-derpconf" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301573 (owner: 10Gilles) [11:52:12] you can always drop the jenkins-bot vote [11:52:21] will survive for the few days until we get them working fine [11:52:23] ahh jessie-backports bah :( [11:52:53] gilles: I am not sure the pbuilder hook we have inject jessie-backports :/ [11:55:07] hashar: \o/ yey, thanks [11:55:23] in theory [11:55:33] one could craft the newer version of apertium [11:55:36] CR+2 it [11:55:43] then CR+2 the other patches that depends it [11:55:58] which would clone the apertium change ahead of it, build the deb and inject it in the cowbuilder env [11:56:13] or we could build them all in a single job :D [11:57:25] so package_builder supports APT_USE_BUILT=yes [11:57:44] so that followup builds use the results of the previous ones [11:58:02] not sure how that could work with jenkins though [11:58:36] elukey: so I am doing the zuul jessie packaging stuff and that it is non trivial :( [12:00:03] hashar what does non trivial mean, dosent work? [12:04:10] akosiaris: jobs are now voting :) [12:04:22] :-) [12:05:30] paladox: gotta merge the cruft of precise-wikimedia into jessie-wikimedia and double checks the dependencies :D [12:05:37] Oh [12:05:39] ok [12:05:40] :) [12:06:09] (03CR) 10Paladox: "I have tested all this and it works except from when you comment for example" [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [12:13:01] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 10Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2501550 (10hashar) [12:14:05] !log installing updates on mendelevium [12:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:14:17] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 10Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2479965 (10hashar) Since I have filled this task, the zuul server on gallium (Precise) has been upgraded to further c... [12:16:25] elukey: I got the jessie package build and it looks to do something https://people.wikimedia.org/~hashar/debs/zuul_2.1.0-391-gbc58ea3-jessie/ [12:16:32] elukey: havent installed it on labs yet though :) [12:19:03] 06Operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor throughput towards some destinations - https://phabricator.wikimedia.org/T120425#2501557 (10mark) >>! In T120425#2444215, @Nemo_bis wrote: > Yes AFAICT, from a quick check. What makes you think it wouldn't be? Ops,... [12:21:01] (03Draft1) 10Giuseppe Lavagetto: WiP add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 [12:21:35] 06Operations, 10Graphite, 05MW-1.27-release-notes, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#1330890 (10fgiunchedi) [12:21:37] 06Operations, 10Graphite, 13Patch-For-Review: add more statsdlb instances for more throughput - https://phabricator.wikimedia.org/T126447#2501575 (10fgiunchedi) 05Open>03Resolved Resolving this since it essentially happened by replacing statsdlb with statsd_proxy, followup on T101141 [12:23:13] 06Operations, 10netops: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2501583 (10mark) [12:25:20] (03PS1) 10Alexandros Kosiaris: palladium: puppetmaster-test instead of puppetmaster.test [puppet] - 10https://gerrit.wikimedia.org/r/301585 (https://phabricator.wikimedia.org/T98173) [12:25:42] (03CR) 10jenkins-bot: [V: 04-1] palladium: puppetmaster-test instead of puppetmaster.test [puppet] - 10https://gerrit.wikimedia.org/r/301585 (https://phabricator.wikimedia.org/T98173) (owner: 10Alexandros Kosiaris) [12:26:12] elukey: tested on labs zuul upgrade on scandium is ready to go. Poke me :-} [12:27:42] hashar: here I am [12:29:14] lintian and debdiff looks all right? [12:29:33] (I am pretending to know something about debian packaging) [12:30:20] 06Operations, 10Graphite, 05MW-1.27-release-notes, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2501646 (10fgiunchedi) I've tried replacing statsdlb with statsd_proxy today and so far so good. I can still see the occasional errors but an orde... [12:31:31] elukey: [12:31:33] yes [12:31:46] elukey: doesn't come with proper systemd integration though [12:32:11] the mess of this package is zuul requires a bunch of python modules that are not on apt.wm.o and we can't really upgrade there. So they ended up being embedded in the .deb package :( [12:32:21] that is usually the main source of problems [12:32:34] yeah we have the same issue with java base apps in analytics [12:32:50] (for example druid) [12:33:32] sometime I wish we had a way to apt-get install python-foobar@2.* ;} [12:33:52] (03PS1) 10Jcrespo: Repool db1055 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301586 (https://phabricator.wikimedia.org/T140650) [12:34:10] but the debian way would probably to craft a special component such as "jessie/wikimedia mysoftware" which would have the proper versions [12:34:21] so as to not clutter the rest of the infra [12:37:38] hashar: I am checking scandium atm to familiarize myself with the host, never jumped on to it [12:38:01] so that host is in the labs-support network so that instance can reach it [12:38:21] it runs a daemon named zuul-merger running as user 'zuul' [12:38:31] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [12:38:38] its purposes is to receive patchset notification, clone the repo, attempt to merge the patch against the tip of the branch and report back to Zuul [12:38:51] (03CR) 10jenkins-bot: [V: 04-1] apertium-es-it: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [12:38:59] the resulting merged reference + sha1 is passed down to Jenkins jobs which then fetch it from scandium over the git-daemon [12:39:10] so in short it is just two process: zuul-merger + git-daemon [12:40:56] super [12:41:49] (03CR) 10Hashar: "the cowdancer update (whatever it can be) failed with:" [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [12:43:42] elukey: I wrote the step by step on https://phabricator.wikimedia.org/T140894 [12:43:52] merely: stop / wget / dpkg -i / start [12:44:03] and if all fine we can look at publishing the package on apt.wm.O [12:44:40] logs being in /var/log/zuul/merger-debug.log [12:46:28] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [12:47:58] beside a spike of errors at 12:15 UTC logstash does not show much errors [12:48:02] hashar: yep yep I was reading it [12:48:10] all right I guess everything looks good [12:49:14] the service zuul-merger can be stopped anytime [12:49:17] the pkg version is veery long but consistent with the actual one, I never saw one like that :) [12:49:23] the zuul-scheduler will just queue the requests until zuul-merger is back [12:49:34] ah yeah the version is crazy [12:49:43] that is the best we could figure out with Filippo :/ [12:49:49] pkg is zuul_2.1.0-391-gbc58ea3-wmf1jessie1_amd64.deb right? [12:49:49] hashar: why is it trying to do a pbuilder --update ? [12:50:05] akosiaris: to keep the cow image up-to-date ? [12:50:18] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [12:50:43] on every build ? [12:50:51] elukey: roughly that is: `git-describe`-wmf<#><#> [12:50:54] akosiaris: I guess [12:51:05] that's sounds inefficient [12:51:09] !log upgrading zuul-merger to zuul_2.1.0-391-gbc58ea3-wmf1jessie (T140894) [12:51:10] T140894: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894 [12:51:12] I understand doing it once per day [12:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:17] but on every build ? [12:52:19] hashar: done :) [12:52:26] yup [12:52:36] so now it is all about watching the logs [12:52:51] and making sure that changes are still processed properly on https://integration.wikimedia.org/zuul/ :} [12:53:27] super. I'd say that we could watch it work for the next couple of days and the upload it to the repo [12:54:38] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:56:53] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [12:57:53] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [12:58:18] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [12:58:42] elukey: it is all good :} [12:58:46] (03CR) 10Alexandros Kosiaris: "not sure yet what is going on. The cowdancer update (which is used to keep the image up to date) seems to be happening on every build whic" [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [12:59:04] elukey: may you push https://people.wikimedia.org/~hashar/debs/zuul_2.1.0-391-gbc58ea3-jessie/ to apt.wm.o jessie-wikimedia/thirdparty ? [13:00:34] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 10Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2501693 (10hashar) 05Open>03Resolved a:03elukey Luca has done the upgrade and it went flawlessly. I have doubl... [13:00:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:02:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [13:02:28] RECOVERY - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is OK: TCP OK - 0.039 second response time on port 9042 [13:05:58] hashar: yeah let's do it tomorrow ok? [13:07:27] elukey: sure :) [13:07:32] elukey: thank you for the upgrade! [13:07:37] welcome! [13:07:45] thanks for the detailed explanation :) [13:11:56] (03CR) 10KartikMistry: "Agree. cowbuilder should not be updated on each build." [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:14:26] (03PS1) 10Muehlenhoff: Add salt grains for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/301591 [13:14:56] 06Operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor throughput towards some destinations - https://phabricator.wikimedia.org/T120425#2501715 (10Nemo_bis) >>! In T120425#2501557, @mark wrote: > @Nemo_bis: see Faidon's reply on Dec 7; if you provide us with your IP add... [13:15:05] 06Operations, 10Monitoring, 06Release-Engineering-Team: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520#2501716 (10hashar) [13:21:15] (03CR) 10Muehlenhoff: [C: 032] Add salt grains for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/301591 (owner: 10Muehlenhoff) [13:22:10] 06Operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#2501746 (10Nemo_bis) >>! In T122917#2483113, @ArielGlenn wrote: > How does http://dumps.wikimedia.your.org/ perform? I can ask them about their routing but I know a... [13:22:28] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:24:10] (03PS2) 10Muehlenhoff: logstash: Use DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297376 [13:27:27] (03CR) 10Muehlenhoff: [C: 032] logstash: Use DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297376 (owner: 10Muehlenhoff) [13:27:45] 06Operations, 10Monitoring, 06Release-Engineering-Team: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520#2501747 (10hashar) [13:28:16] mark: if you need IP addresses, please let me know on IRC (and specify IP address of which machine... per previous instructions I have tested perhaps 5 locations and I have no idea which are considered relevant) https://phabricator.wikimedia.org/T120425#2501715 [13:28:28] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:30:41] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission all old mediawiki appservers in eqiad - https://phabricator.wikimedia.org/T139353#2501750 (10Joe) 05Open>03Resolved [13:32:48] 06Operations, 10Monitoring, 06Release-Engineering-Team: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520#2501752 (10hashar) [13:33:56] 06Operations, 10ops-eqiad: Physically decommission mw1001-mw1148 (except mw1017 and mw1099) - https://phabricator.wikimedia.org/T141522#2501753 (10Joe) [13:34:06] 06Operations, 10ops-eqiad: Physically decommission mw1001-mw1148 (except mw1017 and mw1099) - https://phabricator.wikimedia.org/T141522#2501765 (10Joe) p:05Triage>03Normal [13:36:24] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission all old mediawiki appservers in eqiad - https://phabricator.wikimedia.org/T139353#2501767 (10elukey) Task opened for physical decom (including DNS entries cleanup): https://phabricator.wikimedia.org/T14152 [13:41:26] (03CR) 10Paladox: "testing" [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [13:46:43] (03PS3) 10Filippo Giunchedi: relaxed directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/301502 (owner: 10Eevans) [13:46:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] relaxed directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/301502 (owner: 10Eevans) [13:49:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "LGTM!" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301573 (owner: 10Gilles) [13:57:38] 06Operations, 10Monitoring, 06Release-Engineering-Team: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520#2501794 (10hashar) [13:57:58] 06Operations, 10Graphite, 05MW-1.27-release-notes, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2501795 (10elukey) This seems to be related only to main-eqiad (kafka100[12].eqiad). I can see these horrible messages in jmxlog: ``` [28 Jul 201... [13:58:20] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2501796 (10Joe) I just uploaded the latest version of the apache package to rep... [13:58:30] 06Operations, 10Monitoring, 06Release-Engineering-Team: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520#2501716 (10hashar) [14:02:47] 06Operations, 10Graphite, 05MW-1.27-release-notes, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2501801 (10elukey) Correction: even the graphs for analytics-eqiad are showing some changes, but not that heavy like main-eqiad. [14:10:01] (03CR) 10Jcrespo: [C: 032] Repool db1055 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301586 (https://phabricator.wikimedia.org/T140650) (owner: 10Jcrespo) [14:10:07] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [14:10:27] (03Merged) 10jenkins-bot: Repool db1055 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301586 (https://phabricator.wikimedia.org/T140650) (owner: 10Jcrespo) [14:10:38] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [14:10:48] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200) [14:11:50] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 after maintenance (duration: 00m 35s) [14:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:59] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [14:12:47] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [14:18:15] 06Operations, 10Monitoring, 06Release-Engineering-Team: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520#2501833 (10hashar) I have added a new Grafana panel showing the state of that specific check on [[ https://grafana.wikimedia... [14:20:28] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [14:22:19] 06Operations, 10EventBus, 10Graphite: eventbus should send statsd in batches - https://phabricator.wikimedia.org/T141524#2501841 (10fgiunchedi) [14:27:52] (03CR) 10Mark Bergsma: "Please break this up into multiple commits, so each commit covers only one concern." [debs/pybal] - 10https://gerrit.wikimedia.org/r/272679 (owner: 10Giuseppe Lavagetto) [14:37:17] (03PS1) 10Giuseppe Lavagetto: pbuilder: actually set buildresult when none is given on the command line [puppet] - 10https://gerrit.wikimedia.org/r/301604 [14:37:22] <_joe_> hashar: ^^ [14:40:55] the level of layers confuse me entirely [14:41:10] pbuilder default to /var/cache/pbuilder/result [14:41:20] and the "../" is injected by git-pbuilder [14:42:32] (03PS1) 10Giuseppe Lavagetto: puppetmaster: move two hosts to use the test puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301605 [14:42:46] <_joe_> hashar: hah ok [14:42:52] <_joe_> so my patch is no good [14:43:02] <_joe_> akosiaris: see https://gerrit.wikimedia.org/r/301605 [14:43:22] git-pbuilder state that GIT_PBUILDER_AUTOCONF=no would skip setting --buildresult [14:44:15] hashar: pbuilder support --buildrest [14:44:22] --buildresult* [14:44:29] maybe it's possible to use that ? [14:44:52] maybe [14:45:07] the jenkins-debian-glue script invokes cowbuilder --buildresult "$WORKSPACE/binaries" to get the .deb in the job workspace [14:45:35] prewivously our pbuilderrc was always overriding BUIDLRESULT [14:46:21] (03PS1) 10BBlack: ciphersuites: add openssl-1.1.0 dhe+3des naming [puppet] - 10https://gerrit.wikimedia.org/r/301606 [14:46:21] so the approach in https://gerrit.wikimedia.org/r/#/c/300830/3/modules/package_builder/templates/pbuilderrc.erb was wrong because due to /usr/lib/pbuilder/pbuilder-loadconfig being loaded before pretty much everything else and setting BUILDRESULT by source /usr/share/pbuilder/pbuilderrc, the BUILDRESULT=${BUILDRESULT:-<%= @basepath %>/result/${DIST}-${ARCH}}always defaults to the already set [14:46:23] (03PS1) 10BBlack: ciphersuites: add openssl-1.1.0 chacha20-poly1305 [puppet] - 10https://gerrit.wikimedia.org/r/301607 [14:46:36] I 'll revert it. it's wrong on premise unfortunately [14:47:43] (03CR) 10Mark Bergsma: [WiP] Add ipvs-related FSM (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/272679 (owner: 10Giuseppe Lavagetto) [14:48:11] maybe me setting GIT_PBUILDER_OUTPUT_DIR would be enough [14:48:50] (03CR) 10BBlack: [C: 032] ciphersuites: add openssl-1.1.0 dhe+3des naming [puppet] - 10https://gerrit.wikimedia.org/r/301606 (owner: 10BBlack) [14:49:05] (03PS1) 10Alexandros Kosiaris: Revert "package_builder: do not override BUILDRESULT" [puppet] - 10https://gerrit.wikimedia.org/r/301608 [14:49:07] (03CR) 10BBlack: [C: 032] ciphersuites: add openssl-1.1.0 chacha20-poly1305 [puppet] - 10https://gerrit.wikimedia.org/r/301607 (owner: 10BBlack) [14:49:29] (03PS2) 10Alexandros Kosiaris: Revert "package_builder: do not override BUILDRESULT" [puppet] - 10https://gerrit.wikimedia.org/r/301608 [14:49:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "package_builder: do not override BUILDRESULT" [puppet] - 10https://gerrit.wikimedia.org/r/301608 (owner: 10Alexandros Kosiaris) [14:50:13] hashar: I am still struggling a bit with jenkins-debian-glue [14:50:35] I haven't yet figured out if it is possible to NOT call cowbuilder --update on every build for example [14:50:47] given we already do it anyway via cron every day [14:51:07] akosiaris: SKIP_COWBUILDER_UPDATE=true [14:51:16] hmm nice [14:51:18] just found it while looking at scripts/build-and-provide-package [14:51:20] i am an idiot ? [14:51:28] na pure luck on my side :} [14:51:38] (03PS1) 10Addshore: Add analytics-wmde user to role::analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/301610 (https://phabricator.wikimedia.org/T141525) [14:51:38] I have searched for /--update/ in the source [14:51:43] god damn I love open source [14:51:45] not documented as it seems [14:52:55] (03CR) 10Alexandros Kosiaris: [C: 031] "let's see what we will wreck!" [puppet] - 10https://gerrit.wikimedia.org/r/301605 (owner: 10Giuseppe Lavagetto) [14:52:57] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: move two hosts to use the test puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301605 (owner: 10Giuseppe Lavagetto) [14:53:02] (03PS2) 10Alexandros Kosiaris: puppetmaster: move two hosts to use the test puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301605 (owner: 10Giuseppe Lavagetto) [14:53:06] (03CR) 10Alexandros Kosiaris: [V: 032] puppetmaster: move two hosts to use the test puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301605 (owner: 10Giuseppe Lavagetto) [14:53:22] <_joe_> akosiaris: I'll take scb2001 [14:53:48] hashar: is it not documented? :) [14:53:58] hashar: so, SKIP_COWBUILDER_UPDATE on the jenkins job ? [14:54:10] or in zuul config... I am unsure right now [14:54:15] I suspect jjb [14:54:24] then the cow image will be stall [14:54:30] nope [14:54:32] * hashar looks at the doc [14:54:34] sudo crontab -l [14:54:44] you will see we have crontab to update all images [14:54:51] once per day IIRC [14:55:13] yup, on 7:34 UTC [14:57:22] <_joe_> akosiaris: uhm nothing happening based on our patch, grr [14:58:37] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:58:49] <_joe_> akosiaris: rotfl $puppetmaster = hiera('puppetmaster') [14:58:53] <_joe_> grrr [14:59:17] !log bounce carbon-cache on graphite1001 - T101141 [14:59:18] T101141: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141 [14:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:56] * aude can do swat today [15:00:05] anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160728T1500). Please do the needful. [15:00:05] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:17] * kart_ is around [15:00:23] Who is SWAT'ng? [15:00:26] (03PS5) 10Eevans: [WIP]: Configurable `vm.diry_background_bytes` parameter [puppet] - 10https://gerrit.wikimedia.org/r/301425 [15:00:33] (03PS1) 10Giuseppe Lavagetto: puppetmaster: fixup for I039b88e2e [puppet] - 10https://gerrit.wikimedia.org/r/301613 [15:00:51] kart_: i can [15:00:58] aude: thanks! [15:01:28] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: fixup for I039b88e2e [puppet] - 10https://gerrit.wikimedia.org/r/301613 (owner: 10Giuseppe Lavagetto) [15:01:34] (03PS2) 10Giuseppe Lavagetto: puppetmaster: fixup for I039b88e2e [puppet] - 10https://gerrit.wikimedia.org/r/301613 [15:01:36] (03PS3) 10Alexandros Kosiaris: puppetmaster: fixup for I039b88e2e [puppet] - 10https://gerrit.wikimedia.org/r/301613 (owner: 10Giuseppe Lavagetto) [15:01:42] (03CR) 10Alexandros Kosiaris: [V: 032] puppetmaster: fixup for I039b88e2e [puppet] - 10https://gerrit.wikimedia.org/r/301613 (owner: 10Giuseppe Lavagetto) [15:01:44] (03CR) 10Giuseppe Lavagetto: [V: 032] puppetmaster: fixup for I039b88e2e [puppet] - 10https://gerrit.wikimedia.org/r/301613 (owner: 10Giuseppe Lavagetto) [15:01:51] kart_: do you have a way to verify the dump-corpora.php patches? [15:02:18] * aude thinks these can skip the canary step [15:02:22] aude: no direct way. apergos need to run it for dump. [15:02:22] aude: I don't think they need verifying right now, we are working with apergos to get them set up and this fixes one issue we found [15:02:49] kart_: Nikerabbit ok [15:03:23] (03PS3) 10Aude: Deploy Compact Language Links as default (Stage 5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301555 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:03:36] (03CR) 10Aude: [C: 032] Deploy Compact Language Links as default (Stage 5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301555 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:03:42] akosiaris: yeah I forgot about the magic cron updating the cow image. Meanwhile I have sent a PR to document SKIP_COWBUILDER_UPDATE https://github.com/mika/jenkins-debian-glue.org/pull/13 [15:03:49] hashar: I 'll upload a patch for SKIP_COWBUILDER_UPDATE and add you as reviewer [15:03:55] hashar: \o/ [15:04:00] (03Merged) 10jenkins-bot: Deploy Compact Language Links as default (Stage 5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301555 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:04:13] someone ping me when the patch goes around (and note it on the ticket please) [15:04:27] apergos: ok [15:04:31] thanks [15:05:13] <_joe_> akosiaris: double-success [15:06:16] _joe_: looks lie it [15:06:18] like* [15:06:19] kart_: please check content translation on mw1099 [15:06:33] _joe_: mw2231 just applied config [15:06:47] it was somewhat slower though. 40 secs vs 34 before [15:06:57] aude: Compact Language Links - 301555 - right? [15:07:06] !log adding new index (schema change) to recentchanges T140108 [15:07:07] T140108: ApiQueryRecentChanges::run is spiking, nuking API servers - https://phabricator.wikimedia.org/T140108 [15:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:19] kart_: yes [15:07:29] aude, kart_: I can confirm cll is enabled on ukwiki on mw1099 [15:07:34] (03PS3) 10Muehlenhoff: Remove DNS entries for old trusty scalers [dns] - 10https://gerrit.wikimedia.org/r/301406 (https://phabricator.wikimedia.org/T141352) [15:08:00] hashar: https://github.com/mika/jenkins-debian-glue.org/search?utf8=%E2%9C%93&q=SKIP_COWBUILDER_UPDATE [15:08:05] should I be worried ? [15:08:14] Cow builder? [15:08:19] Nikerabbit: ok [15:08:22] You can build cows? [15:08:27] (03Abandoned) 10Giuseppe Lavagetto: pbuilder: actually set buildresult when none is given on the command line [puppet] - 10https://gerrit.wikimedia.org/r/301604 (owner: 10Giuseppe Lavagetto) [15:08:29] Bsadowski1: goats as well [15:08:32] haha [15:08:50] akosiaris: yeah it is undocumented but definitely here https://github.com/mika/jenkins-debian-glue.org/pull/13 fix the doc [15:08:58] PROBLEM - puppet last run on lvs1012 is CRITICAL: CRITICAL: puppet fail [15:09:04] akosiaris: and the cron are definitely there so I will add SKIP_COWBUILDER_UPDATE [15:09:08] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [15:09:13] <_joe_> uh? [15:09:14] how abotu sheep? [15:09:14] hashar: no I meant I can't find it on the github repo [15:09:28] like not in the code [15:09:28] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Puppet has 1 failures [15:09:29] akosiaris: there is a repo for the doc and another one for the software :( [15:09:35] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable content translation on more wikis (duration: 00m 25s) [15:09:39] ah [15:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:42] that would explain it [15:09:46] akosiaris: https://github.com/mika/jenkins-debian-glue/search?utf8=✓&q=SKIP_COWBUILDER_UPDATE [15:09:47] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Puppet has 1 failures [15:09:50] This is the git repository for the jenkins-debian-glue.org website. [15:09:54] <_joe_> strontium fail ^^ [15:09:58] yeah yeah, my mistake, sorry hashar [15:10:07] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Puppet has 1 failures [15:10:12] akosiaris: that confuses me every single time :} [15:10:17] !log aude@tin Synchronized dblists/clldefault.dblist: Enable content translation on more wikis (duration: 00m 23s) [15:10:17] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Puppet has 1 failures [15:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:26] aude: looks good. [15:10:36] _joe_: strontium fail ? [15:10:37] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Puppet has 1 failures [15:10:45] <_joe_> yeah it's back though [15:10:48] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 2 failures [15:11:08] PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Puppet has 1 failures [15:11:18] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [15:11:37] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Puppet has 1 failures [15:11:40] waiting for jenkins on the other patches [15:11:49] PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: Puppet has 1 failures [15:11:58] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures [15:12:38] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Puppet has 1 failures [15:12:49] aude: err, that was Compact Language Links :) [15:13:01] ah [15:13:03] ok :) [15:13:19] aude: synced dblist too, right? [15:13:22] yes [15:13:37] cool. Thanks. [15:13:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove DNS entries for old trusty scalers [dns] - 10https://gerrit.wikimedia.org/r/301406 (https://phabricator.wikimedia.org/T141352) (owner: 10Muehlenhoff) [15:14:30] akosiaris: https://gerrit.wikimedia.org/r/301616 should do :) [15:14:56] :-) [15:15:35] ok. I am abandoning mine then [15:15:52] !log bounce statsite on graphite1001 - T101141 [15:15:53] T101141: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141 [15:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:00] akosiaris: lets keep your actually . For stats purposes :) [15:16:18] I 've already abandoned and +1ed yours [15:16:28] and even linked to yours [15:16:59] (03Abandoned) 10Alexandros Kosiaris: palladium: puppetmaster-test instead of puppetmaster.test [puppet] - 10https://gerrit.wikimedia.org/r/301585 (https://phabricator.wikimedia.org/T98173) (owner: 10Alexandros Kosiaris) [15:17:26] akosiaris: that will be one more commit not authored by me in the repo :} [15:17:46] (03PS2) 10Addshore: Add analytics-wmde user to role::analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/301610 (https://phabricator.wikimedia.org/T141525) [15:17:57] 06Operations, 10hardware-requests: Decomission mw1153-mw1160 - https://phabricator.wikimedia.org/T141352#2502022 (10MoritzMuehlenhoff) Also dropped from DNS (except mgmt) [15:18:33] (03CR) 10Chad: [C: 04-1] Support linking to a phabricator comment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [15:18:35] (03PS3) 10Addshore: Add analytics-wmde user to role::analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/301610 (https://phabricator.wikimedia.org/T141525) [15:19:34] (03CR) 10jenkins-bot: [V: 04-1] Add analytics-wmde user to role::analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/301610 (https://phabricator.wikimedia.org/T141525) (owner: 10Addshore) [15:19:37] !log aude@tin Synchronized php-1.28.0-wmf.12/extensions/ContentTranslation: Fix DumpCorpora script (duration: 00m 31s) [15:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:44] kart_: ^ [15:19:54] (03CR) 10BryanDavis: "> (1 comment)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301404 (owner: 10BryanDavis) [15:20:04] (03PS3) 10Paladox: Support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [15:20:09] (03PS4) 10Paladox: Support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [15:20:27] !log aude@tin Synchronized php-1.28.0-wmf.11/extensions/ContentTranslation: Fix DumpCorpora script (duration: 00m 27s) [15:20:30] (03CR) 10Paladox: Support linking to a phabricator comment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [15:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:35] and apergos ^ [15:20:53] oh good [15:21:09] which groups are running .12, do you know, aude? [15:21:14] Thanks aude! [15:21:21] apergos: both wmf11, wmf12 [15:21:22] apergos: everything except most wikipedias [15:21:32] ok I'll test run it on a non wp then [15:21:34] thanks! [15:22:25] 06Operations, 10hardware-requests: Decomission mw1153-mw1160 - https://phabricator.wikimedia.org/T141352#2502028 (10MoritzMuehlenhoff) [15:22:28] 06Operations, 06Labs: labnet100[12].eqiad.wmnet need to be reimaged with RAID - https://phabricator.wikimedia.org/T136718#2502029 (10Andrew) labnet1001 is reimaged now. The steps are a bit weird due to unusual networking setup: The 10g nic can't support pxe-booting. So to install you have to go into the bio... [15:22:43] apergos: you can always find out what version is running on what group at https://tools.wmflabs.org/versions/ [15:22:47] 06Operations, 10ops-eqiad: Decomission mw1153-mw1160 - https://phabricator.wikimedia.org/T141352#2502032 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03Cmjohnson [15:23:24] bd808: bookmarked [15:23:34] bd808: that's handy. Bookmarked. [15:23:37] In the past I've dug around for it in the versions file but meh [15:23:55] Nikerabbit: give me a non wikipedia with content translation enabled? [15:23:57] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:21] I ask because I just tried this on elwiktionary and got [15:24:23] Fatal error: Class 'ContentTranslation\Translation' not found in /srv/mediawiki/php-1.28.0-wmf.12/extensions/ContentTranslation/scripts/dump-corpora.php on line 121 [15:24:44] (03PS4) 10Addshore: Add analytics-wmde user to role::analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/301610 (https://phabricator.wikimedia.org/T141525) [15:25:30] o_O [15:26:10] apergos: huh? cx is not enabled outside wikipedia and testwiki [15:26:21] ah, then I have to wait til it makes it to the pedias [15:26:22] thanks then [15:26:33] apergos: you can try on cawiki [15:26:37] ohhh [15:26:52] some of the wikipedias are in group1 and the patch is also on wmf11 (so all wikipedias, afaik) [15:26:57] apergos: as I said, it is deploy in all wikis. [15:27:02] Um if it's not enabled on elwikt then it probably shouldn't show a fatal if you try :) [15:27:08] (03PS5) 10Addshore: Add analytics-wmde user to role::analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/301610 (https://phabricator.wikimedia.org/T141525) [15:27:09] Anyway ;-) [15:27:15] side issue though [15:27:21] (even if it is, fatals are bad mmkay!) [15:27:48] well, it's the usual MW issue of running maintenance script using paths [15:27:58] It'll run and blow up due to dependancies [15:28:01] !log aude@tin Synchronized php-1.28.0-wmf.12/extensions/Wikidata: Fix exception when undeleting items and fix css bug (duration: 01m 52s) [15:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:35] (03CR) 10Mark Bergsma: "Nice start. :-)" (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/272679 (owner: 10Giuseppe Lavagetto) [15:28:37] Reedy: Well, we should bail nicer ;-) [15:28:47] "You can't run this thing you weirdo!" [15:28:55] I wonder if we can make it WMF agnostic [15:28:58] test is running now [15:29:01] Using the extension loader stuff [15:29:20] (03PS3) 10Aude: Update entityNamespace settings to not use content model ids [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297240 (https://phabricator.wikimedia.org/T138982) [15:29:22] if ( !extensionloaded ) { die nicely } [15:29:27] (03CR) 10Aude: [C: 032] Update entityNamespace settings to not use content model ids [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297240 (https://phabricator.wikimedia.org/T138982) (owner: 10Aude) [15:29:54] (03Merged) 10jenkins-bot: Update entityNamespace settings to not use content model ids [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297240 (https://phabricator.wikimedia.org/T138982) (owner: 10Aude) [15:32:17] (03PS6) 10Addshore: Add analytics-wmde user to role::analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/301610 (https://phabricator.wikimedia.org/T141525) [15:33:24] (03PS1) 10Muehlenhoff: pmacct: Limit to production networks [puppet] - 10https://gerrit.wikimedia.org/r/301621 [15:34:42] (03PS7) 10Ottomata: Add analytics-wmde user to role::analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/301610 (https://phabricator.wikimedia.org/T141525) (owner: 10Addshore) [15:34:59] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:34:59] Reedy: ooooh, that sounds nice [15:35:08] RECOVERY - puppet last run on lvs1012 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:35:27] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:35:30] Ideally, we'd do it in the Maintenance parent class [15:35:39] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:35:46] Not sure how exactly we'd work out which extension we were loading [15:35:48] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:35:49] * Reedy files a task [15:35:58] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:36:08] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:36:08] RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:36:17] (03CR) 10Mark Bergsma: [WiP] Add ipvs-related FSM (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/272679 (owner: 10Giuseppe Lavagetto) [15:36:27] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:36:29] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:36:48] Nikerabbit: run completed, bunch of files created, output on ticket, can you check/verify contents? thanks [15:36:48] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:36:58] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:37:17] apergos: thanks, will do tomorrow [15:37:24] cool thanks [15:37:27] RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:37:29] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:37:33] 06Operations, 10Cassandra, 10RESTBase-Cassandra: track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#2502056 (10Eevans) [15:37:44] 06Operations, 10Cassandra, 10RESTBase-Cassandra: track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#1858341 (10Eevans) p:05Normal>03High [15:38:13] (03CR) 10Alexandros Kosiaris: [C: 031] "was it even ever used? Anyway indeed it makes sense only on production networks" [puppet] - 10https://gerrit.wikimedia.org/r/301621 (owner: 10Muehlenhoff) [15:38:19] RECOVERY - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on port 9042 [15:38:35] (03CR) 10Ottomata: [C: 032] Add analytics-wmde user to role::analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/301610 (https://phabricator.wikimedia.org/T141525) (owner: 10Addshore) [15:38:41] (03PS2) 10Ottomata: Add dump_log_dir to stats::wmde config [puppet] - 10https://gerrit.wikimedia.org/r/301511 (https://phabricator.wikimedia.org/T119070) (owner: 10Addshore) [15:38:51] (03CR) 10Ottomata: [C: 032 V: 032] Add dump_log_dir to stats::wmde config [puppet] - 10https://gerrit.wikimedia.org/r/301511 (https://phabricator.wikimedia.org/T119070) (owner: 10Addshore) [15:39:05] Filed https://phabricator.wikimedia.org/T141531 [15:39:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Generally good, but I'd rather not use hiera if the value is fixed on any cluster, or if it depends on e.g. the RAM of the system." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/301425 (owner: 10Eevans) [15:40:26] akosiaris: 00:00:05.683 *** Skipping cowbuilder update as requested via SKIP_COWBUILDER_UPDATE *** [15:40:31] akosiaris: it works thanks :) [15:40:36] :-) [15:41:21] !log aude@tin Synchronized wmf-config/Wikibase.php: Update entityNamespaces setting (duration: 00m 27s) [15:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:33] then our pbuilderrc overrides cowbuilder --buildresult $WORKSPACE/binaries so the .Deb ends up somewhere under /mnt/pbuilder/result :/ [15:41:52] swat is done [15:42:00] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:42:09] akosiaris: the env variable takes precedence over the --buildresult parameter :( [15:42:43] it shouldn't [15:42:46] lemme see what's up [15:42:47] (03PS1) 10Addshore: Move statistics::wmde::user to correct dir [puppet] - 10https://gerrit.wikimedia.org/r/301624 [15:42:55] ottomata: ^^ [15:43:24] (03CR) 10Ottomata: [C: 032 V: 032] Move statistics::wmde::user to correct dir [puppet] - 10https://gerrit.wikimedia.org/r/301624 (owner: 10Addshore) [15:43:29] akosiaris: https://integration.wikimedia.org/ci/job/debian-glue/352/consoleFull [15:43:48] akosiaris: at 00:00:05.761 (or search for "buildresult") [15:44:01] cowbuilder is invoked with --buildresult /mnt/jenkins-workspace/workspace/debian-glue/binaries/ [15:44:17] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: puppet fail [15:44:37] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [15:44:46] then forks I guess read pbuilderrc and ends up invoking pbuilder --buildresult /mnt/pbuilder/result/jessie-amd64 [15:45:28] maybe I could have our /etc/pbuilderrc to detect jenkins and if that is the case export BUILDRESULT="$WORKSPACE/binaries" which is dirty but might just work [15:47:07] (03PS1) 10Addshore: Fix statistics::wmde::user usage in statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/301625 [15:47:12] ottomata: ^^ [15:48:01] addshore [15:48:03] class { [15:48:07] actual [15:48:07] literal [15:48:08] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:48:08] class [15:48:17] class { 'statistics::wmde::user': ... [15:48:17] ahh class, not the class name! :[p [15:48:18] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:48:19] ja! [15:48:36] (03PS2) 10Addshore: Fix statistics::wmde::user usage in statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/301625 [15:48:39] ottomata: ^^ [15:49:06] (03CR) 10Ottomata: [C: 032 V: 032] Fix statistics::wmde::user usage in statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/301625 (owner: 10Addshore) [15:51:36] (03PS1) 10Muehlenhoff: contint::firewall: Limit to production networks [puppet] - 10https://gerrit.wikimedia.org/r/301627 [15:52:37] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:54:15] =] [15:56:39] hashar: I can't reproduce it with sudo DIST=jessie-wikimedia ARCH=amd64 cowbuilder --buildresult /mnt/ --build lala.dsc [15:56:59] can we find out what --configfile=/tmp/tmp.peciRankrh has inside ? [15:57:16] it's obviously a tempfile [15:58:22] there we go [15:58:34] 15:36:16 + '[' -n /etc/pbuilderrc ']' [15:58:34] 15:36:16 + echo '*** PBUILDER_CONFIG is set, considering /etc/pbuilderrc for pbuilder config ***' [15:58:34] 15:36:16 *** PBUILDER_CONFIG is set, considering /etc/pbuilderrc for pbuilder config *** [15:58:34] 15:36:16 + '[' -r /etc/pbuilderrc ']' [15:58:40] hashar: that probably explains it [15:59:31] akosiaris: yeah that is to benefit from all the logic in that file / hooks etc. It then cat /etc/pbuilderrc to a tmp file and append extra parameters [15:59:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160728T1600). Please do the needful. [16:01:58] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:02:08] PROBLEM - HTTPS-policy on policy.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate policy.wikimedia.org valid until 2016-08-27 16:01:01 +0000 (expires in 29 days) [16:05:44] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2353511 (10faidon) >>! In T136957#2485532, @GWicke wrote: > This suggests that there are several issues in firejail: > > - Error return status codes... [16:06:17] (03PS1) 10Hashar: package_builder: set BUILDRESULT on Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/301633 [16:06:25] akosiaris: going to try https://gerrit.wikimedia.org/r/301633 :D [16:08:46] hashar: ew... I am not merging that :P [16:08:54] oh come on! [16:09:02] ;-D [16:12:33] hashar: ok confirmed [16:12:38] akosiaris@packager:~$ sudo DIST=jessie-wikimedia ARCH=amd64 cowbuilder --buildresult /mnt --dumpconfig |grep buildre [16:12:38] buildresult: /mnt [16:12:38] akosiaris@packager:~$ sudo DIST=jessie-wikimedia ARCH=amd64 cowbuilder --buildresult /mnt --configfile=/etc/pbuilderrc --dumpconfig |grep buildres [16:12:38] buildresult: /var/cache/pbuilder/result/jessie-amd64 [16:12:50] it's that --configfile thing that messes things up [16:12:57] probably some wrong ordering somewhere [16:13:33] my patch does not work anyway ( https://integration.wikimedia.org/ci/job/debian-glue/354/console still fork pbuilder with the wrong --buildresult ) [16:13:57] I would be surprised if it did [16:14:02] (03Abandoned) 10Hashar: package_builder: set BUILDRESULT on Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/301633 (owner: 10Hashar) [16:14:07] this is like source 5 levels down [16:14:23] and cowbuilder is actually a C program [16:14:31] I am wondering how it parses the config file [16:14:41] yeah I am looking at it :( [16:14:51] else if (!strcmp(buf, "BUILDRESULT")) [16:14:51] { [16:14:51] pc->buildresult=strdup_strip_quote(delim); [16:14:51] 06Operations, 10Graphite, 05MW-1.27-release-notes, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2502169 (10fgiunchedi) I looked at the whisper files for `OneMinuteRate` and I think the NULLs are due to data not making it to the whisper bucket... [16:14:56] guess what... [16:15:02] https://anonscm.debian.org/git/pbuilder/cowdancer.git at debian/0.73 but my skill.C == 0 [16:22:30] (03PS1) 10Filippo Giunchedi: statsite: flush to graphite every 30s [puppet] - 10https://gerrit.wikimedia.org/r/301635 (https://phabricator.wikimedia.org/T101141) [16:27:34] hashar: ahahaha [16:27:35] so [16:27:42] sudo DIST=jessie-wikimedia ARCH=amd64 cowbuilder --configfile=/etc/pbuilderrc --buildresult /mnt --dumpconfig |grep buildresult: [16:27:42] buildresult: /mnt [16:27:46] guess what [16:27:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: flush to graphite every 30s [puppet] - 10https://gerrit.wikimedia.org/r/301635 (https://phabricator.wikimedia.org/T101141) (owner: 10Filippo Giunchedi) [16:27:56] the position of the arguments is important... [16:27:57] (03PS2) 10Filippo Giunchedi: statsite: flush to graphite every 30s [puppet] - 10https://gerrit.wikimedia.org/r/301635 (https://phabricator.wikimedia.org/T101141) [16:28:06] (03CR) 10Filippo Giunchedi: [V: 032] statsite: flush to graphite every 30s [puppet] - 10https://gerrit.wikimedia.org/r/301635 (https://phabricator.wikimedia.org/T101141) (owner: 10Filippo Giunchedi) [16:28:07] .... [16:28:16] that is enough. I am done with IT and computer [16:28:24] :D [16:28:27] hahahaha [16:28:33] rotfl [16:28:45] * akosiaris not sure what he is laughing with though [16:29:02] damn... [16:29:39] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2502223 (10jcrespo) @akosiaris I think now should be a safe time to restart bacula, but I will block on you giving an ok for that/confirming there are no ongoing jobs running. [16:30:35] akosiaris: nice finding really [16:30:43] so --configfile=/etc/pbuilderrc --buildresult /mnt [16:30:45] to override [16:31:16] 06Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2502239 (10BBlack) The facts of the situation: 1. We considered long ago in another ticket using cloudflare's OpenSSL patch (an older one) to implement the old Draft versions of ChaCha20-Poly1305 ci... [16:31:41] akosiaris: and yeah jenkins-debian-glue pass ---configfile as the very last parameter [16:31:53] lol [16:32:12] * akosiaris goes away to play pokemon go... less stressful [16:32:15] :P [16:32:28] akosiaris: Till the servers go down [16:32:31] Then you can't fix them [16:32:35] And then you have to login again [16:32:44] And it forgets all the pokemon you've caught [16:32:44] till ? they are constantly down [16:32:45] akosiaris: thank you very much :} [16:32:48] lol [16:32:54] Hey Reedy :) [16:33:59] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2502243 (10akosiaris) It is. Let's do it. I already did a restart, everything is fine. [16:35:00] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2502246 (10MoritzMuehlenhoff) firejail 0.38 is available in jessie-wikipedia, but restbase* isn't upgraded to the new version yet (haven't checked ho... [16:35:06] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2502247 (10jcrespo) Oh, thank you a lot! I see bacula already from the new ip! Thank you again! [16:36:46] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2502252 (10demon) Would be nice for tool users too, who are routinely doing git operations and are stuck on an ancient 1.9.x as well. [16:36:51] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2502253 (10Joe) >>! In T136957#2502163, @faidon wrote: > So yes, this is probably some systemd/firejail interaction issue and not RESTBase or node's... [16:38:50] !log stopping dbproxy1001 haproxy service [16:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:00] hashar: so, I don't think this is easily fixeable in cowbuilder... better fix it in jenkins-debian-glue [16:39:45] (03PS1) 10Ottomata: Remove analytics-wmde from analytics cluster, improve docs around analytics cluster users [puppet] - 10https://gerrit.wikimedia.org/r/301639 (https://phabricator.wikimedia.org/T141525) [16:40:09] hello, icinga are you there? [16:40:24] akosiaris: fuck yeah, caterpie from an egg [16:40:33] ts ts ts caterpie ? [16:40:43] get a pikachu first and then we are talking [16:40:46] lol [16:40:47] Reedy: You play Pokemon Go? LOL [16:40:50] :O [16:40:52] Nice! [16:40:57] My little sister wanted me to [16:41:05] cmjohnson1: are there any pokemon in eqiad? [16:41:08] Do you like the game? [16:41:23] ottomata: Hrm...i never checked [16:41:35] if you find any, we can blame the next outage on them [16:41:52] (03CR) 10Ottomata: [C: 032] Remove analytics-wmde from analytics cluster, improve docs around analytics cluster users [puppet] - 10https://gerrit.wikimedia.org/r/301639 (https://phabricator.wikimedia.org/T141525) (owner: 10Ottomata) [16:41:56] "A pikachu overloaded our generators" [16:43:17] papaul: have you thoroughly checked codfw? [16:43:54] akosiaris: will check why --configfile is last [16:44:12] akosiaris: maybe that is made on purpose so the pbuidderrc can override jenkins-debian-glue hardcoded settings : d I have filled https://phabricator.wikimedia.org/T141538 anyway [16:44:24] you are subscribed [16:45:45] (03PS1) 10Faidon Liambotis: nagios: fix check_ssl warning when no SANs are present [puppet] - 10https://gerrit.wikimedia.org/r/301641 [16:45:48] 06Operations, 10Monitoring, 06Release-Engineering-Team: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520#2502307 (10greg) Ori is out for family reasons right now, but since he helped craft this alert I'm adding him here for his t... [16:45:58] giving up for today *wave* [16:46:23] (03CR) 10Filippo Giunchedi: [C: 031] nagios: fix check_ssl warning when no SANs are present [puppet] - 10https://gerrit.wikimedia.org/r/301641 (owner: 10Faidon Liambotis) [16:46:44] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-urd] - 10https://gerrit.wikimedia.org/r/296229 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [16:47:03] (03CR) 10Faidon Liambotis: [C: 032] nagios: fix check_ssl warning when no SANs are present [puppet] - 10https://gerrit.wikimedia.org/r/301641 (owner: 10Faidon Liambotis) [16:47:09] (03CR) 10jenkins-bot: [V: 04-1] apertium-urd: New upstream release and rebuild for Jessie [debs/contenttranslation/apertium-urd] - 10https://gerrit.wikimedia.org/r/296229 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [16:49:54] (03PS1) 10Eevans: Enable Casssandra instance restbase1014-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301642 (https://phabricator.wikimedia.org/T134016) [16:49:59] !log bounce statsite on graphite1001 T101141 [16:50:00] T101141: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141 [16:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:09] (03PS1) 10Eevans: Enable Cassandra instance restbase2006-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301643 (https://phabricator.wikimedia.org/T134016) [16:53:12] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2502362 (10Joe) So I added ``` StandardOutput=tty ``` to the unit and magically I found everything was being logged in real time to `/dev/console`... [16:53:27] (03CR) 10Eevans: [C: 04-1] "Just queuing this up for now; Do not yet merge. I'll +1 this when ready." [puppet] - 10https://gerrit.wikimedia.org/r/301642 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:53:35] (03CR) 10Eevans: [C: 04-1] "Just queuing this up for now; Do not yet merge. I'll +1 this when ready." [puppet] - 10https://gerrit.wikimedia.org/r/301643 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:53:40] 06Operations, 10EventBus, 10Graphite: eventbus should send statsd in batches - https://phabricator.wikimedia.org/T141524#2502363 (10Ottomata) a:03Ottomata [16:54:01] ottomata: nerver checked [16:54:05] :) [16:56:29] (03PS1) 10Kaldari: Test numeric sorting on test wiki, remove test from beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301644 (https://phabricator.wikimedia.org/T141433) [16:59:34] (03PS1) 10Muehlenhoff: Remove role::logging::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/301645 [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160728T1700). [17:04:48] (03PS1) 10Yuvipanda: tools: Send combined status / httpver stats to graphite [puppet] - 10https://gerrit.wikimedia.org/r/301646 [17:05:19] (03CR) 10jenkins-bot: [V: 04-1] tools: Send combined status / httpver stats to graphite [puppet] - 10https://gerrit.wikimedia.org/r/301646 (owner: 10Yuvipanda) [17:05:35] (03PS2) 10Yuvipanda: tools: Send combined status / httpver stats to graphite [puppet] - 10https://gerrit.wikimedia.org/r/301646 [17:06:45] bd808 ^ CR if you have time (No obligations, I assume you are still banging about striker) [17:06:50] (03CR) 10jenkins-bot: [V: 04-1] tools: Send combined status / httpver stats to graphite [puppet] - 10https://gerrit.wikimedia.org/r/301646 (owner: 10Yuvipanda) [17:07:36] YuviPanda: I'm in sprint planning, so CR is a perfect distraction :) [17:07:59] hehe ok :) [17:08:09] bd808 feel free to CR that entire file really. it has some code smells etc [17:10:31] (03PS3) 10Yuvipanda: tools: Send combined status / httpver stats to graphite [puppet] - 10https://gerrit.wikimedia.org/r/301646 [17:10:33] (03PS3) 10Yuvipanda: tools: Don't load kernel logging module explicitly [puppet] - 10https://gerrit.wikimedia.org/r/301534 [17:13:51] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2502454 (10jcrespo) [17:13:53] godog: hiiii [17:13:54] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2502450 (10jcrespo) 05Open>03Resolved labsdb1002 itself has been decomissioned; the tracking of the new server setup (and labs fix in general) will be done at T140452 [17:14:17] hey ottomata [17:14:51] looked quickly into how the tornado statsd stuff works [17:14:55] that's what is sending eventbus stats [17:14:56] https://github.com/sprockets/sprockets.clients.statsd/blob/master/sprockets/clients/statsd/__init__.py [17:15:06] unfortunetly they just sent to a socket [17:16:55] ottomata: ah! I missed that part, I was looking at eventlogging/handlers.py that uses 'import statsd' [17:17:19] how are the two related? i.e. tornado and handlers.py ? [17:18:45] totally different, so handlers are eventlogging readers and writer endpoints [17:19:07] if we wanted to send some stats about things associated with a stream reader or writer [17:19:17] like say, inserting into mysql (which is what i think statsd in handlers is used for) [17:19:19] we could do it there [17:19:26] tornado is used for eventlogging-service [17:19:34] and http endpoint for producing to eventlogging writers [17:19:44] so, the eventbus status you are seeing those from are about http requests [17:19:55] eventlogging-service uses tornado [17:20:12] and the sprockets statsd stuff was a transparent way to get the http stats out of tornado [17:20:37] (03PS5) 10Paladox: Support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [17:20:43] (03PS6) 10Paladox: Support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [17:20:56] PROBLEM - Disk space on Hadoop worker on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:20:57] ottomata: ah thanks for the explanation, do you know if upstream would be interested in a patch? [17:21:07] (03CR) 10BryanDavis: tools: Send combined status / httpver stats to graphite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301646 (owner: 10Yuvipanda) [17:21:15] PROBLEM - dhclient process on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:21:26] PROBLEM - Hadoop DataNode on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:21:35] PROBLEM - puppet last run on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:21:40] (03CR) 10BryanDavis: [C: 031] tools: Don't load kernel logging module explicitly [puppet] - 10https://gerrit.wikimedia.org/r/301534 (owner: 10Yuvipanda) [17:21:46] PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:21:49] PROBLEM - salt-minion processes on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:22:06] PROBLEM - Check size of conntrack table on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:22:06] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:22:16] checking analytics1032.. [17:22:36] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:22:45] PROBLEM - DPKG on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:22:56] PROBLEM - configured eth on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:22:56] PROBLEM - Disk space on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:23:33] godog: i'm not in touch with sprockets folks, but i bet they'd be ok with it [17:23:37] if not, we can fork, i build the deb packages for it anyway [17:23:46] elukey me too! [17:26:00] (03CR) 10Chad: [C: 031] "Minor last bit about the regex, but otherwise great and what we need." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [17:27:23] (03PS7) 10Paladox: gerrit: support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [17:28:58] (03PS8) 10Paladox: gerrit: support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [17:29:51] 06Operations, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2502551 (10jcrespo) [17:30:56] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2502549 (10jcrespo) 05Open>03Resolved Both servers are now in jessie. The only remaining trusty servers is dbproxy1002 and dbproxy1004, which will be upgraded to jessie when we perform... [17:31:13] 06Operations, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1938131 (10jcrespo) 15- dbproxies done. [17:31:21] 06Operations, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2502562 (10jcrespo) [17:31:54] elukey: i'm going to reboot analytics1032 [17:31:54] ok? [17:32:35] (03PS4) 10Dzahn: Gerrit: Remove all the junk to support 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/300930 (owner: 10Chad) [17:34:22] (03PS4) 10BBlack: VCL backends 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [17:34:24] (03PS2) 10BBlack: VCL backends 5/N: use for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/300656 [17:34:26] (03PS4) 10BBlack: VCL backends 3/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [17:34:28] (03PS2) 10BBlack: VCL backends 4/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 [17:34:30] (03PS4) 10BBlack: VCL backends 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [17:34:32] (03PS1) 10BBlack: cache_misc: no need for (?i) on planet regex [puppet] - 10https://gerrit.wikimedia.org/r/301650 (https://phabricator.wikimedia.org/T110717) [17:35:18] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2502581 (10jcrespo) 05Open>03Resolved No more issues from dbproxy1002, and we now have proxy redundancy. [17:35:32] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/3526/lead.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/300930 (owner: 10Chad) [17:35:34] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: no need for (?i) on planet regex [puppet] - 10https://gerrit.wikimedia.org/r/301650 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:36:25] ottomata: sure [17:36:26] (03PS5) 10Dzahn: Gerrit: Remove all the junk to support 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/300930 (owner: 10Chad) [17:36:29] (03Abandoned) 10BBlack: VCL backends 3/N: no need for (?i) on planet [puppet] - 10https://gerrit.wikimedia.org/r/300580 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:36:35] I didn't find anything in server-board and I can ssh [17:36:37] really weird [17:37:07] you can ssh? [17:37:14] sorry can't [17:37:14] i can't ssh. [17:37:15] ok [17:37:23] @log powercycling analytics1032 [17:37:28] !log powercycling analytics1032 [17:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:55] (03CR) 10jenkins-bot: [V: 04-1] VCL backends 5/N: use for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/300656 (owner: 10BBlack) [17:39:21] 06Operations, 10DBA, 07Availability: Setup automatic failover for misc database servers - https://phabricator.wikimedia.org/T141547#2502618 (10jcrespo) [17:39:33] (03CR) 10jenkins-bot: [V: 04-1] VCL backends 4/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (owner: 10BBlack) [17:39:35] 06Operations, 10DBA, 07Availability: Setup automatic failover for misc database servers - https://phabricator.wikimedia.org/T141547#2502636 (10jcrespo) [17:39:37] 06Operations, 10DBA, 07Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1831329 (10jcrespo) [17:40:01] (03CR) 10Ottomata: "Whoa really, this is no longer used in prod?" [puppet] - 10https://gerrit.wikimedia.org/r/301645 (owner: 10Muehlenhoff) [17:40:25] ottomata: ok thanks! do you mind reporting this conversation into the task too? it isn't a huge deal if we can't batch stats but would be nice to have [17:40:30] k [17:40:34] PROBLEM - Host analytics1032 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:36] bblack: in cache/misc.pp the plural in "backends" seems to imply there can be multiple ones, but currently in misc they all just have one. what about adding one, if i had one that does the same thing in the other dc [17:41:43] oh, nevermind actually [17:41:44] mutante hi, you need to c+2 https://gerrit.wikimedia.org/r/#/c/300930/ again please [17:41:50] since you rebased it after +2 [17:41:52] i see logstash and wdqs, ok [17:41:58] and rcs [17:42:05] (03CR) 10Alex Monk: "You would not expect it to be used in labs? It contains some cases specifically for labs..." [puppet] - 10https://gerrit.wikimedia.org/r/301645 (owner: 10Muehlenhoff) [17:42:12] mutante: but it's not for multi-dc, that's entirely separate [17:42:25] 06Operations, 10EventBus, 10Graphite: eventbus should send statsd in batches - https://phabricator.wikimedia.org/T141524#2502657 (10Ottomata) So, these stats are HTTP request statistics directly from Python tornado eventlogging-service (eventbus) uses tornado as its http server. The metrics are collected by... [17:42:31] (03PS9) 10Paladox: gerrit: support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [17:42:47] mutante: right now misc has some cool things for declarative directors/backends, but really only works for the eqiad-only case. text/upload/maps have some completely-different cool stuff to support multi-dc. [17:43:03] the "VCL backends" patches I'm working on above are working towards merging up these two ideas so they both get both things [17:43:21] bblack: ok, yea, i have planet2001, it was to switch when we use the other dc [17:43:38] mutante: for now there's not really a sane way. but soon! [17:43:52] bblack: cool :) thanks [17:44:09] hey, first of all, sorry for the page this morning [17:44:32] ostriches: submitted the junk cleanup change just now [17:44:37] paladox: yep [17:44:40] second, I will be leaving a schema change that will take multiple hours running on neodymium [17:44:46] mutante thanks [17:44:47] :) [17:45:23] (03PS10) 10Paladox: gerrit: support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) [17:45:37] (03PS3) 10Chad: Deployment master: Make sure that none of MediaWiki got taken by a root [puppet] - 10https://gerrit.wikimedia.org/r/301327 [17:45:50] it kills itself if it detects anything strange, but I wanted to let it you know (it an ongoing SAL from this morning) [17:46:21] ugh, dependency cycle [17:46:25] with puppet change [17:46:41] but in compiler it did not show.. what [17:47:21] jynus: thanks for the heads up [17:47:56] oh, and metrics meeting too [17:48:23] (03CR) 10Chad: "Since `find` returns 0 whether or not it finds anything, I basically interpreted it as a string and passed it to `test -z` to see if it's " [puppet] - 10https://gerrit.wikimedia.org/r/301327 (owner: 10Chad) [17:49:42] (03PS5) 10Krinkle: graphite: Set xFilesFactor to 0 for sum/count. [puppet] - 10https://gerrit.wikimedia.org/r/300911 [17:49:44] (03PS2) 10Krinkle: contint: Remove 'integration/phpcs' deployment source [puppet] - 10https://gerrit.wikimedia.org/r/301523 [17:49:58] Weird, Gerrit keeps saying "merge conflict" and yet it rebased fine ^ [17:50:39] Krinkle, try a manual rebase? [17:50:47] (03PS5) 10BBlack: VCL backends 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [17:50:49] (03PS3) 10BBlack: VCL backends 5/N: use for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/300656 [17:50:51] (03PS5) 10BBlack: VCL backends 3/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [17:50:53] (03PS3) 10BBlack: VCL backends 4/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 [17:50:55] (03PS5) 10BBlack: VCL backends 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [17:51:07] Krinkle, it looks good to me [17:51:09] jynus: That's just it. In the gerrit overview dashboard, it says "merge conflict", but on the patch it self pressing rebase makes it fine. [17:51:12] probably you are stuk [17:51:16] on the old version [17:51:20] clik on the title [17:51:23] No, there is no conflict. [17:51:28] rebasing showed no conflict. [17:51:34] it goes to the "general" view [17:51:37] I'm not getting a conflict locally or anywhere else [17:51:59] I think it is because puppet is fast-forward only and the gerrit UI will say "Merge conflict" if it cannot allow a merge right away [17:52:00] (03CR) 10Yuvipanda: "It is being used in beta cluster http://tools.wmflabs.org/watroles/role/role::logging::mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/301645 (owner: 10Muehlenhoff) [17:52:08] so even though it rebases fine, because the outdated parent, it will say merge conflict yea [17:52:09] h [17:52:14] anyway, I'll stop rebasing :) [17:55:01] (03PS1) 10BBlack: add fake ecc-star.planet key [labs/private] - 10https://gerrit.wikimedia.org/r/301654 [17:55:13] (03CR) 10BBlack: [C: 032 V: 032] add fake ecc-star.planet key [labs/private] - 10https://gerrit.wikimedia.org/r/301654 (owner: 10BBlack) [17:58:07] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1032 disk failure - https://phabricator.wikimedia.org/T141550#2502716 (10Ottomata) [17:58:10] cmjohnson1: ^ [17:58:59] (03PS1) 10Dzahn: gerrit: fix dependency cycle Letsencrypt/Apache [puppet] - 10https://gerrit.wikimedia.org/r/301655 [17:59:16] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1032 disk failure - https://phabricator.wikimedia.org/T141550#2502732 (10Ottomata) I've disconnected from the console, but I've left the box open to the PERC menu. [17:59:48] (03CR) 10Paladox: [C: 031] gerrit: fix dependency cycle Letsencrypt/Apache [puppet] - 10https://gerrit.wikimedia.org/r/301655 (owner: 10Dzahn) [18:00:04] (03PS4) 10Yuvipanda: tools: Send combined status / httpver stats to graphite [puppet] - 10https://gerrit.wikimedia.org/r/301646 [18:00:13] bd808 defaultdict was a good idea, I've added it [18:00:20] (03CR) 10BBlack: [C: 031] "Yeah, this is more-correct. the "integrated" class handles dependency internally..." [puppet] - 10https://gerrit.wikimedia.org/r/301655 (owner: 10Dzahn) [18:01:17] (03CR) 10Chad: [C: 031] gerrit: fix dependency cycle Letsencrypt/Apache [puppet] - 10https://gerrit.wikimedia.org/r/301655 (owner: 10Dzahn) [18:01:29] (03CR) 10Dzahn: [C: 032] gerrit: fix dependency cycle Letsencrypt/Apache [puppet] - 10https://gerrit.wikimedia.org/r/301655 (owner: 10Dzahn) [18:01:39] bblack: Weird part was puppet compiler didn't catch that. mutante and I both ran it [18:01:51] yea, that part [18:02:00] i got the result still open [18:03:44] (03CR) 10Chad: [C: 031] gerrit: support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [18:03:48] it builds the catalog but dependency issues like that will just be caught when it actually runs [18:06:16] (03Abandoned) 10GWicke: Service::node: Capture stdout and stderr in journal [puppet] - 10https://gerrit.wikimedia.org/r/301309 (https://phabricator.wikimedia.org/T136957) (owner: 10GWicke) [18:07:28] (03CR) 10Dzahn: [C: 032] gerrit: support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [18:07:39] (03PS11) 10Dzahn: gerrit: support linking to a phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/301580 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [18:08:02] mutante needs c+2 again ^^ please since it was rebased. [18:09:06] (03CR) 10BryanDavis: [C: 031] tools: Send combined status / httpver stats to graphite [puppet] - 10https://gerrit.wikimedia.org/r/301646 (owner: 10Yuvipanda) [18:15:41] (03Abandoned) 10Muehlenhoff: Remove role::logging::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/301645 (owner: 10Muehlenhoff) [18:18:02] (03PS6) 10BBlack: VCL backends 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [18:18:04] (03PS4) 10BBlack: VCL backends 5/N: use for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/300656 [18:18:06] (03PS6) 10BBlack: VCL backends 3/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [18:18:08] (03PS4) 10BBlack: VCL backends 4/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 [18:18:10] (03PS6) 10BBlack: VCL backends 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [18:27:22] (03PS5) 10BBlack: VCL backends 5/N: use for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/300656 [18:27:24] (03PS5) 10BBlack: VCL backends 4/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 [18:27:44] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2502820 (10zhuyifei1999) [18:33:33] 06Operations, 10Graphite, 05MW-1.27-release-notes, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2502832 (10fgiunchedi) recap: jmxtrans pushes gauges to statsd every 15s, I can see the metrics being received by statsd-proxy and forwarded to st... [18:41:58] (03PS1) 10Chad: Gerrit: Reload replication plugin automatically when config changes [puppet] - 10https://gerrit.wikimedia.org/r/301658 [18:43:24] (03CR) 10Paladox: [C: 031] Gerrit: Reload replication plugin automatically when config changes [puppet] - 10https://gerrit.wikimedia.org/r/301658 (owner: 10Chad) [18:44:31] (03PS1) 10Dzahn: gerrit: fix another dependency issue apache::site/LE [puppet] - 10https://gerrit.wikimedia.org/r/301659 [18:45:10] (03PS2) 10Dzahn: gerrit: fix another dependency issue apache::site/LE [puppet] - 10https://gerrit.wikimedia.org/r/301659 [18:45:37] (03CR) 10Paladox: [C: 031] gerrit: fix another dependency issue apache::site/LE [puppet] - 10https://gerrit.wikimedia.org/r/301659 (owner: 10Dzahn) [18:45:46] (03CR) 10Dzahn: [C: 032] gerrit: fix another dependency issue apache::site/LE [puppet] - 10https://gerrit.wikimedia.org/r/301659 (owner: 10Dzahn) [18:50:29] gerrit restart ... [18:51:14] done [18:51:27] paladox: you can check now too [18:51:36] mutante thanks [18:51:38] yay it works [18:51:41] great [18:51:44] you can now do [18:51:47] T1#1 [18:51:48] T1: Get puppet runs into logstash - https://phabricator.wikimedia.org/T1 [18:53:18] mutante i closed https://phabricator.wikimedia.org/T76459 as resolved :) [18:55:15] paladox: very nice, back from 2014 :) [18:55:22] Yep [18:55:30] seems when i publish it [18:55:35] it gets a 500 server error [18:55:38] i wanted to link to comments before [18:55:38] mutante ^^ [18:55:50] i think that may be because of the gerrit review bot [18:55:58] ? [18:56:00] which connects to phabricator [18:56:28] Never mind works [18:56:48] But the phabricator bot wont support Bug: T1#1 though [18:56:48] T1: Get puppet runs into logstash - https://phabricator.wikimedia.org/T1 [18:57:03] but that is not what the bug i closed was on about i think [18:57:05] mutante ^^ [18:57:36] the ticket does not mention the bot, no [18:57:59] where did you see a 500 error? [18:58:07] On gerrit [18:58:16] testing here [18:58:16] https://gerrit.wikimedia.org/r/#/c/145575/ [18:58:21] it published now [18:58:37] but i got 500 error when i did it, probaly gerritbot. [18:58:43] brb, dinner. [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160728T1900). Please do the needful. [19:00:25] * thcipriani does [19:01:23] !log restarted grrrit-wm [19:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:51] 06Operations, 10Monitoring, 06Release-Engineering-Team: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520#2502892 (10hashar) The alarm is not delayed by half an hour. It is delay but not as much as I thought. Looking at the green... [19:02:52] mutante: i have a couple of instances to bootstrap, if you have the bandwidth :) [19:03:38] (03PS3) 10Dzahn: Gerrit: Remove old library linking [puppet] - 10https://gerrit.wikimedia.org/r/300932 (owner: 10Chad) [19:03:51] (03PS1) 10Thcipriani: all wikis to 1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301663 [19:05:02] (03CR) 10Dzahn: [C: 032] Gerrit: Remove old library linking [puppet] - 10https://gerrit.wikimedia.org/r/300932 (owner: 10Chad) [19:05:06] (03PS5) 10Chad: Minor tweaks to 2.12.2 package [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 [19:05:23] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:05:37] (03CR) 10Thcipriani: [C: 032] all wikis to 1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301663 (owner: 10Thcipriani) [19:05:55] urandom: in a minute, after the gerrit one [19:06:02] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301663 (owner: 10Thcipriani) [19:06:24] mutante: cool; let me know [19:06:25] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.12 [19:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:54] (03CR) 10Eevans: [C: 031] "I'm ready; Let's do this! :)" [puppet] - 10https://gerrit.wikimedia.org/r/301642 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:06:59] (03PS2) 10Eevans: Enable Casssandra instance restbase1014-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301642 (https://phabricator.wikimedia.org/T134016) [19:07:14] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [19:08:44] (03CR) 10Dzahn: [C: 032] Enable Casssandra instance restbase1014-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301642 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:08:52] rhodium? [19:08:59] Luke081515: new puppetmaster [19:09:06] ah [19:09:26] strontium is decomissioned? [19:09:30] no [19:10:00] no or not yet? :) [19:10:09] urandom: 1014-c merged [19:10:15] mutante: thanks! [19:10:50] Luke081515: https://phabricator.wikimedia.org/T98173 [19:11:09] ah [19:12:35] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 4 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2502925 (10Yurik) > We can already get technical feedback from WikiVoyage, we aren't going to get much... [19:14:05] (03PS2) 10Dzahn: Gerrit: Reload replication plugin automatically when config changes [puppet] - 10https://gerrit.wikimedia.org/r/301658 (owner: 10Chad) [19:14:36] (03CR) 10Dzahn: [C: 032] ""gerrit.autoReload : If true, automatically reloads replication destinations and settings after replication.config file is updated, withou" [puppet] - 10https://gerrit.wikimedia.org/r/301658 (owner: 10Chad) [19:17:00] Luke081515: from https://phabricator.wikimedia.org/T139471 i think it's going to be in addition, and strontium will still be upgraded and stay [19:17:09] ah [19:20:51] (03PS6) 10Dzahn: labs: restart slapd once a week [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) [19:22:15] !log T134016: Bootstrapping restbase1014-c.eqiad.wmnet [19:22:16] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:15] (03CR) 10Dzahn: "Moritz has explained why it does not hit corp OIT mirror, it only happens in a clustered setup. removed those comments." [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [19:23:27] (03PS1) 10Chad: Revert "gerrit: support linking to a phabricator comment" [puppet] - 10https://gerrit.wikimedia.org/r/301665 [19:23:38] mutante: That's the one ^^^ [19:24:49] (03CR) 10Dzahn: [C: 032] Revert "gerrit: support linking to a phabricator comment" [puppet] - 10https://gerrit.wikimedia.org/r/301665 (owner: 10Chad) [19:24:57] (03PS2) 10Dzahn: Revert "gerrit: support linking to a phabricator comment" [puppet] - 10https://gerrit.wikimedia.org/r/301665 (owner: 10Chad) [19:25:03] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:26:42] I see we've a lot of Undefined variable: index in /srv/mediawiki/php-1.28.0-wmf.12/includes/api/ApiQueryUserContributions.php on line 340. I've a fix merged on master, should I cherry pick it for this evening SWAT? [19:28:21] Hey, a question. We had two spikes of failed jobs for ores today in 12:52 and 13:52 UTC. Was something happening at that time? [19:28:35] https://grafana.wikimedia.org/dashboard/db/ores-extension [19:28:59] 12:47:58 < hashar> beside a spike of errors at 12:15 UTC logstash does not show much errors [19:30:03] Dereckson: if you can cherry pick it, I can deploy it now. [19:30:16] thcipriani: fine [19:31:45] thcipriani: https://gerrit.wikimedia.org/r/#/c/301666/ [19:33:43] Dereckson: thank you! [19:34:11] You're welcome. [19:36:20] Dereckson: hello [19:36:35] ah [19:36:51] Amir1: I have noticed the spikes but they were rather short and I haven't looked at the errors themselves. [19:37:16] 06Operations, 10Graphite, 05MW-1.27-release-notes, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#1330890 (10Krinkle) >>! In T101141#2502351, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVYya... [19:37:19] gilles: https://phabricator.wikimedia.org/T101141#2502979 [19:37:42] godog: ^ [19:38:01] hashar_: thanks, I was thinking if sometime was down. Did the spike happen for all jobs or just ores jobs? [19:38:08] *something [19:38:12] Amir1: no idea I havent looked at them [19:38:24] okay [19:38:37] let me check graphite [19:38:53] PROBLEM - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: Connection refused [19:39:16] ^ that just got installed, ok [19:39:23] ya [19:39:42] * urandom was waiting for it... [19:39:44] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-30 19:39:21. [19:39:58] :) yep thx [19:42:27] !log deploy restbase cdd164c4e canary on restbase1007 [19:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:43:28] Im back [19:43:32] mutante ^^ [19:43:32] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:43:32] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:43:42] !log thcipriani@tin Synchronized php-1.28.0-wmf.12/includes/api/ApiQueryUserContributions.php: [[gerrit:301666|Fix Undefined variable issue in ApiQueryUserContributions]] (duration: 00m 32s) [19:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:43:54] paladox: unfortunately one change had to reverteted [19:43:55] Amir1: do you have access to the production logstash ? [19:44:02] paladox: and that caused the 500 you saw [19:44:08] hashar_: yup [19:44:16] I checked that too [19:44:16] be reverted, cant type [19:44:28] https://phabricator.wikimedia.org/T141368 [19:47:33] it seems it was just ores [19:47:52] 06Operations, 10Monitoring, 06Release-Engineering-Team: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520#2503030 (10hashar) p:05Triage>03Normal [19:55:26] thanks [20:02:53] !log deploy restbase cdd164c4e [20:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:01] 06Operations, 10Traffic: SSL cert for policy.wm.org expiring Aug 27 - https://phabricator.wikimedia.org/T141564#2503063 (10Dzahn) [20:06:24] mutante: https://gerrit.wikimedia.org/r/#/c/301643 is ready whenever you've got a moment (last today) [20:06:35] ACKNOWLEDGEMENT - HTTPS-policy on policy.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate policy.wikimedia.org valid until 2016-08-27 16:01:01 +0000 (expires in 29 days) daniel_zahn https://phabricator.wikimedia.org/T141564 [20:07:08] 06Operations, 10Traffic: SSL cert for policy.wm.org expiring Aug 27 - https://phabricator.wikimedia.org/T141564#2503063 (10Dzahn) a:05RobH>03None [20:07:44] 06Operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571514 (10Dzahn) This cert is going to expire next month. now. Created a subtask for that. [20:09:24] 06Operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#2503097 (10Dzahn) [20:09:26] 06Operations, 10Traffic: SSL cert for policy.wm.org expiring Aug 27 - https://phabricator.wikimedia.org/T141564#2503093 (10Dzahn) 05Open>03Invalid oops, duplicate of T140263 [20:09:58] 06Operations, 10Traffic: SSL cert for policy.wm.org expiring Aug 27 - https://phabricator.wikimedia.org/T141564#2503099 (10Dzahn) [20:10:00] 06Operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571514 (10Dzahn) [20:10:59] (03CR) 10Dzahn: [C: 032] Enable Cassandra instance restbase2006-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301643 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [20:11:38] urandom: go ahead [20:11:50] mutante: awesome; thanks again [20:13:42] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:15:33] mutante: heh, you merged the ssl task into an invisible one [20:16:51] Luke081515: i closed the new one as duplicate. did not create the original one. it's not public because it's in the procurement space [20:17:15] :-/ [20:18:32] Luke081515: https://phabricator.wikimedia.org/T93796 [20:18:54] hm [20:19:06] grrrit-wm: is dead or behind [20:19:40] ostriches: ^^ [20:19:44] * apergos runs away [20:19:55] Somebody *just* kicked it [20:20:10] i restarted it after each gerrit restart but _this_ was not me [20:20:38] I'm raising my bribe. Instead of $20. Now it's $50 who can make that stupid-ass bot restart itself when it hits a gerrit d/c. [20:21:08] LOL [20:21:40] ostriches why not create a !grrit-wm restart command lol [20:23:16] that entire dialog just happened in -labs [20:23:47] Yep [20:24:11] the way we use -operations and -labs and -releng at the same time for the same thing :) [20:24:20] and randomly switch [20:25:29] !log T134016: Bootstrapping restbase2006-c.codfw.wmnet [20:25:30] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [20:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:28] some of us don't camp in those other channels [20:27:44] I slum in wikimedia-dev, seems good enough to me :-P [20:28:25] E_TOOMANYCHANNELS [20:28:49] yeah no kidding [20:28:52] yes, and we always solve that problem by adding another channel ("so we can focus") [20:28:54] what did I just find out about the other day [20:29:00] a databases specific channel I think [20:29:11] LOL [20:29:31] mutante: maybe what we need is to add another medium, say slack, or discourse [20:29:34] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:29:37] mutante: so we can focus. [20:29:48] both, I'd say [20:34:59] not sure if kidding [20:35:06] yes, it's always both [20:39:35] PROBLEM - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is CRITICAL: Connection refused [20:41:44] well slack, discourse, and um... instagram :-P [20:42:08] I think twitter is using too much market share to add to the list :-P [20:42:59] lol, i doint even have twitter, but i do fb. [20:44:05] yeah don't really do fb [20:44:20] don't to slack, discourse or instagram either tbh :-P [20:44:34] i (sort of) use twitter, but not facebook [20:44:56] mostly because facebook is so much better at connecting you to others, and i don't need to be that connected [20:45:15] at all. [20:45:33] twitter is nice as a live news stream [20:45:45] fb not so much and I don't really use social media to be social [20:46:52] Oh [20:46:56] i always thought twitter's purpose was to deteriorate the english language, but yeah, i guess you could use it for news too [20:47:11] sigh, restarted it again. [20:47:21] I use twitter only to follow on the latest windows 10 insiders builds [20:47:22] :) [20:47:23] good luck mu tante [20:47:37] paladox: nobody's perfect :-P [20:47:51] Yep [20:47:53] all right I'm checking out for the night, have a good one [20:47:59] (03CR) 10Dzahn: "hi, i'm a bot" [puppet] - 10https://gerrit.wikimedia.org/r/301327 (owner: 10Chad) [20:49:24] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-30 20:49:04. [20:50:45] (03PS8) 10BBlack: VCL backends 5/N: use for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/300656 [20:50:47] (03PS8) 10BBlack: VCL backends 4/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 [20:58:44] (03PS1) 10Eevans: Undo temporary rsync setup on RESTBase codfw staging nodes [puppet] - 10https://gerrit.wikimedia.org/r/301674 [21:00:17] mutante: re: ^^^ I'm all done syncing that data. Thanks again for setting that up, it was really helpful. [21:01:03] mutante: and I can do the cleanup on those machines if you like (i guess that's just killing the daemon and removing the config?) [21:02:23] and the systemd unit, i guess? [21:04:30] urandom: does it make sense to keep them separate in site.pp? [21:04:39] wondering how soon we want to do something again [21:04:56] where the 3 machines will be slightly different [21:04:59] urandom oh, never thought facebook was better at connecting [21:05:21] i use it as i have other freinds but i manly use it since you have news in your news feed [21:05:22] lol [21:05:45] mutante: so, what will happen (i think), is that the eqiad staging nodes will be replaced with beefier hardware [21:06:06] mutante: at which time, i'd want to transfer those snapshots back over to seed them [21:06:43] mutante: there is an issue open, and general agreement that we ought to upgrade staging, but nothing has actually happened [21:06:51] so i expect that means we're talking "months" [21:07:12] if you think it's OK to leave that in site.pp in the meantime, I'm fine with that [21:08:19] mutante: i don't think it hurts anything to have rsync running there, but i was under the impression that putting this into site.pp was a quick-hack sort of thing [21:08:46] so i didn't want to drop the ball on my end [21:10:06] urandom: i did not mean to keep rsync running, i just meant the part where all 3 nodes are a single regex [21:10:16] which made it harder to change things on just a single one of them [21:10:24] reverting is fine though [21:10:29] it's the cleanest change [21:10:41] oh, sorry, i misunderstood [21:12:38] (03CR) 10Dzahn: [C: 032] Undo temporary rsync setup on RESTBase codfw staging nodes [puppet] - 10https://gerrit.wikimedia.org/r/301674 (owner: 10Eevans) [21:14:15] urandom: done. yes, please do the clean up. the 3 things you already mentioned, yea, stop the daemon and kill the config snippet. there is no package to remove because having rsync (client) is normal [21:14:25] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [21:15:51] mutante: will do. [21:15:56] :) [21:17:43] (03CR) 10Dzahn: [C: 032] "this works" [puppet] - 10https://gerrit.wikimedia.org/r/301327 (owner: 10Chad) [21:18:06] (03PS4) 10Dzahn: Deployment master: Make sure that none of MediaWiki got taken by a root [puppet] - 10https://gerrit.wikimedia.org/r/301327 (owner: 10Chad) [21:20:23] ^ that's a new icinga check that will tell us when permissions are messed up on -staging ... and once it gets added it's expected that it will be CRIT right away :) [21:37:14] ACKNOWLEDGEMENT - Host analytics1032 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn disk failure = https://phabricator.wikimedia.org/T141550 [22:06:14] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Puppet config changes for ORES refactor - https://phabricator.wikimedia.org/T141575#2503579 (10Ladsgroup) [22:08:05] (03CR) 10MaxSem: Test numeric sorting on test wiki, remove test from beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301644 (https://phabricator.wikimedia.org/T141433) (owner: 10Kaldari) [22:08:27] (03PS2) 10MaxSem: Labs: remove wmgUseGuidedTour - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300472 [22:09:09] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseGuidedTour - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300472 (owner: 10MaxSem) [22:09:38] (03Merged) 10jenkins-bot: Labs: remove wmgUseGuidedTour - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300472 (owner: 10MaxSem) [22:11:29] (03PS1) 10ArielGlenn: tiny script that retrieves config values from dump config files [dumps] - 10https://gerrit.wikimedia.org/r/301712 (https://phabricator.wikimedia.org/T141563) [22:11:34] (03PS2) 10MaxSem: Labs: remove wmgUseWPB - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300473 [22:11:42] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseWPB - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300473 (owner: 10MaxSem) [22:11:50] (03PS4) 10Yuvipanda: tools: Don't load kernel logging module explicitly [puppet] - 10https://gerrit.wikimedia.org/r/301534 [22:11:55] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't load kernel logging module explicitly [puppet] - 10https://gerrit.wikimedia.org/r/301534 (owner: 10Yuvipanda) [22:12:04] (03PS5) 10Yuvipanda: tools: Send combined status / httpver stats to graphite [puppet] - 10https://gerrit.wikimedia.org/r/301646 [22:12:07] (03PS2) 10MaxSem: Labs: remove wgThumbnailMinimumBucketDistance - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300474 [22:12:09] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Send combined status / httpver stats to graphite [puppet] - 10https://gerrit.wikimedia.org/r/301646 (owner: 10Yuvipanda) [22:12:11] (03Merged) 10jenkins-bot: Labs: remove wmgUseWPB - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300473 (owner: 10MaxSem) [22:12:13] (03CR) 10MaxSem: [C: 032] Labs: remove wgThumbnailMinimumBucketDistance - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300474 (owner: 10MaxSem) [22:12:22] (03PS2) 10MaxSem: Labs: remove wgThumbnailBuckets - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300475 [22:12:28] (03CR) 10MaxSem: [C: 032] Labs: remove wgThumbnailBuckets - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300475 (owner: 10MaxSem) [22:12:37] (03PS2) 10MaxSem: Labs: remove wgUseBetaFeatures - it's wmg actually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300476 [22:12:44] (03Merged) 10jenkins-bot: Labs: remove wgThumbnailMinimumBucketDistance - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300474 (owner: 10MaxSem) [22:12:46] (03CR) 10MaxSem: [C: 032] Labs: remove wgUseBetaFeatures - it's wmg actually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300476 (owner: 10MaxSem) [22:12:49] (03CR) 10jenkins-bot: [V: 04-1] tiny script that retrieves config values from dump config files [dumps] - 10https://gerrit.wikimedia.org/r/301712 (https://phabricator.wikimedia.org/T141563) (owner: 10ArielGlenn) [22:13:03] (03Merged) 10jenkins-bot: Labs: remove wgThumbnailBuckets - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300475 (owner: 10MaxSem) [22:13:14] (03Merged) 10jenkins-bot: Labs: remove wgUseBetaFeatures - it's wmg actually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300476 (owner: 10MaxSem) [22:14:53] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Puppet config changes for ORES refactor - https://phabricator.wikimedia.org/T141575#2503579 (10Pchelolo) > Change method of precaching to several models at a time. Please do this in the backwards compatible way and ping me when it's done, so that I could upd... [22:16:01] (03PS2) 10MaxSem: Labs: remove wmgUseMultimediaViewer - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300477 [22:16:06] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseMultimediaViewer - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300477 (owner: 10MaxSem) [22:16:29] (03Merged) 10jenkins-bot: Labs: remove wmgUseMultimediaViewer - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300477 (owner: 10MaxSem) [22:17:05] 06Operations, 10Incident-20151216-Labs-NFS, 05Wikimedia-Incident: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#2503649 (10greg) [22:17:10] 06Operations, 10Incident-20151216-Labs-NFS, 05Wikimedia-Incident: Add step in start-nfs to ask operator to consider dropping some snapshots - https://phabricator.wikimedia.org/T121890#2503650 (10greg) [22:17:12] 06Operations, 10Incident-20151216-Labs-NFS, 06Labs, 05Wikimedia-Incident: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#2503651 (10greg) [22:17:16] 06Operations, 10Incident-20151216-Labs-NFS, 06Labs, 05Wikimedia-Incident: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#2503654 (10greg) [22:17:18] (03PS2) 10ArielGlenn: tiny script that retrieves config values from dump config files [dumps] - 10https://gerrit.wikimedia.org/r/301712 (https://phabricator.wikimedia.org/T141563) [22:17:19] 06Operations, 10Continuous-Integration-Config, 10Incident-20160126-WikimediaDomainRedirection, 07Regression, 05Wikimedia-Incident: operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801#2503652 (10greg) [22:17:22] 06Operations, 10Incident-20150825-Redis, 10Monitoring, 05Wikimedia-Incident: Monitor redis memory/disk usage - https://phabricator.wikimedia.org/T110169#2503656 (10greg) [22:17:24] 06Operations, 10Incident-20150825-Redis, 10Monitoring, 05Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#2503655 (10greg) [22:17:28] 06Operations, 10ops-codfw, 06DC-Ops, 10Incident-20150617-LabsNFSOutage, 05Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2503658 (10greg) [22:17:32] 06Operations, 10ops-eqiad, 06DC-Ops, 10Incident-20151216-Labs-NFS, and 3 others: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#2503660 (10greg) [22:17:39] 06Operations, 10Architecture, 10Incident-20150423-Commons, 10RESTBase, and 6 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#2503662 (10greg) [22:17:41] (sorry) [22:17:48] 06Operations, 10Incident-20150820-OCSP, 10Traffic, 07HTTPS, 05Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2503668 (10greg) [22:17:53] 06Operations, 10DBA, 10Incident-20150205-SiteOutage, 05Wikimedia-Incident: sleeper database connection surges during outage - https://phabricator.wikimedia.org/T88770#2503670 (10greg) [22:17:56] 06Operations, 10Incident-20150205-SiteOutage, 07Availability, 13Patch-For-Review, 05Wikimedia-Incident: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#2503671 (10greg) [22:22:10] (03PS2) 10MaxSem: Labs: remove wmgUseRestbaseVRS - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300479 [22:22:12] (03PS2) 10MaxSem: Labs: RevisionSlider is already loaded in prod, remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300482 [22:22:14] (03PS2) 10MaxSem: Labs: Kartographer is already loaded in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300483 [22:22:16] (03PS2) 10MaxSem: Labs: remove wmgVisualEditorAccessRESTbaseDirectly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300480 [22:22:18] (03PS2) 10MaxSem: Labs: remove wmgUseNavigationTiming - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300481 [22:22:20] (03PS2) 10MaxSem: Labs: remove commented out OnlineStatusBar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300486 (https://phabricator.wikimedia.org/T34128) [22:22:22] (03PS2) 10MaxSem: Labs: don't load Interwiki - duplicates prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300484 [22:22:24] (03PS2) 10MaxSem: Labs: don't load MultimediaViewer - already done in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300485 [22:22:35] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 13Patch-For-Review: Increase time before alert for elasticsearch disk space issues - https://phabricator.wikimedia.org/T136702#2503719 (10debt) p:05Triage>03Normal [22:24:41] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Puppet config changes for ORES refactor - https://phabricator.wikimedia.org/T141575#2503730 (10Ladsgroup) >>! In T141575#2503622, @Pchelolo wrote: >> Change method of precaching to several models at a time. > Please do this in the backwards compatible way an... [22:27:05] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseRestbaseVRS - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300479 (owner: 10MaxSem) [22:27:11] (03CR) 10MaxSem: [C: 032] Labs: remove wmgVisualEditorAccessRESTbaseDirectly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300480 (owner: 10MaxSem) [22:27:15] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseNavigationTiming - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300481 (owner: 10MaxSem) [22:27:21] (03CR) 10MaxSem: [C: 032] Labs: RevisionSlider is already loaded in prod, remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300482 (owner: 10MaxSem) [22:27:25] (03CR) 10MaxSem: [C: 032] Labs: Kartographer is already loaded in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300483 (owner: 10MaxSem) [22:27:29] (03CR) 10MaxSem: [C: 032] Labs: don't load Interwiki - duplicates prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300484 (owner: 10MaxSem) [22:27:33] (03Merged) 10jenkins-bot: Labs: remove wmgUseRestbaseVRS - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300479 (owner: 10MaxSem) [22:27:36] (03Merged) 10jenkins-bot: Labs: remove wmgVisualEditorAccessRESTbaseDirectly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300480 (owner: 10MaxSem) [22:27:38] (03CR) 10MaxSem: [C: 032] Labs: don't load MultimediaViewer - already done in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300485 (owner: 10MaxSem) [22:27:41] (03Merged) 10jenkins-bot: Labs: remove wmgUseNavigationTiming - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300481 (owner: 10MaxSem) [22:27:43] (03CR) 10MaxSem: [C: 032] Labs: remove commented out OnlineStatusBar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300486 (https://phabricator.wikimedia.org/T34128) (owner: 10MaxSem) [22:27:47] (03Merged) 10jenkins-bot: Labs: RevisionSlider is already loaded in prod, remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300482 (owner: 10MaxSem) [22:27:49] (03PS3) 10Thcipriani: contint: role for Android testing [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar) [22:27:51] (03Merged) 10jenkins-bot: Labs: Kartographer is already loaded in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300483 (owner: 10MaxSem) [22:27:58] (03Merged) 10jenkins-bot: Labs: don't load Interwiki - duplicates prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300484 (owner: 10MaxSem) [22:28:02] I approved MaxSem deploying some Beta Cluster only changes, btw [22:28:03] (03Merged) 10jenkins-bot: Labs: don't load MultimediaViewer - already done in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300485 (owner: 10MaxSem) [22:28:09] (03Merged) 10jenkins-bot: Labs: remove commented out OnlineStatusBar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300486 (https://phabricator.wikimedia.org/T34128) (owner: 10MaxSem) [22:29:07] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 14 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [22:32:47] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 4 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2503750 (10Esanders) >>! In T139946#2502925, @Yurik wrote: >> ...maps are not working in the apps...... [22:42:05] !log maxsem@tin Synchronized wmf-config/: Labs only (duration: 00m 45s) [22:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:42:58] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:43:50] (03Abandoned) 10MaxSem: Labs: remove wmgUseImageMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300478 (owner: 10MaxSem) [23:00:04] RoanKattouw, ostriches, MaxSem, awight, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160728T2300). Please do the needful. [23:00:04] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:55] kaldari, I left a comment on your patch [23:01:13] oh? [23:02:46] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2494814 (10Dzahn) Looks like this ticket is lacking a (manager) approval. Please get one. Meanwhile i am moving on with creating the user (it has to be separate from addin... [23:04:21] (03CR) 10Kaldari: Test numeric sorting on test wiki, remove test from beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301644 (https://phabricator.wikimedia.org/T141433) (owner: 10Kaldari) [23:04:30] MaxSem: replied [23:05:26] Krenair: i see a disadvantage now if we'd remove bastiononly and replace it with allusers, we cant create user accounts before giving them actual access. slows down access requests since everything can only happen after approvals [23:06:03] MaxSem: hmm, maybe I'm wrong: https://gerrit.wikimedia.org/r/#/c/301550/1/wmf-config/InitialiseSettings-labs.php [23:06:10] might useful for other folks [23:06:22] I guess I'll leave it... [23:06:29] (03PS1) 10Dzahn: admin: add shell account for Jan Dittrich [puppet] - 10https://gerrit.wikimedia.org/r/301721 (https://phabricator.wikimedia.org/T141339) [23:07:02] mutante, yeah well, that's just how all-users already works [23:07:07] (03PS2) 10Kaldari: Test numeric sorting on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301644 (https://phabricator.wikimedia.org/T141433) [23:07:13] you're giving them rutherfordium access when you add their user [23:07:44] Krenair: not really since they cant get there without a bastion [23:07:52] but after that change.. then it would [23:08:09] mutante, because of the network, not the admin module [23:08:41] yes, but what matters if they actually have access or not [23:09:05] like this guy i am creating now [23:09:09] i couldnt do this anymore [23:09:32] until after WMF found out who is the manager of WMDE [23:10:21] I'm not sure users should be created prior to approval anyway [23:11:11] (03PS6) 10BryanDavis: [WIP] Provision Striker via scap3 [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) [23:11:21] ok fine, i'm gonna let it sit another week [23:11:43] (03CR) 10MaxSem: [C: 032] Test numeric sorting on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301644 (https://phabricator.wikimedia.org/T141433) (owner: 10Kaldari) [23:11:45] (03Abandoned) 10Dzahn: admin: add shell account for Jan Dittrich [puppet] - 10https://gerrit.wikimedia.org/r/301721 (https://phabricator.wikimedia.org/T141339) (owner: 10Dzahn) [23:13:07] we are already so effective at this [23:13:45] (03PS3) 10MaxSem: Test numeric sorting on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301644 (https://phabricator.wikimedia.org/T141433) (owner: 10Kaldari) [23:16:20] kaldari, pulled on mw1099 [23:17:47] MaxSem: This change only affects how things are written to the database, so I'm not sure how to test it without just syching it to testwiki. i.e. I don't think X-Wikimedia-Debug is going to be useful. [23:18:32] mhm, inded - LinksUpdate is now asynchronous so can't test reasonably [23:18:54] it did work fine on Beta Labs at least :) [23:19:36] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/301644/ (duration: 00m 29s) [23:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:07] 06Operations, 06Release-Engineering-Team, 15User-greg, 05Wikimedia-Incident: Institute quarterly(?) review of incident reports and follow-up - https://phabricator.wikimedia.org/T141287#2503945 (10greg) [23:20:14] kaldari, MWException from line 282 of /srv/mediawiki-staging/php-1.28.0-wmf.12/includes/collation/IcuCollation.php: MediaWiki does not support ICU locale "default-u-kn" [23:20:40] :( [23:21:19] ah I bet I know what's wrong. I bet the code it's dependent on didn't make the train [23:21:24] better rollback [23:21:42] sorry [23:22:01] (03PS1) 10MaxSem: Revert "Test numeric sorting on test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301724 [23:22:09] (03PS1) 10Kaldari: Revert "Test numeric sorting on test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301725 [23:22:15] (03CR) 10MaxSem: [C: 032] Revert "Test numeric sorting on test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301724 (owner: 10MaxSem) [23:22:25] oops :) [23:22:38] looks like you're on it [23:22:40] (03Merged) 10jenkins-bot: Revert "Test numeric sorting on test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301724 (owner: 10MaxSem) [23:23:04] (03Abandoned) 10Kaldari: Revert "Test numeric sorting on test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301725 (owner: 10Kaldari) [23:23:09] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2503966 (10Dzahn) a:03yuvipanda Hey Yuvi, could you please do the checkbox about labs lists? Otherwise i'd have to reset the password for all you admi... [23:23:55] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/301724/1 (duration: 00m 24s) [23:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:18] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2503969 (10yuvipanda) @Dzahn I meant 'admin'. I don't have the password at all, I've just been using the mailman master password, which is probably pret... [23:26:13] MaxSem: Thanks though. I'll wait and try it again after the next train [23:28:06] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2503993 (10Dzahn) I would prefer it if the list admins could handle administration of their lists. The password needs to be shared with the other admins. [23:29:05] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2503994 (10yuvipanda) Sure, I'll do that later then. There's no hurry for us to finish that box off. [23:31:48] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: puppet fail [23:33:06] kaldari: MaxSem: another solution: we could cherry-pick the uca change? [23:33:21] kaldari: what's your goal and ETA to deploy that in real wikis? [23:33:37] Dereckson: Not really necessary. I'm fine with waiting as there is no hard deadline on this. [23:34:20] okay so next candidate SWAT windows is Tuesday evening [23:34:26] (the first after the next train) [23:36:15] I vote for waiting if that's how kaldari is feeling :) [23:36:32] (03PS5) 10Dzahn: Create the group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [23:36:42] I'll do the Tuesday evening window [23:37:24] (03CR) 10Dzahn: "the journalctl part is just like in many other admin groups. so if we want to change it we'd change it for all of them. so i'm going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [23:38:39] (03CR) 10Dzahn: [C: 032] Create the group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [23:41:36] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484278 (10Dzahn) group has been created. (but not added to hosts) [23:51:56] (03PS1) 10Dzahn: eventbus: add new eventbus-admins group to nodes via role [puppet] - 10https://gerrit.wikimedia.org/r/301726 (https://phabricator.wikimedia.org/T141013) [23:59:18] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures