[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T0000). [00:10:10] RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [00:17:34] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, 13Patch-For-Review: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2657991 (10awight) @AndyRussG I believe we have server-side logging that will show us every o... [00:31:14] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [00:34:43] 06Operations, 10Phabricator (Upstream), 07Upstream: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2658003 (10Paladox) [00:38:47] (03PS2) 10BBlack: upload storage: transition cp1063+cp1064 [puppet] - 10https://gerrit.wikimedia.org/r/311998 [00:39:15] (03CR) 10BBlack: [C: 032 V: 032] upload storage: transition cp1063+cp1064 [puppet] - 10https://gerrit.wikimedia.org/r/311998 (owner: 10BBlack) [00:48:49] PROBLEM - Varnishkafka log producer on cp1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [01:10:44] RECOVERY - Varnishkafka log producer on cp1074 is OK: PROCS OK: 1 process with command name varnishkafka [01:19:58] (03PS3) 10Dzahn: admin: create shell account for Volker E. [puppet] - 10https://gerrit.wikimedia.org/r/307667 (https://phabricator.wikimedia.org/T143465) [01:24:08] (03CR) 10Dzahn: [C: 032] admin: create shell account for Volker E. [puppet] - 10https://gerrit.wikimedia.org/r/307667 (https://phabricator.wikimedia.org/T143465) (owner: 10Dzahn) [01:26:11] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:28:35] !log Rebooting iridium to apply kernel update [01:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:33:49] PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100% [01:34:13] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:34:41] RECOVERY - Host iridium is UP: PING OK - Packet loss = 0%, RTA = 2.05 ms [01:34:46] and it's back [01:35:05] !log reboot successful, iridium is back online [01:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:36:36] :) [01:36:53] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [01:37:07] hmm that's odd [01:38:15] it appears to be up [01:38:19] i hope that's just a bit delayed [01:38:24] and comes back in a moment [01:38:40] well I can access git-ssh so I believe it must be just delayed [01:38:48] i think so yea [01:39:20] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [01:39:24] it does say ssh6, so IPv6 [01:39:26] and there we go [01:39:31] ok [01:39:41] NTP service is listed as critical [01:39:50] service ntp start doesn't work [01:40:19] is it "offset unknown"? [01:40:35] yeah, now it just went to status: ok [01:40:45] that's relatively normal [01:40:50] after a reboot [01:40:51] everything is green now [01:40:53] :) [01:40:53] cool [01:41:02] thanks for having my back [01:41:05] yw [01:46:48] :) [01:49:39] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1803.309624 Seconds [01:49:39] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 1804.17389 Seconds [01:50:11] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 1832.332595 Seconds [01:51:33] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, 13Patch-For-Review: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2658041 (10AndyRussG) @awight woohoo yeah let's do it! Thanks!!!! [01:52:03] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 101.280585 Seconds [01:52:10] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 107.105938 Seconds [01:52:40] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 135.045516 Seconds [01:53:00] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:55:23] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2658044 (10Dzahn) Hey @Volker_E This is now done :) Your user account has been created on the host called "rutherfordium.eqiad.wmnet". That is where people.wikimedia.org lives. Here yo... [01:55:49] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2658045 (10Dzahn) 05Open>03Resolved [01:58:53] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, 13Patch-For-Review: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2658060 (10awight) The bug can be reproduced from the backend: ``` mwrepl metawiki print_r(Re... [01:59:26] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:03:14] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, 13Patch-For-Review: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2658061 (10awight) Also interesting: ``` $cache = MessageCache::singleton(); print_r($cache->... [02:03:39] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:18:15] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2658067 (10Dzahn) I made the public_html directory and put a placeholder there: https://people.wikimedia.org/~volkere/ [02:26:24] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:31:52] 06Operations, 10Monitoring, 10Traffic, 07HTTPS, 13Patch-For-Review: adjust ssl certificate montioring to differentiate between standard and LE certificates. - https://phabricator.wikimedia.org/T144293#2658075 (10Dzahn) p:05Triage>03Normal [02:32:05] 06Operations, 05Prometheus-metrics-monitoring: MySQL monitoring with prometheus - https://phabricator.wikimedia.org/T143896#2658076 (10Dzahn) p:05Triage>03Normal [02:32:53] 06Operations, 10Pybal, 06Services, 13Patch-For-Review, 15User-mobrovac: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled - https://phabricator.wikimedia.org/T145518#2658077 (10Dzahn) p:05Triage>03High [02:33:10] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#2658078 (10Dzahn) p:05Triage>03Normal [02:34:56] 06Operations, 10Icinga: Make nagios check_disk check for inode usage as well - https://phabricator.wikimedia.org/T84171#923686 (10Dzahn) I guess we want to wait with this until neon has been moved to a beefier box (because we already have T1242 and this would at a lot more checks at once). [02:35:21] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:37:02] 06Operations, 10Mail, 10OTRS, 10Wiki-Loves-Monuments: E-mails not being received by OTRS - https://phabricator.wikimedia.org/T145293#2658083 (10Dzahn) p:05Triage>03High [02:37:48] 06Operations, 10DBA, 05Prometheus-metrics-monitoring: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb - https://phabricator.wikimedia.org/T145072#2658084 (10Dzahn) p:05Triage>03Normal [02:38:22] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 16m 39s) [02:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:49] 06Operations: Have Diamond collect Linux KSM metrics on Ganeti hosts - https://phabricator.wikimedia.org/T146038#2658085 (10Dzahn) p:05Triage>03Low [02:39:52] 06Operations, 10ops-codfw, 10DBA: db2017 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T145844#2658103 (10Dzahn) p:05Triage>03Normal [02:40:21] 06Operations, 10ops-codfw: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#2658107 (10Dzahn) p:05Triage>03Normal [02:45:09] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:02:30] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:09:40] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [03:12:00] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 17m 08s) [03:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:18:48] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep 22 03:18:48 UTC 2016 (duration 6m 48s) [03:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:31:36] (03PS1) 10Catrope: Remove individual wikis' config for wgOresModels, use 'default' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312168 [03:33:10] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [03:36:39] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:57:40] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:33:13] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:01:49] (03PS1) 10KartikMistry: Fix typo in changelog [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/312174 [05:05:05] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:30:15] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:36:24] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:38:13] 06Operations, 10Phabricator (Upstream), 07Upstream: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2658156 (10Peachey88) > Paladox added a commit: rPHAB1667c5b2bf9f: Expose the field text to maniphest advanced search again. This is different to... [05:51:27] (03PS1) 10Giuseppe Lavagetto: naggen2: do not order resources alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/312186 [05:52:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] naggen2: do not order resources alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/312186 (owner: 10Giuseppe Lavagetto) [05:54:34] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:00:53] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:10:12] (03PS1) 10KartikMistry: giella-core: Fix distribution [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/312189 [06:31:29] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:45:24] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[moreutils] [06:45:54] !log Puppet disabled on analytics1027 to stop periodic Java daemons (prep step for Hadoop cluster reboots) [06:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:39] 06Operations, 10Phabricator (Upstream), 07Upstream: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2658246 (10Paladox) That will be done when @mmodell merges from upstream. At the time I did the patch there was no patch from upstream. [07:02:18] (03PS1) 10KartikMistry: apertium-es-ro: Rebuild for Jessie [debs/contenttranslation/apertium-es-ro] - 10https://gerrit.wikimedia.org/r/312192 [07:09:38] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:00] (03PS1) 10Giuseppe Lavagetto: puppet-merge: run conftool after having merged on all the other hosts [puppet] - 10https://gerrit.wikimedia.org/r/312193 [07:11:02] (03PS1) 10Giuseppe Lavagetto: puppet-merge: only sync to other machines if a sha1 is not provided. [puppet] - 10https://gerrit.wikimedia.org/r/312194 [07:17:36] (03CR) 10Alexandros Kosiaris: [C: 032] puppet-merge: run conftool after having merged on all the other hosts [puppet] - 10https://gerrit.wikimedia.org/r/312193 (owner: 10Giuseppe Lavagetto) [07:17:52] (03CR) 10Alexandros Kosiaris: [C: 032] puppet-merge: only sync to other machines if a sha1 is not provided. [puppet] - 10https://gerrit.wikimedia.org/r/312194 (owner: 10Giuseppe Lavagetto) [07:18:27] <_joe_> akosiaris: uh did you check my bash for the second commit? [07:18:32] <_joe_> because I didn't :P [07:18:45] (03CR) 10Alexandros Kosiaris: "Note that this removes from the frontend the ability to do a" [puppet] - 10https://gerrit.wikimedia.org/r/312194 (owner: 10Giuseppe Lavagetto) [07:19:18] the if [ -z ${sha1} ]; then ? [07:19:24] <_joe_> eheh yes [07:19:31] you 've added 2 lines [07:19:34] <_joe_> it's correct, but I just printed down the idea :P [07:19:37] one of them, and fi :P [07:19:42] an* fi [07:20:00] how badly could you mess that up ? [07:20:07] <_joe_> I can be creative [07:20:23] yeah.. actually that question takes all forms of answers [07:20:32] and a different one per person at the least [07:20:35] (03CR) 10Giuseppe Lavagetto: "I am aware; I thought we might just want to:" [puppet] - 10https://gerrit.wikimedia.org/r/312194 (owner: 10Giuseppe Lavagetto) [07:20:40] anyway, merging [07:20:41] 06Operations, 06Labs: cronspam from labscontrol1001, labstore1001, labnet1002.eqiad.wmnet, labsdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T132422#2658263 (10elukey) About the `gzip: stdin: file size changed while zipping` email: it should be related to the upstart logrotate config. I logged onto `l... [07:21:17] (03CR) 10Alexandros Kosiaris: "yeah makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/312194 (owner: 10Giuseppe Lavagetto) [07:21:49] off to fixing the freaking innodb deadlock now before jaime finds out [07:21:59] <_joe_> ahah [07:22:08] <_joe_> run! [07:27:03] (03PS1) 10Giuseppe Lavagetto: puppetmasters: introduce a 'puppet' cluster, assign to all masters [puppet] - 10https://gerrit.wikimedia.org/r/312196 [07:31:41] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [07:33:55] !log rebooting stat1004 for kernel upgrades [07:34:00] (03PS3) 10Ema: run-no-puppet: do not interpret grep pattern as a regex [puppet] - 10https://gerrit.wikimedia.org/r/312004 [07:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:34:06] (03CR) 10Ema: [C: 032 V: 032] run-no-puppet: do not interpret grep pattern as a regex [puppet] - 10https://gerrit.wikimedia.org/r/312004 (owner: 10Ema) [07:40:11] PROBLEM - Disk space on thumbor1002 is CRITICAL: DISK CRITICAL - free space: / 1558 MB (3% inode=88%) [07:40:18] !log rolling restart of trusty swift frontend servers in codfw for kernel security update [07:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:41:13] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:44:20] PROBLEM - Host ms-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% [07:46:08] ^ fixed, forgot to click "Ok" in the Icinga dialogue... [07:47:41] RECOVERY - Disk space on thumbor1002 is OK: DISK OK [07:48:28] (03CR) 10MarcoAurelio: "How many users do have the 'moodbar-admin' right on all wikis? We should remove it from users **before**, to avoid having to request its r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [07:52:36] !log rebooted stat100[23] for kernel upgrades [07:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:53:54] (03PS1) 10Ema: varnish-backend-restart: do not interfere with puppet [puppet] - 10https://gerrit.wikimedia.org/r/312199 [07:54:51] RECOVERY - Host ms-fe2001 is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms [07:58:12] !log uploaded varnishkafka 1.0.12-1 to reprepro [07:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:01:01] !log rolling restart of the whole Analytics Hadoop cluster for kernel upgrades (analytics* hosts) [08:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:02:12] (03CR) 10Alexandros Kosiaris: [C: 032] Fix typo in changelog [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/312174 (owner: 10KartikMistry) [08:03:41] ACKNOWLEDGEMENT - MegaRAID on db2017 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Marostegui https://phabricator.wikimedia.org/T145844 [08:03:44] 06Operations, 10MediaWiki-JobRunner, 07Beta-Cluster-reproducible, 13Patch-For-Review: wikidev people cant read /var/log/mediawiki/jobrunner.log - https://phabricator.wikimedia.org/T146040#2658328 (10hashar) `jobchron.log` did not rotate but I believe that is due to logrotate only considering them after a c... [08:03:49] (03PS1) 10Hashar: jobchron on trusty did not log at the proper place [puppet] - 10https://gerrit.wikimedia.org/r/312201 (https://phabricator.wikimedia.org/T146040) [08:05:30] 06Operations, 10ops-codfw, 10DBA: db2017 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T145844#2658330 (10Marostegui) Hey @PPaul just adding you here to make sure this doesn't get missed. Thanks! [08:07:35] (03PS2) 10Gehel: Monitor usage of in-memory elasticsearch datastructures [puppet] - 10https://gerrit.wikimedia.org/r/311848 (https://phabricator.wikimedia.org/T144387) (owner: 10EBernhardson) [08:07:47] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2658333 (10jcrespo) 05Open>03Resolved a:03jcrespo [08:09:07] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/312199 (owner: 10Ema) [08:09:09] (03CR) 10Gehel: [C: 032] Monitor usage of in-memory elasticsearch datastructures [puppet] - 10https://gerrit.wikimedia.org/r/311848 (https://phabricator.wikimedia.org/T144387) (owner: 10EBernhardson) [08:09:44] !log Resyncing all jobrunner deployment installations since only 41/68 minions have completed fetch/checkout [08:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:09] (03PS2) 10Ema: upload storage: transition cp1071+cp1072 [puppet] - 10https://gerrit.wikimedia.org/r/311999 (owner: 10BBlack) [08:11:15] (03CR) 10Ema: [C: 032 V: 032] upload storage: transition cp1071+cp1072 [puppet] - 10https://gerrit.wikimedia.org/r/311999 (owner: 10BBlack) [08:11:42] (03PS2) 10Ema: varnish-backend-restart: do not interfere with puppet [puppet] - 10https://gerrit.wikimedia.org/r/312199 [08:11:46] (03CR) 10Ema: [C: 032 V: 032] varnish-backend-restart: do not interfere with puppet [puppet] - 10https://gerrit.wikimedia.org/r/312199 (owner: 10Ema) [08:17:19] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [08:17:42] ^ that's me [08:19:17] !log Cleanup jobrunner list of minions in redis ( "deploy:jobrunner/jobrunner:minions" ) [08:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:44] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:24:31] (03PS2) 10Gehel: Bump batch size for WDQS updater to 500 [puppet] - 10https://gerrit.wikimedia.org/r/311209 (owner: 10Smalyshev) [08:26:03] (03CR) 10Gehel: [C: 032] Bump batch size for WDQS updater to 500 [puppet] - 10https://gerrit.wikimedia.org/r/311209 (owner: 10Smalyshev) [08:26:41] (03Abandoned) 10KartikMistry: giella-core: Fix distribution [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/312189 (owner: 10KartikMistry) [08:32:58] PROBLEM - Varnishkafka log producer on cp1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [08:34:49] ema are you working on cp1099 or can I restart vk? [08:35:15] elukey: please go ahead, you might also want to upgrade it while you're there [08:35:55] !log restarted varnishkafka on cp1099 (log abandoned ) [08:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:36:49] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:36:50] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 18 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [08:37:52] RECOVERY - Varnishkafka log producer on cp1099 is OK: PROCS OK: 1 process with command name varnishkafka [08:39:21] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:39:58] (03PS2) 10Ema: upload storage: finish up eqiad (cp1073+cp1074) [puppet] - 10https://gerrit.wikimedia.org/r/312000 (owner: 10BBlack) [08:40:05] (03CR) 10Ema: [C: 032 V: 032] upload storage: finish up eqiad (cp1073+cp1074) [puppet] - 10https://gerrit.wikimedia.org/r/312000 (owner: 10BBlack) [08:40:13] !log installed varnishkafka 1.0.12 on cp1099 [08:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:43:07] !log installing varnishkafka 1.0.12 on cache:upload esams [08:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:44:54] 06Operations, 10MediaWiki-JobRunner, 06Release-Engineering-Team, 10Trebuchet: Some Trebuchet minions are not responding to salt call when deploying jobrunner - https://phabricator.wikimedia.org/T146352#2658377 (10hashar) [08:46:29] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:49:16] !log Deploying schema change on S7 master - T141951 [08:49:17] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [08:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:01] 06Operations, 10MediaWiki-JobRunner, 06Release-Engineering-Team, 10Trebuchet: Some Trebuchet minions are not responding to salt call when deploying jobrunner - https://phabricator.wikimedia.org/T146352#2658407 (10hashar) 05Open>03declined I am giving up trying to deploy jobrunner update. The whole Tre... [09:02:37] !log installing varnishkafka 1.0.12 on cache:upload codfw [09:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:03:50] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:05:17] (03PS2) 10Hashar: jobchron on trusty did not log at the proper place [puppet] - 10https://gerrit.wikimedia.org/r/312201 (https://phabricator.wikimedia.org/T96132) [09:07:59] (03PS1) 10Ema: cache_upload eqiad: set upload_storage_experiment in the right place [puppet] - 10https://gerrit.wikimedia.org/r/312203 [09:10:13] (03CR) 10Ema: [C: 032] cache_upload eqiad: set upload_storage_experiment in the right place [puppet] - 10https://gerrit.wikimedia.org/r/312203 (owner: 10Ema) [09:12:28] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:13:35] (03PS1) 10Giuseppe Lavagetto: ganglia_clusters: s/name/description/ [puppet] - 10https://gerrit.wikimedia.org/r/312204 [09:13:37] (03PS1) 10Giuseppe Lavagetto: hieradata: stop repeating data for clusters [puppet] - 10https://gerrit.wikimedia.org/r/312205 [09:13:40] (03PS1) 10Giuseppe Lavagetto: hiera: allow searching for the full key when using expand_path [puppet] - 10https://gerrit.wikimedia.org/r/312206 [09:16:20] (03CR) 10jenkins-bot: [V: 04-1] hiera: allow searching for the full key when using expand_path [puppet] - 10https://gerrit.wikimedia.org/r/312206 (owner: 10Giuseppe Lavagetto) [09:17:04] hashar: ready to merge that when you like (https://gerrit.wikimedia.org/r/#/c/312201/) [09:17:20] how are things on jessie? [09:17:46] apergos: good morning! [09:18:22] apergos: jobchron.log did not rotate, but I think it is because logrotate has a delay to notice a file ( /var/lib/logrotate/status track the date ) [09:18:27] but I can read the log files now! [09:18:34] the patch above fix up the upstart script for jobchron [09:18:38] yes I saw [09:18:44] it missed the >> 2>&1 :] [09:18:47] yep [09:18:49] so I have merely copy pasted from the other [09:19:08] and I believe it should be safe. Might have to reload or restart the jobchron service on all trusty jobrunners [09:19:11] so like I say I'm ready to merge that through [09:19:20] if you are happy with it [09:19:28] because lgtm [09:19:45] !log upgrade / restart of elasticsearch eqiad cluster done T145404 / T146123 [09:19:46] T145404: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404 [09:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:01] \o/ [09:20:09] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404#2658440 (10Gehel) [09:20:19] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:29] 06Operations, 10Graphite: upgrade grafana to 3.1 - https://phabricator.wikimedia.org/T146354#2658442 (10fgiunchedi) [09:23:00] 06Operations, 10Graphite: upgrade grafana to 3.1.1 - https://phabricator.wikimedia.org/T146354#2658454 (10fgiunchedi) [09:23:43] (03PS1) 10Ema: upload storage: transition cp2002+cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/312208 [09:23:45] (03PS1) 10Ema: upload storage: transition cp2008+cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/312209 [09:23:47] (03PS1) 10Ema: upload storage: transition cp2014+cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/312210 [09:23:49] (03PS1) 10Ema: upload storage: transition cp2020+cp2022 [puppet] - 10https://gerrit.wikimedia.org/r/312211 [09:23:51] (03PS1) 10Ema: upload storage: finish up codfw (cp2024+cp2026) [puppet] - 10https://gerrit.wikimedia.org/r/312212 [09:27:10] 06Operations, 07discovery-system: Replace etcd internal auth mechanism with a frontend proxy - https://phabricator.wikimedia.org/T146355#2658457 (10Joe) [09:27:30] <_joe_> done :) [09:30:03] There are some ApiQueryRevisions::run on dewiki (db1070) running for 1 hour [09:31:06] no they are not [09:31:14] monitoring glitch? [09:36:27] apergos: sorry I have missed your reply. Yeah you can do https://gerrit.wikimedia.org/r/#/c/312201/ :) [09:36:33] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:36:38] great! [09:36:51] (03PS3) 10ArielGlenn: jobchron on trusty did not log at the proper place [puppet] - 10https://gerrit.wikimedia.org/r/312201 (https://phabricator.wikimedia.org/T96132) (owner: 10Hashar) [09:38:35] (03CR) 10ArielGlenn: [C: 032] jobchron on trusty did not log at the proper place [puppet] - 10https://gerrit.wikimedia.org/r/312201 (https://phabricator.wikimedia.org/T96132) (owner: 10Hashar) [09:40:56] 06Operations, 10MediaWiki-JobRunner, 06Release-Engineering-Team, 10Trebuchet: Some Trebuchet minions are not responding to salt call when deploying jobrunner - https://phabricator.wikimedia.org/T146352#2658493 (10hashar) 05declined>03Open So @volans found: NameError: global name '__pillar__' is n... [09:41:57] hashar: live on mw1161. jobchron restarts via puppet since the conf file was changed [09:42:11] apergos: ah good to know it is automatic :-] [09:42:43] guess who is now writing stuff to /var/log/mediawiki/jobchron.log ? mw1161 ! [09:42:45] thanks a ton! [09:43:13] 06Operations, 10MediaWiki-JobRunner, 06Release-Engineering-Team, 10Trebuchet: Some Trebuchet minions are not responding to salt call when deploying jobrunner - https://phabricator.wikimedia.org/T146352#2658510 (10Volans) >>! In T146352#2658493, @hashar wrote: > So @volans found: > > NameError: global... [09:44:08] 06Operations, 10MediaWiki-JobRunner, 06Release-Engineering-Team, 10Trebuchet: Some Trebuchet minions are not responding to salt call when deploying jobrunner - https://phabricator.wikimedia.org/T146352#2658511 (10hashar) 05Open>03Resolved a:03Joe Giuseppe has done all the magic restart of salt minion... [09:44:19] <_joe_> apergos: why merge that change? O [09:44:32] <_joe_> I am pretty sure that messes up with upstart [09:44:52] <_joe_> anyways, they're getting decommissioned, it's still ok [09:46:04] it follows the setup for the jobrunner logs [09:46:19] unless you know that also to be broken [09:49:01] (03PS2) 10Giuseppe Lavagetto: puppetmasters: introduce a 'puppet' cluster, assign to all masters [puppet] - 10https://gerrit.wikimedia.org/r/312196 [09:49:15] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457#2658527 (10Gehel) [09:50:48] (03PS1) 10Gehel: wdqs LVS DNS entries [dns] - 10https://gerrit.wikimedia.org/r/312216 (https://phabricator.wikimedia.org/T132457) [09:50:57] !log updated jobrunner code to a0e82166 (tweak errors reporting in logs) | Does not include 51014242 "Batch stats to statsd" (poke addshore ) [09:50:57] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmasters: introduce a 'puppet' cluster, assign to all masters [puppet] - 10https://gerrit.wikimedia.org/r/312196 (owner: 10Giuseppe Lavagetto) [09:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:52] (03PS2) 10Giuseppe Lavagetto: ganglia_clusters: s/name/description/ [puppet] - 10https://gerrit.wikimedia.org/r/312204 [09:59:29] !log rebooting subra/suhail for kernel security update [09:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:01:47] going to run a fast errand while it's not raining [10:02:17] (acquire food and a diffuser) [10:09:11] *reads up* [10:09:31] ACKNOWLEDGEMENT - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 13 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sdb3],Exec[mkfs-/dev/sdc1] Filippo Giunchedi disks being diagnosed T140597 [10:17:09] addshore: I refreshed the jobrunner but havent pulled in your change to batch statsd metrics [10:17:32] gehel: if you are around, I am looking at the mysterious tox failure for elasticseach-tool :/ [10:18:21] hashar: thanks! Don't spend too much time on it. I used pyscaffold to create the project structure, but I should probably just remove half of the generated code... [10:18:51] gehel: the issue is that tox does "python setup.py sdist" outside of a virtualenv [10:19:35] setuptools 27.3 is apparently installed but not used/recognized by pyscaffold bah [10:19:57] gehel: some log at https://integration.wikimedia.org/ci/job/tox-jessie/11682/artifact/.tox/log/tox-0.log/*view*/ [10:21:20] PROBLEM - Oozie Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap [10:21:27] gehel: or yeah maybe drop pyscaffold entirely. It seems to use pbr which is all fine [10:22:01] hashar: at the same time, the packaging of that project is probably crap at this point and needs to be improved (I know mostly nothing about python packaging) [10:22:46] the oozie alarm is mine [10:23:01] hashar: I'll try to do some cleanup after lunch [10:23:08] hashar: thanks a lot for the help! [10:24:04] gehel: and I caught a bug in the CI job which was not capturing some tox log file :] [10:24:24] hashar: at least that was useful to something :) [10:24:54] !log rolling reboot of trusty swift backend servers in eqiad for kernel security update [10:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:27:26] (03PS1) 10Filippo Giunchedi: prometheus: pre aggregate CPU utilization across instances [puppet] - 10https://gerrit.wikimedia.org/r/312222 [10:29:35] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: pre aggregate CPU utilization across instances [puppet] - 10https://gerrit.wikimedia.org/r/312222 (owner: 10Filippo Giunchedi) [10:30:17] (03PS1) 10Gehel: wdqs - LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/312223 (https://phabricator.wikimedia.org/T132457) [10:30:19] (03PS1) 10Gehel: wdqs - add icinga check for LVS services [puppet] - 10https://gerrit.wikimedia.org/r/312224 (https://phabricator.wikimedia.org/T132457) [10:30:21] (03PS1) 10Gehel: wdqs - configure varnish to use LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/312225 (https://phabricator.wikimedia.org/T132457) [10:32:36] RECOVERY - Oozie Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap [10:32:55] (03CR) 10Gehel: "This change is based on a very limited understanding of role::cache and its configuration. It is probably wrong in many ways, feedback is " [puppet] - 10https://gerrit.wikimedia.org/r/312225 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [10:39:47] (03PS3) 10Giuseppe Lavagetto: ganglia_clusters: s/name/description/ [puppet] - 10https://gerrit.wikimedia.org/r/312204 [10:41:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "why go through all this when you can just point the wdqs backend to" [puppet] - 10https://gerrit.wikimedia.org/r/312225 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [10:43:29] (03CR) 10Giuseppe Lavagetto: "Also, application-level routes are defined in hieradata/common/discovery.yaml, which is where you should define your route." [puppet] - 10https://gerrit.wikimedia.org/r/312225 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [10:44:09] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia_clusters: s/name/description/ [puppet] - 10https://gerrit.wikimedia.org/r/312204 (owner: 10Giuseppe Lavagetto) [10:50:35] <_joe_> sigh [10:50:38] (03PS1) 10Giuseppe Lavagetto: ganglia: fix parameter name change [puppet] - 10https://gerrit.wikimedia.org/r/312227 [10:50:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ganglia: fix parameter name change [puppet] - 10https://gerrit.wikimedia.org/r/312227 (owner: 10Giuseppe Lavagetto) [10:53:17] (03PS1) 10Muehlenhoff: Imported Upstream version 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/312228 [10:53:19] (03PS1) 10Muehlenhoff: Bump changelog for 1.0.2i update [debs/openssl] - 10https://gerrit.wikimedia.org/r/312229 [10:57:02] (03PS2) 10Giuseppe Lavagetto: hieradata: stop repeating data for clusters [puppet] - 10https://gerrit.wikimedia.org/r/312205 [11:03:01] (03CR) 10Gehel: "@Giuseppe: I'm going through "all this" because it seems to be the way it is done for other services (in particular, looking at role::cach" [puppet] - 10https://gerrit.wikimedia.org/r/312225 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [11:03:02] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:24] <_joe_> gehel: where exactly is that done in cache::misc? [11:06:25] <_joe_> gehel: that is done for things not going through misc [11:06:47] sorry, not cache::misc, cache::text [11:07:00] <_joe_> so one might ask himself if wdqs is still something we are experimenting with, and should remain in cache::misc [11:07:09] <_joe_> or if it should be moved to cache::text [11:07:25] <_joe_> so for instance: do we have anything that depends on it? [11:07:32] <_joe_> as in, other services calling it? [11:08:08] there is some graph tools that get data from wdqs [11:08:16] (03PS1) 10Marostegui: db-eqiad.php: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312231 [11:08:59] <_joe_> gehel: is it going to work active-active, or just one of the two will be active at any time? [11:09:10] <_joe_> and anyways, wait for bblack/ema's feedback [11:09:41] _joe_: ideally it should work active / active, but I understood from bblack that this is not something we are able to do at the moment [11:10:09] <_joe_> gehel: ? [11:10:30] <_joe_> gehel: I don't think that's accurate in general, he was probably referring to cache::misc [11:10:57] <_joe_> anyways. bbiab [11:11:03] _joe_: I might have understood wrong, but I asked the same question for maps (which could be active/active) and I remember that there was still some work to be done on that side [11:11:13] _joe_: thanks for the comment! see you! [11:12:55] (03CR) 10Gehel: "Correction, I took cache::text as example, not cache::misc as I did not see any service with codfw+eqiad backends in cache::misc" [puppet] - 10https://gerrit.wikimedia.org/r/312225 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [11:24:16] (03PS1) 10Jcrespo: analytics-backups: unblock 301076 by stop using mysql_wmf class [puppet] - 10https://gerrit.wikimedia.org/r/312232 [11:24:22] ^elukey [11:27:13] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312231 (owner: 10Marostegui) [11:27:37] (03PS1) 10Muehlenhoff: Update ca.patch and cloudflare-c20p1305.patch for changes in 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/312234 [11:31:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312231 (owner: 10Marostegui) [11:31:39] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:32:02] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312231 (owner: 10Marostegui) [11:33:49] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:34:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 with some light weight (duration: 00m 52s) [11:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:58] buffer pool eficiency went down to 98.7% [11:36:17] hope it goes up soon [11:40:05] and indeed it does [11:42:27] (03CR) 10Jcrespo: "@Paladox I am trying to have Analytics ops to give the ok to https://gerrit.wikimedia.org/r/312232" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [11:50:11] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:52:49] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:58] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [11:55:38] 06Operations, 06Discovery, 06Maps, 10Tilerator, 03Interactive-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2658668 (10Yurik) [11:56:15] 06Operations, 10Pybal, 06Services, 13Patch-For-Review, 15User-mobrovac: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled - https://phabricator.wikimedia.org/T145518#2658670 (10mobrovac) a:03mobrovac [11:57:16] (03PS1) 10Hashar: Polish up setup and drop pyscaffold [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312235 [11:57:19] (03PS1) 10Hashar: tox: add a 'venv' to run abitrary command [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312236 [11:57:21] (03PS1) 10Hashar: Add basic documentation glue with Sphinx [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312237 [11:58:15] (03CR) 10jenkins-bot: [V: 04-1] Add basic documentation glue with Sphinx [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312237 (owner: 10Hashar) [11:58:36] gehel: you got patch :D [11:59:31] gehel: or in short, I have dropped pyscaffold entirely in favor of "pbr" https://gerrit.wikimedia.org/r/#/c/312235/1 [11:59:33] and build pass [12:00:08] gehel: we also have a bot to auto add you as a reviewer : https://www.mediawiki.org/wiki/Git/Reviewers ;] [12:00:13] hashar: kool, thanks a lot! [12:03:23] gehel: doc has to be polished though and fail the build [12:03:24] (03PS2) 10Muehlenhoff: Update Debian patches for 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/312234 [12:03:27] but he baby steps [12:03:57] gehel: you will probably want to squash my change https://gerrit.wikimedia.org/r/#/c/312235/1 in your draft [12:04:02] and that should work (TM) [12:04:56] (03CR) 10Hashar: "A random magic python stacktrace. I have no idea why :(" [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312237 (owner: 10Hashar) [12:05:06] hashar: I feel bad that you fixed all my issues for me, but thanks! [12:05:24] (03CR) 10Elukey: [C: 031] "LGTM but I'd like to wait for Andrew's opinion before proceeding :)" [puppet] - 10https://gerrit.wikimedia.org/r/312232 (owner: 10Jcrespo) [12:05:34] gehel: a smarter one would have spend a couple hours explaining it. But I felt lazy :] [12:06:06] hashar: nah, I'll take some time to actually read your change, that probably enough explanantion! [12:10:49] !restbase deploy start of d5538ad [12:10:53] elukey: ^^ [12:11:33] lol, log fail [12:11:39] !log restbase deploy start of d5538ad [12:11:39] (03PS3) 10Giuseppe Lavagetto: hieradata: stop repeating data for clusters [puppet] - 10https://gerrit.wikimedia.org/r/312205 [12:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:12] !log Early SWAT for mobile team ( https://gerrit.wikimedia.org/r/#/c/311977/ ) [12:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:56] (03PS1) 10Elukey: Set read_only false for the analytics' mariadb instance [puppet] - 10https://gerrit.wikimedia.org/r/312240 [12:14:41] mobrovac: checking metrics [12:17:00] (03CR) 10Paladox: [C: 031] "Hi sorry I coulden do this inline but the description in the file needs updating since it still says the old class :)" [puppet] - 10https://gerrit.wikimedia.org/r/312232 (owner: 10Jcrespo) [12:18:39] (03PS1) 10Gehel: maps - increase osm replication frequency to hourly [puppet] - 10https://gerrit.wikimedia.org/r/312241 (https://phabricator.wikimedia.org/T137939) [12:18:51] (03PS3) 10Paladox: Remove mysql_wmf::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310964 [12:19:08] (03PS4) 10Paladox: Remove mysql_wmf::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310964 [12:19:25] (03Abandoned) 10Paladox: Switch analytics_cluster to the new mariadb::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310963 (owner: 10Paladox) [12:20:11] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-es-ro: Rebuild for Jessie [debs/contenttranslation/apertium-es-ro] - 10https://gerrit.wikimedia.org/r/312192 (owner: 10KartikMistry) [12:20:39] (03PS2) 10Jcrespo: analytics-backups: unblock 301076 by stop using mysql_wmf class [puppet] - 10https://gerrit.wikimedia.org/r/312232 [12:21:00] (03CR) 10Jcrespo: "Good catch, fixed in the latest patch." [puppet] - 10https://gerrit.wikimedia.org/r/312232 (owner: 10Jcrespo) [12:23:33] (03CR) 10Alexandros Kosiaris: [C: 031] "+1 but make sure to test this carefully. I 've been trying to get these down to the minute for some time and usually some issue shows up w" [puppet] - 10https://gerrit.wikimedia.org/r/312241 (https://phabricator.wikimedia.org/T137939) (owner: 10Gehel) [12:24:32] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-es-ro_0.7.3~r57551-2+wmf1 [12:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:25:02] (03CR) 10Gehel: "@Alexandros: ok, I'll modify this to deploy on maps-test only at this point, make sure it works and replicate on all maps servers." [puppet] - 10https://gerrit.wikimedia.org/r/312241 (https://phabricator.wikimedia.org/T137939) (owner: 10Gehel) [12:25:51] !log installing varnishkafka 1.0.12 on cache:upload ulsfo and eqiad [12:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:27:37] !log restbase deploy end of d5538ad [12:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:31:11] (03PS2) 10Gehel: maps - increase osm replication frequency to hourly [puppet] - 10https://gerrit.wikimedia.org/r/312241 (https://phabricator.wikimedia.org/T137939) [12:31:38] !log hashar@tin Synchronized php-1.28.0-wmf.20/extensions/Popups: Merge mw.popups.experiment into mw.popups.core T146035 (duration: 00m 49s) [12:31:39] T146035: Popups ResourceLoader modules are not declaring their dependencies properly - https://phabricator.wikimedia.org/T146035 [12:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:01] (03PS1) 10Alexandros Kosiaris: ganeti: Enable KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/312245 (https://phabricator.wikimedia.org/T146038) [12:34:32] (03PS2) 10Gehel: elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312237 (owner: 10Hashar) [12:35:12] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312237 (owner: 10Hashar) [12:36:34] (03PS7) 10Gehel: elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 [12:37:15] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 (owner: 10Gehel) [12:38:41] (03CR) 10Ema: [C: 032] upload storage: transition cp2002+cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/312208 (owner: 10Ema) [12:38:47] (03PS2) 10Ema: upload storage: transition cp2002+cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/312208 [12:38:48] (03CR) 10Ema: [V: 032] upload storage: transition cp2002+cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/312208 (owner: 10Ema) [12:39:47] (03PS8) 10Gehel: elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 [12:40:16] akosiaris: :-] [12:40:25] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 (owner: 10Gehel) [12:42:14] 06Operations, 10Gerrit: Update gerrit to 2.13 - https://phabricator.wikimedia.org/T146350#2658694 (10Peachey88) [12:44:52] (03CR) 10Gehel: [C: 032] maps - increase osm replication frequency to hourly [puppet] - 10https://gerrit.wikimedia.org/r/312241 (https://phabricator.wikimedia.org/T137939) (owner: 10Gehel) [12:44:58] (03PS3) 10Gehel: maps - increase osm replication frequency to hourly [puppet] - 10https://gerrit.wikimedia.org/r/312241 (https://phabricator.wikimedia.org/T137939) [12:46:21] gehel: the doc change is not ready yet :( [12:47:43] (03CR) 10Muehlenhoff: [C: 032] Imported Upstream version 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/312228 (owner: 10Muehlenhoff) [12:47:44] hashar: don't worry, now that you gave me a direction, I can also look a bit on my own! [12:48:00] (03CR) 10Muehlenhoff: [C: 032] Bump changelog for 1.0.2i update [debs/openssl] - 10https://gerrit.wikimedia.org/r/312229 (owner: 10Muehlenhoff) [12:48:14] gehel: https://bugs.launchpad.net/pbr/+bug/1384919 :) [12:49:31] hashar: that reads like half chinese to me... [12:52:01] (03PS3) 10Hashar: Add basic documentation glue with Sphinx [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312237 [12:52:17] gehel: rest is half french so it is parseable :] [12:52:46] (03CR) 10jenkins-bot: [V: 04-1] Add basic documentation glue with Sphinx [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312237 (owner: 10Hashar) [12:53:48] (03CR) 10Jcrespo: "Look at the comment, aside from that- what you asked for is exactly what this patch does." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/312240 (owner: 10Elukey) [12:54:05] (03PS4) 10Hashar: Add basic documentation glue with Sphinx [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/312237 [12:54:54] gehel: fix is https://gerrit.wikimedia.org/r/#/c/312237/2..4/requirements.txt [12:55:05] just add pbr to requirements.txt and that workaround whatever crazy issue [12:55:20] so I think you can add to your change https://gerrit.wikimedia.org/r/#/c/312235/1 [12:55:21] hashar: just like magic! [12:55:25] and I will rebase mine on top of your :] [12:55:29] then you are all settled up [12:55:34] delta doc having to be written [12:56:02] jouncebot: next [12:56:02] In 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T1300) [12:57:14] hashar: thanks a lot! [12:59:33] (03PS2) 10Elukey: Update the Analytics mariadb config [puppet] - 10https://gerrit.wikimedia.org/r/312240 [13:00:05] hashar, Dereckson, addshore, and aude: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T1300). [13:00:28] no patches for swat today [13:00:41] we had one for mobile but I got it deployed half an hour ago [13:00:43] (03PS9) 10Gehel: elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 [13:00:55] (03CR) 10Muehlenhoff: [C: 032] Update Debian patches for 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/312234 (owner: 10Muehlenhoff) [13:00:57] (03PS3) 10Elukey: Update the Analytics mariadb config [puppet] - 10https://gerrit.wikimedia.org/r/312240 [13:01:24] hashar: ok, I think I squashed all your improvements in my initial commit... [13:02:08] !log uploaded openssl 1.0.2i for jessie-wikimedia to carbon [13:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:02:39] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: Enable KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/312245 (https://phabricator.wikimedia.org/T146038) (owner: 10Alexandros Kosiaris) [13:02:43] (03PS2) 10Alexandros Kosiaris: ganeti: Enable KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/312245 (https://phabricator.wikimedia.org/T146038) [13:02:45] (03CR) 10Alexandros Kosiaris: [V: 032] ganeti: Enable KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/312245 (https://phabricator.wikimedia.org/T146038) (owner: 10Alexandros Kosiaris) [13:11:44] (03CR) 10Elukey: [C: 032] "Thanks Jaime! I've also double checked with https://puppet-compiler.wmflabs.org/4156/ and everything looks good to merge." [puppet] - 10https://gerrit.wikimedia.org/r/312240 (owner: 10Elukey) [13:11:54] (03PS4) 10Elukey: Update the Analytics mariadb config [puppet] - 10https://gerrit.wikimedia.org/r/312240 [13:18:32] (03PS2) 10Ema: upload storage: transition cp2008+cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/312209 (https://phabricator.wikimedia.org/T145661) [13:21:47] (03PS2) 10Ema: upload storage: transition cp2014+cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/312210 (https://phabricator.wikimedia.org/T145661) [13:21:49] (03PS3) 10Ema: upload storage: transition cp2008+cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/312209 (https://phabricator.wikimedia.org/T145661) [13:22:18] !log resume rolling reboot of trusty swift backend servers in eqiad for kernel security update [13:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:05] (03PS2) 10Ema: upload storage: transition cp2020+cp2022 [puppet] - 10https://gerrit.wikimedia.org/r/312211 (https://phabricator.wikimedia.org/T145661) [13:24:08] (03PS2) 10Ema: upload storage: finish up codfw (cp2024+cp2026) [puppet] - 10https://gerrit.wikimedia.org/r/312212 (https://phabricator.wikimedia.org/T145661) [13:27:26] (03PS1) 10Gehel: osm - move logs to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/312248 [13:27:42] (03PS4) 10Giuseppe Lavagetto: hieradata: stop repeating data for clusters [puppet] - 10https://gerrit.wikimedia.org/r/312205 [13:28:12] (03PS5) 10Giuseppe Lavagetto: hieradata: stop repeating data for clusters [puppet] - 10https://gerrit.wikimedia.org/r/312205 [13:34:03] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Contrary to my ingenuous belief, the two lists of clusters, the icinga one and the ganglia one, do not map 1:1 to each other! See https://" [puppet] - 10https://gerrit.wikimedia.org/r/312205 (owner: 10Giuseppe Lavagetto) [13:45:17] (03CR) 10Alexandros Kosiaris: [C: 031] osm - move logs to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/312248 (owner: 10Gehel) [13:45:39] (03PS2) 10Gehel: osm - move logs to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/312248 [13:47:32] (03CR) 10Gehel: [C: 032] osm - move logs to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/312248 (owner: 10Gehel) [13:55:47] (03PS4) 10Andrew Bogott: labs_dns: Fix optional parameter listed before required parameter [puppet] - 10https://gerrit.wikimedia.org/r/308340 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [13:57:53] (03CR) 10Andrew Bogott: [C: 032] labs_dns: Fix optional parameter listed before required parameter [puppet] - 10https://gerrit.wikimedia.org/r/308340 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [14:00:10] (03CR) 10Ottomata: [C: 031] "Sounds good to me. Can we use puppet-compiler to be sure this will be a no-op?" [puppet] - 10https://gerrit.wikimedia.org/r/312232 (owner: 10Jcrespo) [14:03:11] doing it --^ [14:03:48] 06Operations, 10ops-codfw, 06DC-Ops, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2658845 (10chasemp) >>! In T102626#2581102, @Papaul wrote: > @chasemp what do you want to do with this? Sorry I didn't see this message sooner @papaul. I'm not s... [14:04:40] _joe_: I just found the ticket I was thinking about about active/active clusters: T134404 [14:04:40] T134404: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404 [14:04:45] (03CR) 10Elukey: "Looks good from https://puppet-compiler.wmflabs.org/4158/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/312232 (owner: 10Jcrespo) [14:11:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] service::node: restrict readability of configurations. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/309522 (owner: 10Giuseppe Lavagetto) [14:11:14] (03PS4) 10Giuseppe Lavagetto: service::node: restrict readability of configurations. [puppet] - 10https://gerrit.wikimedia.org/r/309522 [14:11:27] (03CR) 10Hashar: [C: 04-1 V: 04-1] contint: migrate slaves to /srv [puppet] - 10https://gerrit.wikimedia.org/r/311959 (owner: 10Hashar) [14:12:46] (03PS5) 10Muehlenhoff: beta: switch deploy server to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311947 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [14:18:27] (03CR) 10Muehlenhoff: [C: 032] beta: switch deploy server to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311947 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [14:23:59] (03PS1) 10Brion VIBBER: static.php should use deployed branch for invalid hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) [14:34:15] (03PS5) 10Giuseppe Lavagetto: service::node: restrict readability of configurations. [puppet] - 10https://gerrit.wikimedia.org/r/309522 [14:53:10] (03PS1) 10Alexandros Kosiaris: puppetmaster: servermon puppet handler concurrency improvements [puppet] - 10https://gerrit.wikimedia.org/r/312258 [14:54:09] (03PS2) 10Alexandros Kosiaris: puppetmaster: servermon puppet handler concurrency improvements [puppet] - 10https://gerrit.wikimedia.org/r/312258 [14:54:11] (03PS3) 10Hashar: contint: migrate slaves to /srv [puppet] - 10https://gerrit.wikimedia.org/r/311959 [14:54:13] (03PS2) 10Hashar: contint: labs instance all have /dev/vdb [puppet] - 10https://gerrit.wikimedia.org/r/311954 [14:54:15] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: servermon puppet handler concurrency improvements [puppet] - 10https://gerrit.wikimedia.org/r/312258 (owner: 10Alexandros Kosiaris) [14:54:40] (03CR) 10Hashar: [V: 04-1] contint: migrate slaves to /srv [puppet] - 10https://gerrit.wikimedia.org/r/311959 (owner: 10Hashar) [14:57:46] (03PS1) 10Alexandros Kosiaris: puppetdb: Stop storing reports for now [puppet] - 10https://gerrit.wikimedia.org/r/312259 [14:58:06] (03PS2) 10Alexandros Kosiaris: puppetdb: Stop storing reports for now [puppet] - 10https://gerrit.wikimedia.org/r/312259 [14:58:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetdb: Stop storing reports for now [puppet] - 10https://gerrit.wikimedia.org/r/312259 (owner: 10Alexandros Kosiaris) [15:00:35] (03PS1) 10Ottomata: Copy hive-site.xml into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/312260 (https://phabricator.wikimedia.org/T133208) [15:01:14] (03CR) 10Hashar: [C: 031] "Some cleanup that comes from the pmtpa era :] All Jenkins slaves no mount /dev/vdb !" [puppet] - 10https://gerrit.wikimedia.org/r/311954 (owner: 10Hashar) [15:01:39] (03CR) 10jenkins-bot: [V: 04-1] Copy hive-site.xml into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/312260 (https://phabricator.wikimedia.org/T133208) (owner: 10Ottomata) [15:02:43] !log upgrading openssl on cp* [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:14] (03PS2) 10Ottomata: Copy hive-site.xml into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/312260 (https://phabricator.wikimedia.org/T133208) [15:03:42] (03CR) 10Paladox: "@Mobrovac per @Dzahn." [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [15:04:40] (03CR) 10Paladox: [C: 031] "Per @Ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/312232 (owner: 10Jcrespo) [15:05:24] (03CR) 10Mobrovac: "@Paladox, my -1 still stands, see in-lined comments." [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [15:07:43] (03PS1) 10Hashar: contint: mount /srv the same as /mnt [puppet] - 10https://gerrit.wikimedia.org/r/312262 (https://phabricator.wikimedia.org/T146381) [15:08:05] (03CR) 10Dzahn: "yep, just talked about it. please amend so that the variable names are changed but _not_ the part that is just straight config options of " [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [15:08:31] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:08:48] (03PS3) 10Jcrespo: analytics-backups: unblock 301076 by stop using mysql_wmf class [puppet] - 10https://gerrit.wikimedia.org/r/312232 [15:09:04] elukey, ottomata I am going to deploy ^ [15:09:09] + [15:09:10] 1 [15:09:12] please keep an eye on the backups [15:09:17] ok [15:09:17] for the next week [15:09:28] in case something doesn't work [15:10:02] !log restbase deploy start of d96fbc1 [15:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:17] (03CR) 10Jcrespo: [C: 032] analytics-backups: unblock 301076 by stop using mysql_wmf class [puppet] - 10https://gerrit.wikimedia.org/r/312232 (owner: 10Jcrespo) [15:12:27] (03PS6) 10Paladox: aqs and restbase: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) [15:12:48] mobrovac done ^^ does that look better? [15:13:27] the new cron on analytics1003 seems sane [15:13:41] but time will tell if there is an undersired effect [15:14:44] Anyone else want to give a +3 here? https://gerrit.wikimedia.org/r/#/c/301076/ [15:15:22] +1 ^^ :) [15:15:32] need manual rebase [15:15:38] doing now [15:15:42] Thanks :) [15:16:11] depending on the changes, a new patch will be easier [15:16:38] (03CR) 10Mobrovac: [C: 04-1] "If we are changing camelCase to snake_case, then that should be properly done, not blindly replaced. So, the var names should be cassandra" [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [15:19:00] (03PS5) 10Jcrespo: Delete coredb_mysql module and dependent roles and modules [puppet] - 10https://gerrit.wikimedia.org/r/301076 [15:19:16] I do not discard missing some orphan file [15:20:11] (03PS1) 10Andrew Bogott: Puppet panel: Fill in formatted_params in cached class list. [puppet] - 10https://gerrit.wikimedia.org/r/312266 [15:23:04] (03CR) 10Andrew Bogott: [C: 032] Puppet panel: Fill in formatted_params in cached class list. [puppet] - 10https://gerrit.wikimedia.org/r/312266 (owner: 10Andrew Bogott) [15:24:24] PROBLEM - Disk space on ms-be1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=81%) [15:24:32] ok, let's do this- hopefuly we will not break much [15:25:14] I'll take a look at ms-be1004 [15:25:30] maybe the previous logging issue, being / ? [15:25:51] (03CR) 10Jcrespo: [C: 032] Delete coredb_mysql module and dependent roles and modules [puppet] - 10https://gerrit.wikimedia.org/r/301076 (owner: 10Jcrespo) [15:26:00] :) [15:26:02] (03PS6) 10Jcrespo: Delete coredb_mysql module and dependent roles and modules [puppet] - 10https://gerrit.wikimedia.org/r/301076 [15:26:14] (03PS1) 10Hashar: contint: migrate browsertest redis to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312267 (https://phabricator.wikimedia.org/T146381) [15:29:24] !log restbase deploy end of d96fbc1 [15:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:56] (03PS1) 10Jcrespo: Revert "Delete coredb_mysql module and dependent roles and modules" [puppet] - 10https://gerrit.wikimedia.org/r/312268 [15:31:13] I am going to prepare myself, just in case [15:31:54] godog: did you find the issue already? [15:32:19] volans: yeah, currently in a meeting but found it [15:32:55] ok, great, so no need of what I've found :) [15:33:17] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:33:18] volans, ? [15:33:35] jynus: for ms-be1004 [15:33:38] I saw many servers with old kernels [15:33:49] and dpkg cache taking 1-2 GB [15:33:54] it's not that [15:34:01] in 5-8 GB / partitions [15:37:27] (03PS3) 10Ottomata: Copy hive-site.xml into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/312260 (https://phabricator.wikimedia.org/T133208) [15:37:35] (03CR) 10Ottomata: [C: 032 V: 032] "Looking fine here: https://puppet-compiler.wmflabs.org/4160/analytics1027.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/312260 (https://phabricator.wikimedia.org/T133208) (owner: 10Ottomata) [15:43:46] (03CR) 10Ema: [C: 032] upload storage: transition cp2008+cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/312209 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [15:43:48] (03PS7) 10Paladox: aqs and restbase: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) [15:43:51] (03PS4) 10Ema: upload storage: transition cp2008+cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/312209 (https://phabricator.wikimedia.org/T145661) [15:43:53] (03CR) 10Ema: [V: 032] upload storage: transition cp2008+cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/312209 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [15:44:00] mobrovac done ^^ does that look better ? :) [15:44:34] (03PS8) 10Paladox: aqs and restbase: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) [15:45:32] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 702 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4286876 keys - replication_delay is 702 [15:52:18] paladox: checking.. [15:52:25] Thanks [15:53:21] paladox: i still see cassandra_localdc instead of cassandra_local_dc etc... [15:53:28] Oh [15:53:28] not sure i see what was changed in the latest patch [15:53:31] wait [15:53:36] You can [15:53:39] use that [15:53:44] no, sorry, looking at ps6 [15:53:47] damn new gerrit [15:54:13] Yep [15:54:35] mobrovac it is made easyer in gerrit 2.13 which was just released today [15:54:40] (03CR) 10Mobrovac: [C: 04-1] "it's still localdc instead of local_dc ..." [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [15:54:46] since it shows oranage for old patches [15:54:53] but yeh sorry that i missed that one [15:55:05] I was wondering if that variable was one name [15:56:27] 06Operations, 10ArchCom-RfC, 06Performance-Team, 06Services, and 4 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#2659323 (10mark) [15:56:50] (03PS1) 10Hashar: contint: migrate package_builder from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312270 (https://phabricator.wikimedia.org/T146381) [15:59:30] (03PS9) 10Paladox: aqs and restbase: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) [15:59:41] mobrovac ^^ done :) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T1600). Please do the needful. [16:00:04] mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:02:36] (03CR) 10Paladox: "Is this reverted just in case it fails, ir did it fails?" [puppet] - 10https://gerrit.wikimedia.org/r/312268 (owner: 10Jcrespo) [16:03:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4261116 keys - replication_delay is 0 [16:04:40] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file],Exec[generate varnish.pyconf] [16:05:09] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [16:05:38] (03CR) 10Mobrovac: [C: 031] "Ok with me, OK with PCC - https://puppet-compiler.wmflabs.org/4161/" [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [16:05:49] RECOVERY - Disk space on ms-be1004 is OK: DISK OK [16:05:58] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:31] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:07:28] (03CR) 10Paladox: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [16:08:12] (03Abandoned) 10Jcrespo: Revert "Delete coredb_mysql module and dependent roles and modules" [puppet] - 10https://gerrit.wikimedia.org/r/312268 (owner: 10Jcrespo) [16:08:39] paladox, check some of the pending commits regarding the deleted modules [16:08:55] some may not apply any longer [16:08:58] Oh [16:09:10] the mysql_wmf, etc. [16:09:38] Oh yeh i have a patch that removes the copycat module from mysql now since it is not needed [16:10:08] jynus https://gerrit.wikimedia.org/r/#/c/310964/ :) [16:10:34] (03PS5) 10Paladox: Remove mysql_wmf::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310964 [16:10:41] Oh wait never mind [16:10:46] it was removed already lol [16:10:51] (03Abandoned) 10Paladox: Remove mysql_wmf::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310964 (owner: 10Paladox) [16:10:54] yes, it was that [16:10:57] at least [16:11:00] not sure if others [16:11:34] Yep [16:11:45] now that we do not have 20 modules [16:11:55] I will be able to clean up the one we use [16:12:04] (03PS4) 10Paladox: archiva: Fix it not being a autoload module [puppet] - 10https://gerrit.wikimedia.org/r/311194 (https://phabricator.wikimedia.org/T119042) [16:12:17] (03PS7) 10Paladox: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) [16:12:22] Oh yep :) [16:12:30] thanks for cleaning it up too :) [16:12:35] jynus ^^ [16:12:56] it sometimes take some time for me to respond [16:13:00] too much backlog [16:13:04] Oh [16:13:33] (03PS1) 10Hashar: contint: remove obsoletes file { ensure => absent } [puppet] - 10https://gerrit.wikimedia.org/r/312275 (https://phabricator.wikimedia.org/T146381) [16:15:04] (03PS3) 10Ema: upload storage: transition cp2014+cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/312210 (https://phabricator.wikimedia.org/T145661) [16:15:10] (03CR) 10Ema: [C: 032 V: 032] upload storage: transition cp2014+cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/312210 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [16:16:43] (03PS2) 10Hashar: contint: remove obsolete files { ensure => absent } [puppet] - 10https://gerrit.wikimedia.org/r/312275 (https://phabricator.wikimedia.org/T146381) [16:17:04] !log offline sdd on ms-be1004 via megacli T144499 [16:17:05] T144499: ms-be1004.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T144499 [16:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:18] (03PS10) 10Dzahn: aqs and restbase: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [16:19:36] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 84.11 ms [16:19:37] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 89.05 ms [16:19:37] godog: no puppetswat? [16:19:51] ACKNOWLEDGEMENT - MegaRAID on ms-be1004 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi sdd failed,T144499 [16:19:51] ACKNOWLEDGEMENT - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdd1] Filippo Giunchedi sdd failed,T144499 [16:19:58] my lonely patch is waiting - https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0September.C2.A022 :) [16:20:30] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 37 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [16:20:30] mobrovac: yeah sorry I was busy with the stuff above [16:20:38] no pb [16:20:43] I'm not the only one running the swat either though :) [16:20:43] was just checking [16:21:08] it's never clear to me who is on puppetswat [16:22:00] formally the people on Deployments, practically it depends of course [16:22:08] anyways, looking at the patch [16:22:28] (03PS2) 10Filippo Giunchedi: RESTBase config: Add Swagger UI header info [puppet] - 10https://gerrit.wikimedia.org/r/311958 (owner: 10Mobrovac) [16:23:00] (03CR) 10Dzahn: [C: 032] "compiler says no-op, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [16:23:01] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:23:14] thanks ^^ mutante :) [16:24:05] "swat" [16:24:17] (03CR) 10Filippo Giunchedi: [C: 032] RESTBase config: Add Swagger UI header info [puppet] - 10https://gerrit.wikimedia.org/r/311958 (owner: 10Mobrovac) [16:24:34] (03PS3) 10Filippo Giunchedi: RESTBase config: Add Swagger UI header info [puppet] - 10https://gerrit.wikimedia.org/r/311958 (owner: 10Mobrovac) [16:24:37] (03CR) 10Filippo Giunchedi: [V: 032] RESTBase config: Add Swagger UI header info [puppet] - 10https://gerrit.wikimedia.org/r/311958 (owner: 10Mobrovac) [16:25:06] mobrovac: {{done}} [16:25:22] thnx godog! [16:26:06] np! [16:30:07] (03PS3) 10Dzahn: contint: labs instance all have /dev/vdb [puppet] - 10https://gerrit.wikimedia.org/r/311954 (owner: 10Hashar) [16:30:19] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 697 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4262798 keys - replication_delay is 697 [16:30:37] (03CR) 10Dzahn: [C: 032] "yes, we don't have pmtpa anymore:) (and labs is so far only in eqiad)" [puppet] - 10https://gerrit.wikimedia.org/r/311954 (owner: 10Hashar) [16:35:34] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:35:45] hasharAway: hmm, i just saw one potential issue with this: [16:36:04] actually, no, nevermind [16:36:07] ok [16:37:11] (03CR) 10Dzahn: "re-removing self" [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [16:37:28] mutante: thanks :) [16:37:43] yw hashar [16:37:54] mutante: it was really just a dirty trick from when ptmpa instance had a large / while eqiad instance need a mount of extended disk :D [16:38:25] mutante: I am removing all occurences of /mnt and transitioning to /srv [16:38:49] yep, i thought Require "File" became Require "Mount" but that isn't true [16:38:54] it's all good [16:38:59] yeah largely confusing [16:40:19] !log forced logrotation for /etc/logrotate.d/upstart on labvirt1014 to investigate cronspam [16:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:43] (probably not needed but post it anyway) [16:45:02] (03CR) 10Dzahn: "i can deploy this, but only after there is consensus that we should" [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [16:45:37] (03PS1) 10Alex Monk: openstack: Import nova_fixed_multi designate plugin [puppet] - 10https://gerrit.wikimedia.org/r/312278 (https://phabricator.wikimedia.org/T144317) [16:47:41] (03CR) 10Krinkle: [C: 04-1] "Need to double check this doesn't break assumptions made in VCL about it responding irregardless of hostname when a query string is given." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [16:51:16] (03PS1) 10Alex Monk: Follow-up I695dab22: Fix style [puppet] - 10https://gerrit.wikimedia.org/r/312280 [16:53:59] (03PS2) 10Andrew Bogott: openstack: Import nova_fixed_multi designate plugin [puppet] - 10https://gerrit.wikimedia.org/r/312278 (https://phabricator.wikimedia.org/T144317) (owner: 10Alex Monk) [16:54:07] (03Abandoned) 10Alex Monk: Revert "Revert "New wikitext editor: Enable the Beta Feature in Beta Cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311983 (owner: 10Alex Monk) [16:55:49] (03CR) 10Andrew Bogott: [C: 032] openstack: Import nova_fixed_multi designate plugin [puppet] - 10https://gerrit.wikimedia.org/r/312278 (https://phabricator.wikimedia.org/T144317) (owner: 10Alex Monk) [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T1700). [17:00:36] no parsoid deploy today [17:01:41] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:14] (03CR) 10Brion VIBBER: "I think the relevant regex is this one:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [17:03:38] (03CR) 10Gehel: "Comments added from discussion with volans" (034 comments) [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 (owner: 10Gehel) [17:09:21] brb [17:09:53] !log rolling reboot of trusty swift backend servers in eqiad completed [17:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:12] 06Operations, 10media-storage, 07Documentation: Document how to handle 'inconsistent state within the internal storage backends' issues - https://phabricator.wikimedia.org/T135318#2659494 (10Dzahn) p:05Triage>03Normal [17:16:55] 06Operations, 10Graphite: upgrade grafana to 3.1.1 - https://phabricator.wikimedia.org/T146354#2659518 (10Dzahn) p:05Triage>03Normal [17:18:11] (03PS3) 10BBlack: upload storage: transition cp2020+cp2022 [puppet] - 10https://gerrit.wikimedia.org/r/312211 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [17:18:47] (03CR) 10BBlack: [C: 032 V: 032] upload storage: transition cp2020+cp2022 [puppet] - 10https://gerrit.wikimedia.org/r/312211 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [17:21:12] (03PS2) 10Brion VIBBER: static.php should use deployed branch for invalid hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) [17:24:47] !log rebooting ms-be1016, high load caused by XFS bug [17:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:26:39] 06Operations, 10ops-eqiad, 10DBA: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2659560 (10jcrespo) Accoording to lifecycle logs "System is turning off." is the cause of the issue. No logs. No signs of a crash. Logs continue as usual: ``` Sep 19 01:09:39 db1061 sshd[47512]: Set /proc... [17:26:45] (03CR) 10Brion VIBBER: "@Krinkle patchset 2 uses the same regex as the VCL that decides whether to send us to static to decide whether to prepend the current vers" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [17:27:11] 06Operations, 10ops-eqiad, 10DBA: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2659561 (10jcrespo) 05Open>03Resolved I think there is not much left to do here, except wait if it happens again. [17:31:45] 06Operations, 10ops-esams, 10DNS, 10Traffic, 10netops: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2659577 (10BBlack) [17:34:05] (03CR) 10Krinkle: [C: 04-1] static.php should use deployed branch for invalid hashes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [17:39:36] (03Draft1) 10Paladox: toollabs: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/312284 [17:39:39] (03Draft2) 10Paladox: toollabs: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/312284 [17:41:51] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/312284 (owner: 10Paladox) [17:45:03] 06Operations, 10ops-esams, 10DNS, 10Traffic, 10netops: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2659624 (10faidon) Nothing appears abnormal in the logs of either csw2, asw nor cr2. Which other hosts on the same network did you try from? I'm interested to find out if they connecte... [17:46:15] (03PS3) 10Brion VIBBER: static.php should use deployed branch for invalid hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) [17:47:24] (03CR) 10Brion VIBBER: static.php should use deployed branch for invalid hashes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [17:50:04] apergos: if still around, can you answer some questions about certcleaner.py? [17:51:26] Ξ™ ψαν γιωΡ ΞΉΟ„ Ξ± σηοτ [17:51:28] ouch [17:51:31] I can give it a shot [17:51:32] what's up? [17:51:54] the main question is: Do we still need that at all? [17:52:29] given that the designate-sink plugin should be cleaning up these things these days [17:52:35] subquestions are: why reject before deleting? And, what is this --rotate-aes-key about? [17:53:06] (03PS1) 10Yuvipanda: puppetmaster: Allow enabling cherrypicks [puppet] - 10https://gerrit.wikimedia.org/r/312287 [17:53:20] the rotate-aes-key: normally on deletion of a minion key, the master aes key is rotated, which causes all minions to have to reauth [17:53:57] so if you do a bunch of these in a rwo then you have a bunch of these reauths on in a few minutes [17:54:02] so that's what that is [17:54:06] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/openstack/files/liberty/designate/nova_ldap/base.py;f0340c7c682948670ddebb99dd909918f061b7d5$211 [17:54:22] 06Operations, 10Phabricator (2016-10-xx), 07Upstream: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2659676 (10Paladox) [17:54:47] (03PS3) 10BBlack: upload storage: finish up codfw (cp2024+cp2026) [puppet] - 10https://gerrit.wikimedia.org/r/312212 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [17:54:56] (03CR) 10BBlack: [C: 032 V: 032] upload storage: finish up codfw (cp2024+cp2026) [puppet] - 10https://gerrit.wikimedia.org/r/312212 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [17:55:08] I don't remember why we needed rejection before deletion at this point [17:55:18] apergos: is it harmful to disable it always, as you do here? [17:55:20] --rotate-aes-key I mean [17:55:38] well it's a hole in forward secrecy [17:56:09] So it sounds like leaving it to default ('true') is ok if we're not doing a big batch of a dozen at a time [17:56:15] yes [17:56:16] I think for the purposes of labs instances we don't care about that, right? [17:56:44] a salt command once issued in labs is public to all instances, i.e. anyone [17:56:56] what is the designate-sink plugin? [17:56:58] (03PS2) 10Yuvipanda: puppetmaster: Allow enabling cherrypicks [puppet] - 10https://gerrit.wikimedia.org/r/312287 [17:57:14] this plugin is code that runs when instances are created and deleted [17:57:14] apergos: that's a plugin that is run whenever an instance is deleted. [17:57:25] linky? [17:57:29] I linked it above [17:57:33] ah [17:57:35] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/openstack/files/liberty/designate/nova_ldap/base.py;f0340c7c682948670ddebb99dd909918f061b7d5$211 [17:57:37] 06Operations, 10Mail: vfowler@wikimedia.org sending bounceback - https://phabricator.wikimedia.org/T146036#2659684 (10Dzahn) 05Open>03Resolved a:03Dzahn 10:57 Josephine says Ms. Fowler is ok! [17:57:53] sorry, at that point I hadn't realized you were in the conversation :-D [17:57:55] (03CR) 10Yuvipanda: [C: 032 V: 032] "https://puppet-compiler.wmflabs.org/4162/ is ok" [puppet] - 10https://gerrit.wikimedia.org/r/312287 (owner: 10Yuvipanda) [17:58:10] apergos: it's a more recent addition. In theory it makes the certcleaner cron obsolete... [17:58:43] Due to other changes the certcleaner needs some updates, so it's suddenly interesting whether or not we can just dump it instead :) [17:59:03] heh [17:59:04] specifically it was https://gerrit.wikimedia.org/r/#/c/204075/3 [17:59:23] not sure why the commit message said git instead of salt [17:59:32] yeah I'm reading the function now [17:59:59] Krenair: probably because I was typing a git command in a different window at the same time :/ [18:00:04] anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T1800). Please do the needful. [18:00:04] brion and AndyRussG: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:28] well it does seeeeem link the _delete code ought to put the salt key deletion stuff from the cert script out of business [18:00:35] s/link/like/ [18:00:42] \o/ [18:01:03] 06Operations, 10ops-eqiad, 10Analytics-Cluster: decom titanium - https://phabricator.wikimedia.org/T145666#2659725 (10Dzahn) alright, handing it over to ops-eqiad now. the server has been shutdown, public IP removed from DNS, removed from puppet/icinga etc. please go ahead with physical decom steps, disk... [18:01:19] 06Operations, 10ops-eqiad, 10Analytics-Cluster: decom titanium - https://phabricator.wikimedia.org/T145666#2659739 (10Dzahn) a:05Dzahn>03None [18:01:24] certcleaner is run on cron as root [18:01:29] apergos, Krenair, let's disable the certcleaner cron, wait a couple of weeks, and see if there's a mess :) [18:01:32] (03PS1) 10Yuvipanda: puppetmaster: Enable cherrypicks by default in standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312291 [18:01:36] sounds good to me [18:01:41] (But not today, in case it causes an immediate mess) [18:01:45] if there's a ticket can you make sure I'm subscribed please? [18:01:51] that way I can get the updates [18:01:52] (03PS2) 10Yuvipanda: puppetmaster: Enable cherrypicks by default in standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312291 [18:01:53] it sends output to /dev/null [18:01:59] but you could have it email us instead [18:02:12] add some print calls in there etc. [18:02:14] I can SWAT today. [18:02:33] I'm updating the ticket now [18:02:37] whee [18:02:42] 06Operations, 10ops-eqiad, 10Analytics-Cluster: decom titanium - https://phabricator.wikimedia.org/T145666#2637714 (10Dzahn) location: eqiad row B, B4 @ 13 [18:03:00] ticket is https://phabricator.wikimedia.org/T146303 [18:03:17] at least, that's the ticket that prompted us to investigate whether the script is still necessary [18:03:32] it uses ldap instead of keystone+nova/labs_metal, and it uses ldapsupportlib and optparse [18:03:43] thcipriani: so i've got the config change for static.php, and two backports on extensions/TimedMediaHandler. if the config change isn't ready to roll, i still need the backports to resolve the caching bug by making all versions consistent [18:04:11] *two backports = one backport on 2 branches [18:04:23] brion: fyi, I've changed the CentralNotice patches, just updating the Deployments wiki now... [18:04:33] whee [18:05:08] awight: those are AndyRussG|bassoo 's :) [18:05:22] brion: awight one quick clarification: we're not deploying wmf.19 anywhere, we're jumping straight to wmf.20. I can still merge the backports, but they won't be deployed anywhere. [18:05:28] perfect, thanks [18:05:32] thcipriani: spiff [18:05:52] yeah i need it on wmf.18 for commons, just did wmf.19 just in case :D [18:05:54] brion: thcipriani: great, yeah I've made wmf.18 and wmf.20 backport patches [18:06:41] (03PS2) 10Dzahn: ganglia: ship native systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) (owner: 10Filippo Giunchedi) [18:06:51] brion: okay, the wiki page reflects what we need now [18:07:05] (Holler when I can test or answer questions!) [18:07:05] awight: thanks :) [18:08:10] (03PS1) 10Yuvipanda: puppetmaster: Allow customizing which user owns git repos [puppet] - 10https://gerrit.wikimedia.org/r/312295 [18:08:16] brion: I'm not very familiar with static.php. Krinkle have you had a chance to look at the updated https://gerrit.wikimedia.org/r/#/c/312254 ? [18:08:21] (03CR) 10Yuvipanda: [C: 032] puppetmaster: Enable cherrypicks by default in standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312291 (owner: 10Yuvipanda) [18:08:43] i updated per krinkle's last comment but yeah double-check :) [18:08:44] (03CR) 10Dzahn: [C: 032] ganglia: ship native systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) (owner: 10Filippo Giunchedi) [18:08:49] (03PS3) 10Dzahn: ganglia: ship native systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) (owner: 10Filippo Giunchedi) [18:09:47] thcipriani: if we don't hear back, it's safest to just deploy the backports; that will resolve the immediate issue while we wait (it'll just break again next time if we don't update static.php first ;) [18:10:11] brion: okie doke, sounds good, thanks :) [18:10:16] ok :D [18:10:18] thanks! [18:11:58] last thing i want to do is break all the static files on the site the last day of SWATs before the offsite hehe [18:12:59] (03PS2) 10Yuvipanda: puppetmaster: Allow customizing which user owns git repos [puppet] - 10https://gerrit.wikimedia.org/r/312295 [18:13:02] (03CR) 10Dduvall: [C: 031] "This patch has already been cherry picked on deployment-puppetmaster and has been applying cleanly for a while. It would be great to get i" [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [18:13:08] (03CR) 10Yuvipanda: [C: 032] "https://puppet-compiler.wmflabs.org/4163/ noop" [puppet] - 10https://gerrit.wikimedia.org/r/312295 (owner: 10Yuvipanda) [18:13:11] (03CR) 10Yuvipanda: [V: 032] puppetmaster: Allow customizing which user owns git repos [puppet] - 10https://gerrit.wikimedia.org/r/312295 (owner: 10Yuvipanda) [18:13:31] awight: sorry relocation issues... I see ^ deployment? [18:13:53] (03CR) 10Dduvall: [C: 031] "This patch has already been cherry picked on deployment-puppetmaster and has been applying cleanly for a while. It would be great to get i" [puppet] - 10https://gerrit.wikimedia.org/r/310360 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [18:14:22] AndyRussG: yup, just +2'd CentralNotice submodule bumps for wmf.18 and wmf.20 (which are the only two branches deployed recently) [18:14:43] AndyRussG: yah, and I'm heads-upping cos the approach in the merged patches is slightly improved beyond my kludgey unicode string comparison [18:15:03] thcipriani: awight cool thx!!! [18:16:48] (03PS1) 10Yuvipanda: puppetmaster: Fix enabling cherrypicks in standalone master [puppet] - 10https://gerrit.wikimedia.org/r/312297 [18:16:58] (03PS2) 10Yuvipanda: puppetmaster: Fix enabling cherrypicks in standalone master [puppet] - 10https://gerrit.wikimedia.org/r/312297 [18:17:02] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Fix enabling cherrypicks in standalone master [puppet] - 10https://gerrit.wikimedia.org/r/312297 (owner: 10Yuvipanda) [18:18:18] (03PS1) 10Yuvipanda: puppetmaster: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/312298 [18:18:27] (03PS2) 10Yuvipanda: puppetmaster: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/312298 [18:18:32] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/312298 (owner: 10Yuvipanda) [18:18:32] PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CRITICAL - load average: 182.18, 178.36, 158.24 [18:20:03] thcipriani: so I guess pls lmk when it's actually out :) [18:20:43] AndyRussG: yup, getting everything setup on the deployment host right now, will ping you when I have something for you to check :) [18:21:10] K thx! [18:24:23] brion: TimedMediaHandler update for wmf.18/19 is live on mw1099, check please (although you'll only be able to check wmf.18) [18:24:38] ok lemme check [18:25:03] (03PS4) 10Dzahn: ganglia: ship native systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) (owner: 10Filippo Giunchedi) [18:25:38] thcipriani: looks good! [18:25:49] brion: ok, going live everywhere [18:25:54] thanks! [18:26:01] (03CR) 10ArielGlenn: [C: 031] "Working for me now, excellent. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) (owner: 10Alex Monk) [18:28:11] !log thcipriani@tin Synchronized php-1.28.0-wmf.18/extensions/TimedMediaHandler/MwEmbedModules: SWAT: [[gerrit:312263|Update ogv.js to 1.2.0 (T145983)]] (duration: 00m 51s) [18:28:12] T145983: Seeking in Ogg audio files with ogv.js player seems broken - https://phabricator.wikimedia.org/T145983 [18:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:28:18] ^ brion wmf.18 live everywhere [18:29:27] (03PS1) 10Yuvipanda: labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 [18:31:04] (03CR) 10jenkins-bot: [V: 04-1] labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 (owner: 10Yuvipanda) [18:31:12] !log thcipriani@tin Synchronized php-1.28.0-wmf.19/extensions/TimedMediaHandler/MwEmbedModules: SWAT: [[gerrit:312264|Update ogv.js to 1.2.0 (T145983)]] (duration: 00m 48s) [18:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:19] (03PS2) 10Yuvipanda: labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 [18:31:22] ^ sync'd for house-keeping sake [18:31:43] (03PS3) 10Yuvipanda: labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 [18:32:39] (03PS1) 10BBlack: upload storage: transition cp3034+cp3035 [puppet] - 10https://gerrit.wikimedia.org/r/312302 [18:32:42] (03PS1) 10BBlack: upload storage: transition cp3036+cp3037 [puppet] - 10https://gerrit.wikimedia.org/r/312303 [18:32:44] (03PS1) 10BBlack: upload storage: transition cp3038+cp3039 [puppet] - 10https://gerrit.wikimedia.org/r/312304 [18:32:46] (03PS1) 10BBlack: upload storage: transition cp3044+cp3045 [puppet] - 10https://gerrit.wikimedia.org/r/312305 [18:32:48] (03PS1) 10BBlack: upload storage: transition cp3046+cp3047 [puppet] - 10https://gerrit.wikimedia.org/r/312306 [18:32:50] (03PS1) 10BBlack: upload storage: finish esams (cp3048+cp3049) [puppet] - 10https://gerrit.wikimedia.org/r/312307 [18:32:52] (03CR) 10jenkins-bot: [V: 04-1] labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 (owner: 10Yuvipanda) [18:32:54] (03CR) 10Dzahn: [C: 031] toollabs: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/312284 (owner: 10Paladox) [18:33:04] woohoo I'm seeing correct version of ogv.js show up now [18:33:07] on safari [18:33:17] (03PS2) 10BBlack: upload storage: transition cp3038+cp3039 [puppet] - 10https://gerrit.wikimedia.org/r/312304 (https://phabricator.wikimedia.org/T145661) [18:33:19] (03PS2) 10BBlack: upload storage: transition cp3044+cp3045 [puppet] - 10https://gerrit.wikimedia.org/r/312305 (https://phabricator.wikimedia.org/T145661) [18:33:21] (03PS2) 10BBlack: upload storage: transition cp3046+cp3047 [puppet] - 10https://gerrit.wikimedia.org/r/312306 (https://phabricator.wikimedia.org/T145661) [18:33:23] (03PS2) 10BBlack: upload storage: finish esams (cp3048+cp3049) [puppet] - 10https://gerrit.wikimedia.org/r/312307 (https://phabricator.wikimedia.org/T145661) [18:33:25] (03PS2) 10BBlack: upload storage: transition cp3036+cp3037 [puppet] - 10https://gerrit.wikimedia.org/r/312303 (https://phabricator.wikimedia.org/T145661) [18:33:27] (03PS2) 10BBlack: upload storage: transition cp3034+cp3035 [puppet] - 10https://gerrit.wikimedia.org/r/312302 (https://phabricator.wikimedia.org/T145661) [18:33:50] (03PS4) 10Yuvipanda: labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 [18:33:53] (03CR) 10Dzahn: "the file was already there on bast3001. change was just mode changed '0644' to '0444'" [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) (owner: 10Filippo Giunchedi) [18:34:44] 06Operations, 13Patch-For-Review: ganglia-monitor and puppet failing on bast3001 - https://phabricator.wikimedia.org/T144778#2659895 (10Dzahn) @bast3001:/etc/systemd/system# systemctl status ganglia-monitor.service ● ganglia-monitor.service - Ganglia monitor Loaded: loaded (/etc/systemd/system/ganglia-monit... [18:35:30] AaronSchulz: Looks like I pulled down a change of your for wmf.20 during SWAT [18:36:10] thcipriani: I was waiting for the second to merge [18:36:17] * AaronSchulz will sync now [18:37:00] AndyRussG: CentralNotice changes are live on mw1099 for wmf.18 and wmf.20, check please [18:37:42] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2659919 (10RobH) [18:38:09] 06Operations, 13Patch-For-Review: ganglia-monitor and puppet failing on bast3001 - https://phabricator.wikimedia.org/T144778#2659923 (10Dzahn) ``` bast3001:/etc/systemd/system# systemctl stop ganglia-monitor.service bast3001:/etc/systemd/system# systemctl start ganglia-monitor.service (failed reverse-i-search)... [18:38:10] (03CR) 10BBlack: [C: 032 V: 032] upload storage: transition cp3034+cp3035 [puppet] - 10https://gerrit.wikimedia.org/r/312302 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [18:38:18] (03PS5) 10Yuvipanda: labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 [18:38:21] !log aaron@tin Synchronized php-1.28.0-wmf.20/includes/libs/rdbms/database/Database.php: 844cfd568a7c7953faa6ac69acebff1cee943b7f & 014a420b4525798b1202cc488b337acdaf09c49a (duration: 00m 49s) [18:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:37] 06Operations, 13Patch-For-Review: ganglia-monitor and puppet failing on bast3001 - https://phabricator.wikimedia.org/T144778#2659925 (10Dzahn) 05Open>03Resolved [18:38:51] 06Operations: ganglia-monitor and puppet failing on bast3001 - https://phabricator.wikimedia.org/T144778#2609955 (10Dzahn) [18:40:08] thcipriani: checking! [18:40:57] 06Operations, 10hardware-requests: codfw/eqiad:(4+4) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#2659943 (10Dzahn) p:05Triage>03Normal [18:42:30] (03PS6) 10Yuvipanda: labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 [18:43:03] thcipriani: lgtm! [18:43:34] AndyRussG: ok, I will go live with wmf.20, then with wmf.18 (for reference: https://tools.wmflabs.org/versions/) [18:44:11] thcipriani: K! \o/ [18:45:33] (03PS7) 10Yuvipanda: labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 [18:45:40] (03CR) 10Yuvipanda: [C: 032] "https://puppet-compiler.wmflabs.org/4171/ looks ok!" [puppet] - 10https://gerrit.wikimedia.org/r/312301 (owner: 10Yuvipanda) [18:45:44] (03CR) 10Yuvipanda: [V: 032] labs: Make labs puppetmaster use the standalone role [puppet] - 10https://gerrit.wikimedia.org/r/312301 (owner: 10Yuvipanda) [18:46:18] !log disable puppet on labcontrol1001 for https://gerrit.wikimedia.org/r/#/c/312301/ [18:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:47:47] !log thcipriani@tin Synchronized php-1.28.0-wmf.20/extensions/CentralNotice: SWAT: [[gerrit:312293|Update extensions/CentralNotice submodule (T144952)]] (duration: 00m 52s) [18:47:47] T144952: Banner not showing up on site - https://phabricator.wikimedia.org/T144952 [18:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:47:53] thcipriani: Maybe you know--what volume of logging can we tolerate in a custom mw-logs bucket? I'm a bit concerned that my patch is gonna flood CentralNotice.log [18:47:53] ^ AndyRussG live on wmf.20 [18:48:12] 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2659965 (10Dzahn) picked up from office. i also received 5 yubikeys from OIT (4 x yubikey4 and 1 x yubikey4nano) (zendesk Ticket #11781) [18:48:27] It could be... similar to the number of pageviews. [18:49:04] awight: that sounds like a lot. I don't know the answer to that question. [18:49:05] ^ in Australia, no? [18:49:10] Maybe I should prepare a patch to disable the logging bucket. [18:49:22] awight: ^ agreed, good point [18:50:12] I think the Australia FR test is in the Australian morning... awight all we would really like is a few logs to get more info on whassup, no? [18:50:20] !log enabling puppet on labcontrol1001, run on labtestcontrol2001 seems ok [18:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:47] AndyRussG: I was sort of hoping to get an idea of the magnitude of the problem, too [18:51:03] (also to have this warning set up in case of future glitches) [18:51:08] awight: hmm right... I think we can rightly assume, "big" [18:51:31] (03PS1) 10Awight: Revert "Capture the "CentralNotice" log bucket" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312310 [18:51:52] AndyRussG: thcipriani: ^ in case it looks like we're going to fill the disk [18:52:47] awight: hrm, should I just go ahead with that one preemptively? [18:53:35] thcipriani: I think we should be able to withstand a few minutes of logging, at least... [18:54:06] AndyRussG: It would be a pity to sample that error. But maybe that's the right thing to do in the short-term? [18:54:17] oh good :) [18:54:28] * awight tightens parachute [18:54:36] alright, well, I'm going to go live with wmf.18 [18:54:44] thcipriani: yeah wmf.20 looks good! [18:55:32] * awight attaches popcorn to parachute [18:55:43] Hmmm which group is meta in? I though it'd be 20 but maybe it's 18 [18:55:50] This code actually only runs on meta [18:56:13] !log thcipriani@tin Synchronized php-1.28.0-wmf.18/extensions/CentralNotice: SWAT: [[gerrit:312292|Update extensions/CentralNotice submodule (T144952)]] (duration: 00m 50s) [18:56:14] T144952: Banner not showing up on site - https://phabricator.wikimedia.org/T144952 [18:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:21] ^ awight AndyRussG live everywhere [18:56:31] * awight is creeped out by log silence [18:57:15] AndyRussG: I triggered the logging with https://en.wikipedia.org/w/index.php?title=Special:BannerLoader&banner=B1617_0921_en6C_ipd_p2_sm_pos_btm&uselang=en&debug=false [18:57:20] I seem to be the *only* one, though [18:57:29] (03PS1) 10Yuvipanda: labs: Temporary hack to make git-sync-upstream to work [puppet] - 10https://gerrit.wikimedia.org/r/312312 [18:57:30] Maybe cos we have campaigns down? [18:57:41] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [18:57:54] (03PS2) 10Yuvipanda: labs: Temporary hack to make git-sync-upstream to work [puppet] - 10https://gerrit.wikimedia.org/r/312312 [18:57:57] AndyRussG: > 2016-09-22 18:56:58 [V@QpegpAAEQAAVLIjnIAAADW] mw1273 enwiki 1.28.0-wmf.18 CentralNotice INFO: Banner message key Centralnotice-template-B1617_0921_en6C_ipd_p2_sm_pos_btm could not be found in en [18:58:36] thcipriani: Logging thing looks good for now. I'll keep the revert in my back pocket... [18:58:40] ty! [18:58:50] awight: cool, good looking out :) [18:59:17] (03CR) 10Yuvipanda: [C: 032] labs: Temporary hack to make git-sync-upstream to work [puppet] - 10https://gerrit.wikimedia.org/r/312312 (owner: 10Yuvipanda) [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T1900). Please do the needful. [19:00:15] aaand right back to deploying :) [19:00:31] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:01:17] !log wmf.20 to group1 will watch until 20 UTC and move forward to all wikis [19:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:45] (03PS2) 10Awight: [DO NOT MERGE] Revert "Capture the "CentralNotice" log bucket" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312310 [19:02:05] (03PS1) 10Hashar: contint: create /srv based directory hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/312313 [19:02:05] thcipriani: nice watch command btw--do you know if all logs get the "repeated message" treatment? [19:03:07] what do you mean? "repeated message" treatment? [19:03:36] thcipriani: I see you're doing watch tail -n 1000 /a/mw-log/hhvm.log |? awk '/message repeated/{for(i=$7;i>0;i--){print}}{print}' |? sed 's/message repeated [0-9]* times: \[ //' |? sed 's/]$//' |? sed 's/#012//' |? cut -d ' ' -f 7- |? sort |? uniq -c |? sort -rn [19:03:59] Wondering if that's a syslog or monolog handler that's rolling up identical messages with "message repeated" [19:04:15] oh, that-that's `fatalmonitor` on fluorine :) [19:04:36] If that handler exists on the custom log buckets, it would protect us from potential logflood... [19:05:09] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:05:25] awight: hrm, unsure, this script is a bd808 joint, afaik, he may have the answers that you seek [19:05:37] :) [19:07:08] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2660027 (10awight) a:05awight>03None [19:09:20] (03PS1) 10Yuvipanda: puppetmaster: Cleanup unused vars / crons in labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/312317 [19:09:28] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.20 [19:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:40] awight: I believe syslog does the "message repeated" stuff [19:10:08] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Cleanup unused vars / crons in labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/312317 (owner: 10Yuvipanda) [19:10:23] (03PS2) 10Yuvipanda: puppetmaster: Cleanup unused vars / crons in labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/312317 [19:10:47] PROBLEM - swift-container-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:11:26] legoktm: good news, thanks! [19:12:27] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:12:58] RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:14:49] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [19:15:41] (03PS2) 10Hashar: contint: create /srv based directory hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/312313 (https://phabricator.wikimedia.org/T146381) [19:19:15] (03PS3) 10BBlack: upload storage: transition cp3036+cp3037 [puppet] - 10https://gerrit.wikimedia.org/r/312303 (https://phabricator.wikimedia.org/T145661) [19:19:24] (03CR) 10BBlack: [C: 032 V: 032] upload storage: transition cp3036+cp3037 [puppet] - 10https://gerrit.wikimedia.org/r/312303 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [19:21:02] (03PS3) 10Hashar: contint: create /srv based directory hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/312313 (https://phabricator.wikimedia.org/T146381) [19:21:48] (03CR) 10BBlack: [C: 031] wdqs LVS DNS entries [dns] - 10https://gerrit.wikimedia.org/r/312216 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [19:22:07] bblack: thanks! [19:22:11] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2660074 (10RobH) wmf4749 is having boot issues: Broadcom UNDI PXE-2.1 v17.0.1 Copyright (C) 2000-2015 Broadcom Corporation Copyright (C) 1997-2000 Intel Corporation All rights reserved. PX... [19:22:17] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:23:23] !log Deploying new version of WDQS GUI [19:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:37] (03CR) 10BBlack: [C: 04-1] "This shouldn't use the app-routing stuff like cache_text, it should stick with the same scheme as all the other LVS services in cache_misc" [puppet] - 10https://gerrit.wikimedia.org/r/312225 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [19:26:39] (03CR) 10BBlack: [C: 04-1] "Seems to be missing the actual service/cluster/node definitions in the conftool data?" [puppet] - 10https://gerrit.wikimedia.org/r/312223 (https://phabricator.wikimedia.org/T132457) (owner: 10Gehel) [19:29:38] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:36:13] PROBLEM - swift-object-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:26] PROBLEM - swift-object-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:38:30] RECOVERY - swift-object-server on ms-be1016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:38:33] RECOVERY - swift-object-auditor on ms-be1016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:40:07] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [19:43:05] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 17 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [19:44:36] (03PS1) 10RobH: correcting ip assignment for wmf4750 [dns] - 10https://gerrit.wikimedia.org/r/312320 [19:44:57] (03CR) 10RobH: [C: 032] correcting ip assignment for wmf4750 [dns] - 10https://gerrit.wikimedia.org/r/312320 (owner: 10RobH) [19:45:33] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:46:06] (03PS2) 10Gehel: wdqs - LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/312223 (https://phabricator.wikimedia.org/T132457) [19:47:11] (03PS1) 10Hashar: contint: migrate castor server to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312322 [19:53:24] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:55:11] Hi, does anyone know how i can use [19:55:12] if ( BetaFeatures::isFeatureEnabled( $this->getUser(), 'my-awesome-feature' ) ) { [19:55:14] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:55:17] In a static function [19:55:44] $this->getUser() wont work in a static [19:56:37] legoktm ^^ i wonder if you know that? [19:56:40] paladox: this isn't really the channel to ask that... try #wikimedia-tech maybe? [19:56:45] Oh sorry [19:58:49] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [20:00:34] !log rolling out wmf.20 to all wikis [20:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:16] (03PS2) 10Gehel: wdqs - add icinga check for LVS services [puppet] - 10https://gerrit.wikimedia.org/r/312224 (https://phabricator.wikimedia.org/T132457) [20:04:56] ^ twentyafterfour auto +2 in deploy-promote kills grrrit-wm, seemingly :( [20:05:20] thcipriani: you need to add πŸš‚ to train SAL messages :) [20:05:22] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:06:08] someday we will be able to use πŸš„ [20:06:17] haha [20:06:18] thruth. [20:06:21] *truth. [20:06:35] (03PS2) 10Gehel: wdqs - configure varnish to use LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/312225 (https://phabricator.wikimedia.org/T132457) [20:08:12] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.20 [20:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:12] * greg-g missed the emojis in his terminal [20:09:13] :( [20:09:23] * greg-g packs up to go coffee shop'ing [20:10:11] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [20:10:13] thcipriani: any idea why that would kill the bot? [20:10:23] greg-g: https://en.wikipedia.org/wiki/Intelligentsia_Coffee_%26_Tea#Honors ? [20:11:17] (03PS2) 10Hashar: contint: migrate castor server to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312322 (https://phabricator.wikimedia.org/T146381) [20:11:19] (03PS1) 10Hashar: contint: add a tmpfs on /srv [puppet] - 10https://gerrit.wikimedia.org/r/312328 (https://phabricator.wikimedia.org/T146381) [20:11:37] twentyafterfour: two messages at ~the same time? Dunno anything about that bot, just noticed that was happening after running deploy-promote [20:11:52] or rather, didn't notice a ping in IRC when I ran deploy-promote [20:12:06] paladox: ^ see comments about grrrit-wm , i know you alreayd looked into that [20:12:16] (why it keeps crashing etc) [20:12:24] Oh [20:12:50] I will try a change then [20:12:57] 06Operations, 10Traffic, 10Wikimedia-Logstash: Move logstash.wikimedia.org (kibana) to an LVS service - https://phabricator.wikimedia.org/T132458#2660234 (10bd808) [20:13:09] I guess it is hitting something [20:13:43] I think it may be the protection it has enabled [20:16:14] Lets see if this will work [20:16:18] otherwise it is a bug [20:16:29] PROBLEM - configured eth on puppetmaster1002 is CRITICAL: Connection refused by host [20:16:31] I was testing a new node version [20:16:33] nodejs 6 [20:16:42] but looks like that made things a little unstable [20:16:59] but one thing we could do is have better logs for this bot since it wont tell me where it breaks [20:17:09] mutante twentyafterfour ^ [20:17:32] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:17:39] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:17:50] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:24] paladox: better logs sounds good :) [20:18:46] Yep, but i have no idea how to do that :) [20:18:50] RECOVERY - configured eth on puppetmaster1002 is OK: OK - interfaces up [20:19:49] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [20:20:50] !log T133395: RESTBase Staging: Restarting Cassandra to pick up TWCS jar in classpath [20:20:51] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [20:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:30] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:31:37] twentyafterfour: Can I roll out https://gerrit.wikimedia.org/r/#/c/312283/ in a bit? [20:31:44] (that'll close the blocking train task) [20:31:50] already reverted in .18 [20:31:55] but we'll keep it out for now and figure this for later. [20:32:19] Krinkle: I'm not on train duty today but I don't see why not. thcipriani are you train this week? [20:33:07] Krinkle: yup, you're clear to roll that out whenever. Train is complete, just monitoring at this point. [20:36:22] we should probably put "train duty" or whatever in the /topic each week [20:36:28] man we need a bot to manage all of this :) [20:36:34] (03PS3) 10Hashar: contint: remove obsolete files { ensure => absent } [puppet] - 10https://gerrit.wikimedia.org/r/312275 (https://phabricator.wikimedia.org/T146381) [20:36:36] (03PS3) 10Hashar: contint: migrate castor server to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312322 (https://phabricator.wikimedia.org/T146381) [20:36:38] (03PS2) 10Hashar: contint: mount /srv the same as /mnt [puppet] - 10https://gerrit.wikimedia.org/r/312262 (https://phabricator.wikimedia.org/T146381) [20:36:40] (03PS2) 10Hashar: contint: add a tmpfs on /srv [puppet] - 10https://gerrit.wikimedia.org/r/312328 (https://phabricator.wikimedia.org/T146381) [20:36:42] (03PS2) 10Hashar: contint: migrate browsertest redis to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312267 (https://phabricator.wikimedia.org/T146381) [20:36:44] (03PS4) 10Hashar: contint: create /srv based directory hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/312313 (https://phabricator.wikimedia.org/T146381) [20:36:46] (03PS2) 10Hashar: contint: migrate package_builder from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312270 (https://phabricator.wikimedia.org/T146381) [20:36:51] ohai antoine [20:36:58] (03PS1) 10Gehel: maps - tilerator cassandra backend has a different name on maps-test [puppet] - 10https://gerrit.wikimedia.org/r/312331 [20:37:06] greg-g: your nerd snip won't work ... today [20:37:38] bd808: don't worry, I'm persistent :) [20:38:02] mutante: btw, I'm at a Peets right now. /me shrugs [20:38:06] Is someone using Safari 10 too? I'm having troubles connecting to Wikimedia sites every now and then. [20:38:11] (03CR) 10Yurik: [C: 031] maps - tilerator cassandra backend has a different name on maps-test [puppet] - 10https://gerrit.wikimedia.org/r/312331 (owner: 10Gehel) [20:39:40] RECOVERY - very high load average likely xfs on ms-be1016 is OK: OK - load average: 70.77, 68.18, 79.80 [20:41:29] greg-g: i just learned Peets bought them :o [20:42:08] intelligensia? [20:42:11] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:42:11] yea [20:42:25] interesting [20:43:04] also stumptown IIRC [20:47:38] !log T133395: RESTBase Staging: altering table to set TWCS on wikipedia parsoid.html table [20:47:39] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [20:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:59] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:52:05] !log T133395: RESTBase Staging: starting dumps (3, eqiad) [20:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:59:37] (03CR) 10Gehel: [C: 032] maps - tilerator cassandra backend has a different name on maps-test [puppet] - 10https://gerrit.wikimedia.org/r/312331 (owner: 10Gehel) [21:02:19] !log imported nodepool_0.1.1-wmf5_amd64 into jessie-wikimedia (T145142) [21:02:20] T145142: Upgrade Nodepool to 0.1.1-wmf5 to reduce requests made to OpenStack API - https://phabricator.wikimedia.org/T145142 [21:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:03:22] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 07Nodepool: Upgrade Nodepool to 0.1.1-wmf5 to reduce requests made to OpenStack API - https://phabricator.wikimedia.org/T145142#2621216 (10Dzahn) ``` root@carbon:~/nodepool# diff -ru 4/ 5/ diff -ru 4/usr/lib/python2.7/dist-packages/nodepool/pro... [21:04:19] !log krinkle@tin Synchronized php-1.28.0-wmf.20/resources/src/mediawiki/mediawiki.js: T146099 (duration: 00m 48s) [21:04:20] T146099: mw-1.28.0-wmf.18 load-time regression - https://phabricator.wikimedia.org/T146099 [21:04:22] !log upgradede nodepool to 0.1.1-wmf on labnodepool1001 [21:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:28] !log upgradede nodepool to 0.1.1-wmf5 on labnodepool1001 [21:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:35] deletes the first one [21:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:34] (03PS1) 10Jdlrobson: Cleanup deprecated MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312339 [21:05:55] !log stopped nodepooled and restarted it with 0.1.1-wmf5 [21:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:40] (03PS2) 10Jdlrobson: Cleanup deprecated MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312339 [21:10:03] thcipriani, done with train? i need to update tilerator service [21:10:07] (03PS1) 10Gehel: logstash - DNS entries for LVS service [dns] - 10https://gerrit.wikimedia.org/r/312342 (https://phabricator.wikimedia.org/T132458) [21:10:08] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 07Nodepool: Upgrade Nodepool to 0.1.1-wmf5 to reduce requests made to OpenStack API - https://phabricator.wikimedia.org/T145142#2660496 (10hashar) 05Open>03Resolved a:05hashar>03Dzahn I stopped Nodepool. Daniel apt-get installed. Done... [21:10:13] yurik: train is complete [21:10:20] thx! [21:12:45] thcipriani: so all on wmf.20 now ? [21:13:02] hasharAway: yep, all moved over [21:13:55] well done! [21:14:06] !log deployed tilerator https://gerrit.wikimedia.org/r/#/c/312329/ [21:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:51] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2660522 (10yuvipanda) So, https://wikitech.wikimedia.org/wiki/Standalone_puppetmaster exists now. It is based off the puppe... [21:23:35] thcipriani: congratulations really [21:24:04] !log Nodepool is all back and operational. Reduced amount of queries to the OpenStack API by more than 10% [21:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:36] hasharAway: nice [21:27:24] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2660533 (10Krinkle) The ones that start with `/skins` and `/static` are most likely from on-wiki gadgets and site scripts and stylesheets (e.g. Common.css) which will have been broken by... [21:27:32] hasharAway: wow, nice :) [21:32:10] thcipriani: greg-g yeah the orange layer on https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=22&fullscreen would be gone tomorrow [21:35:07] bed bed time [21:36:57] 06Operations, 10Wikimedia-Apache-configuration: Font list resource doesn't have a "Content-type: text/plain;charset=utf-8" header - https://phabricator.wikimedia.org/T146421#2660590 (10Dereckson) [21:38:10] (03PS3) 10BBlack: upload storage: transition cp3038+cp3039 [puppet] - 10https://gerrit.wikimedia.org/r/312304 (https://phabricator.wikimedia.org/T145661) [21:38:19] (03CR) 10BBlack: [C: 032 V: 032] upload storage: transition cp3038+cp3039 [puppet] - 10https://gerrit.wikimedia.org/r/312304 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [21:39:23] (03PS1) 10GergΕ‘ Tisza: Add 'message-format' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312404 (https://phabricator.wikimedia.org/T146416) [21:45:11] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:54:03] (03CR) 10MaxSem: Cleanup deprecated MobileFrontend variables (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312339 (owner: 10Jdlrobson) [21:56:43] 06Operations, 13Patch-For-Review: Have Diamond collect Linux KSM metrics on Ganeti hosts - https://phabricator.wikimedia.org/T146038#2660692 (10hashar) 05Open>03Resolved a:03akosiaris I did a very basic graph on https://grafana.wikimedia.org/dashboard/db/ganeti Metrics use different units so they should... [22:08:22] 06Operations: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2660720 (10Dzahn) contacts in private repo: ``` define contact{ contact_name slaporte alias Stephen LaPorte host_notification_period 24x7... [22:09:22] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [22:11:17] (03PS1) 10Dzahn: icinga: add contactgroup for legal [puppet] - 10https://gerrit.wikimedia.org/r/312420 (https://phabricator.wikimedia.org/T146227) [22:11:41] (03PS2) 10Dzahn: icinga: add contactgroup for legal [puppet] - 10https://gerrit.wikimedia.org/r/312420 (https://phabricator.wikimedia.org/T146227) [22:13:15] (03PS3) 10Dzahn: icinga: add contactgroup for legal [puppet] - 10https://gerrit.wikimedia.org/r/312420 (https://phabricator.wikimedia.org/T146227) [22:13:31] (03PS4) 10Dzahn: icinga: add contactgroup for legal [puppet] - 10https://gerrit.wikimedia.org/r/312420 (https://phabricator.wikimedia.org/T146227) [22:13:44] (03CR) 10Dzahn: [C: 032] icinga: add contactgroup for legal [puppet] - 10https://gerrit.wikimedia.org/r/312420 (https://phabricator.wikimedia.org/T146227) (owner: 10Dzahn) [22:18:38] !log added slaporte and zhousquared to wmf LDAP group (T146227) [22:18:39] T146227: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227 [22:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:22:29] 06Operations, 13Patch-For-Review: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2660790 (10Dzahn) [22:28:28] (03PS1) 10Dzahn: icinga: add legal contactgroup to legal footer checks [puppet] - 10https://gerrit.wikimedia.org/r/312424 (https://phabricator.wikimedia.org/T146227) [22:28:42] (03PS2) 10Dzahn: icinga: add legal contactgroup to legal footer checks [puppet] - 10https://gerrit.wikimedia.org/r/312424 (https://phabricator.wikimedia.org/T146227) [22:29:47] jouncebot, next [22:29:47] In 0 hour(s) and 30 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T2300) [22:30:28] jouncebot: now [22:30:29] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [22:30:43] just wanted to see if that was added, cool [22:33:02] (03CR) 10Dzahn: [C: 032] icinga: add legal contactgroup to legal footer checks [puppet] - 10https://gerrit.wikimedia.org/r/312424 (https://phabricator.wikimedia.org/T146227) (owner: 10Dzahn) [22:42:07] 06Operations, 13Patch-For-Review: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2660835 (10Dzahn) @Zhouz @slaporte You can now try logging in at https://icinga.wikimedia.org/icinga/ Then you can type "legal" into the the search field, or you can directly bookmark and go to... [22:46:02] (03CR) 10MaxSem: [C: 04-1] Cleanup deprecated MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312339 (owner: 10Jdlrobson) [22:49:42] !log aaron@tin Synchronized php-1.28.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php: a73a7ef9286275f797411646f9c5af60d4894c73 (duration: 01m 04s) [22:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:32] PROBLEM - HHVM rendering on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:53] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:55:57] jouncebot: refresh [22:55:58] (03PS3) 10BBlack: upload storage: transition cp3044+cp3045 [puppet] - 10https://gerrit.wikimedia.org/r/312305 (https://phabricator.wikimedia.org/T145661) [22:56:00] I refreshed my knowledge about deployments. [22:56:06] jouncebot: next [22:56:06] In 0 hour(s) and 3 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T2300) [22:56:29] (03CR) 10BBlack: [C: 032 V: 032] upload storage: transition cp3044+cp3045 [puppet] - 10https://gerrit.wikimedia.org/r/312305 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160922T2300). Please do the needful. [23:00:05] Jdlrobson, RoanKattouw, MaxSem, and Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:17] hey [23:00:58] oooh, we get pinged for every swat now! :) [23:01:00] I can swat [23:01:33] addshore, we have for a long while? [23:02:01] Yeah that's not new [23:02:12] Although, that was a lot of people it pinged [23:02:16] the list of deployers does seem bigger than usual [23:02:16] it's new [23:02:20] email incoming :) [23:02:20] RoanKattouw, I assume you don't need the wmf.18 patches? [23:02:33] someone edited the deployments page to add a lot more people to the deployers column for every SWAT [23:02:38] MaxSem: Yes I do, because of train weirdness [23:02:42] wmf20 isn't actually on all wikis yet [23:02:47] (greg-g correct me if I'm wrong) [23:02:47] Um [23:02:50] I thought it was on all wikis [23:02:51] yeh, list of swatters merged :) so far no complaints(from me)! [23:02:59] Well 2 days ago we had wmf.18 on all wikis [23:03:00] it is [23:03:04] wait, it is? [23:03:05] it's on all wikis [23:03:06] http://tools.wmflabs.org/versions/ [23:03:16] krenair@tin:~$ scap wikiversions-inuse [23:03:16] 1.28.0-wmf.20 [23:03:16] krenair@tin:~$ [23:03:17] so, swat like normal :) [23:03:24] 20:08 logmsgbot: thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.20 [23:03:24] I see [23:03:31] jdlrobson, around? [23:03:31] OK, yes then skip the wmf.18 patches [23:03:33] also: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=859186&oldid=858991 [23:04:05] (that's why I did the jouncebot refresh, to see if I broke jouncebot by adding all those names :) ) [23:04:32] greg-g, now you will annoy ppl who use bouncers :] [23:04:55] minimally (hopefully) only when you're sleeping :) [23:05:01] and sorry/not sorry :P [23:05:14] Hm [23:05:16] Maybe not [23:05:19] Okay, so... RoanKattouw? [23:05:31] Krenair: jdlrobson is around, but can't use IRC for the moment [23:05:44] freenode is busted apparently [23:05:44] Let's return to him later then [23:05:48] great :/ [23:05:52] Krenair: I'm here, good to go for my patch [23:05:56] (Only 1 of the 3 left now) [23:06:06] kaldari, He could try leguin.freenode.net [23:06:07] oh we're on 20 everywhere [23:06:09] that's what I'm currently on [23:06:12] Yeah freenode had a big netsplit recently and IRCCloud was down for like an hour [23:06:12] congrats folks! [23:06:24] apergos: we try [23:07:00] you know what yoda says about that [23:07:06] Krenair: hey [23:07:10] ok ok, we did [23:07:12] heh [23:07:14] in webchat.freenode.net [23:07:14] :) [23:07:35] on that cheerful note I'm checking out (2 am? really? ugh) [23:07:46] have a good one folks [23:07:54] jdlrobson_, hey. I just started on RoanKattouw's patch, sorry [23:08:17] order: RoanKattouw, jdlrobson_, MaxSem, me [23:08:22] np. I have a European Product owner who wants to go to sleep though :) [23:08:31] jdlrobson_, I also -1'd one of your patches [23:08:38] MaxSem: will take a look [23:08:40] thanks for flagging [23:08:45] I'm afraid it's already going through jenkins [23:08:56] (roan's) [23:09:32] MaxSem: don't understand your comment. mf-uploadbutton permission does nothing at moment [23:09:33] you can always merge all the extensions at once and then just update submodules as needed [23:09:35] it's not used anywhere [23:09:56] jdlrobson_, so all upload functionality is gone? [23:10:03] MaxSem: yeh it went yearsago [23:10:21] so killkillkill the perms from extension.json [23:10:50] also, messages [23:11:15] MaxSem: already on that [23:11:18] https://gerrit.wikimedia.org/r/312343 [23:12:25] I'm watching mediawiki-extensions-php55 [23:12:48] 23:07:39 PHP Notice: Cannot find site mywiki in sites table [Called from Wikibase\Client\WikibaseClient::newSiteGroup in /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php at line 711] in /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/debug/MWDebug.php on line 311 [23:14:16] and im back in the room [23:15:19] RoanKattouw, okay your patch is on mw1099 [23:15:21] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 28 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [23:15:24] please test [23:16:31] Checking [23:17:50] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:18:22] Krenair: Works [23:19:16] (03CR) 10MaxSem: Cleanup deprecated MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312339 (owner: 10Jdlrobson) [23:19:24] !log krenair@tin Synchronized php-1.28.0-wmf.20/resources/src/mediawiki.less/mediawiki.ui/mixins.less: https://gerrit.wikimedia.org/r/#/c/312340/ (duration: 00m 48s) [23:19:27] RoanKattouw, ^ [23:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:21] jdlrobson, your hovercards patch has a -2... but it's no longer relevant, is it? [23:20:33] it links to a bug that says it's fine with wmf.20 deployed - which it now is, so [23:21:01] Thanks Krenair [23:21:55] Krenair: you can ignore yes [23:22:16] phuedx 's kids needed bed time stories [23:22:20] :) [23:23:31] :) [23:23:32] that's fine [23:23:35] Just one thing with your patch [23:23:47] It's setting $wgPopupsSchemaPopupsSamplingRate [23:23:50] But I found this [23:23:56] php-1.28.0-wmf.20/extensions/Popups/Popups.hooks.php: $vars['wgPopupsSchemaPopupsSamplingRate'] = $conf->get( 'SchemaPopupsSamplingRate' ); [23:24:04] Implying it should be called $wgSchemaPopupsSamplingRate instead [23:25:05] (obviously the $vars key is for RL's JS config, we're interested in the server-side config->get) [23:27:01] jdlrobson? [23:27:30] Krenair, continue with something else meanwhile? [23:28:10] yeah, suppose we can do your change simultaneously [23:29:16] MaxSem, your commit looks more like a feature addition to me? [23:30:28] rather a tweak to existing feature [23:31:02] it's not a regression fix or a config change? [23:31:10] no [23:31:18] (Krenair: looking) [23:31:29] MaxSem, so what's the backport rationale? [23:31:35] but i think you are right [23:31:48] long deployment hiatus :P [23:31:59] Krenair: yup you are right [23:32:22] Krenair: thanks for catching that [23:32:37] MaxSem, I don't think that's a good enough reason [23:32:44] pfft [23:32:53] then I'll do that afterwards [23:33:01] (03PS3) 10Jdlrobson: Initiate Hovercards A/B test on ruwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310483 (https://phabricator.wikimedia.org/T136746) (owner: 10Jhobs) [23:33:03] greg-g, ^ [23:33:48] link? [23:33:51] (03CR) 10Alex Monk: [C: 032] Initiate Hovercards A/B test on ruwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310483 (https://phabricator.wikimedia.org/T136746) (owner: 10Jhobs) [23:34:13] greg-g, I think https://gerrit.wikimedia.org/r/#/c/312406/ is ineligible for deployment outside of the train [23:34:29] Agreed. [23:35:02] MaxSem: we need to keep the surface area of non-train backports to fixes only. There's no need to use SWATs or your deployer powers to push out new changes early just because. [23:35:41] greg-g, I was preparing to grab a window then? [23:36:09] why does it need to go out now? and no, not after 5pm on a Thursday [23:36:22] ok [23:36:29] (03PS4) 10Alex Monk: Initiate Hovercards A/B test on ruwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310483 (https://phabricator.wikimedia.org/T136746) (owner: 10Jhobs) [23:37:14] There has been an increase in the number of non-fixes backports going out in SWAT lately (I have no numbers, all hand-wavy): we need to nip that in the bud, so to speak. [23:37:44] greg-g: make the train even quicker! [23:37:47] * Reedy hides [23:37:56] (this isn't meant to be a comment to any particular person, just general) [23:38:12] Reedy: daily! [23:38:14] jdlrobson, your change is on mw1099 [23:38:19] Krenair: on it! [23:38:43] wait, was phuedx's kid going to sleep at 11pm? ;) [23:39:28] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:41:30] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:43:41] jdlrobson, everything okay? [23:43:48] yup just being thorough [23:43:53] almost ready to give you the green light :) [23:43:53] kk [23:45:27] Krenair: looks good to me! [23:46:19] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/310483 (duration: 00m 48s) [23:46:21] jdlrobson, ^ [23:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:58] Krenair: im seeing strange behaviour [23:49:08] the A/B test worked when i tested on mw1099 [23:49:15] but now i'm seeing it show for everyone [23:49:15] should I revert? [23:49:25] be on standby just verifying with a few others [23:50:49] i think we might be okay... [23:51:30] Krenair: yeh looks good [23:51:31] phew [23:51:43] Okay [23:51:49] I've been looking at your other patch [23:52:36] 06Operations, 13Patch-For-Review: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2660989 (10ZhouZ) Thanks - it works. [23:53:54] 06Operations, 13Patch-For-Review: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2660992 (10Dzahn) 05Open>03Resolved great :) i'll resolve the ticket then. [23:54:11] yeah looks good [23:54:22] jdlrobson, MFUploadMinEdits is no longer used anywhere, is it? [23:54:49] I found some references in MF that document it/set it up, none that actually use it