[00:03:24] <wikibugs>	 (03PS7) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050)
[00:10:49] <wikibugs>	 (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/425448
[00:11:12] <XioNoX>	 Krinkle: updated my comment with more info https://phabricator.wikimedia.org/T191940#4122060
[00:12:01] <bstorm_>	 !log Updated views and indexes on labsdb1011
[00:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:13:05] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr1-eqsin.wikimedia.org recovered from Processor usage over 85%
[00:14:07] <wikibugs>	 (03PS2) 10Bstorm: Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/425448
[00:18:37] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "for some reason still enabled for all:" [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn)
[00:23:16] <wikibugs>	 (03PS8) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050)
[00:28:32] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "still not http://puppet-compiler.wmflabs.org/10894/restbase-dev1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn)
[00:29:21] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4122111 (10awight) Still finding strangeness...  Reading the virtualenv so...
[00:36:51] <wikibugs>	 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4122112 (10thcipriani) `fwrite` is definitely different in hhvm -- accounts for all the `lseek` in the hhvm output: https://gist.github.com/thcipriani/...
[01:37:21] <wikibugs>	 10Operations, 10netops: Juniper HA audit - https://phabricator.wikimedia.org/T191667#4122151 (10ayounsi)
[02:36:09] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.28) (duration: 05m 41s)
[02:36:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:53:30] <wikibugs>	 (03PS1) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466
[05:16:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/425448 (owner: 10Bstorm)
[05:17:20] <marostegui>	 !log Reload haproxy on dbprox1010 to repool labsdb1010 
[05:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:18:49] <wikibugs>	 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4122327 (10Joe) What are the blockers for the use of PHP7?  All I see on the ticket mentioned is the memcached issue, which ops are working on right no...
[05:22:48] <marostegui>	 !log Deploy schema change on codfw s8 master (db2045) with replication enabled (this will generate lag on codfw) - T187089 T185128 T153182
[05:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:55] <stashbot>	 T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089
[05:22:55] <stashbot>	 T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182
[05:22:55] <stashbot>	 T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128
[05:25:26] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4122342 (10Marostegui) @Papaul next one will be db2042 Thanks!
[05:28:53] <Krinkle>	 !log manual coal back-fill still running with the normal coal disabled via systemd. Will restore normal coal when I wake up.
[05:28:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:57] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db2069 from s1 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/425468 (https://phabricator.wikimedia.org/T191275)
[05:35:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2069 from s1 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/425468 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[05:37:16] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Move db2069 from s1 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/425468 (https://phabricator.wikimedia.org/T191275)
[05:40:34] <wikibugs>	 (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10895/" [puppet] - 10https://gerrit.wikimedia.org/r/425468 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[05:48:28] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow reimage db2069 [puppet] - 10https://gerrit.wikimedia.org/r/425472
[05:50:00] <wikibugs>	 (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2069 [puppet] - 10https://gerrit.wikimedia.org/r/425472 (owner: 10Marostegui)
[05:58:56] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Update rbenv ruby version to match production [puppet] - 10https://gerrit.wikimedia.org/r/425280
[06:02:15] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122370 (10Dzahn) @Sharvaniharan @MoritzMuehlenhoff  The user name is now correct. The remaining issue (that looks like the same from outside but is a different problem now) is:...
[06:11:19] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2069 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425473 (https://phabricator.wikimedia.org/T191275)
[06:11:20] <wikibugs>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4122371 (10Joe)
[06:12:43] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2069 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425473 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[06:14:10] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2069 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425473 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[06:15:51] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2069 to the config as depooled x1 slave - T191275 (duration: 01m 01s)
[06:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:58] <stashbot>	 T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275
[06:17:02] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db2069 to the config as depooled x1 slave - T191275 (duration: 01m 03s)
[06:17:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:04] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2069 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425473 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[06:20:31] <marostegui>	 !log Stop MySQL on db2033 to clone db2069 - T191275
[06:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:23] <wikibugs>	 (03PS3) 10Elukey: Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924)
[06:29:47] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled]
[06:32:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A few minor comments, which can be addressed now or later, and an important fix in class mcrouter, where an override is improperly applied" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz)
[06:37:23] <wikibugs>	 (03PS1) 10Elukey: network::constants: add conf100[456] to zookeeper_main_hosts [puppet] - 10https://gerrit.wikimedia.org/r/425474 (https://phabricator.wikimedia.org/T182924)
[06:49:06] <elukey>	 !log restart Yarn Resource Manager daemons on analytics100[12] to pick up the new Prometheus configuration file
[06:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:11] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4122392 (10Vgutierrez) Fixed following @Volans recommendations: ``` vgutierrez@neodymium:~$ sudo cumin 'R:class%site = eqsin' 'ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volati...
[06:52:32] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4122393 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez
[06:59:47] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:03:15] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "Revert "install_server: Reimage lvs5003 as stretch"" [puppet] - 10https://gerrit.wikimedia.org/r/425475
[07:08:05] <vgutierrez>	 !log Reimaging lvs5003.eqsin as stretch (2nd attempt)
[07:08:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] Revert "Revert "install_server: Reimage lvs5003 as stretch"" [puppet] - 10https://gerrit.wikimedia.org/r/425475 (owner: 10Vgutierrez)
[07:12:08] <wikibugs>	 (03CR) 10Elukey: [C: 032] network::constants: add conf100[456] to zookeeper_main_hosts [puppet] - 10https://gerrit.wikimedia.org/r/425474 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[07:12:13] <wikibugs>	 (03PS2) 10Elukey: network::constants: add conf100[456] to zookeeper_main_hosts [puppet] - 10https://gerrit.wikimedia.org/r/425474 (https://phabricator.wikimedia.org/T182924)
[07:12:50] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4120349 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5003.eqsin.wmnet ``` The log can be found in `/var/lo...
[07:14:27] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 04-1] "> Should we merge this now or wait https://gerrit.wikimedia.org/r/#/c/201104/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423660 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[07:20:52] <wikibugs>	 (03PS1) 10Marostegui: mariadb: notifications enable/disable db2069/2033 [puppet] - 10https://gerrit.wikimedia.org/r/425476 (https://phabricator.wikimedia.org/T191275)
[07:22:56] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::alerts: tune mirror maker alert [puppet] - 10https://gerrit.wikimedia.org/r/425477
[07:23:56] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: tune mirror maker alert [puppet] - 10https://gerrit.wikimedia.org/r/425477 (owner: 10Elukey)
[07:25:30] <wikibugs>	 (03PS2) 10Marostegui: mariadb: notifications enable/disable db2069/2033 [puppet] - 10https://gerrit.wikimedia.org/r/425476 (https://phabricator.wikimedia.org/T191275)
[07:26:27] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: notifications enable/disable db2069/2033 [puppet] - 10https://gerrit.wikimedia.org/r/425476 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[07:27:33] <marostegui>	 !log Stop MySQL on db2033 to copy its data away before reimaging - T191275
[07:27:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:41] <stashbot>	 T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275
[07:30:21] <wikibugs>	 10Operations, 10Documentation: Please document how to try fixing IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956#4122418 (10ema)
[07:30:44] <wikibugs>	 10Operations, 10Documentation: Document how to fix IPMI issues on Wikitech  - https://phabricator.wikimedia.org/T191956#4122431 (10ema) p:05Triage>03Normal
[07:30:59] <wikibugs>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4122433 (10ArielGlenn)
[07:32:36] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow reimage db2033 [puppet] - 10https://gerrit.wikimedia.org/r/425480 (https://phabricator.wikimedia.org/T191275)
[07:33:51] <wikibugs>	 (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2033 [puppet] - 10https://gerrit.wikimedia.org/r/425480 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[07:39:52] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122437 (10MoritzMuehlenhoff) You're getting connection failures from bast4001, but in your config you've configured your SSH client to use bast1002?  Try changing the last four l...
[07:40:30] <wikibugs>	 (03PS1) 10Marostegui: db2069.yaml: Binlog format ROW [puppet] - 10https://gerrit.wikimedia.org/r/425481
[07:42:14] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db2069.yaml: Binlog format ROW [puppet] - 10https://gerrit.wikimedia.org/r/425481 (owner: 10Marostegui)
[07:45:17] <ema>	 !log cp2022: restart varnish-be due to child process crash https://phabricator.wikimedia.org/P6979 T191229
[07:45:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:24] <stashbot>	 T191229: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229
[07:46:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Update rbenv ruby version to match production [puppet] - 10https://gerrit.wikimedia.org/r/425280 (owner: 10Giuseppe Lavagetto)
[07:46:40] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Update rbenv ruby version to match production [puppet] - 10https://gerrit.wikimedia.org/r/425280
[07:57:59] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Repool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425482 (https://phabricator.wikimedia.org/T191275)
[07:58:33] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[07:59:01] <marostegui>	 _joe_: ^ I guess?
[08:00:38] <_joe_>	 marostegui: yeah sorry
[08:00:49] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425482 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[08:00:51] <_joe_>	 that is the kind of change with no production effect I forget to merge
[08:01:43] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[08:02:08] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Repool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425482 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[08:02:22] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Repool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425482 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[08:03:28] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2069 as candidate master for x1 - T191275 (duration: 01m 03s)
[08:03:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:34] <stashbot>	 T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275
[08:18:20] <jynus>	 !log rerunning eqiad misc backups
[08:18:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:13] <wikibugs>	 (03PS5) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771)
[08:22:18] <wikibugs>	 (03CR) 10Hashar: "I had to rename libtidy.sa to use the 0.99 suffix which lead to the following mess:" [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar)
[08:24:56] <wikibugs>	 (03PS6) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771)
[08:25:07] <wikibugs>	 (03PS1) 10Marostegui: s1,x1.hosts: Move db2069 from s1 to x1 [software] - 10https://gerrit.wikimedia.org/r/425487 (https://phabricator.wikimedia.org/T191275)
[08:26:17] <wikibugs>	 (03Abandoned) 10Elukey: prometheus_jmx_exporter_config: fine grained selection of resources [puppet] - 10https://gerrit.wikimedia.org/r/423851 (owner: 10Elukey)
[08:27:36] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122514 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5003.eqsin.wmnet'] ```  Of which those **FAILED**: ``` ['lvs5003.eqsin.wmnet'] ```
[08:28:22] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122515 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5003.eqsin.wmnet ``` The log can be found in `/var/lo...
[08:28:48] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s1,x1.hosts: Move db2069 from s1 to x1 [software] - 10https://gerrit.wikimedia.org/r/425487 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[08:29:39] <wikibugs>	 (03Merged) 10jenkins-bot: s1,x1.hosts: Move db2069 from s1 to x1 [software] - 10https://gerrit.wikimedia.org/r/425487 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui)
[08:31:18] <wikibugs>	 (03PS2) 10Muehlenhoff: Reimage mw1265 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/425269 (https://phabricator.wikimedia.org/T174431)
[08:35:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Reimage mw1265 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/425269 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff)
[08:39:28] <wikibugs>	 (03PS7) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771)
[08:40:10] <wikibugs>	 (03CR) 10Hashar: "A couple lintian issues:" [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar)
[08:49:19] <wikibugs>	 (03PS8) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771)
[08:49:53] <wikibugs>	 (03CR) 10Hashar: "Fixed doc-base and binary without manpage." [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar)
[08:54:09] <hashar>	 LD_LIBRARY_PATH=/home/hashar/projects/tidy/hacked php tests/phpunit/phpunit.php 
[08:54:15] <hashar>	 (wrong term grrr)
[08:55:09] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Allow reimage of all es2*** hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/425491
[08:56:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of all es2*** hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/425491 (owner: 10Jcrespo)
[08:56:37] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Allow reimage of all es2*** hosts to stretch" [puppet] - 10https://gerrit.wikimedia.org/r/425492
[08:59:25] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Create FSM test cases according to the RFC 4271 definition [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 (owner: 10Mark Bergsma)
[08:59:43] <moritzm>	 !log reimaging mw1265 to stretch (T174431)
[08:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:49] <stashbot>	 T174431: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431
[08:59:59] <wikibugs>	 (03Merged) 10jenkins-bot: Create FSM test cases according to the RFC 4271 definition [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 (owner: 10Mark Bergsma)
[09:02:14] <wikibugs>	 (03PS3) 10Mark Bergsma: Handle non-IDLE states in idleHoldTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/423997
[09:02:16] <wikibugs>	 (03PS3) 10Mark Bergsma: Fix sendNotification invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/423998
[09:02:18] <wikibugs>	 (03PS3) 10Mark Bergsma: Fix two typos in bgp.FSM.openReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/423999
[09:02:20] <wikibugs>	 (03PS3) 10Mark Bergsma: Fix holdTimeEvent incrementing connectRetryCounter twice [debs/pybal] - 10https://gerrit.wikimedia.org/r/424000
[09:02:22] <wikibugs>	 (03PS3) 10Mark Bergsma: Fix distinction between events 19 and 20 (delayOpen) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424001
[09:02:24] <wikibugs>	 (03PS3) 10Mark Bergsma: Handle state ESTABLISHED in versionError (event 24) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424002
[09:02:26] <wikibugs>	 (03PS3) 10Mark Bergsma: Handle state OPENSENT in keepAliveEvent (event 11) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424003
[09:02:28] <wikibugs>	 (03PS3) 10Mark Bergsma: Handle state OPENSENT in keepAliveReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/424004
[09:02:32] <wikibugs>	 (03PS3) 10Mark Bergsma: Correctly handle event 9 (connectRetryTimeEvent) in ACTIVE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424005
[09:02:34] <wikibugs>	 (03PS3) 10Mark Bergsma: Fix typo in FSM.delayOpenTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/424006
[09:02:36] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493
[09:02:54] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493 (owner: 10Jcrespo)
[09:02:54] <vgutierrez>	 oh my :)
[09:03:32] <mark>	 :)
[09:03:41] <ema>	 !log restart pybal on lvs1003 for UDP monitoring config changes https://gerrit.wikimedia.org/r/#/c/425251/
[09:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:48] <mark>	 that was not all, i can only push 10 changes at once ;)
[09:03:55] <wikibugs>	 (03PS3) 10Mark Bergsma: Move updating of FSM metric labels to the protocol's connectionMade [debs/pybal] - 10https://gerrit.wikimedia.org/r/424007
[09:03:57] <wikibugs>	 (03PS3) 10Mark Bergsma: Ignore headerError and openMessageError in state IDLE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424008
[09:04:00] <wikibugs>	 (03PS3) 10Mark Bergsma: Cleanup module for consistency [debs/pybal] - 10https://gerrit.wikimedia.org/r/424009
[09:04:02] <wikibugs>	 (03PS3) 10Mark Bergsma: Add test cases for implemented event 25 and fix OPENSENT [debs/pybal] - 10https://gerrit.wikimedia.org/r/424011
[09:04:08] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs5003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:04:15] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493 (owner: 10Jcrespo)
[09:04:21] <vgutierrez>	 lvs5003 it's me
[09:05:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:06:02] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493 (owner: 10Jcrespo)
[09:06:13] <wikibugs>	 (03Abandoned) 10Mark Bergsma: Fix test case ESTABLISHED event 27 hold time nonzero [debs/pybal] - 10https://gerrit.wikimedia.org/r/424010 (owner: 10Mark Bergsma)
[09:06:23] <apergos>	 gone for a couple of hours, back later
[09:07:37] <icinga-wm>	 PROBLEM - Check rp_filter disabled on lvs5003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:07:37] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:07:57] <icinga-wm>	 PROBLEM - Disk space on labtestvirt2001 is CRITICAL: DISK CRITICAL - /home/aborrero/mnt is not accessible: Permission denied
[09:08:28] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2014 (duration: 01m 03s)
[09:08:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:19] <jynus>	 someone working in deploy1001.eqiad.wmnet? it failed to sync (ok if you are, I will research if not)
[09:10:55] <wikibugs>	 (03PS7) 10Elukey: Modify eventlogging purging script to read from YAML whitelist [puppet] - 10https://gerrit.wikimedia.org/r/420685 (https://phabricator.wikimedia.org/T189692) (owner: 10Mforns)
[09:11:30] <wikibugs>	 (03CR) 10Elukey: [C: 032] "Tested in labs (deployment-eventlog05) with both tsv and yaml encoding, everything seems good." [puppet] - 10https://gerrit.wikimedia.org/r/420685 (https://phabricator.wikimedia.org/T189692) (owner: 10Mforns)
[09:11:33] <marostegui>	 jynus: I saw that too earlier and saw this in SAL mutante: deploy1001 - reinstalled with stretch - re-adding to puppet so I guess it is still WIP
[09:11:37] <icinga-wm>	 RECOVERY - Check rp_filter disabled on lvs5003 is OK: OK: kernel parameters are set to expected value.
[09:15:47] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Fix sendNotification invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/423998 (owner: 10Mark Bergsma)
[09:16:08] <wikibugs>	 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#4122579 (10Qgil)
[09:16:57] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Fix two typos in bgp.FSM.openReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/423999 (owner: 10Mark Bergsma)
[09:17:52] <jynus>	 !log start reimage of es2014
[09:17:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:50] <icinga-wm>	 RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old.
[09:19:10] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs5003 is OK: OK: no difference between hosts in IPVS/PyBal
[09:20:25] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122595 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5003.eqsin.wmnet'] ```  and were **ALL** successful.
[09:22:00] <icinga-wm>	 RECOVERY - Disk space on labtestvirt2001 is OK: DISK OK
[09:22:39] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5003 is OK: OK: 12 connections established with conf2003.codfw.wmnet:2379 (min=12)
[09:23:30] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493 (owner: 10Jcrespo)
[09:24:43] <wikibugs>	 (03PS1) 10Volans: wmf-auto-reimage: bugfix Phabricator client [puppet] - 10https://gerrit.wikimedia.org/r/425495
[09:25:00] <icinga-wm>	 PROBLEM - Disk space on labtestvirt2001 is CRITICAL: DISK CRITICAL - /home/aborrero/mnt is not accessible: Permission denied
[09:28:04] <wikibugs>	 (03PS1) 10Elukey: profile::mariadb::misc::el::sanitization: add package [puppet] - 10https://gerrit.wikimedia.org/r/425496 (https://phabricator.wikimedia.org/T189692)
[09:28:39] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::mariadb::misc::el::sanitization: add package [puppet] - 10https://gerrit.wikimedia.org/r/425496 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey)
[09:29:28] <arturo>	 !log doing some testing in labtestvirt2001 mounting instance's qcow2 files into /home/aborrero/mnt
[09:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:40] <icinga-wm>	 PROBLEM - HHVM processes on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:32:40] <icinga-wm>	 PROBLEM - nutcracker port on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:34:19] <icinga-wm>	 PROBLEM - HHVM rendering on mw1265 is CRITICAL: connect to address 10.64.0.60 and port 80: Connection refused
[09:34:19] <icinga-wm>	 PROBLEM - nutcracker process on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:36:00] <icinga-wm>	 PROBLEM - puppet last run on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:36:28] <moritzm>	 ^reimage, silencing again
[09:39:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1265 is CRITICAL: connect to address 10.64.0.60 and port 80: Connection refused
[09:39:29] <icinga-wm>	 PROBLEM - MD RAID on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:41:09] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:41:10] <icinga-wm>	 RECOVERY - Disk space on labtestvirt2001 is OK: DISK OK
[09:42:09] <wikibugs>	 (03PS1) 10Elukey: role:mariadb::misc::el::replica: add new yaml whitelist to db1108 [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692)
[09:45:46] <wikibugs>	 (03PS1) 10Marostegui: db2033.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/425499
[09:46:19] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db2033.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/425499 (owner: 10Marostegui)
[09:49:34] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool es2014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425501
[09:51:47] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425501 (owner: 10Jcrespo)
[09:51:56] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool es2015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425502
[09:52:30] <moritzm>	 !log installing java security updates on kafka/analytics cluster
[09:52:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time
[09:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:08] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425501 (owner: 10Jcrespo)
[09:53:51] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425502 (owner: 10Jcrespo)
[09:54:42] <wikibugs>	 (03PS2) 10Elukey: role:mariadb::misc::el::replica: add new yaml whitelist to db1108 [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692)
[09:55:09] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool es2015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425502 (owner: 10Jcrespo)
[09:57:09] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2014, depool es2015 (duration: 01m 02s)
[09:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:41] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool es2015 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425503
[09:59:39] <wikibugs>	 (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/10898/" [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey)
[10:00:28] <moritzm>	 !log installing java security updates on kafka/jumbo cluster
[10:00:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:31] <jynus>	 !log start reimage of es2015
[10:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:59] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool es2014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425501 (owner: 10Jcrespo)
[10:05:06] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool es2015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425502 (owner: 10Jcrespo)
[10:10:56] <icinga-wm>	 RECOVERY - MD RAID on mw1265 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[10:11:07] <icinga-wm>	 RECOVERY - HHVM processes on mw1265 is OK: PROCS OK: 6 processes with command name hhvm
[10:11:26] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1265 is OK: OK: nf_conntrack is 0 % full
[10:17:44] <icinga-wm>	 RECOVERY - nutcracker process on mw1265 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[10:19:34] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1964 bytes in 0.105 second response time
[10:19:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 75329 bytes in 0.174 second response time
[10:26:08] <icinga-wm>	 RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:28:31] <marostegui>	 !log Drop table prefstats in s5 - T154490
[10:28:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:36] <stashbot>	 T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490
[10:31:43] <marostegui>	 !log Drop table prefstats in s6 - T154490
[10:31:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:28] <icinga-wm>	 RECOVERY - nutcracker port on mw1265 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[10:33:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2015 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425503 (owner: 10Jcrespo)
[10:33:05] <marostegui>	 !log Drop table prefstats in s4 - T154490
[10:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:21] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425507
[10:34:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2015 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425503 (owner: 10Jcrespo)
[10:34:40] <wikibugs>	 (03PS4) 10EddieGP: beta: Combine commons, deployments, meta and zero vhost [puppet] - 10https://gerrit.wikimedia.org/r/398399
[10:35:57] <wikibugs>	 (03PS3) 10EddieGP: Run initSiteStats twice a month [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad)
[10:37:31] <wikibugs>	 (03CR) 10EddieGP: "I've signed this one up for tomorrows puppet swat. It's trivial and a no-op in prod." [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP)
[10:38:58] <wikibugs>	 (03PS1) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508
[10:39:16] <wikibugs>	 (03PS2) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508
[10:39:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Disable PrivateTmp via systemd override for stretch-based app servers [puppet] - 10https://gerrit.wikimedia.org/r/425509 (https://phabricator.wikimedia.org/T185195)
[10:41:02] <wikibugs>	 (03CR) 10EddieGP: [C: 031] "I've signed this up for tomorrows puppet swat, even though it's not my patch (I'd have done the same, but Chad was faster). I hope that's " [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad)
[10:43:10] <wikibugs>	 (03CR) 10Ema: [C: 031] pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (owner: 10Vgutierrez)
[10:43:13] <marostegui>	 !log Drop table prefstats in s2 - T154490
[10:43:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:18] <stashbot>	 T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490
[10:44:17] <wikibugs>	 (03PS3) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897)
[10:44:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[10:45:47] <wikibugs>	 (03PS4) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897)
[10:46:30] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[10:46:40] <wikibugs>	 (03PS5) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897)
[10:49:01] <wikibugs>	 (03CR) 10Jcrespo: "SiteStatsInit.php seems well written, using the replica for slow queries, I can endorse this." [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad)
[10:49:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] Run initSiteStats twice a month [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad)
[10:50:49] <moritzm>	 !log installing openssl updates
[10:50:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425507 (owner: 10Jcrespo)
[10:53:46] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425507 (owner: 10Jcrespo)
[10:53:48] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool es2015 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425503 (owner: 10Jcrespo)
[10:53:58] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425507 (owner: 10Jcrespo)
[10:56:53] <ema>	 !log stop pybal on lvs5001 to test requests through lvs5003, reimaged as stretch T191897
[10:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:59] <stashbot>	 T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897
[10:59:04] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2015, depool es2011 (duration: 00m 59s)
[10:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:03] <icinga-wm>	 PROBLEM - pybal on lvs5001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[11:00:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090
[11:00:29] <ema>	 that's me, ignore ^
[11:01:57] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 Ema Pybal stopped to test lvs5003 T191897
[11:01:57] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal connections to etcd on lvs5001 is CRITICAL: CRITICAL: 0 connections established with conf2003.codfw.wmnet:2379 (min=4) Ema Pybal stopped to test lvs5003 T191897
[11:01:57] <icinga-wm>	 ACKNOWLEDGEMENT - pybal on lvs5001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Ema Pybal stopped to test lvs5003 T191897
[11:04:08] <marostegui>	 !log Drop table prefstats in s7 - T154490
[11:04:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:15] <stashbot>	 T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490
[11:09:45] <ema>	 !log start pybal on lvs5001, test completed on lvs5003 
[11:09:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:03] <icinga-wm>	 RECOVERY - pybal on lvs5001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal
[11:10:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy
[11:11:04] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122849 (10Vgutierrez)
[11:28:20] <wikibugs>	 (03PS8) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504)
[11:28:22] <wikibugs>	 (03PS6) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504)
[11:28:24] <wikibugs>	 (03PS8) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504)
[11:28:26] <wikibugs>	 (03PS4) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504)
[11:28:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[11:28:36] * volans waiting for the -1s, 
[11:28:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[11:28:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[11:28:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[11:34:40] <wikibugs>	 (03PS3) 10EddieGP: Kill some hiera paths [labs/private] - 10https://gerrit.wikimedia.org/r/423189
[11:36:31] <wikibugs>	 (03PS2) 10EddieGP: cloud hiera: Remove unused paths from hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/423190
[11:47:07] <jynus>	 !log start reimage of es2011
[11:47:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:40] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool es2011 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425516
[11:59:03] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool es2012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425517
[11:59:40] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2011 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425516 (owner: 10Jcrespo)
[11:59:59] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425517 (owner: 10Jcrespo)
[12:01:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2011 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425516 (owner: 10Jcrespo)
[12:01:22] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool es2011 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425516 (owner: 10Jcrespo)
[12:01:25] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool es2012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425517 (owner: 10Jcrespo)
[12:04:20] <wikibugs>	 (03PS2) 10Mobrovac: Disable bulk number 2 jobs in redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[12:05:06] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2011, depool es2012 (duration: 01m 01s)
[12:05:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:47] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool es2012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425517 (owner: 10Jcrespo)
[12:09:42] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.085 second response time
[12:09:51] <jynus>	 !log start reimage of es2012
[12:09:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:03] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool es2012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425518
[12:16:07] <mobrovac>	 taking over tin for 10 mins
[12:16:22] <icinga-wm>	 PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:17:32] <wikibugs>	 (03CR) 10Mobrovac: [C: 032] Disable bulk number 2 jobs in redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[12:18:47] <wikibugs>	 (03Merged) 10jenkins-bot: Disable bulk number 2 jobs in redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[12:19:17] <wikibugs>	 (03CR) 10jenkins-bot: Disable bulk number 2 jobs in redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[12:20:33] <icinga-wm>	 PROBLEM - Disk space on labtestcontrol2001 is CRITICAL: DISK CRITICAL - free space: / 322 MB (3% inode=78%)
[12:21:07] <moritzm>	 !log enable production traffic for mw1265 (stretch app server) for a brief test period
[12:21:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:48] <logmsgbot>	 !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch a bulk of low-traffic jobs to EventBus for testwikis, file 1/2 - T190327 (duration: 01m 01s)
[12:21:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:55] <stashbot>	 T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327
[12:32:10] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122973 (10Sharvaniharan) My config looks like this now..   ``` Host bastlabs HostName bastion-eqiad.wmflabs.org User sharan IdentityFile ~/.ssh/id_rsa  Host *.eqiad.wmflabs !bast...
[12:32:37] <logmsgbot>	 !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch a bulk of low-traffic jobs to EventBus for testwikis, file 1/2 (retry) - T190327 (duration: 01m 00s)
[12:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:43] <stashbot>	 T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327
[12:33:53] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122985 (10Sharvaniharan) @MoritzMuehlenhoff  please let me know if a hangout would be better. I will be available for the next hour and then anytime after 9am Pacific time.
[12:39:32] <logmsgbot>	 !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch a bulk of low-traffic jobs to EventBus for testwikis, file 1/2 (retry #2) (duration: 01m 01s)
[12:39:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:23] <icinga-wm>	 RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:41:29] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage lvs4007 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425520 (https://phabricator.wikimedia.org/T191897)
[12:43:04] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (watching): Scap sync-file failing for 9 hosts - https://phabricator.wikimedia.org/T191972#4122995 (10mobrovac)
[12:43:46] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (watching): Scap sync-file failing for 9 hosts - https://phabricator.wikimedia.org/T191972#4123005 (10mobrovac) p:05Triage>03Unbreak!
[12:44:11] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123007 (10MoritzMuehlenhoff) >>! In T191673#4122973, @Sharvaniharan wrote: > Not sure if this is what you meant by ssh -add -L    > ``` > wmf2050:~ sharan$ ssh -add -L releases10...
[12:44:19] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123008 (10ema) p:05Triage>03High
[12:45:47] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (blocked): Scap sync-file failing for 9 hosts - https://phabricator.wikimedia.org/T191972#4123012 (10mobrovac)
[12:46:44] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.099 second response time
[12:47:28] <wikibugs>	 (03CR) 10Ema: [C: 031] install_server: Reimage lvs4007 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425520 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[12:47:34] <icinga-wm>	 RECOVERY - Disk space on labtestcontrol2001 is OK: DISK OK
[12:47:41] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123013 (10Sharvaniharan) This is what I have in etc/ssh   ``` wmf2050:~ sharan$ ls /etc/ssh moduli  ssh_config sshd_config  ```  They are both empty files though. Should i delete...
[12:48:00] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (blocked): Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123014 (10jcrespo)
[12:48:41] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123015 (10Sharvaniharan) And..   ``` wmf2050:~ sharan$ ssh-add -L releases1001.eqiad.wmnet The agent has no identities. ```
[12:50:20] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs4007 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425520 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[12:53:09] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (blocked): Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123026 (10mobrovac) Apparently `deploy1001` has been recently reimaged. However, it doesn't seem like it has a role associated with i...
[12:54:11] <wikibugs>	 (03CR) 10Mobrovac: [C: 032] "NOTE: this hasn't been fully synced due to T191972" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[12:56:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425518 (owner: 10Jcrespo)
[12:56:42] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3589662 (10mobrovac) It seems that the reimage is now blocking deployments, cf. {T191972}
[12:57:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425518 (owner: 10Jcrespo)
[12:57:58] <mobrovac>	 jouncebot: next
[12:57:58] <jouncebot>	 In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1300)
[12:58:11] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123031 (10MoritzMuehlenhoff) >>! In T191673#4123015, @Sharvaniharan wrote: > And.. >  >  > ``` > wmf2050:~ sharan$ ssh-add -L releases1001.eqiad.wmnet > The agent has no identiti...
[12:58:17] <mobrovac>	 probably not happening ^ due to T191972
[12:58:17] <stashbot>	 T191972: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972
[12:59:12] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool es2012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425518 (owner: 10Jcrespo)
[12:59:42] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123034 (10Sharvaniharan) yes... I was looking in the wrong directory... I have updated my comment to reflect the contents of /etc/ssh/ssh_config
[13:00:05] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1300).
[13:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[13:00:11] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2012 (duration: 01m 00s)
[13:00:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:51] <mobrovac>	 jynus: is it not failing for you or you don't care about the 9 servers?
[13:01:01] <vgutierrez>	 !log Reimage lvs4007 as stretch
[13:01:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:34] <zeljkof>	 nice, nothing for swat today :)
[13:01:45] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.325 second response time
[13:01:59] <jynus>	 mobrovac: please, they are not 9 servers
[13:02:04] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123036 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4007.ulsfo.wmnet ``` The log can be found in `/var/lo...
[13:02:21] <jynus>	 please read the logs carefully, I had to correct your ticket
[13:03:14] <mobrovac>	 jynus: what are the mw hosts listed in the log then? my read is that it fails to connect to these
[13:03:31] <jynus>	 they happen to be on the same batch, but they are succesful
[13:03:41] <jynus>	 ok: 269; fail: 1; left: 0
[13:04:06] <jynus>	 and that is without being a "real" deployer not knowing scap
[13:04:20] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123038 (10Sharvaniharan) finally!!!! I was able to ssh into releases1001 and stat1006!
[13:04:36] <jynus>	 the commend returns 1 and only 1 error for a single host
[13:04:53] <jynus>	 plus my changes are noop- I deploy and revert, as usual
[13:04:55] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123039 (10Imarlier) Change in observed performance due to depooling of Singapore:  Synthetic tests (from AWS Mumbai): https://grafana.wikimedia.org/dash...
[13:05:28] <mobrovac>	 jynus: indeed
[13:05:29] <mobrovac>	 k
[13:05:34] <mobrovac>	 i'll finish my stuff then
[13:05:46] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (blocked): Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123040 (10jcrespo)
[13:06:21] <jynus>	 most likely, that will be setup as a deployment server, but having 3 of those is complicated, so it ended up in a limbo
[13:06:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] cassandra/icinga: make monitoring configurable, skip on dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn)
[13:06:44] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123042 (10mobrovac) p:05Unbreak!>03High It's fialing only on `deploy1001`, so lowering the priority.
[13:09:08] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4007.ulsfo.wmnet'] ```  Of which those **FAILED**: ``` ['lvs4007.ulsfo.wmnet'] ```
[13:09:21] <marostegui>	 !log Drop prefstats table on s3 codfw master - db2043 (this might generate lag on codfw) - T154490
[13:09:25] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123048 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4007.ulsfo.wmnet ``` The log can be found in `/var/lo...
[13:09:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:27] <stashbot>	 T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490
[13:09:28] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123049 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4007.ulsfo.wmnet'] ```  Of which those **FAILED**: ``` ['lvs4007.ulsfo.wmnet'] ```
[13:10:00] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123051 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4007.ulsfo.wmnet ``` The log can be found in `/var/lo...
[13:10:30] <wikibugs>	 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123052 (10Sharvaniharan) 05Open>03Resolved
[13:12:34] <elukey>	 !log restart kafka brokers on kafka1012->23 for openjdk-7 upgrades
[13:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:53] <marostegui>	 !log Drop prefstats table on s1 codfw master - db2048 (this might generate lag on codfw) - T154490
[13:13:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] puppet-merge: continue despite errors during remote/ssh stage [puppet] - 10https://gerrit.wikimedia.org/r/425339 (owner: 10Herron)
[13:18:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "Nice, wanna have a look at https://gerrit.wikimedia.org/r/#/c/356021/ as well ? Can also be useful (I never received input so I am resolic" [puppet] - 10https://gerrit.wikimedia.org/r/425335 (owner: 10Herron)
[13:20:27] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425521 (https://phabricator.wikimedia.org/T154490)
[13:22:33] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425521 (https://phabricator.wikimedia.org/T154490) (owner: 10Marostegui)
[13:23:49] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425521 (https://phabricator.wikimedia.org/T154490) (owner: 10Marostegui)
[13:23:54] <wikibugs>	 (03PS1) 10Pmiazga: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966)
[13:25:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga)
[13:25:09] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 (duration: 01m 00s)
[13:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:22] <moritzm>	 !log installing java security updates on kafka/main cluster
[13:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:20] <marostegui>	 !log Drop prefstats table on s3 sanitarium master (db1072) this might cause lag on labs - T154490
[13:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:25] <stashbot>	 T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490
[13:28:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425523
[13:28:50] <wikibugs>	 (03CR) 10Ottomata: "Thanks Luca!" [puppet] - 10https://gerrit.wikimedia.org/r/425477 (owner: 10Elukey)
[13:29:07] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425521 (https://phabricator.wikimedia.org/T154490) (owner: 10Marostegui)
[13:29:50] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425523 (owner: 10Marostegui)
[13:30:47] <wikibugs>	 (03PS1) 10Gehel: maps: run populate_admin() regularly [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605)
[13:31:04] <wikibugs>	 (03PS2) 10Pmiazga: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966)
[13:31:06] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425523 (owner: 10Marostegui)
[13:31:45] <logmsgbot>	 !log ppchelko@tin Started deploy [cpjobqueue/deploy@2b59313]: Enable second bulk of low-traffic jobs T190327
[13:31:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:51] <stashbot>	 T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327
[13:32:13] <mobrovac>	 marostegui: when you are done with the sync, i'll need to sync too
[13:32:29] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 (duration: 01m 07s)
[13:32:30] <marostegui>	 mobrovac: all yours
[13:32:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:35] <mobrovac>	 kk thnx
[13:32:53] <logmsgbot>	 !log ppchelko@tin Started deploy [cpjobqueue/deploy@2b59313]: Enable second bulk of low-traffic jobs T190327
[13:32:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:58] <logmsgbot>	 !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch a bulk of low-traffic jobs to EventBus for testwikis, file 2/2 - T190327 (duration: 01m 00s)
[13:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:17] <icinga-wm>	 PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:34:44] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425523 (owner: 10Marostegui)
[13:37:36] <moritzm>	 !log rolling restart of restbase in codfw to pick up openssl update
[13:37:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Add AAAA and PTR records for conf100[456] [dns] - 10https://gerrit.wikimedia.org/r/425292 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey)
[13:40:28] <wikibugs>	 (03PS2) 10Gehel: maps: run populate_admin() regularly [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605)
[13:41:21] <logmsgbot>	 !log ppchelko@tin Finished deploy [cpjobqueue/deploy@2b59313]: Enable second bulk of low-traffic jobs T190327 (duration: 08m 27s)
[13:41:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:26] <stashbot>	 T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327
[13:44:01] <logmsgbot>	 !log ppchelko@tin Started deploy [cpjobqueue/deploy@3ba6580]: Enable second bulk of low-traffic jobs T190327 take 2
[13:44:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:17] <icinga-wm>	 RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational
[13:44:50] <logmsgbot>	 !log ppchelko@tin Finished deploy [cpjobqueue/deploy@3ba6580]: Enable second bulk of low-traffic jobs T190327 take 2 (duration: 00m 49s)
[13:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:50] <elukey>	 akosiaris: thanks a lot! 
[13:48:28] <wikibugs>	 (03CR) 10Mobrovac: [C: 032] "> NOTE: this hasn't been fully synced due to T191972" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[13:51:45] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123145 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4007.ulsfo.wmnet'] ```  and were **ALL** successful.
[13:53:44] <wikibugs>	 (03PS3) 10Elukey: Add AAAA and PTR records for conf100[456] [dns] - 10https://gerrit.wikimedia.org/r/425292 (https://phabricator.wikimedia.org/T166081)
[13:54:00] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add AAAA and PTR records for conf100[456] [dns] - 10https://gerrit.wikimedia.org/r/425292 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey)
[13:58:46] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4123181 (10Andrew)
[14:05:54] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425531
[14:06:05] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089)
[14:07:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425531 (owner: 10Jcrespo)
[14:09:04] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425531 (owner: 10Jcrespo)
[14:09:18] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425531 (owner: 10Jcrespo)
[14:09:56] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089)
[14:11:16] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[14:11:50] <wikibugs>	 (03PS1) 10Jcrespo: Revert "Revert "mariadb: Depool es2012 for reimage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425534
[14:11:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php: add module for basic installation [puppet] - 10https://gerrit.wikimedia.org/r/425535
[14:11:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "mariadb: Depool es2012 for reimage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425534 (owner: 10Jcrespo)
[14:12:06] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2013 (duration: 01m 00s)
[14:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:33] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[14:12:48] <wikibugs>	 (03Abandoned) 10Jcrespo: Revert "Revert "mariadb: Depool es2012 for reimage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425534 (owner: 10Jcrespo)
[14:13:10] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536
[14:13:50] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 for alter table (duration: 01m 00s)
[14:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:22] <marostegui>	 !log Deploy schema change on db1099:3318 - T187089 T185128 T153182
[14:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:29] <stashbot>	 T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089
[14:14:29] <stashbot>	 T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182
[14:14:29] <stashbot>	 T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128
[14:14:47] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[14:15:08] <jynus>	 !log start reimage of es2013
[14:15:11] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4123214 (10Halfak) This sounds surprising and strange.  Please ping me on...
[14:15:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:29] <wikibugs>	 (03PS1) 10Ema: pybal: Re-enable BGP on lvs4007 [puppet] - 10https://gerrit.wikimedia.org/r/425537 (https://phabricator.wikimedia.org/T191897)
[14:20:16] <wikibugs>	 (03CR) 10Ema: [C: 032] pybal: Re-enable BGP on lvs4007 [puppet] - 10https://gerrit.wikimedia.org/r/425537 (https://phabricator.wikimedia.org/T191897) (owner: 10Ema)
[14:26:51] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123236 (10jcrespo)
[14:27:06] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: php: add module for basic installation [puppet] - 10https://gerrit.wikimedia.org/r/425535
[14:30:13] <wikibugs>	 (03PS1) 10Herron: puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538
[14:31:00] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123270 (10jcrespo) T150160 suggests `racadm reset` may fix it.
[14:36:33] <logmsgbot>	 !log ppchelko@tin Started deploy [cpjobqueue/deploy@a090a3c]: Fix the low priority jobs topic names
[14:36:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:11] <logmsgbot>	 !log ppchelko@tin Finished deploy [cpjobqueue/deploy@a090a3c]: Fix the low priority jobs topic names (duration: 00m 38s)
[14:37:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:02] <Krinkle>	 !log Turned regular coal back on (T191239)
[14:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:08] <stashbot>	 T191239: coal metrics changed after deploying new code - https://phabricator.wikimedia.org/T191239
[14:38:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538 (owner: 10Herron)
[14:39:30] <moritzm>	 !log rolling restart of restbase in eqiad to pick up openssl update
[14:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:22] <wikibugs>	 (03CR) 10Mforns: [C: 031] "LGTM!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey)
[14:46:36] <wikibugs>	 (03CR) 10Herron: "Looking at this and the patch to remove colors (https://gerrit.wikimedia.org/r/#/c/425335/) I realized that by tuning our syslog config we" [puppet] - 10https://gerrit.wikimedia.org/r/356021 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris)
[14:47:26] <wikibugs>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4123336 (10MoritzMuehlenhoff)
[14:48:13] <wikibugs>	 (03Abandoned) 10Herron: puppet: disable color output in puppet log /var/log/puppet.log [puppet] - 10https://gerrit.wikimedia.org/r/425335 (owner: 10Herron)
[14:49:17] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536
[14:49:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536 (owner: 10Jcrespo)
[14:49:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[14:50:26] <mdholloway>	 ^ saw this, will look into it
[14:50:57] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536 (owner: 10Jcrespo)
[14:51:11] <wikibugs>	 (03CR) 10Herron: "Thanks akosiaris!" [puppet] - 10https://gerrit.wikimedia.org/r/425339 (owner: 10Herron)
[14:51:13] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536 (owner: 10Jcrespo)
[14:51:15] <wikibugs>	 (03PS2) 10Herron: puppet-merge: continue despite errors during remote/ssh stage [puppet] - 10https://gerrit.wikimedia.org/r/425339
[14:51:44] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[14:51:58] <wikibugs>	 (03CR) 10Herron: [C: 032] puppet-merge: continue despite errors during remote/ssh stage [puppet] - 10https://gerrit.wikimedia.org/r/425339 (owner: 10Herron)
[14:53:00] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool es1011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542
[14:53:51] <wikibugs>	 (03PS2) 10Ema: lvs: use UDP monitor for logstash-{json,syslog}-udp [puppet] - 10https://gerrit.wikimedia.org/r/425253
[14:54:24] <gehel>	 !log starting rolling restart of elasticsearch cirrus / eqiad for jvm upgrade
[14:54:27] <wikibugs>	 (03CR) 10Ema: [C: 032] lvs: use UDP monitor for logstash-{json,syslog}-udp [puppet] - 10https://gerrit.wikimedia.org/r/425253 (owner: 10Ema)
[14:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:31] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123393 (10Krinkle)
[14:54:40] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2013 (duration: 01m 00s)
[14:54:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:26] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Depool es1012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542
[14:55:30] <wikibugs>	 (03CR) 10Herron: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/425538 (owner: 10Herron)
[14:57:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 (owner: 10Jcrespo)
[14:57:58] <wikibugs>	 (03CR) 10Elukey: role:mariadb::misc::el::replica: add new yaml whitelist to db1108 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey)
[14:58:05] <wikibugs>	 (03PS3) 10Elukey: role:mariadb::misc::el::replica: add new yaml whitelist to db1108 [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692)
[14:58:49] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123399 (10Vgutierrez)
[14:58:53] <wikibugs>	 (03CR) 10Elukey: [C: 032] role:mariadb::misc::el::replica: add new yaml whitelist to db1108 [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey)
[14:59:05] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4123400 (10akosiaris) >>! In T191648#4113134, @Andrew wrote: >>>! In T191648#4112786, @bd808 wrote: >> I wonder if the specific ordering issue is t...
[14:59:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "I think we can solve this in a bit cleaner approach than mangling the plugins settings list. See https://phabricator.wikimedia.org/T191648" [puppet] - 10https://gerrit.wikimedia.org/r/424638 (https://phabricator.wikimedia.org/T191648) (owner: 10Andrew Bogott)
[14:59:30] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool es1012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 (owner: 10Jcrespo)
[15:01:13] <marlier>	 !log Stopping coal on graphite2001.codfw.wmnet for data replay
[15:01:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:51] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1012 (duration: 01m 00s)
[15:01:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:46] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool es1012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 (owner: 10Jcrespo)
[15:03:44] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1981 bytes in 0.144 second response time
[15:06:00] <wikibugs>	 (03PS2) 10Ppchelko: Remove wmgDebugJobQueueEventBus config parameter. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888
[15:06:37] <wikibugs>	 (03CR) 10Ppchelko: "rebased." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko)
[15:06:53] <robh>	 !log shutting down cp2008, cp2011, and cp2018 for onsite work
[15:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:03] <ema>	 !log restart pybal on lvs1006 for logstash-{json,syslog} UDP monitoring config changes https://gerrit.wikimedia.org/r/#/c/425253/
[15:08:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:33] <icinga-wm>	 PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100%
[15:10:33] <icinga-wm>	 PROBLEM - Host cp2018 is DOWN: PING CRITICAL - Packet loss = 100%
[15:10:43] <icinga-wm>	 PROBLEM - Host cp2011 is DOWN: PING CRITICAL - Packet loss = 100%
[15:11:39] <wikibugs>	 (03PS1) 10Ppchelko: Switch second bulk of low-traffic jobs for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327)
[15:13:22] <wikibugs>	 (03PS2) 10Rduran: Add integration tests to test agains MariaDB [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/425291
[15:14:04] <icinga-wm>	 PROBLEM - Host cp2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:14:12] <ema>	 !log restart pybal on lvs1003 for logstash-{json,syslog} UDP monitoring config changes https://gerrit.wikimedia.org/r/#/c/425253/
[15:14:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:24] <icinga-wm>	 PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:24] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:14:33] <icinga-wm>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:14:33] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:14:33] <icinga-wm>	 PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:34] <icinga-wm>	 PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:34] <icinga-wm>	 PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:34] <icinga-wm>	 PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:34] <icinga-wm>	 PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:43] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:43] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:43] <icinga-wm>	 PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:44] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:44] <icinga-wm>	 PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:53] <icinga-wm>	 PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:53] <icinga-wm>	 PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:54] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:14:54] <icinga-wm>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:14:54] <icinga-wm>	 PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:14:54] <icinga-wm>	 PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:03] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:03] <icinga-wm>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:15:04] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:15:04] <icinga-wm>	 PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:15:04] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:15:04] <icinga-wm>	 PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:04] <icinga-wm>	 PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:05] <icinga-wm>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:15:05] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:14] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:14] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:14] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:14] <icinga-wm>	 PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6
[15:15:14] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:14] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:15] <icinga-wm>	 PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2018_v4, cp2018_v6
[15:15:15] <icinga-wm>	 PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:16] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:16] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:17] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:24] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:15:25] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:25] <icinga-wm>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6
[15:15:25] <icinga-wm>	 PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:25] <icinga-wm>	 PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:25] <icinga-wm>	 PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:33] <icinga-wm>	 PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2018_v4, cp2018_v6
[15:15:33] <icinga-wm>	 PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:34] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6
[15:15:34] <icinga-wm>	 PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6
[15:15:43] <icinga-wm>	 PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6
[15:15:44] <icinga-wm>	 PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2018_v4, cp2018_v6
[15:15:50] <jynus>	 ema: will you take care of running puppet there and on icinga?
[15:15:54] <icinga-wm>	 PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6
[15:16:25] <ema>	 jynus: ?
[15:16:38] <jynus>	 for the alerts noise, I mean
[15:17:18] <jynus>	 aren't you decomming servers?
[15:17:22] <ema>	 jynus: there's ongoing dc-ops work causing that, it isn't me (and it isn't puppet related)
[15:17:25] <jynus>	 ah
[15:17:27] <jynus>	 sorry
[15:17:39] <jynus>	 now I get it
[15:17:46] <ema>	 no worries. Those ipsec alerts are a pain, I know
[15:19:14] <icinga-wm>	 RECOVERY - Host cp2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.85 ms
[15:19:51] <jynus>	 !log fixing grant issue on db1114
[15:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:26] <Krinkle>	 !log disabling coal service on graphite2001 and disabling puppet – T191239
[15:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:32] <stashbot>	 T191239: coal metrics changed after deploying new code - https://phabricator.wikimedia.org/T191239
[15:23:44] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.074 second response time
[15:26:53] <icinga-wm>	 PROBLEM - Disk space on labtestvirt2001 is CRITICAL: DISK CRITICAL - /home/aborrero/mnt is not accessible: Permission denied
[15:27:28] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4123551 (10debt) Hi @BBlack - can you add your concerns to this ticket....we're needing to get this figured out soon. Thanks!
[15:27:53] <icinga-wm>	 RECOVERY - Disk space on labtestvirt2001 is OK: DISK OK
[15:28:28] <arturo>	 ^^^ that's me
[15:29:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice idea, some inline comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425538 (owner: 10Herron)
[15:29:34] <icinga-wm>	 PROBLEM - Host cp2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:29:49] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: Timestamp puppet-run logs [puppet] - 10https://gerrit.wikimedia.org/r/356021 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris)
[15:31:23] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123569 (10Papaul) DIMM A2 replaced  DIMM B2 replaced DIMM B6 replaced
[15:33:44] <icinga-wm>	 RECOVERY - Host cp2008 is UP: PING WARNING - Packet loss = 28%, RTA = 36.04 ms
[15:34:43] <icinga-wm>	 RECOVERY - Host cp2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.77 ms
[15:42:26] <wikibugs>	 (03PS1) 10Ema: cache::ipsec: remove non-jumbo hosts from kafka::nodes [puppet] - 10https://gerrit.wikimedia.org/r/425550 (https://phabricator.wikimedia.org/T185136)
[15:43:03] <robh>	 !log cp2008 repooled after memory swap
[15:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:03] <icinga-wm>	 PROBLEM - Host cp2018.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:46:16] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123634 (10RobH) system has been pushed back into service with the new memory in use
[15:50:14] <icinga-wm>	 RECOVERY - Host cp2018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.58 ms
[15:52:24] <icinga-wm>	 PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100%
[15:56:49] <wikibugs>	 (03PS2) 10Ema: role::kafka::analytics: get rid of ipsec [puppet] - 10https://gerrit.wikimedia.org/r/425550 (https://phabricator.wikimedia.org/T185136)
[15:58:25] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4123709 (10Papaul) DIMM B3 replaced  BIOS update  IDRAC update
[15:58:52] <Krinkle>	 !log Re-enabled puppet and coal on graphite2001
[15:58:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:04] <icinga-wm>	 RECOVERY - Host cp2011 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[15:59:34] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Install git-lfs on scap source and target [puppet] - 10https://gerrit.wikimedia.org/r/420409 (https://phabricator.wikimedia.org/T180628) (owner: 10Awight)
[15:59:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Install git-lfs on scap source and target [puppet] - 10https://gerrit.wikimedia.org/r/420409 (https://phabricator.wikimedia.org/T180628) (owner: 10Awight)
[15:59:43] <icinga-wm>	 RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK
[15:59:43] <icinga-wm>	 RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK
[15:59:43] <icinga-wm>	 RECOVERY - Host cp2008 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms
[15:59:44] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK
[15:59:44] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK
[15:59:44] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK
[15:59:44] <icinga-wm>	 RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK
[15:59:45] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 66 ESP OK
[15:59:53] <icinga-wm>	 RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 66 ESP OK
[15:59:53] <icinga-wm>	 RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 66 ESP OK
[15:59:53] <icinga-wm>	 RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK
[15:59:53] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 66 ESP OK
[15:59:53] <icinga-wm>	 RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK
[15:59:54] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 66 ESP OK
[15:59:54] <icinga-wm>	 RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 66 ESP OK
[15:59:55] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK
[15:59:55] <icinga-wm>	 RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK
[15:59:56] <icinga-wm>	 RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK
[15:59:56] <icinga-wm>	 RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK
[16:00:00] <wikibugs>	 (03CR) 10Mobrovac: [C: 04-1] Switch second bulk of low-traffic jobs for all wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[16:00:03] <icinga-wm>	 RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK
[16:00:03] <icinga-wm>	 RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK
[16:00:04] <icinga-wm>	 RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK
[16:00:04] <icinga-wm>	 RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK
[16:00:04] <icinga-wm>	 RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK
[16:00:04] <icinga-wm>	 RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK
[16:00:04] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK
[16:00:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "Just noting for completeness that this is not gonna make any difference on tin.eqiad.wmnet as it is not stretch" [puppet] - 10https://gerrit.wikimedia.org/r/420409 (https://phabricator.wikimedia.org/T180628) (owner: 10Awight)
[16:00:14] <icinga-wm>	 RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK
[16:00:14] <icinga-wm>	 RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK
[16:00:14] <icinga-wm>	 RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK
[16:00:23] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK
[16:00:23] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK
[16:00:24] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK
[16:00:24] <icinga-wm>	 RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK
[16:00:24] <icinga-wm>	 RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK
[16:00:34] <icinga-wm>	 RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK
[16:03:35] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123741 (10RobH) also note I rebooted cp2008 into the post and debian kernel selection screen 7 times, without any memory post errors.
[16:04:44] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4123756 (10Papaul) 05Open>03Resolved
[16:04:58] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10Papaul)
[16:05:01] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123757 (10Papaul) 05Open>03Resolved
[16:05:40] <wikibugs>	 (03PS1) 10BBlack: Revert "Depolling eqsin due to router issue" [dns] - 10https://gerrit.wikimedia.org/r/425552 (https://phabricator.wikimedia.org/T191667)
[16:06:23] <icinga-wm>	 PROBLEM - Host cp2011 is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:55] <wikibugs>	 (03CR) 10BBlack: [C: 032] Revert "Depolling eqsin due to router issue" [dns] - 10https://gerrit.wikimedia.org/r/425552 (https://phabricator.wikimedia.org/T191667) (owner: 10BBlack)
[16:08:13] <wikibugs>	 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4123787 (10Krinkle)
[16:09:22] <wikibugs>	 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10Krinkle)
[16:10:04] <icinga-wm>	 RECOVERY - Host cp2011 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[16:10:14] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:21] <wikibugs>	 (03PS1) 10Bstorm: wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/425553 (https://phabricator.wikimedia.org/T181650)
[16:10:23] <icinga-wm>	 PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:23] <icinga-wm>	 PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:23] <icinga-wm>	 PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:24] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:24] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:24] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:24] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:25] <icinga-wm>	 PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:44] <icinga-wm>	 PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:44] <icinga-wm>	 PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:44] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:50] <robh>	 yeah yeah 
[16:10:53] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:53] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:53] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:10:54] <icinga-wm>	 PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:11:00] <robh>	 i didnt mean to let cp2011 hit the os but it totally did
[16:11:03] <robh>	 hence these.
[16:11:03] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] "Dumps are running, so I will have to wait to touch this host." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 (owner: 10Jcrespo)
[16:11:04] <icinga-wm>	 PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:11:13] <icinga-wm>	 PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:11:13] <icinga-wm>	 PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:11:13] <icinga-wm>	 PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:11:13] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:11:14] <icinga-wm>	 PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6
[16:11:31] <urandom>	 !log restarting cassandra, dev environment (testing default GC settings) -- T186751
[16:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:37] <stashbot>	 T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751
[16:12:43] <icinga-wm>	 PROBLEM - Host cp2011 is DOWN: PING CRITICAL - Packet loss = 100%
[16:13:24] <wikibugs>	 (03PS3) 10Elukey: role::analytics_cluster::hadoop:master|standby: enable HDFS trash [puppet] - 10https://gerrit.wikimedia.org/r/424237 (https://phabricator.wikimedia.org/T189051)
[16:14:10] <elukey>	 !log reboot notebook1001 for kernel updates
[16:14:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:33] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::analytics_cluster::hadoop:master|standby: enable HDFS trash [puppet] - 10https://gerrit.wikimedia.org/r/424237 (https://phabricator.wikimedia.org/T189051) (owner: 10Elukey)
[16:15:33] <icinga-wm>	 RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK
[16:15:33] <icinga-wm>	 RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK
[16:15:34] <icinga-wm>	 RECOVERY - Host cp2011 is UP: PING WARNING - Packet loss = 28%, RTA = 36.29 ms
[16:15:44] <icinga-wm>	 RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK
[16:15:53] <icinga-wm>	 RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK
[16:15:54] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK
[16:15:54] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK
[16:15:54] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK
[16:15:54] <icinga-wm>	 RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK
[16:15:54] <icinga-wm>	 RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK
[16:16:04] <icinga-wm>	 RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK
[16:16:13] <icinga-wm>	 RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK
[16:16:13] <icinga-wm>	 RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK
[16:16:13] <icinga-wm>	 RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK
[16:16:13] <icinga-wm>	 RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK
[16:16:14] <icinga-wm>	 RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK
[16:16:20] <wikibugs>	 (03PS1) 10MusikAnimal: Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393)
[16:16:23] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK
[16:16:23] <icinga-wm>	 RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK
[16:16:24] <icinga-wm>	 RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK
[16:16:24] <icinga-wm>	 RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK
[16:16:24] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK
[16:16:33] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK
[16:16:33] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK
[16:17:14] <icinga-wm>	 RECOVERY - Host cp2018 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms
[16:17:14] <icinga-wm>	 RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 40 ESP OK
[16:17:23] <icinga-wm>	 RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK
[16:17:24] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK
[16:17:24] <icinga-wm>	 RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK
[16:17:25] <icinga-wm>	 RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK
[16:17:33] <icinga-wm>	 RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK
[16:17:33] <icinga-wm>	 RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 40 ESP OK
[16:17:33] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK
[16:17:33] <icinga-wm>	 RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK
[16:17:34] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK
[16:17:34] <icinga-wm>	 RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK
[16:17:37] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4123838 (10Papaul) DIMM A2 replaced DIMM A6 replaced BIOS update IDRAC update
[16:17:53] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK
[16:17:53] <icinga-wm>	 RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK
[16:18:03] <icinga-wm>	 RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 40 ESP OK
[16:18:03] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK
[16:18:03] <icinga-wm>	 RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK
[16:18:03] <icinga-wm>	 RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK
[16:18:03] <icinga-wm>	 RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK
[16:18:04] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK
[16:18:53] <icinga-wm>	 RECOVERY - Check systemd state on notebook1001 is OK: OK - running: The system is fully operational
[16:19:54] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4123845 (10RobH) so we rebooted this system half a dozen times through post and kernel section splash screen and no more memory errors.
[16:20:48] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123848 (10Dzahn) a:03Dzahn
[16:21:48] <wikibugs>	 (03PS2) 10Marostegui: wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/425553 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm)
[16:22:04] <wikibugs>	 (03PS2) 10Herron: puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538
[16:22:38] <wikibugs>	 (03CR) 10Marostegui: [C: 032] wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/425553 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm)
[16:23:03] <icinga-wm>	 PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:23:56] <marostegui>	 !log Reload haproxy on dbproxy1011 to depool labsdb1009
[16:24:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:11] <robh>	 !log cp2011 returned to service
[16:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:04] <icinga-wm>	 PROBLEM - Host cp2018 is DOWN: PING CRITICAL - Packet loss = 100%
[16:27:06] <wikibugs>	 (03CR) 10Herron: puppet-agent: log puppet runs via syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425538 (owner: 10Herron)
[16:27:23] <icinga-wm>	 RECOVERY - Host cp2018 is UP: PING WARNING - Packet loss = 93%, RTA = 36.13 ms
[16:32:04] <icinga-wm>	 PROBLEM - Host cp2018 is DOWN: PING CRITICAL - Packet loss = 100%
[16:33:03] <icinga-wm>	 RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:33:04] <icinga-wm>	 RECOVERY - Host cp2018 is UP: PING WARNING - Packet loss = 66%, RTA = 36.02 ms
[16:33:37] <foks>	 !log See T191887
[16:33:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:41] <robh>	 !log cp2018 returned to service
[16:35:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:19] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123938 (10ayounsi) Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180410-Routing
[16:38:03] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4123940 (10RobH) rebooted this half a dozen times after the memory swap, and no memory errors have cropped back up.  pushed back into service.  @papaul: can you please post the return tag tracking numb...
[16:44:15] <elukey>	 !log restart hadoop hdfs namenodes on analytics100[12] to pick up HDFS Trash settings - T189051
[16:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:21] <stashbot>	 T189051: Add trash folder to hadoop - https://phabricator.wikimedia.org/T189051
[16:51:45] <logmsgbot>	 !log sbisson@tin Started deploy [kartotherian/deploy@4cd5a19]: Deploying kartotherian v0.0.38 to maps-test*
[16:51:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:01] <logmsgbot>	 !log sbisson@tin Finished deploy [kartotherian/deploy@4cd5a19]: Deploying kartotherian v0.0.38 to maps-test* (duration: 01m 16s)
[16:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:04] <wikibugs>	 (03PS9) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050)
[17:00:05] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1700).
[17:00:05] <jouncebot>	 subbu and raynor: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:15] <subbu>	 o/
[17:00:30] <raynor>	 o/
[17:00:47] <raynor>	 subbu - you can go first - my change will require a bit to test
[17:00:59] <subbu>	 ok .. any swatters?
[17:06:19] <subbu>	 anyone able to swat? :)
[17:08:10] <wikibugs>	 (03PS1) 10Dzahn: remove deploy1001 from scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/425561 (https://phabricator.wikimedia.org/T191972)
[17:08:33] <wikibugs>	 (03CR) 10Dzahn: [C: 032] remove deploy1001 from scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/425561 (https://phabricator.wikimedia.org/T191972) (owner: 10Dzahn)
[17:09:20] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "thanks, fixed it. now it works" [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn)
[17:09:23] <subbu>	 thcipriani, RoanKattouw can one of you?
[17:09:32] <wikibugs>	 (03PS10) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050)
[17:10:14] <wikibugs>	 (03PS2) 10Dzahn: remove deploy1001 from scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/425561 (https://phabricator.wikimedia.org/T191972)
[17:10:16] <RoanKattouw>	 subbu: I'm in a meeting, as is half the foundation
[17:10:24] <subbu>	 ah, ok. 
[17:10:35] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] remove deploy1001 from scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/425561 (https://phabricator.wikimedia.org/T191972) (owner: 10Dzahn)
[17:10:52] <RoanKattouw>	 I've complained previously that 10am PT is a bad time to schedule a SWAT but I'm not sure if that ever made it back to releng
[17:11:09] <thcipriani>	 I can SWAT, give me a few to get setup
[17:11:28] <subbu>	 thcipriani, thanks.
[17:12:17] <wikibugs>	 (03CR) 10Jdlrobson: [C: 031] Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga)
[17:12:22] <thcipriani>	 RoanKattouw: I'll bring it up at the monday meeting, this was initially to give some space for the train for swat overruns, etc. But it's a tricky time.
[17:12:42] <RoanKattouw>	 Yeah it's just that a lot of teams have standups between the hours of 10 and 11
[17:13:08] <RoanKattouw>	 So both the number of available SWATters and the number of people even wanting to use that window are reduced
[17:14:05] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team, 10Patch-For-Review: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4124068 (10Dzahn) deploy1001 has been removed from scap hosts and puppet ran on tin.  This should have fixed the immediate scap issue.
[17:14:31] <thcipriani>	 Tuesday swat has caught me by surprise quite a bit
[17:14:38] <thcipriani>	 or quite often rather
[17:15:02] <wikibugs>	 (03PS3) 10Thcipriani: Enable RemexHtml on wikis with <50 issues in high priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423496 (https://phabricator.wikimedia.org/T190731) (owner: 10Subramanya Sastry)
[17:15:08] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423496 (https://phabricator.wikimedia.org/T190731) (owner: 10Subramanya Sastry)
[17:16:36] <wikibugs>	 (03Merged) 10jenkins-bot: Enable RemexHtml on wikis with <50 issues in high priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423496 (https://phabricator.wikimedia.org/T190731) (owner: 10Subramanya Sastry)
[17:19:58] <thcipriani>	 subbu: ^ is on mwdebug1002, check please
[17:20:28] <subbu>	 thanks. will do.
[17:21:06] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team, 10Patch-For-Review: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4124089 (10Dzahn) 05Open>03Resolved
[17:21:38] <subbu>	 thcipriani, lgtm on two of the wikis. as long as there are no errors, good to go.
[17:21:56] <thcipriani>	 logs look good, syncing
[17:22:09] <subbu>	 k
[17:24:07] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:423496|Enable RemexHtml on wikis with <50 issues in high priority linter cats]] T190731 (duration: 00m 59s)
[17:24:08] <wikibugs>	 (03PS2) 10Madhuvishy: dumps: Remove stat1005|6 from nfs clients for dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/423733 (https://phabricator.wikimedia.org/T188644)
[17:24:13] <thcipriani>	 subbu: live now
[17:24:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:13] <stashbot>	 T190731: Enable RemexHTML on additional wikis with < 50 errors in all high priority categories - https://phabricator.wikimedia.org/T190731
[17:24:23] <subbu>	 great. thanks.
[17:24:52] <wikibugs>	 (03PS3) 10Thcipriani: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga)
[17:25:02] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga)
[17:25:15] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4124125 (10Papaul)
[17:25:18] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4124122 (10Papaul) 05Open>03Resolved a:05Papaul>03None I do not have them Dell tech already took all the boxes
[17:25:36] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10Papaul)
[17:25:39] <mutante>	 urandom: could i possible enable and run puppet on restbase-dev1004 (to confirm my change to disable icinga checks for cassandra if on 'dev' works)
[17:25:43] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4124126 (10Papaul) 05Open>03Resolved
[17:26:39] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga)
[17:27:32] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] dumps: Remove stat1005|6 from nfs clients for dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/423733 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy)
[17:27:45] <thcipriani>	 raynor: your change is live on mwdebug1002, check please
[17:28:21] <raynor>	 ok, thanks thcipriani: I'
[17:28:23] <logmsgbot>	 !log sbisson@tin Started deploy [kartotherian/deploy@4cd5a19]: Deploying kartotherian v0.0.38 everywhere
[17:28:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:32] <raynor>	 I'm testing that
[17:29:14] <thcipriani>	 okie doke, looks like you've got a good checklist so take your time :)
[17:29:48] <Krinkle>	 !log actually re-enabled puppet on graphite2001
[17:29:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:51] <logmsgbot>	 !log sbisson@tin Finished deploy [kartotherian/deploy@4cd5a19]: Deploying kartotherian v0.0.38 everywhere (duration: 02m 27s)
[17:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:28] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: restbase: skip (some) icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4124163 (10Dzahn) The change above will now ensure that cassandra Icinga checks are not added if on the dev cluster.  We don't see the results yet because p...
[17:32:42] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: restbase/cassandra: skip (some) icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4124165 (10Dzahn)
[17:38:03] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[17:38:17] <raynor>	 thcipriani - looks good on mwdebug1002
[17:38:22] <raynor>	 please deploy to production
[17:38:32] * thcipriani does
[17:38:51] <Niharika>	 thcipriani: Are you a gerrit admin? 
[17:40:35] <wikibugs>	 (03Abandoned) 10Dzahn: prometheus: ganglia-gen outdated resource names (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/409390 (https://phabricator.wikimedia.org/T186918) (owner: 10Dzahn)
[17:40:56] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:425522|Deploy page previews for anons on dewiki]] T191966 (duration: 00m 54s)
[17:41:02] <thcipriani>	 raynor: ^ live now
[17:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:02] <stashbot>	 T191966: Deploy page previews for anons on dewiki - https://phabricator.wikimedia.org/T191966
[17:41:04] <thcipriani>	 Niharika: I don't know. I do have some special powers on gerrit, I think.
[17:41:24] <raynor>	 thanks thcipriani 
[17:41:28] <raynor>	 let us check it
[17:41:42] <Niharika>	 thcipriani: Can you grant +2 access on mediawiki-config to two people? https://phabricator.wikimedia.org/T189414 and https://phabricator.wikimedia.org/T161181 
[17:42:16] <Niharika>	 I'm going to do a deployment training with them. They are both staffers and have been around for a while.
[17:42:41] <thcipriani>	 hrm, I've never tried to do that...
[17:42:42] * thcipriani digs
[17:42:43] <icinga-wm>	 PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 62873 keys, up 15 hours 15 minutes - replication_delay is 626
[17:43:48] <wikibugs>	 (03CR) 10jenkins-bot: Enable RemexHtml on wikis with <50 issues in high priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423496 (https://phabricator.wikimedia.org/T190731) (owner: 10Subramanya Sastry)
[17:43:51] <wikibugs>	 (03CR) 10jenkins-bot: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga)
[17:44:03] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 640 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 63066 keys, up 15 hours 19 minutes - replication_delay is 640
[17:44:33] <raynor>	 everything works as expected thcipriani thanks for deployment
[17:44:49] <thcipriani>	 raynor: thanks for testing!
[17:45:04] <Niharika>	 thcipriani: Maybe you see something on this screen - https://gerrit.wikimedia.org/r/#/admin/projects/operations/mediawiki-config
[17:45:09] <Niharika>	 You are a gerrit admin. 
[17:46:11] <bd808>	 thcipriani: they need to be added to https://gerrit.wikimedia.org/r/#/admin/groups/21,members to give +2 typically.
[17:49:15] <thcipriani>	 yep, just found that group after flailing around a bit
[17:49:27] <wikibugs>	 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4124258 (10RobH) The order for this is escalated for placement.  This should arrive sometime next week.  (Just upda...
[17:53:12] <thcipriani>	 Niharika: musikanimal: MusikAnimal is the gerrit username, correct?
[17:53:23] <Niharika>	 thcipriani: Yes. 
[17:53:27] <musikanimal>	 yep
[17:53:53] <thcipriani>	 all added
[17:54:07] <Niharika>	 thcipriani: And samwilson too, right?
[17:54:12] <Niharika>	 Thank you so much!
[17:54:15] <thcipriani>	 Niharika: yes indeed
[17:54:38] <thcipriani>	 sure thing, glad to have more deployers :)
[17:54:43] <icinga-wm>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 62765 keys, up 15 hours 24 minutes - replication_delay is 653
[17:57:04] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1424 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 63066 keys, up 15 hours 32 minutes - replication_delay is 1424
[17:57:43] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/ores on deploy1001 is OK: No errors detected
[17:58:03] <icinga-wm>	 RECOVERY - Disk space on deploy1001 is OK: DISK OK
[17:58:03] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/maps on deploy1001 is OK: No errors detected
[17:58:04] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/zotero-translators on deploy1001 is OK: No errors detected
[17:58:04] <icinga-wm>	 RECOVERY - Check size of conntrack table on deploy1001 is OK: OK: nf_conntrack is 0 % full
[17:58:04] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/mediawiki-installation on deploy1001 is OK: No errors detected
[17:58:04] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/cassandra on deploy1001 is OK: No errors detected
[17:58:04] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/zotero-translation-server on deploy1001 is OK: No errors detected
[17:58:05] <icinga-wm>	 RECOVERY - confd service on deploy1001 is OK: OK - confd is active
[17:58:05] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on deploy1001 is OK: No changes to merge.
[17:58:06] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on deploy1001 is OK: OK ferm input default policy is set
[17:58:06] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/parsoid on deploy1001 is OK: No errors detected
[17:58:07] <icinga-wm>	 RECOVERY - MD RAID on deploy1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[17:58:14] <icinga-wm>	 RECOVERY - configured eth on deploy1001 is OK: OK - interfaces up
[17:58:14] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/jobrunner on deploy1001 is OK: No errors detected
[17:58:14] <icinga-wm>	 RECOVERY - dhclient process on deploy1001 is OK: PROCS OK: 0 processes with command name dhclient
[17:59:04] <icinga-wm>	 RECOVERY - DPKG on deploy1001 is OK: All packages OK
[17:59:04] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 621 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 62926 keys, up 15 hours 43 minutes - replication_delay is 621
[17:59:04] <musikanimal>	 thanks!
[17:59:14] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1800)
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:01:03] <icinga-wm>	 RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:03:14] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4766041 keys, up 15 hours 53 minutes - replication_delay is 633
[18:05:54] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is OK: Files ownership is ok.
[18:10:04] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on deploy1001 is OK: OK: synced at Wed 2018-04-11 18:10:00 UTC.
[18:10:24] <icinga-wm>	 RECOVERY - IPMI Sensor Status on deploy1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[18:10:30] <wikibugs>	 (03PS2) 10Madhuvishy: nfsclient: Cleanup absented dumps mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423728 (https://phabricator.wikimedia.org/T188643)
[18:11:25] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] nfsclient: Cleanup absented dumps mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423728 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy)
[18:11:47] <mutante>	 !log deploy1001 is back on stretch once again - it has been removed from scap hosts though (T175288 T185275)
[18:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:54] <stashbot>	 T185275: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275
[18:11:54] <stashbot>	 T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288
[18:14:53] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 616 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 62887 keys, up 16 hours 2 minutes - replication_delay is 616
[18:16:48] <wikibugs>	 (03PS2) 10Madhuvishy: dumps: Turn off cron that rsyncs to labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423731 (https://phabricator.wikimedia.org/T188643)
[18:17:31] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] dumps: Turn off cron that rsyncs to labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423731 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy)
[18:21:24] <wikibugs>	 (03PS2) 10Madhuvishy: nfs: Stop exporting dumps from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423727 (https://phabricator.wikimedia.org/T188643)
[18:22:15] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] nfs: Stop exporting dumps from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423727 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy)
[18:25:07] <wikibugs>	 (03PS3) 10Madhuvishy: dumps: Clean up code that rsyncs to labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423732 (https://phabricator.wikimedia.org/T188643)
[18:25:52] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] dumps: Clean up code that rsyncs to labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423732 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy)
[18:26:05] <wikibugs>	 (03PS3) 10Herron: puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538 (https://phabricator.wikimedia.org/T75989)
[18:35:28] <wikibugs>	 (03CR) 10Chad: [C: 04-1] "Per discussion on IRC the other day, we want to serve this from a separate vhost over something like gerrit.wmfusercontent.org." [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[18:35:41] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 629 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 63347 keys, up 16 hours 22 minutes - replication_delay is 629
[18:38:50] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 63136 keys, up 16 hours 22 minutes - replication_delay is 619
[18:41:48] <wikibugs>	 (03CR) 10Chad: [C: 031] Gerrit: Switch gc back on [puppet] - 10https://gerrit.wikimedia.org/r/421593 (https://phabricator.wikimedia.org/T190045) (owner: 10Paladox)
[18:43:16] <wikibugs>	 (03CR) 10Chad: [V: 032 C: 032] add plugin avatars-external [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/424710 (owner: 10Paladox)
[18:45:19] <urandom>	 mutante: ugh
[18:47:22] <wikibugs>	 (03PS1) 10Pmiazga: Enable Page Previews for 10% enwiki anon users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T191101)
[18:47:26] <urandom>	 !log restarting cassandra, dev environment (set -XX:+PerfDisableSharedMem) -- T186751
[18:47:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:33] <stashbot>	 T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751
[18:47:56] <mutante>	 urandom: no worries, it can wait, the compiler said it's all good 
[18:48:10] <wikibugs>	 (03PS1) 10Ladsgroup: Add badge for good lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425589 (https://phabricator.wikimedia.org/T190976)
[18:48:15] <urandom>	 mutante: i need to figure out something here anyway...
[18:48:25] <urandom>	 i've had it disabled too long
[18:48:57] <urandom>	 but yeah, if it can wait, let me see about getting it sorted
[18:49:35] <wikibugs>	 (03CR) 10Jdlrobson: [C: 031] Enable Page Previews for 10% enwiki anon users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga)
[18:49:49] <mutante>	 urandom: take your time
[18:50:25] <wikibugs>	 (03PS2) 10Ladsgroup: Add badge for good lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425589 (https://phabricator.wikimedia.org/T190976)
[18:51:58] <wikibugs>	 (03PS2) 10Herron: base: auto logout idle bash shells after 2 days [puppet] - 10https://gerrit.wikimedia.org/r/392698 (https://phabricator.wikimedia.org/T122922)
[18:52:47] <mutante>	 no_justification: do you want me to merge the GC change now (with or without restart )
[18:53:15] <no_justification>	 If you want. I'm OOO today
[18:54:42] <mutante>	 i'm also OOO
[18:55:51] <wikibugs>	 (03PS3) 10Ladsgroup: Add badge for good lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425589 (https://phabricator.wikimedia.org/T190976)
[18:56:10] <wikibugs>	 (03PS2) 10Jdlrobson: Enable Page Previews for 10% enwiki anon users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga)
[19:00:04] <jouncebot>	 thcipriani: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1900).
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:00:52] <thcipriani>	 this may be a reference I don't get.
[19:01:08] <thcipriani>	 but I am working on the train.
[19:02:07] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183)
[19:05:14] <urandom>	 mutante: fyi, in the interest of getting a clean diff of some live hacks I've made, I just ran puppet on restbase-dev1004
[19:05:49] <paladox>	 no_justification done
[19:05:51] <mutante>	 urandom: thanks, i think that's all i needed
[19:06:00] <paladox>	 for gerrit.wmfusercontent.org
[19:06:03] <wikibugs>	 (03CR) 10Chad: [C: 04-1] Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[19:06:47] <paladox>	 no_justification yep, though i had a look in apache.conf template and it seems it does ServerAlias with that config (ie as an array)
[19:07:26] <paladox>	 no_justification ie
[19:07:26] <paladox>	 ServerAlias <%= Array(@slave_hosts).join(' ') %>
[19:08:08] <mutante>	 urandom: it works. running puppet on icinga server and the cassandra checks are being removed because it's a "dev" 
[19:08:15] <mutante>	 that was what i wanted 
[19:08:24] <urandom>	 mutante: cool
[19:08:47] <urandom>	 i'm still going to work on getting this to the state where i can re-enable puppet
[19:08:58] <wikibugs>	 (03CR) 10Paladox: Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[19:09:55] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.29/includes/libs/rdbms/database: [[gerrit:425566|rdbms: fix transaction flushing in Database::close]] T191916 (duration: 01m 01s)
[19:10:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:01] <stashbot>	 T191916: Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBUnexpectedError' with message 'Wikimedia\Rdbms\Database::close: Expected mass commit of all peer transactions (DBO_TRX set).' in /srv/mediawiki/php-1.31.0-wmf.29/includes/libs/rdbms/database/Database.php:3602 - https://phabricator.wikimedia.org/T191916
[19:11:32] <logmsgbot>	 !log thcipriani@tin rebuilt and synchronized wikiversions files: testwiki back to 1.31.0-wmf.29
[19:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:54] <wikibugs>	 (03PS1) 10Catrope: Allow sysops to create Flow boards on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500)
[19:15:28] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4849 keys, up 16 hours 59 minutes - replication_delay is 0
[19:15:32] <mutante>	 urandom: i am done with my part, all up to you :)
[19:15:37] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5018 keys, up 17 hours 2 minutes - replication_delay is 0
[19:16:16] <wikibugs>	 (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/10906/bast1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/392698 (https://phabricator.wikimedia.org/T122922) (owner: 10Herron)
[19:16:22] <wikibugs>	 (03PS3) 10Herron: base: auto logout idle bash shells after 2 days [puppet] - 10https://gerrit.wikimedia.org/r/392698 (https://phabricator.wikimedia.org/T122922)
[19:17:38] <icinga-wm>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 5630 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 62765 keys, up 16 hours 47 minutes - replication_delay is 5630
[19:17:54] <wikibugs>	 (03CR) 10Herron: [C: 032] base: auto logout idle bash shells after 2 days [puppet] - 10https://gerrit.wikimedia.org/r/392698 (https://phabricator.wikimedia.org/T122922) (owner: 10Herron)
[19:18:08] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 63445 keys, up 17 hours 2 minutes - replication_delay is 0
[19:18:15] <wikibugs>	 (03PS1) 10Thcipriani: Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425595
[19:19:18] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4622 keys, up 17 hours 6 minutes - replication_delay is 0
[19:19:53] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425595 (owner: 10Thcipriani)
[19:20:08] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4707403 keys, up 17 hours 9 minutes - replication_delay is 0
[19:20:43] <wikibugs>	 (03CR) 10Chad: [C: 04-1] Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[19:21:08] <wikibugs>	 (03Merged) 10jenkins-bot: Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425595 (owner: 10Thcipriani)
[19:21:23] <wikibugs>	 (03CR) 10jenkins-bot: Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425595 (owner: 10Thcipriani)
[19:21:38] <icinga-wm>	 RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 62813 keys, up 16 hours 51 minutes - replication_delay is 0
[19:23:28] <logmsgbot>	 !log thcipriani@tin rebuilt and synchronized wikiversions files: Group0 to 1.31.0-wmf.29
[19:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:48] <icinga-wm>	 RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4437 keys, up 16 hours 58 minutes - replication_delay is 0
[19:26:18] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4495 keys, up 17 hours 1 minutes - replication_delay is 0
[19:44:13] <wikibugs>	 (03PS1) 10Andrew Bogott: WMCS puppet enc api: remove --autoload from uwsgi service settings. [puppet] - 10https://gerrit.wikimedia.org/r/425598 (https://phabricator.wikimedia.org/T191648)
[19:44:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WMCS puppet enc api: remove --autoload from uwsgi service settings. [puppet] - 10https://gerrit.wikimedia.org/r/425598 (https://phabricator.wikimedia.org/T191648) (owner: 10Andrew Bogott)
[19:46:06] <wikibugs>	 (03PS1) 10Thcipriani: Group1 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425599
[19:46:17] <wikibugs>	 (03PS2) 10Andrew Bogott: WMCS puppet enc api: remove --autoload from uwsgi service settings. [puppet] - 10https://gerrit.wikimedia.org/r/425598 (https://phabricator.wikimedia.org/T191648)
[19:49:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] WMCS puppet enc api: remove --autoload from uwsgi service settings. [puppet] - 10https://gerrit.wikimedia.org/r/425598 (https://phabricator.wikimedia.org/T191648) (owner: 10Andrew Bogott)
[19:51:26] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Group1 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425599 (owner: 10Thcipriani)
[19:52:42] <wikibugs>	 (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425599 (owner: 10Thcipriani)
[19:52:48] <wikibugs>	 (03PS1) 10Ppchelko: Enable EventBus for job events for all the wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464)
[19:52:58] <wikibugs>	 (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425599 (owner: 10Thcipriani)
[19:57:48] <wikibugs>	 (03CR) 10Ppchelko: [C: 04-1] "Because blocked by T192005" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) (owner: 10Ppchelko)
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T2000).
[20:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[20:00:13] <logmsgbot>	 !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 to 1.31.0-wmf.29
[20:00:22] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4124737 (10Andrew) > Fixing this looks to be as easy as passing $service_settings => '--die-on-term'in openstack::puppet::master::encapi  Indeed, t...
[20:00:24] <wikibugs>	 (03CR) 10Paladox: Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[20:00:25] <awight>	 ORES has some minor patches to roll out.
[20:00:30] * halfak watches
[20:00:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:00] <awight>	 halfak: fyi, I’m starting slow, just gonna deploy to the (new) canary, ores1001 for starters.
[20:01:21] * awight dons hairshirt in hope of not repeating words
[20:02:09] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@b6deb5d]: Transitional virtualenv for ORES (take 2), T181071
[20:02:10] <logmsgbot>	 !log thcipriani@tin Synchronized php: group1 to 1.31.0-wmf.29 (duration: 01m 16s)
[20:02:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:26] <stashbot>	 T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071
[20:02:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:17] <wikibugs>	 (03CR) 10Ppchelko: "Actually, since $wmgUseEventBus is still respected, this change is a no-op, but must be deployed before Id1d043e5ce02e73b51f75ee54575647b5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) (owner: 10Ppchelko)
[20:04:40] <subbu>	 nothing to deploy for parsoid
[20:05:20] <mdholloway>	 no deploy for mobileapps
[20:20:42] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@b6deb5d]: Transitional virtualenv for ORES (take 2), T181071 (duration: 18m 34s)
[20:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:48] <stashbot>	 T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071
[20:22:05] <wikibugs>	 (03PS4) 10Awight: Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071)
[20:22:33] <wikibugs>	 (03PS5) 10Awight: Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071)
[20:23:21] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4124827 (10awight)
[20:26:35] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4124865 (10awight) @akosiaris We're finally ready to deploy the puppet cha...
[20:27:09] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3851398 (10awight) p:05Triage>03High
[20:27:24] <halfak>	 \o/ 
[20:28:52] <bawolff>	 I'm having intermittent timeouts when connecting to gerrit.wikimedia.org
[20:29:52] <awight>	 Pchelolo: Feel like kicking this patch today?  https://gerrit.wikimedia.org/r/#/c/424145/
[20:34:19] <_joe_>	 bawolff: still ongoing?
[20:34:45] <bawolff>	 No, it seems to be better now
[20:35:00] <bawolff>	 At least for the moment
[20:35:20] <bawolff>	 and it was only when connecting via ssh, not the web interface
[20:35:22] <wikibugs>	 (03PS2) 10Hashar: Tag 'latest' during build instead of at publishing [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398265
[20:36:22] <Pchelolo>	 awight: can we do it tomorrow UTC morning? We have a huge undeployed backlog of patches for Change-Prop that we need to deploy with great caution, it's super late for mobrovac and kinda late for me already
[20:36:52] <urandom>	 !log increase change-prop sample rate in dev env to 40% (from 20) -- T186751
[20:36:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:58] <stashbot>	 T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751
[20:42:00] <awight>	 Pchelolo: Sounds good to me, thanks!  It would be good if Scoring staff were monitoring, so let me know what time you’re thinking of deploying once it becomes more concrete.
[20:42:51] <Pchelolo>	 kk, I'm in UTC-3 and I start working pretty early
[20:46:50] <awight>	 :)
[20:50:03] <awight>	 Pchelolo: fyi, these are the main graphs to look at, if we’re unable to be awake: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=3&fullscreen&orgId=1 will show how many scores each machine is handling.  We expect that to increase by 50-100%, but keeping the same shape.  Danger signs would be if any machine stops processing scores.  Also, https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=2&fullscreen&orgId
[20:50:04] <awight>	 might have a small spike of up to maybe 10 errors per minute during deployment, but should otherwise stay close to zero at all times.
[20:51:20] <Pchelolo>	 kk
[20:51:28] <Pchelolo>	 acknowledged awight 
[20:51:40] <awight>	 whew!  sorry about the brain dump, I should put that on a wiki page.
[20:51:55] <Pchelolo>	 we will probably just get rid of our deploy backlog tomorrow morning and make another deploy as you wake up
[20:52:48] <awight>	 :) Our patch isn’t a huge rush, I’ve just been energized by *nearly* unblocking on some other ORES stuff.
[21:18:25] <wikibugs>	 (03PS1) 10Rush: openstack: add bootstrap instructions for provider created vxlan [puppet] - 10https://gerrit.wikimedia.org/r/425713 (https://phabricator.wikimedia.org/T188266)
[21:18:35] <wikibugs>	 (03PS2) 10MusikAnimal: Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393)
[21:19:32] <wikibugs>	 (03PS2) 10MusikAnimal: Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697)
[21:20:18] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: add bootstrap instructions for provider created vxlan [puppet] - 10https://gerrit.wikimedia.org/r/425713 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[21:41:14] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "tests pass, workflow makes sense, tested working locally." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398265 (owner: 10Hashar)
[21:41:46] <wikibugs>	 (03Merged) 10jenkins-bot: Tag 'latest' during build instead of at publishing [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398265 (owner: 10Hashar)
[21:42:02] <mutante>	 what's "dashiki" ?  as in "labs-project-dashiki" 
[21:42:37] <James_F>	 mutante: The wiki-configured stats system from Analytics.
[21:43:09] <mutante>	 thanks James_F 
[21:43:40] <James_F>	 I don't know what that particular WMCloud instance is specifically for, though, sorry.
[21:51:52] <nuria_>	 bblack: yt?
[21:53:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "The reason to tag only at publishing was to keep the "latest" tag consistent with the remote repository, but it's admittedly useless and c" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398265 (owner: 10Hashar)
[22:00:05] <jouncebot>	 samwilson and musikanimal: (Dis)respected human, time to deploy Logging and PageAssessments (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T2200). Please do the needful.
[22:00:05] <jouncebot>	 samwilson and musikanimal: A patch you scheduled for Logging and PageAssessments is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[22:00:09] <mutante>	 _joe_: deploy1001 is on stretch again (but removed from scap hosts right now). tin is still active deployment_server and on jessie. i also have a ticket to upgrade naos.codfw to stretch. should i go ahead and do that now? i mean it's codfw
[22:00:41] <mutante>	 well the ticket says to also rename it to deploy2001
[22:00:48] <_joe_>	 no, let's wait for a week until it's clearer what we want to do
[22:00:54] <mutante>	 ok
[22:01:54] <_joe_>	 we still have to make some decisions around what we will do, with releng and other people involved in the php7 migration
[22:03:05] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4108485 (10Dzahn) adding @RStallman-legalteam for the NDA step
[22:03:32] <mutante>	 *nod* i figured that. thanks
[22:04:18] <wikibugs>	 (03CR) 10MusikAnimal: [C: 032] Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393) (owner: 10MusikAnimal)
[22:05:36] <wikibugs>	 (03Merged) 10jenkins-bot: Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393) (owner: 10MusikAnimal)
[22:07:17] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "it will also add him on the bastion hosts via the "all-users" special group and magic" [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) (owner: 10ArielGlenn)
[22:09:20] <wikibugs>	 (03CR) 10jenkins-bot: Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393) (owner: 10MusikAnimal)
[22:12:23] <wikibugs>	 (03CR) 10Dzahn: "it would affect all these:" [puppet] - 10https://gerrit.wikimedia.org/r/415510 (owner: 10Dzahn)
[22:13:09] <wikibugs>	 (03Abandoned) 10Dzahn: cache::misc: switch webserver_misc_static to codfw backend [puppet] - 10https://gerrit.wikimedia.org/r/420142 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn)
[22:13:52] <logmsgbot>	 !log musikanimal@tin Synchronized wmf-config/InitialiseSettings.php: Enabling PageAssessments on frwiki (T153393) (duration: 01m 26s)
[22:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:58] <stashbot>	 T153393: Deploy PageAssessments to French wikipedia - https://phabricator.wikimedia.org/T153393
[22:16:10] <wikibugs>	 (03CR) 10MusikAnimal: [C: 032] Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal)
[22:16:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal)
[22:17:36] <wikibugs>	 (03PS3) 10MusikAnimal: Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697)
[22:17:51] <wikibugs>	 (03CR) 10MusikAnimal: [C: 032] Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal)
[22:19:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal)
[22:19:59] <mutante>	 awight: so i just looked at the ores change and then saw your ticket comment.. i see that would only affect all ores1* machines but not also the ones with ores::redis, right?  also the admin groups look like you have the powers to do the restarts. i would merge that if you are here to check on it afterwards
[22:20:56] <mutante>	 could also run the right restart command via cumin
[22:21:10] <wikibugs>	 (03CR) 10jenkins-bot: Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal)
[22:23:02] <awight>	 mutante: Great, yes it should work like you described.  I can run the restarts via pssh :D
[22:23:25] <bstorm_>	 !log views updated on labsdb1009
[22:23:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:31] <wikibugs>	 (03PS6) 10Dzahn: Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071) (owner: 10Awight)
[22:24:49] <logmsgbot>	 !log musikanimal@tin Synchronized wmf-config/InitialiseSettings.php: Enabling PageAssessments on huwiki (T191697) (duration: 01m 17s)
[22:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:56] <stashbot>	 T191697: Deploy PageAssessments to Hungarian Wikipedia - https://phabricator.wikimedia.org/T191697
[22:25:29] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071) (owner: 10Awight)
[22:26:09] <milimetric>	 labs-project-dashiki I’m not too sure, mutante, but we call our custom dashboarding tool dashiki: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Dashboards
[22:26:11] <wikibugs>	 (03CR) 10Pnorman: [C: 031] "I double-checked, and the admin tables are currently owned by postgres, so this is okay" [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605) (owner: 10Gehel)
[22:28:06] <mutante>	 !log ores - running puppet on all instances to apply venv path change for T181071
[22:28:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:14] <stashbot>	 T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071
[22:28:17] <wikibugs>	 (03PS2) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466
[22:28:36] <awight>	 mutante: Thanks!  I see that ores1001 already has the updated config, so I’ll try a canary restart.
[22:28:39] <mutante>	 awight: running puppet on all of them ..applying the config change (not restarting)
[22:28:48] <mutante>	 awight: yes, i did it on 1001 manually and now on all
[22:29:08] <mutante>	 milimetric: thanks! ok
[22:29:16] <awight>	 mutante: ah, you already restarted I see
[22:29:40] <mutante>	 awight: no, i did not. then puppet did it
[22:29:46] <mutante>	 yes, puppet
[22:30:18] <awight>	 !! ok that’s odd, it must have been a dependency within the systemd module.
[22:30:19] <mutante>	 !log ores - all eqiad instances are being restarted by puppet after config change
[22:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:39] <mutante>	 indeed
[22:30:40] <mutante>	 Info: Base::Service_unit[uwsgi-ores]: Scheduling refresh of Exec[systemd reload for uwsgi-ores]
[22:30:46] <wikibugs>	 (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/425718
[22:31:27] <awight>	 Wow, it worked.
[22:31:31] <mutante>	 pphew :)
[22:31:44] <awight>	 mutante: Last thing to bug you about, how do you suggest we clean up the now-unused virtualenv path?
[22:31:45] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 (owner: 10Samwilson)
[22:31:47] <wikibugs>	 (03PS2) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/425718
[22:31:47] <mutante>	 there is still codfw to go and eqiad are 6/9
[22:31:55] <wikibugs>	 (03CR) 10Samwilson: [V: 032] Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 (owner: 10Samwilson)
[22:31:57] <awight>	 nice
[22:33:00] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 (owner: 10Samwilson)
[22:33:14] <wikibugs>	 (03CR) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 (owner: 10Samwilson)
[22:33:39] <mutante>	 awight: ok, now it's actually done on eqiad. 100.0% (9/9) success ratio
[22:33:45] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4108485 (10RobH) I'm not sure why L2 was listed as a requirement?  The phabricator NDA doesn't really work into any workflow that I'm aware of, we require an NDA on...
[22:33:45] <mutante>	 doing codfw
[22:36:01] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4125249 (10RStallman-legalteam) @Matthias_Geisler_WMDE - will reach out to you via email in the next day or so with the NDA. Thanks!
[22:36:10] <mutante>	 awight: i suggest:  i run:   rm -rf /srv/deployment/ores/venv  on  ores1001   and then you restart it again?
[22:36:38] <mutante>	 and then i run that on all 
[22:36:59] <awight>	 mutante: That sounds right to me.
[22:37:29] <mutante>	 !log ores - same for codfw instances, change of venv path to /srv/deployment/ores/deploy/venv/
[22:37:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:37:40] <mutante>	 !log ores1001 - rm -rf /srv/deployment/ores/venv/
[22:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:38:13] <mutante>	 awight: done on 1001
[22:38:58] <awight>	 I’ll restart services there.
[22:39:42] <awight>	 mutante: Still healthy.
[22:40:07] <mutante>	 awight: great, will delete it on all eqiad. codfw is still running puppet
[22:41:40] <mutante>	 !log ores1002-1009 - deleting old venv dir - rm -f /srv/deployment/ores/venv (T181071)
[22:41:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:46] <stashbot>	 T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071
[22:42:41] <mutante>	 awight: done.. eh.. i just happened to see this: https://phabricator.wikimedia.org/rORESDEPLOYadc5c06417290c9980cc2b35d599c6da13ea24c6
[22:42:48] <mutante>	 what about that "install libs in old path" 
[22:43:06] <wikibugs>	 (03PS1) 10Ladsgroup: Stop logging autopatrol actions everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425719 (https://phabricator.wikimedia.org/T184485)
[22:44:53] <awight>	 mutante: That’s the last piece of the migration, it stops us from rebuilding /srv/deployment/ores/venv
[22:45:07] <awight>	 but I think you grokked that, maybe I missed the question?
[22:46:18] <mutante>	 awight: ah. yea so then "venv_old" is already gone. then i get it. 
[22:46:47] <awight>	 It’s awesome to finally get to this point!  Such a simple change, but with many breakable pieces...
[22:47:39] <mutante>	 !log ores2* - puppet ran to change venv config, then 'rm -rf /srv/deployment/ores/venv/' via cumin to clean-up (T181071)
[22:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:44] <stashbot>	 T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071
[22:47:49] <mutante>	 awight: ^ that's it, you can restart it all 
[22:47:50] <logmsgbot>	 !log samwilson@tin Synchronized wmf-config/InitialiseSettings.php: Deploy GlobalPreferences T184121 (duration: 01m 17s)
[22:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:55] <stashbot>	 T184121: Deploy checklist for GlobalPreferences on production - https://phabricator.wikimedia.org/T184121
[22:49:44] <mutante>	 oops, missed ores1009 because it had been reinstalled or sometihng.. also done now
[22:49:51] <awight>	 mutante: wicked.  ok, doing so.
[23:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T2300).
[23:00:05] <jouncebot>	 RoanKattouw and Amir1: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:17] <Dereckson>	 Hello, I can SWAT this evening.
[23:00:20] <Amir1>	 o/
[23:00:27] <Dereckson>	 But first let's check if samwilson is done
[23:01:01] <samwilson>	 Dereckson: yep, we're all done for now
[23:01:05] <Dereckson>	 Thanks
[23:01:50] <wikibugs>	 (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425719 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup)
[23:02:41] <Amir1>	 Dereckson: not testable. It's config switch that has been working for a while though
[23:03:05] <wikibugs>	 (03Merged) 10jenkins-bot: Stop logging autopatrol actions everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425719 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup)
[23:04:02] <awight>	 mutante: Everything has been restarted.  Thanks for your time!
[23:07:33] <Dereckson>	 Amir1: it seems there isn't any mayhem on logs (enabled on mwdebug1002)
[23:07:46] <Dereckson>	 so yes, it seems indeed fine to sync
[23:08:11] <Amir1>	 cool
[23:09:53] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Stop logging autopatrol actions everywhere (T184485) (duration: 01m 18s)
[23:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:09:59] <stashbot>	 T184485: Stop logging autopatrol actions - https://phabricator.wikimedia.org/T184485
[23:10:33] <Amir1>	 Thanks!
[23:11:00] <wikibugs>	 (03PS2) 10Dereckson: Allow sysops to create Flow boards on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500) (owner: 10Catrope)
[23:13:56] <Dereckson>	 RoanKattouw: ping?
[23:14:26] <mutante>	 awight: :) welcome
[23:14:31] <RoanKattouw>	 Here
[23:14:57] <Dereckson>	 Hello, let's SWAT sysop flow permission to create board for eu.
[23:15:29] <wikibugs>	 (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500) (owner: 10Catrope)
[23:16:43] <wikibugs>	 (03Merged) 10jenkins-bot: Allow sysops to create Flow boards on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500) (owner: 10Catrope)
[23:17:03] <RoanKattouw>	 Sure
[23:17:10] <RoanKattouw>	 lmk when it's on mwdebug1002
[23:17:11] <Dereckson>	 If you've an eu. admin available, they can test on mwdebug1002 if it works (if not we can check https://eu.wikipedia.org/wiki/Berezi:ListGroupRights)
[23:17:21] <Dereckson>	 just now 
[23:17:31] <RoanKattouw>	 I'll just check the special page
[23:17:57] <RoanKattouw>	 The special page looks good, that's good enough for me
[23:19:44] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Allow sysops to create Flow boards on euwiki (T190500) (duration: 01m 17s)
[23:19:44] <Dereckson>	 Syncing
[23:19:48] <Dereckson>	 Synced
[23:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:19:50] <stashbot>	 T190500: Enable Extension:StructuredDiscussions in Basque Wikipedia - https://phabricator.wikimedia.org/T190500
[23:21:27] <RoanKattouw>	 Thanks!
[23:21:32] <Dereckson>	 You're welcome :)
[23:29:10] <wikibugs>	 (03CR) 10Dereckson: [C: 031] "The namespaced classes have been introduced in 5.4 and backported to 4.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421588 (https://phabricator.wikimedia.org/T188166) (owner: 10Umherirrender)
[23:33:12] <wikibugs>	 (03CR) 10jenkins-bot: Stop logging autopatrol actions everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425719 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup)
[23:33:16] <wikibugs>	 (03CR) 10jenkins-bot: Allow sysops to create Flow boards on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500) (owner: 10Catrope)
[23:53:18] <wikibugs>	 (03PS1) 10MaxSem: Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723
[23:54:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem)
[23:55:00] <wikibugs>	 (03CR) 10MaxSem: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem)
[23:56:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem)
[23:56:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem)
[23:58:17] <wikibugs>	 (03PS2) 10MaxSem: Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723
[23:59:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem)
[23:59:40] <MaxSem>	 what's wrong with jerkins?