[00:03:24] (03PS7) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) [00:10:49] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/425448 [00:11:12] Krinkle: updated my comment with more info https://phabricator.wikimedia.org/T191940#4122060 [00:12:01] !log Updated views and indexes on labsdb1011 [00:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:05] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-eqsin.wikimedia.org recovered from Processor usage over 85% [00:14:07] (03PS2) 10Bstorm: Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/425448 [00:18:37] (03CR) 10Dzahn: [C: 04-1] "for some reason still enabled for all:" [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [00:23:16] (03PS8) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) [00:28:32] (03CR) 10Dzahn: [C: 04-1] "still not http://puppet-compiler.wmflabs.org/10894/restbase-dev1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [00:29:21] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4122111 (10awight) Still finding strangeness... Reading the virtualenv so... [00:36:51] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4122112 (10thcipriani) `fwrite` is definitely different in hhvm -- accounts for all the `lseek` in the hhvm output: https://gist.github.com/thcipriani/... [01:37:21] 10Operations, 10netops: Juniper HA audit - https://phabricator.wikimedia.org/T191667#4122151 (10ayounsi) [02:36:09] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.28) (duration: 05m 41s) [02:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:30] (03PS1) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 [05:16:18] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/425448 (owner: 10Bstorm) [05:17:20] !log Reload haproxy on dbprox1010 to repool labsdb1010 [05:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:49] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4122327 (10Joe) What are the blockers for the use of PHP7? All I see on the ticket mentioned is the memcached issue, which ops are working on right no... [05:22:48] !log Deploy schema change on codfw s8 master (db2045) with replication enabled (this will generate lag on codfw) - T187089 T185128 T153182 [05:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:55] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [05:22:55] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [05:22:55] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [05:25:26] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4122342 (10Marostegui) @Papaul next one will be db2042 Thanks! [05:28:53] !log manual coal back-fill still running with the normal coal disabled via systemd. Will restore normal coal when I wake up. [05:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:57] (03PS1) 10Marostegui: mariadb: Move db2069 from s1 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/425468 (https://phabricator.wikimedia.org/T191275) [05:35:34] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2069 from s1 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/425468 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [05:37:16] (03PS2) 10Marostegui: mariadb: Move db2069 from s1 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/425468 (https://phabricator.wikimedia.org/T191275) [05:40:34] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10895/" [puppet] - 10https://gerrit.wikimedia.org/r/425468 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [05:48:28] (03PS1) 10Marostegui: install_server: Allow reimage db2069 [puppet] - 10https://gerrit.wikimedia.org/r/425472 [05:50:00] (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2069 [puppet] - 10https://gerrit.wikimedia.org/r/425472 (owner: 10Marostegui) [05:58:56] (03PS2) 10Giuseppe Lavagetto: Update rbenv ruby version to match production [puppet] - 10https://gerrit.wikimedia.org/r/425280 [06:02:15] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122370 (10Dzahn) @Sharvaniharan @MoritzMuehlenhoff The user name is now correct. The remaining issue (that looks like the same from outside but is a different problem now) is:... [06:11:19] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2069 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425473 (https://phabricator.wikimedia.org/T191275) [06:11:20] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4122371 (10Joe) [06:12:43] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2069 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425473 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:14:10] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2069 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425473 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:15:51] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2069 to the config as depooled x1 slave - T191275 (duration: 01m 01s) [06:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:58] T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275 [06:17:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db2069 to the config as depooled x1 slave - T191275 (duration: 01m 03s) [06:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:04] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2069 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425473 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:20:31] !log Stop MySQL on db2033 to clone db2069 - T191275 [06:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:23] (03PS3) 10Elukey: Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) [06:29:47] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:32:01] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A few minor comments, which can be addressed now or later, and an important fix in class mcrouter, where an override is improperly applied" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [06:37:23] (03PS1) 10Elukey: network::constants: add conf100[456] to zookeeper_main_hosts [puppet] - 10https://gerrit.wikimedia.org/r/425474 (https://phabricator.wikimedia.org/T182924) [06:49:06] !log restart Yarn Resource Manager daemons on analytics100[12] to pick up the new Prometheus configuration file [06:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:11] 10Operations, 10ops-eqsin, 10Traffic: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4122392 (10Vgutierrez) Fixed following @Volans recommendations: ``` vgutierrez@neodymium:~$ sudo cumin 'R:class%site = eqsin' 'ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volati... [06:52:32] 10Operations, 10ops-eqsin, 10Traffic: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4122393 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [06:59:47] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:03:15] (03PS1) 10Vgutierrez: Revert "Revert "install_server: Reimage lvs5003 as stretch"" [puppet] - 10https://gerrit.wikimedia.org/r/425475 [07:08:05] !log Reimaging lvs5003.eqsin as stretch (2nd attempt) [07:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:33] (03CR) 10Vgutierrez: [C: 032] Revert "Revert "install_server: Reimage lvs5003 as stretch"" [puppet] - 10https://gerrit.wikimedia.org/r/425475 (owner: 10Vgutierrez) [07:12:08] (03CR) 10Elukey: [C: 032] network::constants: add conf100[456] to zookeeper_main_hosts [puppet] - 10https://gerrit.wikimedia.org/r/425474 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [07:12:13] (03PS2) 10Elukey: network::constants: add conf100[456] to zookeeper_main_hosts [puppet] - 10https://gerrit.wikimedia.org/r/425474 (https://phabricator.wikimedia.org/T182924) [07:12:50] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4120349 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5003.eqsin.wmnet ``` The log can be found in `/var/lo... [07:14:27] (03CR) 10Daimona Eaytoy: [C: 04-1] "> Should we merge this now or wait https://gerrit.wikimedia.org/r/#/c/201104/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423660 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [07:20:52] (03PS1) 10Marostegui: mariadb: notifications enable/disable db2069/2033 [puppet] - 10https://gerrit.wikimedia.org/r/425476 (https://phabricator.wikimedia.org/T191275) [07:22:56] (03PS1) 10Elukey: profile::prometheus::alerts: tune mirror maker alert [puppet] - 10https://gerrit.wikimedia.org/r/425477 [07:23:56] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: tune mirror maker alert [puppet] - 10https://gerrit.wikimedia.org/r/425477 (owner: 10Elukey) [07:25:30] (03PS2) 10Marostegui: mariadb: notifications enable/disable db2069/2033 [puppet] - 10https://gerrit.wikimedia.org/r/425476 (https://phabricator.wikimedia.org/T191275) [07:26:27] (03CR) 10Marostegui: [C: 032] mariadb: notifications enable/disable db2069/2033 [puppet] - 10https://gerrit.wikimedia.org/r/425476 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [07:27:33] !log Stop MySQL on db2033 to copy its data away before reimaging - T191275 [07:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:41] T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275 [07:30:21] 10Operations, 10Documentation: Please document how to try fixing IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956#4122418 (10ema) [07:30:44] 10Operations, 10Documentation: Document how to fix IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956#4122431 (10ema) p:05Triage>03Normal [07:30:59] 10Operations, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4122433 (10ArielGlenn) [07:32:36] (03PS1) 10Marostegui: install_server: Allow reimage db2033 [puppet] - 10https://gerrit.wikimedia.org/r/425480 (https://phabricator.wikimedia.org/T191275) [07:33:51] (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2033 [puppet] - 10https://gerrit.wikimedia.org/r/425480 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [07:39:52] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122437 (10MoritzMuehlenhoff) You're getting connection failures from bast4001, but in your config you've configured your SSH client to use bast1002? Try changing the last four l... [07:40:30] (03PS1) 10Marostegui: db2069.yaml: Binlog format ROW [puppet] - 10https://gerrit.wikimedia.org/r/425481 [07:42:14] (03CR) 10Marostegui: [C: 032] db2069.yaml: Binlog format ROW [puppet] - 10https://gerrit.wikimedia.org/r/425481 (owner: 10Marostegui) [07:45:17] !log cp2022: restart varnish-be due to child process crash https://phabricator.wikimedia.org/P6979 T191229 [07:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:24] T191229: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229 [07:46:33] (03CR) 10Giuseppe Lavagetto: [C: 032] Update rbenv ruby version to match production [puppet] - 10https://gerrit.wikimedia.org/r/425280 (owner: 10Giuseppe Lavagetto) [07:46:40] (03PS3) 10Giuseppe Lavagetto: Update rbenv ruby version to match production [puppet] - 10https://gerrit.wikimedia.org/r/425280 [07:57:59] (03PS1) 10Marostegui: db-codfw.php: Repool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425482 (https://phabricator.wikimedia.org/T191275) [07:58:33] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:59:01] _joe_: ^ I guess? [08:00:38] <_joe_> marostegui: yeah sorry [08:00:49] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425482 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [08:00:51] <_joe_> that is the kind of change with no production effect I forget to merge [08:01:43] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [08:02:08] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425482 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [08:02:22] (03CR) 10jenkins-bot: db-codfw.php: Repool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425482 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [08:03:28] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2069 as candidate master for x1 - T191275 (duration: 01m 03s) [08:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:34] T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275 [08:18:20] !log rerunning eqiad misc backups [08:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:13] (03PS5) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) [08:22:18] (03CR) 10Hashar: "I had to rename libtidy.sa to use the 0.99 suffix which lead to the following mess:" [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar) [08:24:56] (03PS6) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) [08:25:07] (03PS1) 10Marostegui: s1,x1.hosts: Move db2069 from s1 to x1 [software] - 10https://gerrit.wikimedia.org/r/425487 (https://phabricator.wikimedia.org/T191275) [08:26:17] (03Abandoned) 10Elukey: prometheus_jmx_exporter_config: fine grained selection of resources [puppet] - 10https://gerrit.wikimedia.org/r/423851 (owner: 10Elukey) [08:27:36] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122514 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5003.eqsin.wmnet'] ``` Of which those **FAILED**: ``` ['lvs5003.eqsin.wmnet'] ``` [08:28:22] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122515 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5003.eqsin.wmnet ``` The log can be found in `/var/lo... [08:28:48] (03CR) 10Marostegui: [C: 032] s1,x1.hosts: Move db2069 from s1 to x1 [software] - 10https://gerrit.wikimedia.org/r/425487 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [08:29:39] (03Merged) 10jenkins-bot: s1,x1.hosts: Move db2069 from s1 to x1 [software] - 10https://gerrit.wikimedia.org/r/425487 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [08:31:18] (03PS2) 10Muehlenhoff: Reimage mw1265 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/425269 (https://phabricator.wikimedia.org/T174431) [08:35:48] (03CR) 10Muehlenhoff: [C: 032] Reimage mw1265 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/425269 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [08:39:28] (03PS7) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) [08:40:10] (03CR) 10Hashar: "A couple lintian issues:" [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar) [08:49:19] (03PS8) 10Hashar: Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) [08:49:53] (03CR) 10Hashar: "Fixed doc-base and binary without manpage." [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar) [08:54:09] LD_LIBRARY_PATH=/home/hashar/projects/tidy/hacked php tests/phpunit/phpunit.php [08:54:15] (wrong term grrr) [08:55:09] (03PS1) 10Jcrespo: mariadb: Allow reimage of all es2*** hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/425491 [08:56:29] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of all es2*** hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/425491 (owner: 10Jcrespo) [08:56:37] (03PS1) 10Jcrespo: Revert "mariadb: Allow reimage of all es2*** hosts to stretch" [puppet] - 10https://gerrit.wikimedia.org/r/425492 [08:59:25] (03CR) 10Mark Bergsma: [C: 032] Create FSM test cases according to the RFC 4271 definition [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 (owner: 10Mark Bergsma) [08:59:43] !log reimaging mw1265 to stretch (T174431) [08:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:49] T174431: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431 [08:59:59] (03Merged) 10jenkins-bot: Create FSM test cases according to the RFC 4271 definition [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 (owner: 10Mark Bergsma) [09:02:14] (03PS3) 10Mark Bergsma: Handle non-IDLE states in idleHoldTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/423997 [09:02:16] (03PS3) 10Mark Bergsma: Fix sendNotification invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/423998 [09:02:18] (03PS3) 10Mark Bergsma: Fix two typos in bgp.FSM.openReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/423999 [09:02:20] (03PS3) 10Mark Bergsma: Fix holdTimeEvent incrementing connectRetryCounter twice [debs/pybal] - 10https://gerrit.wikimedia.org/r/424000 [09:02:22] (03PS3) 10Mark Bergsma: Fix distinction between events 19 and 20 (delayOpen) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424001 [09:02:24] (03PS3) 10Mark Bergsma: Handle state ESTABLISHED in versionError (event 24) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424002 [09:02:26] (03PS3) 10Mark Bergsma: Handle state OPENSENT in keepAliveEvent (event 11) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424003 [09:02:28] (03PS3) 10Mark Bergsma: Handle state OPENSENT in keepAliveReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/424004 [09:02:32] (03PS3) 10Mark Bergsma: Correctly handle event 9 (connectRetryTimeEvent) in ACTIVE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424005 [09:02:34] (03PS3) 10Mark Bergsma: Fix typo in FSM.delayOpenTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/424006 [09:02:36] (03PS1) 10Jcrespo: mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493 [09:02:54] (03CR) 10Marostegui: [C: 031] mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493 (owner: 10Jcrespo) [09:02:54] oh my :) [09:03:32] :) [09:03:41] !log restart pybal on lvs1003 for UDP monitoring config changes https://gerrit.wikimedia.org/r/#/c/425251/ [09:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:48] that was not all, i can only push 10 changes at once ;) [09:03:55] (03PS3) 10Mark Bergsma: Move updating of FSM metric labels to the protocol's connectionMade [debs/pybal] - 10https://gerrit.wikimedia.org/r/424007 [09:03:57] (03PS3) 10Mark Bergsma: Ignore headerError and openMessageError in state IDLE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424008 [09:04:00] (03PS3) 10Mark Bergsma: Cleanup module for consistency [debs/pybal] - 10https://gerrit.wikimedia.org/r/424009 [09:04:02] (03PS3) 10Mark Bergsma: Add test cases for implemented event 25 and fix OPENSENT [debs/pybal] - 10https://gerrit.wikimedia.org/r/424011 [09:04:08] PROBLEM - PyBal IPVS diff check on lvs5003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:04:15] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493 (owner: 10Jcrespo) [09:04:21] lvs5003 it's me [09:05:57] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:06:02] (03Merged) 10jenkins-bot: mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493 (owner: 10Jcrespo) [09:06:13] (03Abandoned) 10Mark Bergsma: Fix test case ESTABLISHED event 27 hold time nonzero [debs/pybal] - 10https://gerrit.wikimedia.org/r/424010 (owner: 10Mark Bergsma) [09:06:23] gone for a couple of hours, back later [09:07:37] PROBLEM - Check rp_filter disabled on lvs5003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:07:37] PROBLEM - PyBal connections to etcd on lvs5003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:07:57] PROBLEM - Disk space on labtestvirt2001 is CRITICAL: DISK CRITICAL - /home/aborrero/mnt is not accessible: Permission denied [09:08:28] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2014 (duration: 01m 03s) [09:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:19] someone working in deploy1001.eqiad.wmnet? it failed to sync (ok if you are, I will research if not) [09:10:55] (03PS7) 10Elukey: Modify eventlogging purging script to read from YAML whitelist [puppet] - 10https://gerrit.wikimedia.org/r/420685 (https://phabricator.wikimedia.org/T189692) (owner: 10Mforns) [09:11:30] (03CR) 10Elukey: [C: 032] "Tested in labs (deployment-eventlog05) with both tsv and yaml encoding, everything seems good." [puppet] - 10https://gerrit.wikimedia.org/r/420685 (https://phabricator.wikimedia.org/T189692) (owner: 10Mforns) [09:11:33] jynus: I saw that too earlier and saw this in SAL mutante: deploy1001 - reinstalled with stretch - re-adding to puppet so I guess it is still WIP [09:11:37] RECOVERY - Check rp_filter disabled on lvs5003 is OK: OK: kernel parameters are set to expected value. [09:15:47] (03CR) 10Mark Bergsma: [C: 032] Fix sendNotification invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/423998 (owner: 10Mark Bergsma) [09:16:08] 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#4122579 (10Qgil) [09:16:57] (03CR) 10Mark Bergsma: [C: 032] Fix two typos in bgp.FSM.openReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/423999 (owner: 10Mark Bergsma) [09:17:52] !log start reimage of es2014 [09:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:50] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. [09:19:10] RECOVERY - PyBal IPVS diff check on lvs5003 is OK: OK: no difference between hosts in IPVS/PyBal [09:20:25] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122595 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5003.eqsin.wmnet'] ``` and were **ALL** successful. [09:22:00] RECOVERY - Disk space on labtestvirt2001 is OK: DISK OK [09:22:39] RECOVERY - PyBal connections to etcd on lvs5003 is OK: OK: 12 connections established with conf2003.codfw.wmnet:2379 (min=12) [09:23:30] (03CR) 10jenkins-bot: mariadb: Depool es2014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425493 (owner: 10Jcrespo) [09:24:43] (03PS1) 10Volans: wmf-auto-reimage: bugfix Phabricator client [puppet] - 10https://gerrit.wikimedia.org/r/425495 [09:25:00] PROBLEM - Disk space on labtestvirt2001 is CRITICAL: DISK CRITICAL - /home/aborrero/mnt is not accessible: Permission denied [09:28:04] (03PS1) 10Elukey: profile::mariadb::misc::el::sanitization: add package [puppet] - 10https://gerrit.wikimedia.org/r/425496 (https://phabricator.wikimedia.org/T189692) [09:28:39] (03CR) 10Elukey: [C: 032] profile::mariadb::misc::el::sanitization: add package [puppet] - 10https://gerrit.wikimedia.org/r/425496 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey) [09:29:28] !log doing some testing in labtestvirt2001 mounting instance's qcow2 files into /home/aborrero/mnt [09:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:40] PROBLEM - HHVM processes on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:32:40] PROBLEM - nutcracker port on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:34:19] PROBLEM - HHVM rendering on mw1265 is CRITICAL: connect to address 10.64.0.60 and port 80: Connection refused [09:34:19] PROBLEM - nutcracker process on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:36:00] PROBLEM - puppet last run on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:36:28] ^reimage, silencing again [09:39:29] PROBLEM - Apache HTTP on mw1265 is CRITICAL: connect to address 10.64.0.60 and port 80: Connection refused [09:39:29] PROBLEM - MD RAID on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:41:09] PROBLEM - Check size of conntrack table on mw1265 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:41:10] RECOVERY - Disk space on labtestvirt2001 is OK: DISK OK [09:42:09] (03PS1) 10Elukey: role:mariadb::misc::el::replica: add new yaml whitelist to db1108 [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) [09:45:46] (03PS1) 10Marostegui: db2033.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/425499 [09:46:19] (03CR) 10Marostegui: [C: 032] db2033.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/425499 (owner: 10Marostegui) [09:49:34] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425501 [09:51:47] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425501 (owner: 10Jcrespo) [09:51:56] (03PS1) 10Jcrespo: mariadb: Depool es2015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425502 [09:52:30] !log installing java security updates on kafka/analytics cluster [09:52:32] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [09:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:08] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425501 (owner: 10Jcrespo) [09:53:51] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425502 (owner: 10Jcrespo) [09:54:42] (03PS2) 10Elukey: role:mariadb::misc::el::replica: add new yaml whitelist to db1108 [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) [09:55:09] (03Merged) 10jenkins-bot: mariadb: Depool es2015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425502 (owner: 10Jcrespo) [09:57:09] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2014, depool es2015 (duration: 01m 02s) [09:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:41] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2015 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425503 [09:59:39] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/10898/" [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey) [10:00:28] !log installing java security updates on kafka/jumbo cluster [10:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:31] !log start reimage of es2015 [10:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:59] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425501 (owner: 10Jcrespo) [10:05:06] (03CR) 10jenkins-bot: mariadb: Depool es2015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425502 (owner: 10Jcrespo) [10:10:56] RECOVERY - MD RAID on mw1265 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:11:07] RECOVERY - HHVM processes on mw1265 is OK: PROCS OK: 6 processes with command name hhvm [10:11:26] RECOVERY - Check size of conntrack table on mw1265 is OK: OK: nf_conntrack is 0 % full [10:17:44] RECOVERY - nutcracker process on mw1265 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [10:19:34] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1964 bytes in 0.105 second response time [10:19:44] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 75329 bytes in 0.174 second response time [10:26:08] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:28:31] !log Drop table prefstats in s5 - T154490 [10:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:36] T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490 [10:31:43] !log Drop table prefstats in s6 - T154490 [10:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:28] RECOVERY - nutcracker port on mw1265 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:33:04] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2015 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425503 (owner: 10Jcrespo) [10:33:05] !log Drop table prefstats in s4 - T154490 [10:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:21] (03PS1) 10Jcrespo: mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425507 [10:34:25] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2015 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425503 (owner: 10Jcrespo) [10:34:40] (03PS4) 10EddieGP: beta: Combine commons, deployments, meta and zero vhost [puppet] - 10https://gerrit.wikimedia.org/r/398399 [10:35:57] (03PS3) 10EddieGP: Run initSiteStats twice a month [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [10:37:31] (03CR) 10EddieGP: "I've signed this one up for tomorrows puppet swat. It's trivial and a no-op in prod." [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP) [10:38:58] (03PS1) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 [10:39:16] (03PS2) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 [10:39:25] (03PS1) 10Muehlenhoff: Disable PrivateTmp via systemd override for stretch-based app servers [puppet] - 10https://gerrit.wikimedia.org/r/425509 (https://phabricator.wikimedia.org/T185195) [10:41:02] (03CR) 10EddieGP: [C: 031] "I've signed this up for tomorrows puppet swat, even though it's not my patch (I'd have done the same, but Chad was faster). I hope that's " [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [10:43:10] (03CR) 10Ema: [C: 031] pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (owner: 10Vgutierrez) [10:43:13] !log Drop table prefstats in s2 - T154490 [10:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:18] T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490 [10:44:17] (03PS3) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897) [10:44:45] (03CR) 10jerkins-bot: [V: 04-1] pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:45:47] (03PS4) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897) [10:46:30] (03CR) 10Vgutierrez: [C: 032] pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:46:40] (03PS5) 10Vgutierrez: pybal: Reenable BGP in lvs5003 [puppet] - 10https://gerrit.wikimedia.org/r/425508 (https://phabricator.wikimedia.org/T191897) [10:49:01] (03CR) 10Jcrespo: "SiteStatsInit.php seems well written, using the replica for slow queries, I can endorse this." [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [10:49:13] (03CR) 10Jcrespo: [C: 031] Run initSiteStats twice a month [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [10:50:49] !log installing openssl updates [10:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:26] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425507 (owner: 10Jcrespo) [10:53:46] (03Merged) 10jenkins-bot: mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425507 (owner: 10Jcrespo) [10:53:48] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2015 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425503 (owner: 10Jcrespo) [10:53:58] (03CR) 10jenkins-bot: mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425507 (owner: 10Jcrespo) [10:56:53] !log stop pybal on lvs5001 to test requests through lvs5003, reimaged as stretch T191897 [10:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:59] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [10:59:04] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2015, depool es2011 (duration: 00m 59s) [10:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:03] PROBLEM - pybal on lvs5001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [11:00:14] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [11:00:29] that's me, ignore ^ [11:01:57] ACKNOWLEDGEMENT - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 Ema Pybal stopped to test lvs5003 T191897 [11:01:57] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs5001 is CRITICAL: CRITICAL: 0 connections established with conf2003.codfw.wmnet:2379 (min=4) Ema Pybal stopped to test lvs5003 T191897 [11:01:57] ACKNOWLEDGEMENT - pybal on lvs5001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Ema Pybal stopped to test lvs5003 T191897 [11:04:08] !log Drop table prefstats in s7 - T154490 [11:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:15] T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490 [11:09:45] !log start pybal on lvs5001, test completed on lvs5003 [11:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:03] RECOVERY - pybal on lvs5001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [11:10:22] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy [11:11:04] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4122849 (10Vgutierrez) [11:28:20] (03PS8) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [11:28:22] (03PS6) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) [11:28:24] (03PS8) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [11:28:26] (03PS4) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [11:28:36] (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:28:36] * volans waiting for the -1s, [11:28:38] (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:28:40] (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:28:43] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:34:40] (03PS3) 10EddieGP: Kill some hiera paths [labs/private] - 10https://gerrit.wikimedia.org/r/423189 [11:36:31] (03PS2) 10EddieGP: cloud hiera: Remove unused paths from hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/423190 [11:47:07] !log start reimage of es2011 [11:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:40] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2011 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425516 [11:59:03] (03PS1) 10Jcrespo: mariadb: Depool es2012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425517 [11:59:40] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2011 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425516 (owner: 10Jcrespo) [11:59:59] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425517 (owner: 10Jcrespo) [12:01:02] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2011 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425516 (owner: 10Jcrespo) [12:01:22] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2011 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425516 (owner: 10Jcrespo) [12:01:25] (03Merged) 10jenkins-bot: mariadb: Depool es2012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425517 (owner: 10Jcrespo) [12:04:20] (03PS2) 10Mobrovac: Disable bulk number 2 jobs in redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [12:05:06] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2011, depool es2012 (duration: 01m 01s) [12:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:47] (03CR) 10jenkins-bot: mariadb: Depool es2012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425517 (owner: 10Jcrespo) [12:09:42] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.085 second response time [12:09:51] !log start reimage of es2012 [12:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:03] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425518 [12:16:07] taking over tin for 10 mins [12:16:22] PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:32] (03CR) 10Mobrovac: [C: 032] Disable bulk number 2 jobs in redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [12:18:47] (03Merged) 10jenkins-bot: Disable bulk number 2 jobs in redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [12:19:17] (03CR) 10jenkins-bot: Disable bulk number 2 jobs in redis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [12:20:33] PROBLEM - Disk space on labtestcontrol2001 is CRITICAL: DISK CRITICAL - free space: / 322 MB (3% inode=78%) [12:21:07] !log enable production traffic for mw1265 (stretch app server) for a brief test period [12:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:48] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch a bulk of low-traffic jobs to EventBus for testwikis, file 1/2 - T190327 (duration: 01m 01s) [12:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:55] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [12:32:10] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122973 (10Sharvaniharan) My config looks like this now.. ``` Host bastlabs HostName bastion-eqiad.wmflabs.org User sharan IdentityFile ~/.ssh/id_rsa Host *.eqiad.wmflabs !bast... [12:32:37] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch a bulk of low-traffic jobs to EventBus for testwikis, file 1/2 (retry) - T190327 (duration: 01m 00s) [12:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:43] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [12:33:53] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4122985 (10Sharvaniharan) @MoritzMuehlenhoff please let me know if a hangout would be better. I will be available for the next hour and then anytime after 9am Pacific time. [12:39:32] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch a bulk of low-traffic jobs to EventBus for testwikis, file 1/2 (retry #2) (duration: 01m 01s) [12:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:23] RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:41:29] (03PS1) 10Vgutierrez: install_server: Reimage lvs4007 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425520 (https://phabricator.wikimedia.org/T191897) [12:43:04] 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (watching): Scap sync-file failing for 9 hosts - https://phabricator.wikimedia.org/T191972#4122995 (10mobrovac) [12:43:46] 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (watching): Scap sync-file failing for 9 hosts - https://phabricator.wikimedia.org/T191972#4123005 (10mobrovac) p:05Triage>03Unbreak! [12:44:11] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123007 (10MoritzMuehlenhoff) >>! In T191673#4122973, @Sharvaniharan wrote: > Not sure if this is what you meant by ssh -add -L > ``` > wmf2050:~ sharan$ ssh -add -L releases10... [12:44:19] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123008 (10ema) p:05Triage>03High [12:45:47] 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (blocked): Scap sync-file failing for 9 hosts - https://phabricator.wikimedia.org/T191972#4123012 (10mobrovac) [12:46:44] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.099 second response time [12:47:28] (03CR) 10Ema: [C: 031] install_server: Reimage lvs4007 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425520 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [12:47:34] RECOVERY - Disk space on labtestcontrol2001 is OK: DISK OK [12:47:41] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123013 (10Sharvaniharan) This is what I have in etc/ssh ``` wmf2050:~ sharan$ ls /etc/ssh moduli ssh_config sshd_config ``` They are both empty files though. Should i delete... [12:48:00] 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (blocked): Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123014 (10jcrespo) [12:48:41] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123015 (10Sharvaniharan) And.. ``` wmf2050:~ sharan$ ssh-add -L releases1001.eqiad.wmnet The agent has no identities. ``` [12:50:20] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs4007 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425520 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [12:53:09] 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (blocked): Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123026 (10mobrovac) Apparently `deploy1001` has been recently reimaged. However, it doesn't seem like it has a role associated with i... [12:54:11] (03CR) 10Mobrovac: [C: 032] "NOTE: this hasn't been fully synced due to T191972" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [12:56:29] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425518 (owner: 10Jcrespo) [12:56:42] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3589662 (10mobrovac) It seems that the reimage is now blocking deployments, cf. {T191972} [12:57:53] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425518 (owner: 10Jcrespo) [12:57:58] jouncebot: next [12:57:58] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1300) [12:58:11] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123031 (10MoritzMuehlenhoff) >>! In T191673#4123015, @Sharvaniharan wrote: > And.. > > > ``` > wmf2050:~ sharan$ ssh-add -L releases1001.eqiad.wmnet > The agent has no identiti... [12:58:17] probably not happening ^ due to T191972 [12:58:17] T191972: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972 [12:59:12] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425518 (owner: 10Jcrespo) [12:59:42] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123034 (10Sharvaniharan) yes... I was looking in the wrong directory... I have updated my comment to reflect the contents of /etc/ssh/ssh_config [13:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1300). [13:00:05] No GERRIT patches in the queue for this window AFAICS. [13:00:11] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2012 (duration: 01m 00s) [13:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:51] jynus: is it not failing for you or you don't care about the 9 servers? [13:01:01] !log Reimage lvs4007 as stretch [13:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:34] nice, nothing for swat today :) [13:01:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.325 second response time [13:01:59] mobrovac: please, they are not 9 servers [13:02:04] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123036 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4007.ulsfo.wmnet ``` The log can be found in `/var/lo... [13:02:21] please read the logs carefully, I had to correct your ticket [13:03:14] jynus: what are the mw hosts listed in the log then? my read is that it fails to connect to these [13:03:31] they happen to be on the same batch, but they are succesful [13:03:41] ok: 269; fail: 1; left: 0 [13:04:06] and that is without being a "real" deployer not knowing scap [13:04:20] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123038 (10Sharvaniharan) finally!!!! I was able to ssh into releases1001 and stat1006! [13:04:36] the commend returns 1 and only 1 error for a single host [13:04:53] plus my changes are noop- I deploy and revert, as usual [13:04:55] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123039 (10Imarlier) Change in observed performance due to depooling of Singapore: Synthetic tests (from AWS Mumbai): https://grafana.wikimedia.org/dash... [13:05:28] jynus: indeed [13:05:29] k [13:05:34] i'll finish my stuff then [13:05:46] 10Operations, 10Deployments, 10Release-Engineering-Team, 10Services (blocked): Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123040 (10jcrespo) [13:06:21] most likely, that will be setup as a deployment server, but having 3 of those is complicated, so it ended up in a limbo [13:06:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] cassandra/icinga: make monitoring configurable, skip on dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [13:06:44] 10Operations, 10Deployments, 10Release-Engineering-Team: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123042 (10mobrovac) p:05Unbreak!>03High It's fialing only on `deploy1001`, so lowering the priority. [13:09:08] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4007.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['lvs4007.ulsfo.wmnet'] ``` [13:09:21] !log Drop prefstats table on s3 codfw master - db2043 (this might generate lag on codfw) - T154490 [13:09:25] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123048 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4007.ulsfo.wmnet ``` The log can be found in `/var/lo... [13:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:27] T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490 [13:09:28] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123049 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4007.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['lvs4007.ulsfo.wmnet'] ``` [13:10:00] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123051 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4007.ulsfo.wmnet ``` The log can be found in `/var/lo... [13:10:30] 10Operations, 10Patch-For-Review: Update SSH key in production hosts for @Sharvaniharan - https://phabricator.wikimedia.org/T191673#4123052 (10Sharvaniharan) 05Open>03Resolved [13:12:34] !log restart kafka brokers on kafka1012->23 for openjdk-7 upgrades [13:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:53] !log Drop prefstats table on s1 codfw master - db2048 (this might generate lag on codfw) - T154490 [13:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:48] (03CR) 10Alexandros Kosiaris: [C: 031] puppet-merge: continue despite errors during remote/ssh stage [puppet] - 10https://gerrit.wikimedia.org/r/425339 (owner: 10Herron) [13:18:17] (03CR) 10Alexandros Kosiaris: [C: 031] "Nice, wanna have a look at https://gerrit.wikimedia.org/r/#/c/356021/ as well ? Can also be useful (I never received input so I am resolic" [puppet] - 10https://gerrit.wikimedia.org/r/425335 (owner: 10Herron) [13:20:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425521 (https://phabricator.wikimedia.org/T154490) [13:22:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425521 (https://phabricator.wikimedia.org/T154490) (owner: 10Marostegui) [13:23:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425521 (https://phabricator.wikimedia.org/T154490) (owner: 10Marostegui) [13:23:54] (03PS1) 10Pmiazga: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) [13:25:05] (03CR) 10jerkins-bot: [V: 04-1] Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga) [13:25:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 (duration: 01m 00s) [13:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:22] !log installing java security updates on kafka/main cluster [13:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:20] !log Drop prefstats table on s3 sanitarium master (db1072) this might cause lag on labs - T154490 [13:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:25] T154490: Delete prefstats tables - https://phabricator.wikimedia.org/T154490 [13:28:20] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425523 [13:28:50] (03CR) 10Ottomata: "Thanks Luca!" [puppet] - 10https://gerrit.wikimedia.org/r/425477 (owner: 10Elukey) [13:29:07] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425521 (https://phabricator.wikimedia.org/T154490) (owner: 10Marostegui) [13:29:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425523 (owner: 10Marostegui) [13:30:47] (03PS1) 10Gehel: maps: run populate_admin() regularly [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605) [13:31:04] (03PS2) 10Pmiazga: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) [13:31:06] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425523 (owner: 10Marostegui) [13:31:45] !log ppchelko@tin Started deploy [cpjobqueue/deploy@2b59313]: Enable second bulk of low-traffic jobs T190327 [13:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:51] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [13:32:13] marostegui: when you are done with the sync, i'll need to sync too [13:32:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 (duration: 01m 07s) [13:32:30] mobrovac: all yours [13:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:35] kk thnx [13:32:53] !log ppchelko@tin Started deploy [cpjobqueue/deploy@2b59313]: Enable second bulk of low-traffic jobs T190327 [13:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:58] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch a bulk of low-traffic jobs to EventBus for testwikis, file 2/2 - T190327 (duration: 01m 00s) [13:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:17] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:34:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425523 (owner: 10Marostegui) [13:37:36] !log rolling restart of restbase in codfw to pick up openssl update [13:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:02] (03CR) 10Alexandros Kosiaris: [C: 031] Add AAAA and PTR records for conf100[456] [dns] - 10https://gerrit.wikimedia.org/r/425292 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [13:40:28] (03PS2) 10Gehel: maps: run populate_admin() regularly [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605) [13:41:21] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@2b59313]: Enable second bulk of low-traffic jobs T190327 (duration: 08m 27s) [13:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:26] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [13:44:01] !log ppchelko@tin Started deploy [cpjobqueue/deploy@3ba6580]: Enable second bulk of low-traffic jobs T190327 take 2 [13:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:17] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [13:44:50] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@3ba6580]: Enable second bulk of low-traffic jobs T190327 take 2 (duration: 00m 49s) [13:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:50] akosiaris: thanks a lot! [13:48:28] (03CR) 10Mobrovac: [C: 032] "> NOTE: this hasn't been fully synced due to T191972" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425271 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [13:51:45] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123145 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4007.ulsfo.wmnet'] ``` and were **ALL** successful. [13:53:44] (03PS3) 10Elukey: Add AAAA and PTR records for conf100[456] [dns] - 10https://gerrit.wikimedia.org/r/425292 (https://phabricator.wikimedia.org/T166081) [13:54:00] (03CR) 10Elukey: [C: 032] Add AAAA and PTR records for conf100[456] [dns] - 10https://gerrit.wikimedia.org/r/425292 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [13:58:46] 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4123181 (10Andrew) [14:05:54] (03PS1) 10Jcrespo: mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425531 [14:06:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089) [14:07:48] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425531 (owner: 10Jcrespo) [14:09:04] (03Merged) 10jenkins-bot: mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425531 (owner: 10Jcrespo) [14:09:18] (03CR) 10jenkins-bot: mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425531 (owner: 10Jcrespo) [14:09:56] (03PS2) 10Marostegui: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089) [14:11:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [14:11:50] (03PS1) 10Jcrespo: Revert "Revert "mariadb: Depool es2012 for reimage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425534 [14:11:52] (03PS1) 10Giuseppe Lavagetto: php: add module for basic installation [puppet] - 10https://gerrit.wikimedia.org/r/425535 [14:11:59] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "mariadb: Depool es2012 for reimage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425534 (owner: 10Jcrespo) [14:12:06] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2013 (duration: 01m 00s) [14:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [14:12:48] (03Abandoned) 10Jcrespo: Revert "Revert "mariadb: Depool es2012 for reimage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425534 (owner: 10Jcrespo) [14:13:10] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536 [14:13:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 for alter table (duration: 01m 00s) [14:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:22] !log Deploy schema change on db1099:3318 - T187089 T185128 T153182 [14:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:29] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [14:14:29] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [14:14:29] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [14:14:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425532 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [14:15:08] !log start reimage of es2013 [14:15:11] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4123214 (10Halfak) This sounds surprising and strange. Please ping me on... [14:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:29] (03PS1) 10Ema: pybal: Re-enable BGP on lvs4007 [puppet] - 10https://gerrit.wikimedia.org/r/425537 (https://phabricator.wikimedia.org/T191897) [14:20:16] (03CR) 10Ema: [C: 032] pybal: Re-enable BGP on lvs4007 [puppet] - 10https://gerrit.wikimedia.org/r/425537 (https://phabricator.wikimedia.org/T191897) (owner: 10Ema) [14:26:51] 10Operations, 10ops-codfw, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123236 (10jcrespo) [14:27:06] (03PS2) 10Giuseppe Lavagetto: php: add module for basic installation [puppet] - 10https://gerrit.wikimedia.org/r/425535 [14:30:13] (03PS1) 10Herron: puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538 [14:31:00] 10Operations, 10ops-codfw, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123270 (10jcrespo) T150160 suggests `racadm reset` may fix it. [14:36:33] !log ppchelko@tin Started deploy [cpjobqueue/deploy@a090a3c]: Fix the low priority jobs topic names [14:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:11] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@a090a3c]: Fix the low priority jobs topic names (duration: 00m 38s) [14:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:02] !log Turned regular coal back on (T191239) [14:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:08] T191239: coal metrics changed after deploying new code - https://phabricator.wikimedia.org/T191239 [14:38:27] (03CR) 10jerkins-bot: [V: 04-1] puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538 (owner: 10Herron) [14:39:30] !log rolling restart of restbase in eqiad to pick up openssl update [14:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:22] (03CR) 10Mforns: [C: 031] "LGTM!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey) [14:46:36] (03CR) 10Herron: "Looking at this and the patch to remove colors (https://gerrit.wikimedia.org/r/#/c/425335/) I realized that by tuning our syslog config we" [puppet] - 10https://gerrit.wikimedia.org/r/356021 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [14:47:26] 10Operations, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4123336 (10MoritzMuehlenhoff) [14:48:13] (03Abandoned) 10Herron: puppet: disable color output in puppet log /var/log/puppet.log [puppet] - 10https://gerrit.wikimedia.org/r/425335 (owner: 10Herron) [14:49:17] (03PS2) 10Jcrespo: Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536 [14:49:33] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536 (owner: 10Jcrespo) [14:49:44] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [14:50:26] ^ saw this, will look into it [14:50:57] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536 (owner: 10Jcrespo) [14:51:11] (03CR) 10Herron: "Thanks akosiaris!" [puppet] - 10https://gerrit.wikimedia.org/r/425339 (owner: 10Herron) [14:51:13] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2013 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425536 (owner: 10Jcrespo) [14:51:15] (03PS2) 10Herron: puppet-merge: continue despite errors during remote/ssh stage [puppet] - 10https://gerrit.wikimedia.org/r/425339 [14:51:44] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:51:58] (03CR) 10Herron: [C: 032] puppet-merge: continue despite errors during remote/ssh stage [puppet] - 10https://gerrit.wikimedia.org/r/425339 (owner: 10Herron) [14:53:00] (03PS1) 10Jcrespo: mariadb: Depool es1011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 [14:53:51] (03PS2) 10Ema: lvs: use UDP monitor for logstash-{json,syslog}-udp [puppet] - 10https://gerrit.wikimedia.org/r/425253 [14:54:24] !log starting rolling restart of elasticsearch cirrus / eqiad for jvm upgrade [14:54:27] (03CR) 10Ema: [C: 032] lvs: use UDP monitor for logstash-{json,syslog}-udp [puppet] - 10https://gerrit.wikimedia.org/r/425253 (owner: 10Ema) [14:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:31] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123393 (10Krinkle) [14:54:40] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2013 (duration: 01m 00s) [14:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:26] (03PS2) 10Jcrespo: mariadb: Depool es1012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 [14:55:30] (03CR) 10Herron: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/425538 (owner: 10Herron) [14:57:52] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 (owner: 10Jcrespo) [14:57:58] (03CR) 10Elukey: role:mariadb::misc::el::replica: add new yaml whitelist to db1108 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey) [14:58:05] (03PS3) 10Elukey: role:mariadb::misc::el::replica: add new yaml whitelist to db1108 [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) [14:58:49] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4123399 (10Vgutierrez) [14:58:53] (03CR) 10Elukey: [C: 032] role:mariadb::misc::el::replica: add new yaml whitelist to db1108 [puppet] - 10https://gerrit.wikimedia.org/r/425498 (https://phabricator.wikimedia.org/T189692) (owner: 10Elukey) [14:59:05] 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4123400 (10akosiaris) >>! In T191648#4113134, @Andrew wrote: >>>! In T191648#4112786, @bd808 wrote: >> I wonder if the specific ordering issue is t... [14:59:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I think we can solve this in a bit cleaner approach than mangling the plugins settings list. See https://phabricator.wikimedia.org/T191648" [puppet] - 10https://gerrit.wikimedia.org/r/424638 (https://phabricator.wikimedia.org/T191648) (owner: 10Andrew Bogott) [14:59:30] (03Merged) 10jenkins-bot: mariadb: Depool es1012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 (owner: 10Jcrespo) [15:01:13] !log Stopping coal on graphite2001.codfw.wmnet for data replay [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:51] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1012 (duration: 01m 00s) [15:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:46] (03CR) 10jenkins-bot: mariadb: Depool es1012 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 (owner: 10Jcrespo) [15:03:44] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1981 bytes in 0.144 second response time [15:06:00] (03PS2) 10Ppchelko: Remove wmgDebugJobQueueEventBus config parameter. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 [15:06:37] (03CR) 10Ppchelko: "rebased." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko) [15:06:53] !log shutting down cp2008, cp2011, and cp2018 for onsite work [15:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:03] !log restart pybal on lvs1006 for logstash-{json,syslog} UDP monitoring config changes https://gerrit.wikimedia.org/r/#/c/425253/ [15:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:33] PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:10:33] PROBLEM - Host cp2018 is DOWN: PING CRITICAL - Packet loss = 100% [15:10:43] PROBLEM - Host cp2011 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:39] (03PS1) 10Ppchelko: Switch second bulk of low-traffic jobs for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327) [15:13:22] (03PS2) 10Rduran: Add integration tests to test agains MariaDB [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/425291 [15:14:04] PROBLEM - Host cp2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:14:12] !log restart pybal on lvs1003 for logstash-{json,syslog} UDP monitoring config changes https://gerrit.wikimedia.org/r/#/c/425253/ [15:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:24] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:24] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:14:33] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:14:33] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:14:33] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:34] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:34] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:34] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:34] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:43] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:43] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:43] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:44] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:44] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:53] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:53] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:54] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:14:54] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:14:54] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:14:54] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:03] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:03] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:15:04] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:15:04] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:15:04] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:15:04] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:04] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:05] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:15:05] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:14] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:14] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:14] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:14] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 [15:15:14] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:14] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:15] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2018_v4, cp2018_v6 [15:15:15] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:16] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:16] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:17] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:24] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:15:25] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:25] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 130 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6, cp2018_v4, cp2018_v6 [15:15:25] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:25] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:25] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:33] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2018_v4, cp2018_v6 [15:15:33] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:34] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [15:15:34] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 [15:15:43] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 [15:15:44] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2018_v4, cp2018_v6 [15:15:50] ema: will you take care of running puppet there and on icinga? [15:15:54] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 [15:16:25] jynus: ? [15:16:38] for the alerts noise, I mean [15:17:18] aren't you decomming servers? [15:17:22] jynus: there's ongoing dc-ops work causing that, it isn't me (and it isn't puppet related) [15:17:25] ah [15:17:27] sorry [15:17:39] now I get it [15:17:46] no worries. Those ipsec alerts are a pain, I know [15:19:14] RECOVERY - Host cp2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.85 ms [15:19:51] !log fixing grant issue on db1114 [15:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:26] !log disabling coal service on graphite2001 and disabling puppet – T191239 [15:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:32] T191239: coal metrics changed after deploying new code - https://phabricator.wikimedia.org/T191239 [15:23:44] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.074 second response time [15:26:53] PROBLEM - Disk space on labtestvirt2001 is CRITICAL: DISK CRITICAL - /home/aborrero/mnt is not accessible: Permission denied [15:27:28] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4123551 (10debt) Hi @BBlack - can you add your concerns to this ticket....we're needing to get this figured out soon. Thanks! [15:27:53] RECOVERY - Disk space on labtestvirt2001 is OK: DISK OK [15:28:28] ^^^ that's me [15:29:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice idea, some inline comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425538 (owner: 10Herron) [15:29:34] PROBLEM - Host cp2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:29:49] (03Abandoned) 10Alexandros Kosiaris: Timestamp puppet-run logs [puppet] - 10https://gerrit.wikimedia.org/r/356021 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [15:31:23] 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123569 (10Papaul) DIMM A2 replaced DIMM B2 replaced DIMM B6 replaced [15:33:44] RECOVERY - Host cp2008 is UP: PING WARNING - Packet loss = 28%, RTA = 36.04 ms [15:34:43] RECOVERY - Host cp2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.77 ms [15:42:26] (03PS1) 10Ema: cache::ipsec: remove non-jumbo hosts from kafka::nodes [puppet] - 10https://gerrit.wikimedia.org/r/425550 (https://phabricator.wikimedia.org/T185136) [15:43:03] !log cp2008 repooled after memory swap [15:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:03] PROBLEM - Host cp2018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:16] 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123634 (10RobH) system has been pushed back into service with the new memory in use [15:50:14] RECOVERY - Host cp2018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.58 ms [15:52:24] PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:49] (03PS2) 10Ema: role::kafka::analytics: get rid of ipsec [puppet] - 10https://gerrit.wikimedia.org/r/425550 (https://phabricator.wikimedia.org/T185136) [15:58:25] 10Operations, 10ops-codfw, 10Traffic: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4123709 (10Papaul) DIMM B3 replaced BIOS update IDRAC update [15:58:52] !log Re-enabled puppet and coal on graphite2001 [15:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:04] RECOVERY - Host cp2011 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:59:34] (03PS3) 10Alexandros Kosiaris: Install git-lfs on scap source and target [puppet] - 10https://gerrit.wikimedia.org/r/420409 (https://phabricator.wikimedia.org/T180628) (owner: 10Awight) [15:59:37] (03CR) 10Alexandros Kosiaris: [C: 032] Install git-lfs on scap source and target [puppet] - 10https://gerrit.wikimedia.org/r/420409 (https://phabricator.wikimedia.org/T180628) (owner: 10Awight) [15:59:43] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK [15:59:43] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK [15:59:43] RECOVERY - Host cp2008 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [15:59:44] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK [15:59:44] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK [15:59:44] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK [15:59:44] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [15:59:45] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 66 ESP OK [15:59:53] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 66 ESP OK [15:59:53] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 66 ESP OK [15:59:53] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK [15:59:53] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 66 ESP OK [15:59:53] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [15:59:54] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 66 ESP OK [15:59:54] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 66 ESP OK [15:59:55] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK [15:59:55] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [15:59:56] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [15:59:56] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [16:00:00] (03CR) 10Mobrovac: [C: 04-1] Switch second bulk of low-traffic jobs for all wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [16:00:03] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [16:00:03] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK [16:00:04] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK [16:00:04] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK [16:00:04] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [16:00:04] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK [16:00:04] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK [16:00:10] (03CR) 10Alexandros Kosiaris: [C: 032] "Just noting for completeness that this is not gonna make any difference on tin.eqiad.wmnet as it is not stretch" [puppet] - 10https://gerrit.wikimedia.org/r/420409 (https://phabricator.wikimedia.org/T180628) (owner: 10Awight) [16:00:14] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK [16:00:14] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [16:00:14] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [16:00:23] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK [16:00:23] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK [16:00:24] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [16:00:24] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [16:00:24] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [16:00:34] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK [16:03:35] 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123741 (10RobH) also note I rebooted cp2008 into the post and debian kernel selection screen 7 times, without any memory post errors. [16:04:44] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4123756 (10Papaul) 05Open>03Resolved [16:04:58] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10Papaul) [16:05:01] 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4123757 (10Papaul) 05Open>03Resolved [16:05:40] (03PS1) 10BBlack: Revert "Depolling eqsin due to router issue" [dns] - 10https://gerrit.wikimedia.org/r/425552 (https://phabricator.wikimedia.org/T191667) [16:06:23] PROBLEM - Host cp2011 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:55] (03CR) 10BBlack: [C: 032] Revert "Depolling eqsin due to router issue" [dns] - 10https://gerrit.wikimedia.org/r/425552 (https://phabricator.wikimedia.org/T191667) (owner: 10BBlack) [16:08:13] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4123787 (10Krinkle) [16:09:22] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10Krinkle) [16:10:04] RECOVERY - Host cp2011 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [16:10:14] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:21] (03PS1) 10Bstorm: wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/425553 (https://phabricator.wikimedia.org/T181650) [16:10:23] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:23] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:23] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:24] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:24] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:24] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:24] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:25] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:44] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:44] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:44] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:50] yeah yeah [16:10:53] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:53] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:53] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:10:54] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:11:00] i didnt mean to let cp2011 hit the os but it totally did [16:11:03] hence these. [16:11:03] (03CR) 10Jcrespo: [C: 032] "Dumps are running, so I will have to wait to touch this host." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425542 (owner: 10Jcrespo) [16:11:04] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:11:13] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:11:13] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:11:13] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:11:13] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:11:14] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2011_v4, cp2011_v6 [16:11:31] !log restarting cassandra, dev environment (testing default GC settings) -- T186751 [16:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:37] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [16:12:43] PROBLEM - Host cp2011 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:24] (03PS3) 10Elukey: role::analytics_cluster::hadoop:master|standby: enable HDFS trash [puppet] - 10https://gerrit.wikimedia.org/r/424237 (https://phabricator.wikimedia.org/T189051) [16:14:10] !log reboot notebook1001 for kernel updates [16:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:33] (03CR) 10Elukey: [C: 032] role::analytics_cluster::hadoop:master|standby: enable HDFS trash [puppet] - 10https://gerrit.wikimedia.org/r/424237 (https://phabricator.wikimedia.org/T189051) (owner: 10Elukey) [16:15:33] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [16:15:33] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [16:15:34] RECOVERY - Host cp2011 is UP: PING WARNING - Packet loss = 28%, RTA = 36.29 ms [16:15:44] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK [16:15:53] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [16:15:54] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK [16:15:54] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK [16:15:54] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK [16:15:54] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK [16:15:54] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK [16:16:04] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [16:16:13] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [16:16:13] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK [16:16:13] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK [16:16:13] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK [16:16:14] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK [16:16:20] (03PS1) 10MusikAnimal: Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393) [16:16:23] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK [16:16:23] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [16:16:24] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [16:16:24] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK [16:16:24] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [16:16:33] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK [16:16:33] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK [16:17:14] RECOVERY - Host cp2018 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [16:17:14] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 40 ESP OK [16:17:23] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK [16:17:24] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [16:17:24] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [16:17:25] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [16:17:33] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [16:17:33] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 40 ESP OK [16:17:33] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [16:17:33] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [16:17:34] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [16:17:34] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK [16:17:37] 10Operations, 10ops-codfw, 10Traffic: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4123838 (10Papaul) DIMM A2 replaced DIMM A6 replaced BIOS update IDRAC update [16:17:53] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [16:17:53] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [16:18:03] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 40 ESP OK [16:18:03] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [16:18:03] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK [16:18:03] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [16:18:03] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK [16:18:04] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [16:18:53] RECOVERY - Check systemd state on notebook1001 is OK: OK - running: The system is fully operational [16:19:54] 10Operations, 10ops-codfw, 10Traffic: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4123845 (10RobH) so we rebooted this system half a dozen times through post and kernel section splash screen and no more memory errors. [16:20:48] 10Operations, 10Deployments, 10Release-Engineering-Team: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4123848 (10Dzahn) a:03Dzahn [16:21:48] (03PS2) 10Marostegui: wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/425553 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:22:04] (03PS2) 10Herron: puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538 [16:22:38] (03CR) 10Marostegui: [C: 032] wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/425553 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:23:03] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:23:56] !log Reload haproxy on dbproxy1011 to depool labsdb1009 [16:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:11] !log cp2011 returned to service [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:04] PROBLEM - Host cp2018 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:06] (03CR) 10Herron: puppet-agent: log puppet runs via syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425538 (owner: 10Herron) [16:27:23] RECOVERY - Host cp2018 is UP: PING WARNING - Packet loss = 93%, RTA = 36.13 ms [16:32:04] PROBLEM - Host cp2018 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:03] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:33:04] RECOVERY - Host cp2018 is UP: PING WARNING - Packet loss = 66%, RTA = 36.02 ms [16:33:37] !log See T191887 [16:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:41] !log cp2018 returned to service [16:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:19] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4123938 (10ayounsi) Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180410-Routing [16:38:03] 10Operations, 10ops-codfw, 10Traffic: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4123940 (10RobH) rebooted this half a dozen times after the memory swap, and no memory errors have cropped back up. pushed back into service. @papaul: can you please post the return tag tracking numb... [16:44:15] !log restart hadoop hdfs namenodes on analytics100[12] to pick up HDFS Trash settings - T189051 [16:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:21] T189051: Add trash folder to hadoop - https://phabricator.wikimedia.org/T189051 [16:51:45] !log sbisson@tin Started deploy [kartotherian/deploy@4cd5a19]: Deploying kartotherian v0.0.38 to maps-test* [16:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:01] !log sbisson@tin Finished deploy [kartotherian/deploy@4cd5a19]: Deploying kartotherian v0.0.38 to maps-test* (duration: 01m 16s) [16:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:04] (03PS9) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) [17:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1700). [17:00:05] subbu and raynor: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:15] o/ [17:00:30] o/ [17:00:47] subbu - you can go first - my change will require a bit to test [17:00:59] ok .. any swatters? [17:06:19] anyone able to swat? :) [17:08:10] (03PS1) 10Dzahn: remove deploy1001 from scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/425561 (https://phabricator.wikimedia.org/T191972) [17:08:33] (03CR) 10Dzahn: [C: 032] remove deploy1001 from scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/425561 (https://phabricator.wikimedia.org/T191972) (owner: 10Dzahn) [17:09:20] (03CR) 10Dzahn: [C: 032] "thanks, fixed it. now it works" [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [17:09:23] thcipriani, RoanKattouw can one of you? [17:09:32] (03PS10) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) [17:10:14] (03PS2) 10Dzahn: remove deploy1001 from scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/425561 (https://phabricator.wikimedia.org/T191972) [17:10:16] subbu: I'm in a meeting, as is half the foundation [17:10:24] ah, ok. [17:10:35] (03CR) 10Dzahn: [V: 032 C: 032] remove deploy1001 from scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/425561 (https://phabricator.wikimedia.org/T191972) (owner: 10Dzahn) [17:10:52] I've complained previously that 10am PT is a bad time to schedule a SWAT but I'm not sure if that ever made it back to releng [17:11:09] I can SWAT, give me a few to get setup [17:11:28] thcipriani, thanks. [17:12:17] (03CR) 10Jdlrobson: [C: 031] Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga) [17:12:22] RoanKattouw: I'll bring it up at the monday meeting, this was initially to give some space for the train for swat overruns, etc. But it's a tricky time. [17:12:42] Yeah it's just that a lot of teams have standups between the hours of 10 and 11 [17:13:08] So both the number of available SWATters and the number of people even wanting to use that window are reduced [17:14:05] 10Operations, 10Deployments, 10Release-Engineering-Team, 10Patch-For-Review: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4124068 (10Dzahn) deploy1001 has been removed from scap hosts and puppet ran on tin. This should have fixed the immediate scap issue. [17:14:31] Tuesday swat has caught me by surprise quite a bit [17:14:38] or quite often rather [17:15:02] (03PS3) 10Thcipriani: Enable RemexHtml on wikis with <50 issues in high priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423496 (https://phabricator.wikimedia.org/T190731) (owner: 10Subramanya Sastry) [17:15:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423496 (https://phabricator.wikimedia.org/T190731) (owner: 10Subramanya Sastry) [17:16:36] (03Merged) 10jenkins-bot: Enable RemexHtml on wikis with <50 issues in high priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423496 (https://phabricator.wikimedia.org/T190731) (owner: 10Subramanya Sastry) [17:19:58] subbu: ^ is on mwdebug1002, check please [17:20:28] thanks. will do. [17:21:06] 10Operations, 10Deployments, 10Release-Engineering-Team, 10Patch-For-Review: Scap sync-file failing for deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T191972#4124089 (10Dzahn) 05Open>03Resolved [17:21:38] thcipriani, lgtm on two of the wikis. as long as there are no errors, good to go. [17:21:56] logs look good, syncing [17:22:09] k [17:24:07] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:423496|Enable RemexHtml on wikis with <50 issues in high priority linter cats]] T190731 (duration: 00m 59s) [17:24:08] (03PS2) 10Madhuvishy: dumps: Remove stat1005|6 from nfs clients for dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/423733 (https://phabricator.wikimedia.org/T188644) [17:24:13] subbu: live now [17:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:13] T190731: Enable RemexHTML on additional wikis with < 50 errors in all high priority categories - https://phabricator.wikimedia.org/T190731 [17:24:23] great. thanks. [17:24:52] (03PS3) 10Thcipriani: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga) [17:25:02] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga) [17:25:15] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4124125 (10Papaul) [17:25:18] 10Operations, 10ops-codfw, 10Traffic: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4124122 (10Papaul) 05Open>03Resolved a:05Papaul>03None I do not have them Dell tech already took all the boxes [17:25:36] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10Papaul) [17:25:39] urandom: could i possible enable and run puppet on restbase-dev1004 (to confirm my change to disable icinga checks for cassandra if on 'dev' works) [17:25:43] 10Operations, 10ops-codfw, 10Traffic: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4124126 (10Papaul) 05Open>03Resolved [17:26:39] (03Merged) 10jenkins-bot: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga) [17:27:32] (03CR) 10Madhuvishy: [C: 032] dumps: Remove stat1005|6 from nfs clients for dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/423733 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy) [17:27:45] raynor: your change is live on mwdebug1002, check please [17:28:21] ok, thanks thcipriani: I' [17:28:23] !log sbisson@tin Started deploy [kartotherian/deploy@4cd5a19]: Deploying kartotherian v0.0.38 everywhere [17:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:32] I'm testing that [17:29:14] okie doke, looks like you've got a good checklist so take your time :) [17:29:48] !log actually re-enabled puppet on graphite2001 [17:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:51] !log sbisson@tin Finished deploy [kartotherian/deploy@4cd5a19]: Deploying kartotherian v0.0.38 everywhere (duration: 02m 27s) [17:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:28] 10Operations, 10monitoring, 10Patch-For-Review: restbase: skip (some) icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4124163 (10Dzahn) The change above will now ensure that cassandra Icinga checks are not added if on the dev cluster. We don't see the results yet because p... [17:32:42] 10Operations, 10monitoring, 10Patch-For-Review: restbase/cassandra: skip (some) icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4124165 (10Dzahn) [17:38:03] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [17:38:17] thcipriani - looks good on mwdebug1002 [17:38:22] please deploy to production [17:38:32] * thcipriani does [17:38:51] thcipriani: Are you a gerrit admin? [17:40:35] (03Abandoned) 10Dzahn: prometheus: ganglia-gen outdated resource names (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/409390 (https://phabricator.wikimedia.org/T186918) (owner: 10Dzahn) [17:40:56] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:425522|Deploy page previews for anons on dewiki]] T191966 (duration: 00m 54s) [17:41:02] raynor: ^ live now [17:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:02] T191966: Deploy page previews for anons on dewiki - https://phabricator.wikimedia.org/T191966 [17:41:04] Niharika: I don't know. I do have some special powers on gerrit, I think. [17:41:24] thanks thcipriani [17:41:28] let us check it [17:41:42] thcipriani: Can you grant +2 access on mediawiki-config to two people? https://phabricator.wikimedia.org/T189414 and https://phabricator.wikimedia.org/T161181 [17:42:16] I'm going to do a deployment training with them. They are both staffers and have been around for a while. [17:42:41] hrm, I've never tried to do that... [17:42:42] * thcipriani digs [17:42:43] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 62873 keys, up 15 hours 15 minutes - replication_delay is 626 [17:43:48] (03CR) 10jenkins-bot: Enable RemexHtml on wikis with <50 issues in high priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423496 (https://phabricator.wikimedia.org/T190731) (owner: 10Subramanya Sastry) [17:43:51] (03CR) 10jenkins-bot: Deploy page previews for anons on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425522 (https://phabricator.wikimedia.org/T191966) (owner: 10Pmiazga) [17:44:03] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 640 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 63066 keys, up 15 hours 19 minutes - replication_delay is 640 [17:44:33] everything works as expected thcipriani thanks for deployment [17:44:49] raynor: thanks for testing! [17:45:04] thcipriani: Maybe you see something on this screen - https://gerrit.wikimedia.org/r/#/admin/projects/operations/mediawiki-config [17:45:09] You are a gerrit admin. [17:46:11] thcipriani: they need to be added to https://gerrit.wikimedia.org/r/#/admin/groups/21,members to give +2 typically. [17:49:15] yep, just found that group after flailing around a bit [17:49:27] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4124258 (10RobH) The order for this is escalated for placement. This should arrive sometime next week. (Just upda... [17:53:12] Niharika: musikanimal: MusikAnimal is the gerrit username, correct? [17:53:23] thcipriani: Yes. [17:53:27] yep [17:53:53] all added [17:54:07] thcipriani: And samwilson too, right? [17:54:12] Thank you so much! [17:54:15] Niharika: yes indeed [17:54:38] sure thing, glad to have more deployers :) [17:54:43] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 62765 keys, up 15 hours 24 minutes - replication_delay is 653 [17:57:04] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1424 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 63066 keys, up 15 hours 32 minutes - replication_delay is 1424 [17:57:43] RECOVERY - Confd template for /etc/dsh/group/ores on deploy1001 is OK: No errors detected [17:58:03] RECOVERY - Disk space on deploy1001 is OK: DISK OK [17:58:03] RECOVERY - Confd template for /etc/dsh/group/maps on deploy1001 is OK: No errors detected [17:58:04] RECOVERY - Confd template for /etc/dsh/group/zotero-translators on deploy1001 is OK: No errors detected [17:58:04] RECOVERY - Check size of conntrack table on deploy1001 is OK: OK: nf_conntrack is 0 % full [17:58:04] RECOVERY - Confd template for /etc/dsh/group/mediawiki-installation on deploy1001 is OK: No errors detected [17:58:04] RECOVERY - Confd template for /etc/dsh/group/cassandra on deploy1001 is OK: No errors detected [17:58:04] RECOVERY - Confd template for /etc/dsh/group/zotero-translation-server on deploy1001 is OK: No errors detected [17:58:05] RECOVERY - confd service on deploy1001 is OK: OK - confd is active [17:58:05] RECOVERY - Unmerged changes on repository mediawiki_config on deploy1001 is OK: No changes to merge. [17:58:06] RECOVERY - Check whether ferm is active by checking the default input chain on deploy1001 is OK: OK ferm input default policy is set [17:58:06] RECOVERY - Confd template for /etc/dsh/group/parsoid on deploy1001 is OK: No errors detected [17:58:07] RECOVERY - MD RAID on deploy1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:58:14] RECOVERY - configured eth on deploy1001 is OK: OK - interfaces up [17:58:14] RECOVERY - Confd template for /etc/dsh/group/jobrunner on deploy1001 is OK: No errors detected [17:58:14] RECOVERY - dhclient process on deploy1001 is OK: PROCS OK: 0 processes with command name dhclient [17:59:04] RECOVERY - DPKG on deploy1001 is OK: All packages OK [17:59:04] PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 621 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 62926 keys, up 15 hours 43 minutes - replication_delay is 621 [17:59:04] thanks! [17:59:14] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1800) [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:03] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:03:14] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4766041 keys, up 15 hours 53 minutes - replication_delay is 633 [18:05:54] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is OK: Files ownership is ok. [18:10:04] RECOVERY - Check the NTP synchronisation status of timesyncd on deploy1001 is OK: OK: synced at Wed 2018-04-11 18:10:00 UTC. [18:10:24] RECOVERY - IPMI Sensor Status on deploy1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [18:10:30] (03PS2) 10Madhuvishy: nfsclient: Cleanup absented dumps mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423728 (https://phabricator.wikimedia.org/T188643) [18:11:25] (03CR) 10Madhuvishy: [C: 032] nfsclient: Cleanup absented dumps mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423728 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [18:11:47] !log deploy1001 is back on stretch once again - it has been removed from scap hosts though (T175288 T185275) [18:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:54] T185275: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275 [18:11:54] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [18:14:53] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 616 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 62887 keys, up 16 hours 2 minutes - replication_delay is 616 [18:16:48] (03PS2) 10Madhuvishy: dumps: Turn off cron that rsyncs to labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423731 (https://phabricator.wikimedia.org/T188643) [18:17:31] (03CR) 10Madhuvishy: [C: 032] dumps: Turn off cron that rsyncs to labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423731 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [18:21:24] (03PS2) 10Madhuvishy: nfs: Stop exporting dumps from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423727 (https://phabricator.wikimedia.org/T188643) [18:22:15] (03CR) 10Madhuvishy: [C: 032] nfs: Stop exporting dumps from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423727 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [18:25:07] (03PS3) 10Madhuvishy: dumps: Clean up code that rsyncs to labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423732 (https://phabricator.wikimedia.org/T188643) [18:25:52] (03CR) 10Madhuvishy: [C: 032] dumps: Clean up code that rsyncs to labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/423732 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [18:26:05] (03PS3) 10Herron: puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538 (https://phabricator.wikimedia.org/T75989) [18:35:28] (03CR) 10Chad: [C: 04-1] "Per discussion on IRC the other day, we want to serve this from a separate vhost over something like gerrit.wmfusercontent.org." [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [18:35:41] PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 629 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 63347 keys, up 16 hours 22 minutes - replication_delay is 629 [18:38:50] PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 63136 keys, up 16 hours 22 minutes - replication_delay is 619 [18:41:48] (03CR) 10Chad: [C: 031] Gerrit: Switch gc back on [puppet] - 10https://gerrit.wikimedia.org/r/421593 (https://phabricator.wikimedia.org/T190045) (owner: 10Paladox) [18:43:16] (03CR) 10Chad: [V: 032 C: 032] add plugin avatars-external [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/424710 (owner: 10Paladox) [18:45:19] mutante: ugh [18:47:22] (03PS1) 10Pmiazga: Enable Page Previews for 10% enwiki anon users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T191101) [18:47:26] !log restarting cassandra, dev environment (set -XX:+PerfDisableSharedMem) -- T186751 [18:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:33] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [18:47:56] urandom: no worries, it can wait, the compiler said it's all good [18:48:10] (03PS1) 10Ladsgroup: Add badge for good lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425589 (https://phabricator.wikimedia.org/T190976) [18:48:15] mutante: i need to figure out something here anyway... [18:48:25] i've had it disabled too long [18:48:57] but yeah, if it can wait, let me see about getting it sorted [18:49:35] (03CR) 10Jdlrobson: [C: 031] Enable Page Previews for 10% enwiki anon users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [18:49:49] urandom: take your time [18:50:25] (03PS2) 10Ladsgroup: Add badge for good lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425589 (https://phabricator.wikimedia.org/T190976) [18:51:58] (03PS2) 10Herron: base: auto logout idle bash shells after 2 days [puppet] - 10https://gerrit.wikimedia.org/r/392698 (https://phabricator.wikimedia.org/T122922) [18:52:47] no_justification: do you want me to merge the GC change now (with or without restart ) [18:53:15] If you want. I'm OOO today [18:54:42] i'm also OOO [18:55:51] (03PS3) 10Ladsgroup: Add badge for good lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425589 (https://phabricator.wikimedia.org/T190976) [18:56:10] (03PS2) 10Jdlrobson: Enable Page Previews for 10% enwiki anon users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga) [19:00:04] thcipriani: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:52] this may be a reference I don't get. [19:01:08] but I am working on the train. [19:02:07] (03PS5) 10Paladox: Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) [19:05:14] mutante: fyi, in the interest of getting a clean diff of some live hacks I've made, I just ran puppet on restbase-dev1004 [19:05:49] no_justification done [19:05:51] urandom: thanks, i think that's all i needed [19:06:00] for gerrit.wmfusercontent.org [19:06:03] (03CR) 10Chad: [C: 04-1] Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [19:06:47] no_justification yep, though i had a look in apache.conf template and it seems it does ServerAlias with that config (ie as an array) [19:07:26] no_justification ie [19:07:26] ServerAlias <%= Array(@slave_hosts).join(' ') %> [19:08:08] urandom: it works. running puppet on icinga server and the cassandra checks are being removed because it's a "dev" [19:08:15] that was what i wanted [19:08:24] mutante: cool [19:08:47] i'm still going to work on getting this to the state where i can re-enable puppet [19:08:58] (03CR) 10Paladox: Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [19:09:55] !log thcipriani@tin Synchronized php-1.31.0-wmf.29/includes/libs/rdbms/database: [[gerrit:425566|rdbms: fix transaction flushing in Database::close]] T191916 (duration: 01m 01s) [19:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:01] T191916: Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBUnexpectedError' with message 'Wikimedia\Rdbms\Database::close: Expected mass commit of all peer transactions (DBO_TRX set).' in /srv/mediawiki/php-1.31.0-wmf.29/includes/libs/rdbms/database/Database.php:3602 - https://phabricator.wikimedia.org/T191916 [19:11:32] !log thcipriani@tin rebuilt and synchronized wikiversions files: testwiki back to 1.31.0-wmf.29 [19:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:54] (03PS1) 10Catrope: Allow sysops to create Flow boards on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500) [19:15:28] RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4849 keys, up 16 hours 59 minutes - replication_delay is 0 [19:15:32] urandom: i am done with my part, all up to you :) [19:15:37] RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5018 keys, up 17 hours 2 minutes - replication_delay is 0 [19:16:16] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/10906/bast1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/392698 (https://phabricator.wikimedia.org/T122922) (owner: 10Herron) [19:16:22] (03PS3) 10Herron: base: auto logout idle bash shells after 2 days [puppet] - 10https://gerrit.wikimedia.org/r/392698 (https://phabricator.wikimedia.org/T122922) [19:17:38] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 5630 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 62765 keys, up 16 hours 47 minutes - replication_delay is 5630 [19:17:54] (03CR) 10Herron: [C: 032] base: auto logout idle bash shells after 2 days [puppet] - 10https://gerrit.wikimedia.org/r/392698 (https://phabricator.wikimedia.org/T122922) (owner: 10Herron) [19:18:08] RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 63445 keys, up 17 hours 2 minutes - replication_delay is 0 [19:18:15] (03PS1) 10Thcipriani: Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425595 [19:19:18] RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4622 keys, up 17 hours 6 minutes - replication_delay is 0 [19:19:53] (03CR) 10Thcipriani: [C: 032] Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425595 (owner: 10Thcipriani) [19:20:08] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4707403 keys, up 17 hours 9 minutes - replication_delay is 0 [19:20:43] (03CR) 10Chad: [C: 04-1] Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [19:21:08] (03Merged) 10jenkins-bot: Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425595 (owner: 10Thcipriani) [19:21:23] (03CR) 10jenkins-bot: Group0 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425595 (owner: 10Thcipriani) [19:21:38] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 62813 keys, up 16 hours 51 minutes - replication_delay is 0 [19:23:28] !log thcipriani@tin rebuilt and synchronized wikiversions files: Group0 to 1.31.0-wmf.29 [19:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:48] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4437 keys, up 16 hours 58 minutes - replication_delay is 0 [19:26:18] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4495 keys, up 17 hours 1 minutes - replication_delay is 0 [19:44:13] (03PS1) 10Andrew Bogott: WMCS puppet enc api: remove --autoload from uwsgi service settings. [puppet] - 10https://gerrit.wikimedia.org/r/425598 (https://phabricator.wikimedia.org/T191648) [19:44:42] (03CR) 10jerkins-bot: [V: 04-1] WMCS puppet enc api: remove --autoload from uwsgi service settings. [puppet] - 10https://gerrit.wikimedia.org/r/425598 (https://phabricator.wikimedia.org/T191648) (owner: 10Andrew Bogott) [19:46:06] (03PS1) 10Thcipriani: Group1 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425599 [19:46:17] (03PS2) 10Andrew Bogott: WMCS puppet enc api: remove --autoload from uwsgi service settings. [puppet] - 10https://gerrit.wikimedia.org/r/425598 (https://phabricator.wikimedia.org/T191648) [19:49:45] (03CR) 10Andrew Bogott: [C: 032] WMCS puppet enc api: remove --autoload from uwsgi service settings. [puppet] - 10https://gerrit.wikimedia.org/r/425598 (https://phabricator.wikimedia.org/T191648) (owner: 10Andrew Bogott) [19:51:26] (03CR) 10Thcipriani: [C: 032] Group1 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425599 (owner: 10Thcipriani) [19:52:42] (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425599 (owner: 10Thcipriani) [19:52:48] (03PS1) 10Ppchelko: Enable EventBus for job events for all the wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) [19:52:58] (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425599 (owner: 10Thcipriani) [19:57:48] (03CR) 10Ppchelko: [C: 04-1] "Because blocked by T192005" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) (owner: 10Ppchelko) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:13] !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 to 1.31.0-wmf.29 [20:00:22] 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4124737 (10Andrew) > Fixing this looks to be as easy as passing $service_settings => '--die-on-term'in openstack::puppet::master::encapi Indeed, t... [20:00:24] (03CR) 10Paladox: Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [20:00:25] ORES has some minor patches to roll out. [20:00:30] * halfak watches [20:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:00] halfak: fyi, I’m starting slow, just gonna deploy to the (new) canary, ores1001 for starters. [20:01:21] * awight dons hairshirt in hope of not repeating words [20:02:09] !log awight@tin Started deploy [ores/deploy@b6deb5d]: Transitional virtualenv for ORES (take 2), T181071 [20:02:10] !log thcipriani@tin Synchronized php: group1 to 1.31.0-wmf.29 (duration: 01m 16s) [20:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:26] T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071 [20:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:17] (03CR) 10Ppchelko: "Actually, since $wmgUseEventBus is still respected, this change is a no-op, but must be deployed before Id1d043e5ce02e73b51f75ee54575647b5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) (owner: 10Ppchelko) [20:04:40] nothing to deploy for parsoid [20:05:20] no deploy for mobileapps [20:20:42] !log awight@tin Finished deploy [ores/deploy@b6deb5d]: Transitional virtualenv for ORES (take 2), T181071 (duration: 18m 34s) [20:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:48] T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071 [20:22:05] (03PS4) 10Awight: Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071) [20:22:33] (03PS5) 10Awight: Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071) [20:23:21] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4124827 (10awight) [20:26:35] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4124865 (10awight) @akosiaris We're finally ready to deploy the puppet cha... [20:27:09] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3851398 (10awight) p:05Triage>03High [20:27:24] \o/ [20:28:52] I'm having intermittent timeouts when connecting to gerrit.wikimedia.org [20:29:52] Pchelolo: Feel like kicking this patch today? https://gerrit.wikimedia.org/r/#/c/424145/ [20:34:19] <_joe_> bawolff: still ongoing? [20:34:45] No, it seems to be better now [20:35:00] At least for the moment [20:35:20] and it was only when connecting via ssh, not the web interface [20:35:22] (03PS2) 10Hashar: Tag 'latest' during build instead of at publishing [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398265 [20:36:22] awight: can we do it tomorrow UTC morning? We have a huge undeployed backlog of patches for Change-Prop that we need to deploy with great caution, it's super late for mobrovac and kinda late for me already [20:36:52] !log increase change-prop sample rate in dev env to 40% (from 20) -- T186751 [20:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:58] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [20:42:00] Pchelolo: Sounds good to me, thanks! It would be good if Scoring staff were monitoring, so let me know what time you’re thinking of deploying once it becomes more concrete. [20:42:51] kk, I'm in UTC-3 and I start working pretty early [20:46:50] :) [20:50:03] Pchelolo: fyi, these are the main graphs to look at, if we’re unable to be awake: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=3&fullscreen&orgId=1 will show how many scores each machine is handling. We expect that to increase by 50-100%, but keeping the same shape. Danger signs would be if any machine stops processing scores. Also, https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=2&fullscreen&orgId [20:50:04] might have a small spike of up to maybe 10 errors per minute during deployment, but should otherwise stay close to zero at all times. [20:51:20] kk [20:51:28] acknowledged awight [20:51:40] whew! sorry about the brain dump, I should put that on a wiki page. [20:51:55] we will probably just get rid of our deploy backlog tomorrow morning and make another deploy as you wake up [20:52:48] :) Our patch isn’t a huge rush, I’ve just been energized by *nearly* unblocking on some other ORES stuff. [21:18:25] (03PS1) 10Rush: openstack: add bootstrap instructions for provider created vxlan [puppet] - 10https://gerrit.wikimedia.org/r/425713 (https://phabricator.wikimedia.org/T188266) [21:18:35] (03PS2) 10MusikAnimal: Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393) [21:19:32] (03PS2) 10MusikAnimal: Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) [21:20:18] (03CR) 10Rush: [C: 032] openstack: add bootstrap instructions for provider created vxlan [puppet] - 10https://gerrit.wikimedia.org/r/425713 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:41:14] (03CR) 10Thcipriani: [C: 032] "tests pass, workflow makes sense, tested working locally." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398265 (owner: 10Hashar) [21:41:46] (03Merged) 10jenkins-bot: Tag 'latest' during build instead of at publishing [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398265 (owner: 10Hashar) [21:42:02] what's "dashiki" ? as in "labs-project-dashiki" [21:42:37] mutante: The wiki-configured stats system from Analytics. [21:43:09] thanks James_F [21:43:40] I don't know what that particular WMCloud instance is specifically for, though, sorry. [21:51:52] bblack: yt? [21:53:44] (03CR) 10Giuseppe Lavagetto: [C: 031] "The reason to tag only at publishing was to keep the "latest" tag consistent with the remote repository, but it's admittedly useless and c" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398265 (owner: 10Hashar) [22:00:05] samwilson and musikanimal: (Dis)respected human, time to deploy Logging and PageAssessments (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T2200). Please do the needful. [22:00:05] samwilson and musikanimal: A patch you scheduled for Logging and PageAssessments is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [22:00:09] _joe_: deploy1001 is on stretch again (but removed from scap hosts right now). tin is still active deployment_server and on jessie. i also have a ticket to upgrade naos.codfw to stretch. should i go ahead and do that now? i mean it's codfw [22:00:41] well the ticket says to also rename it to deploy2001 [22:00:48] <_joe_> no, let's wait for a week until it's clearer what we want to do [22:00:54] ok [22:01:54] <_joe_> we still have to make some decisions around what we will do, with releng and other people involved in the php7 migration [22:03:05] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4108485 (10Dzahn) adding @RStallman-legalteam for the NDA step [22:03:32] *nod* i figured that. thanks [22:04:18] (03CR) 10MusikAnimal: [C: 032] Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393) (owner: 10MusikAnimal) [22:05:36] (03Merged) 10jenkins-bot: Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393) (owner: 10MusikAnimal) [22:07:17] (03CR) 10Dzahn: [C: 031] "it will also add him on the bastion hosts via the "all-users" special group and magic" [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) (owner: 10ArielGlenn) [22:09:20] (03CR) 10jenkins-bot: Enable PageAssessments on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425554 (https://phabricator.wikimedia.org/T153393) (owner: 10MusikAnimal) [22:12:23] (03CR) 10Dzahn: "it would affect all these:" [puppet] - 10https://gerrit.wikimedia.org/r/415510 (owner: 10Dzahn) [22:13:09] (03Abandoned) 10Dzahn: cache::misc: switch webserver_misc_static to codfw backend [puppet] - 10https://gerrit.wikimedia.org/r/420142 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [22:13:52] !log musikanimal@tin Synchronized wmf-config/InitialiseSettings.php: Enabling PageAssessments on frwiki (T153393) (duration: 01m 26s) [22:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:58] T153393: Deploy PageAssessments to French wikipedia - https://phabricator.wikimedia.org/T153393 [22:16:10] (03CR) 10MusikAnimal: [C: 032] Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal) [22:16:21] (03CR) 10jerkins-bot: [V: 04-1] Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal) [22:17:36] (03PS3) 10MusikAnimal: Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) [22:17:51] (03CR) 10MusikAnimal: [C: 032] Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal) [22:19:06] (03Merged) 10jenkins-bot: Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal) [22:19:59] awight: so i just looked at the ores change and then saw your ticket comment.. i see that would only affect all ores1* machines but not also the ones with ores::redis, right? also the admin groups look like you have the powers to do the restarts. i would merge that if you are here to check on it afterwards [22:20:56] could also run the right restart command via cumin [22:21:10] (03CR) 10jenkins-bot: Enable PageAssessments on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425212 (https://phabricator.wikimedia.org/T191697) (owner: 10MusikAnimal) [22:23:02] mutante: Great, yes it should work like you described. I can run the restarts via pssh :D [22:23:25] !log views updated on labsdb1009 [22:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:31] (03PS6) 10Dzahn: Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071) (owner: 10Awight) [22:24:49] !log musikanimal@tin Synchronized wmf-config/InitialiseSettings.php: Enabling PageAssessments on huwiki (T191697) (duration: 01m 17s) [22:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:56] T191697: Deploy PageAssessments to Hungarian Wikipedia - https://phabricator.wikimedia.org/T191697 [22:25:29] (03CR) 10Dzahn: [C: 032] Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071) (owner: 10Awight) [22:26:09] labs-project-dashiki I’m not too sure, mutante, but we call our custom dashboarding tool dashiki: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Dashboards [22:26:11] (03CR) 10Pnorman: [C: 031] "I double-checked, and the admin tables are currently owned by postgres, so this is okay" [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605) (owner: 10Gehel) [22:28:06] !log ores - running puppet on all instances to apply venv path change for T181071 [22:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:14] T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071 [22:28:17] (03PS2) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 [22:28:36] mutante: Thanks! I see that ores1001 already has the updated config, so I’ll try a canary restart. [22:28:39] awight: running puppet on all of them ..applying the config change (not restarting) [22:28:48] awight: yes, i did it on 1001 manually and now on all [22:29:08] milimetric: thanks! ok [22:29:16] mutante: ah, you already restarted I see [22:29:40] awight: no, i did not. then puppet did it [22:29:46] yes, puppet [22:30:18] !! ok that’s odd, it must have been a dependency within the systemd module. [22:30:19] !log ores - all eqiad instances are being restarted by puppet after config change [22:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:39] indeed [22:30:40] Info: Base::Service_unit[uwsgi-ores]: Scheduling refresh of Exec[systemd reload for uwsgi-ores] [22:30:46] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/425718 [22:31:27] Wow, it worked. [22:31:31] pphew :) [22:31:44] mutante: Last thing to bug you about, how do you suggest we clean up the now-unused virtualenv path? [22:31:45] (03CR) 10MaxSem: [C: 032] Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 (owner: 10Samwilson) [22:31:47] (03PS2) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/425718 [22:31:47] there is still codfw to go and eqiad are 6/9 [22:31:55] (03CR) 10Samwilson: [V: 032] Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 (owner: 10Samwilson) [22:31:57] nice [22:33:00] (03Merged) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 (owner: 10Samwilson) [22:33:14] (03CR) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (second time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425466 (owner: 10Samwilson) [22:33:39] awight: ok, now it's actually done on eqiad. 100.0% (9/9) success ratio [22:33:45] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4108485 (10RobH) I'm not sure why L2 was listed as a requirement? The phabricator NDA doesn't really work into any workflow that I'm aware of, we require an NDA on... [22:33:45] doing codfw [22:36:01] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4125249 (10RStallman-legalteam) @Matthias_Geisler_WMDE - will reach out to you via email in the next day or so with the NDA. Thanks! [22:36:10] awight: i suggest: i run: rm -rf /srv/deployment/ores/venv on ores1001 and then you restart it again? [22:36:38] and then i run that on all [22:36:59] mutante: That sounds right to me. [22:37:29] !log ores - same for codfw instances, change of venv path to /srv/deployment/ores/deploy/venv/ [22:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:40] !log ores1001 - rm -rf /srv/deployment/ores/venv/ [22:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:13] awight: done on 1001 [22:38:58] I’ll restart services there. [22:39:42] mutante: Still healthy. [22:40:07] awight: great, will delete it on all eqiad. codfw is still running puppet [22:41:40] !log ores1002-1009 - deleting old venv dir - rm -f /srv/deployment/ores/venv (T181071) [22:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:46] T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071 [22:42:41] awight: done.. eh.. i just happened to see this: https://phabricator.wikimedia.org/rORESDEPLOYadc5c06417290c9980cc2b35d599c6da13ea24c6 [22:42:48] what about that "install libs in old path" [22:43:06] (03PS1) 10Ladsgroup: Stop logging autopatrol actions everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425719 (https://phabricator.wikimedia.org/T184485) [22:44:53] mutante: That’s the last piece of the migration, it stops us from rebuilding /srv/deployment/ores/venv [22:45:07] but I think you grokked that, maybe I missed the question? [22:46:18] awight: ah. yea so then "venv_old" is already gone. then i get it. [22:46:47] It’s awesome to finally get to this point! Such a simple change, but with many breakable pieces... [22:47:39] !log ores2* - puppet ran to change venv config, then 'rm -rf /srv/deployment/ores/venv/' via cumin to clean-up (T181071) [22:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:44] T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071 [22:47:49] awight: ^ that's it, you can restart it all [22:47:50] !log samwilson@tin Synchronized wmf-config/InitialiseSettings.php: Deploy GlobalPreferences T184121 (duration: 01m 17s) [22:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:55] T184121: Deploy checklist for GlobalPreferences on production - https://phabricator.wikimedia.org/T184121 [22:49:44] oops, missed ores1009 because it had been reinstalled or sometihng.. also done now [22:49:51] mutante: wicked. ok, doing so. [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180411T2300). [23:00:05] RoanKattouw and Amir1: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:17] Hello, I can SWAT this evening. [23:00:20] o/ [23:00:27] But first let's check if samwilson is done [23:01:01] Dereckson: yep, we're all done for now [23:01:05] Thanks [23:01:50] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425719 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup) [23:02:41] Dereckson: not testable. It's config switch that has been working for a while though [23:03:05] (03Merged) 10jenkins-bot: Stop logging autopatrol actions everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425719 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup) [23:04:02] mutante: Everything has been restarted. Thanks for your time! [23:07:33] Amir1: it seems there isn't any mayhem on logs (enabled on mwdebug1002) [23:07:46] so yes, it seems indeed fine to sync [23:08:11] cool [23:09:53] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Stop logging autopatrol actions everywhere (T184485) (duration: 01m 18s) [23:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:59] T184485: Stop logging autopatrol actions - https://phabricator.wikimedia.org/T184485 [23:10:33] Thanks! [23:11:00] (03PS2) 10Dereckson: Allow sysops to create Flow boards on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500) (owner: 10Catrope) [23:13:56] RoanKattouw: ping? [23:14:26] awight: :) welcome [23:14:31] Here [23:14:57] Hello, let's SWAT sysop flow permission to create board for eu. [23:15:29] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500) (owner: 10Catrope) [23:16:43] (03Merged) 10jenkins-bot: Allow sysops to create Flow boards on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500) (owner: 10Catrope) [23:17:03] Sure [23:17:10] lmk when it's on mwdebug1002 [23:17:11] If you've an eu. admin available, they can test on mwdebug1002 if it works (if not we can check https://eu.wikipedia.org/wiki/Berezi:ListGroupRights) [23:17:21] just now [23:17:31] I'll just check the special page [23:17:57] The special page looks good, that's good enough for me [23:19:44] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Allow sysops to create Flow boards on euwiki (T190500) (duration: 01m 17s) [23:19:44] Syncing [23:19:48] Synced [23:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:50] T190500: Enable Extension:StructuredDiscussions in Basque Wikipedia - https://phabricator.wikimedia.org/T190500 [23:21:27] Thanks! [23:21:32] You're welcome :) [23:29:10] (03CR) 10Dereckson: [C: 031] "The namespaced classes have been introduced in 5.4 and backported to 4.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421588 (https://phabricator.wikimedia.org/T188166) (owner: 10Umherirrender) [23:33:12] (03CR) 10jenkins-bot: Stop logging autopatrol actions everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425719 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup) [23:33:16] (03CR) 10jenkins-bot: Allow sysops to create Flow boards on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425594 (https://phabricator.wikimedia.org/T190500) (owner: 10Catrope) [23:53:18] (03PS1) 10MaxSem: Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 [23:54:36] (03CR) 10jerkins-bot: [V: 04-1] Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem) [23:55:00] (03CR) 10MaxSem: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem) [23:56:23] (03CR) 10jerkins-bot: [V: 04-1] Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem) [23:56:27] (03CR) 10jerkins-bot: [V: 04-1] Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem) [23:58:17] (03PS2) 10MaxSem: Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 [23:59:27] (03CR) 10jerkins-bot: [V: 04-1] Revert "Deploy GlobalPreferences to test wikis and mw.org (second time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425723 (owner: 10MaxSem) [23:59:40] what's wrong with jerkins?