[00:00:45] !log T224850 repooled labsdb1009 after completing view updates [00:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:49] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove unused preference T47877-buster (duration: 00m 47s) [00:00:49] T224850: Offer alternate views of the comment and actor tables which only check for supression in a single table in the Wiki Replicas - https://phabricator.wikimedia.org/T224850 [00:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:54] T47877: During deployment old servers may populate new cache URIs - https://phabricator.wikimedia.org/T47877 [00:02:08] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for finishing it!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [00:04:13] (03CR) 10Volans: [C: 03+2] "LGTM" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514968 (owner: 10CDanis) [00:06:24] (03PS1) 10BryanDavis: systemd::timer::job: always provision NRPE plugin [puppet] - 10https://gerrit.wikimedia.org/r/514988 (https://phabricator.wikimedia.org/T225268) [00:06:46] (03Merged) 10jenkins-bot: dbctl config: remove comment cruft [software/conftool] - 10https://gerrit.wikimedia.org/r/514968 (owner: 10CDanis) [00:11:06] (03CR) 10Volans: dbconfig: save live config before updating it (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 (owner: 10Volans) [00:11:13] (03PS6) 10Volans: types: do not pre-compile regex in SchemaRule [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 [00:11:15] (03PS6) 10Volans: dbconfig: use lists of dicts for sectionLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 [00:11:17] (03PS4) 10Volans: dbconfig: pretty-print get actions [software/conftool] - 10https://gerrit.wikimedia.org/r/514869 [00:11:19] (03PS5) 10Volans: dbconfig: save live config before updating it [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 [00:11:21] (03PS2) 10Volans: dbconfig: add config restore action [software/conftool] - 10https://gerrit.wikimedia.org/r/514983 [00:12:16] (03CR) 10CDanis: [C: 03+2] dbconfig: save live config before updating it [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 (owner: 10Volans) [00:21:51] 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10Krenair) Okay. You might just want to check that something like `mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -h analytics-store.eqiad.wmnet -e 'describe cx_translations;'`... [00:27:48] (03CR) 10BryanDavis: "https://puppet-compiler.wmflabs.org/compiler1002/16927/cloudcontrol1004.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/514988 (https://phabricator.wikimedia.org/T225268) (owner: 10BryanDavis) [01:12:30] (03PS1) 10Smalyshev: Migrate CirrusSearch to extension.json officially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514994 (https://phabricator.wikimedia.org/T87892) [01:22:02] 10Operations, 10Wikimedia-Mailing-lists: Mailman: Consider restricting access to members - https://phabricator.wikimedia.org/T225269 (10MarkAHershberger) [01:31:28] 10Operations, 10Wikimedia-Mailing-lists: Consider restricting access to list subscriber list - https://phabricator.wikimedia.org/T225269 (10Krenair) [01:33:16] 10Operations, 10Wikimedia-Mailing-lists: Consider restricting access to list subscriber list - https://phabricator.wikimedia.org/T225269 (10Krenair) Don't most of our lists require people to have a list admin password to read the subscriber list? Do we have any that don't? [03:17:17] (03CR) 10Andrew Bogott: "running an epic compiler test: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16928/" [puppet] - 10https://gerrit.wikimedia.org/r/514988 (https://phabricator.wikimedia.org/T225268) (owner: 10BryanDavis) [05:09:28] (03PS1) 10Marostegui: mariadb: db1132, enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/515003 (https://phabricator.wikimedia.org/T221533) [05:10:40] (03CR) 10Marostegui: [C: 03+2] mariadb: db1132, enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/515003 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [05:37:16] (03PS1) 10Marostegui: db-codfw.php: Move db2051 from s4 to s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515004 (https://phabricator.wikimedia.org/T221533) [05:38:45] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Move db2051 from s4 to s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515004 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [05:39:36] (03Merged) 10jenkins-bot: db-codfw.php: Move db2051 from s4 to s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515004 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [05:39:50] (03CR) 10jenkins-bot: db-codfw.php: Move db2051 from s4 to s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515004 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [05:40:48] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Move db2051 from s4 to s2T221533 (duration: 00m 49s) [05:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:56] (03PS1) 10Marostegui: install_server: Allow reimage db2051 [puppet] - 10https://gerrit.wikimedia.org/r/515005 (https://phabricator.wikimedia.org/T221533) [05:48:04] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db2051 [puppet] - 10https://gerrit.wikimedia.org/r/515005 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [05:56:42] (03PS1) 10Marostegui: mariadb: Move db2051 from s4 to s2 [puppet] - 10https://gerrit.wikimedia.org/r/515006 (https://phabricator.wikimedia.org/T221533) [06:01:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Production etcd must only be restarted manually under supervision." [puppet] - 10https://gerrit.wikimedia.org/r/512338 (https://phabricator.wikimedia.org/T169287) (owner: 10Arturo Borrero Gonzalez) [06:03:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] [RFC] etcd::ssl: restart etcd service when the SSL cert changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512338 (https://phabricator.wikimedia.org/T169287) (owner: 10Arturo Borrero Gonzalez) [06:04:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2051 from s4 to s2 [puppet] - 10https://gerrit.wikimedia.org/r/515006 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [06:10:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] types: do not pre-compile regex in SchemaRule (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 (owner: 10Volans) [06:13:14] (03Merged) 10jenkins-bot: types: do not pre-compile regex in SchemaRule [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 (owner: 10Volans) [06:29:53] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:38:38] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [06:44:25] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 18 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/GlobalSign_Organization_Validation_CA_-_SHA256_-_G2.crt] [06:47:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Some minor comments, but lgtm if you prefer this route." (035 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [06:55:11] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:04] (03PS2) 10Mobrovac: RESTBase: Remove restbase10(0[7-9]|1[0-5]) and set them as spares [puppet] - 10https://gerrit.wikimedia.org/r/513262 (https://phabricator.wikimedia.org/T223976) [07:06:23] 10Operations: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 (10Marostegui) [07:09:52] 10Operations: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 (10Marostegui) [07:29:17] !log Drop unused temporary test tables on db1111 and db1112 [07:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:18] (03PS1) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) [07:42:19] 10Operations, 10Traffic, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10ema) Thanks @Legoktm and @Quiddity! [07:42:28] (03PS8) 10Lucas Werkmeister (WMDE): dologmsg: extract variables from Toolforge dologmsg [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) [07:44:45] (03PS2) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) [07:57:30] 10Operations, 10Maps: reimage of maps2002 fails to "run preseeded command" - https://phabricator.wikimedia.org/T225238 (10Gehel) [07:57:32] 10Operations: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 (10Gehel) [08:03:12] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] dologmsg: fix variable [puppet] - 10https://gerrit.wikimedia.org/r/511750 (owner: 10Lucas Werkmeister (WMDE)) [08:04:16] 10Operations, 10ops-eqiad, 10Analytics, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ayounsi) Great, over to @Cmjohnson then! [08:12:29] !log upgrading certbot in wikitech-static [08:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:11] ooohhh nice thanks [08:16:10] hopefully I won't break anything [08:17:33] :-) [08:19:04] !log remove BGP session to AS55658 on cr1-eqsin (left the IXP) [08:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:08] !log Upgrade s2 codfw to 10.1.39 in preparation for its codfw failover - T221533 [08:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:13] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [08:25:56] (03PS1) 10Ema: cache: reimage cp3043 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/515011 (https://phabricator.wikimedia.org/T222937) [08:26:32] ema: if you are going to reimage, be aware of T225278 [08:26:33] T225278: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 [08:26:48] marostegui: ah, thanks! [08:28:08] I'll wait for jbond42|away to come online and take a look before reimaging then [08:28:18] :) [08:46:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] Set LVS eventgate-* service to critical: true [puppet] - 10https://gerrit.wikimedia.org/r/514575 (owner: 10Ottomata) [08:46:43] !log start the reboot of the Analytics Hadoop's worker nodes for kernel+openjdk upgrades [08:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:48] (03PS3) 10Alexandros Kosiaris: Set LVS eventgate-* service to critical: true [puppet] - 10https://gerrit.wikimedia.org/r/514575 (owner: 10Ottomata) [08:46:54] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Set LVS eventgate-* service to critical: true [puppet] - 10https://gerrit.wikimedia.org/r/514575 (owner: 10Ottomata) [08:50:49] 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10S.piccardi) I only signed up Mediawiki to file a bug report. I don't know anything about this, nor about homes at my name. I have no idea why there are 11G of data somewere, which da... [08:53:34] 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10elukey) @S.piccardi sorry my bad! There is another person with your surname that works with Analytics, and I added the wrong phab username. Apologies :) [08:56:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "The patch and PCC LGTM. A comment inline though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514988 (https://phabricator.wikimedia.org/T225268) (owner: 10BryanDavis) [09:00:40] !log Upgrade x1 codfw hosts in preparation for its failover T220170 [09:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:45] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [09:00:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please have @andrew review this as well and merge if appropriate (he is probably more familiar with this code)." [puppet] - 10https://gerrit.wikimedia.org/r/513752 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [09:02:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but please get @Andrew to review and merge as well, since he will probably be more familiar with this code." [puppet] - 10https://gerrit.wikimedia.org/r/513909 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [09:07:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Same as in the other patches: LGTM, but please have andrew in the loop." [puppet] - 10https://gerrit.wikimedia.org/r/513910 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [09:08:53] 10Operations, 10ops-codfw: codfw humidity too high - https://phabricator.wikimedia.org/T225137 (10ayounsi) Some PSUs humidity alerts were flapping so I disabled them. https://librenms.wikimedia.org/device/device=107/tab=edit/section=health/ https://librenms.wikimedia.org/device/device=108/tab=edit/section=heal... [09:09:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but please get andrew in the loop (and possibly Bryan as well)." [puppet] - 10https://gerrit.wikimedia.org/r/513911 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [09:16:18] !log mobrovac@deploy1001 scap-helm mathoid upgrade production stable/mathoid -f mathoid-values.yaml [namespace: mathoid, clusters: eqiad,codfw] [09:16:21] !log mobrovac@deploy1001 scap-helm mathoid cluster eqiad completed [09:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:23] !log mobrovac@deploy1001 scap-helm mathoid cluster codfw completed [09:16:23] !log mobrovac@deploy1001 scap-helm mathoid finished [09:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:17] (03PS1) 10Jbond: install: only enable facter/puppet components on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/515017 [09:34:04] (03PS1) 10Jcrespo: labsdb: Depool labsdb1010 to proceed with compression [puppet] - 10https://gerrit.wikimedia.org/r/515018 (https://phabricator.wikimedia.org/T222978) [09:38:03] (03PS2) 10Arturo Borrero Gonzalez: etcd::ssl: restart etcd service when the SSL cert changes [puppet] - 10https://gerrit.wikimedia.org/r/512338 (https://phabricator.wikimedia.org/T169287) [09:38:52] (03CR) 10Muehlenhoff: [C: 03+1] install: only enable facter/puppet components on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/515017 (owner: 10Jbond) [09:40:01] (03CR) 10Jbond: [C: 03+2] install: only enable facter/puppet components on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/515017 (owner: 10Jbond) [09:40:50] (03CR) 10Volans: "Replies inline" (035 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [09:41:26] 10Operations: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 (10jbond) Sorry for this i have just pushed https://gerrit.wikimedia.org/r/c/operations/puppet/+/515017 so this should only be broken for sarin and neodyium now, sorry for the interruption and ping me if you still s... [09:42:40] 10Operations: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 (10Marostegui) >>! In T225278#5242570, @jbond wrote: > Sorry for this i have just pushed https://gerrit.wikimedia.org/r/c/operations/puppet/+/515017 so this should only be broken for sarin and neodyium now, sorry fo... [09:43:25] (03PS3) 10Arturo Borrero Gonzalez: etcd::ssl: restart etcd service when the SSL cert changes [puppet] - 10https://gerrit.wikimedia.org/r/512338 (https://phabricator.wikimedia.org/T169287) [09:44:22] (03CR) 10jerkins-bot: [V: 04-1] etcd::ssl: restart etcd service when the SSL cert changes [puppet] - 10https://gerrit.wikimedia.org/r/512338 (https://phabricator.wikimedia.org/T169287) (owner: 10Arturo Borrero Gonzalez) [09:45:10] (03PS1) 10Giuseppe Lavagetto: mediawiki::php::monitoring: use runbook links in notes_url [puppet] - 10https://gerrit.wikimedia.org/r/515019 [09:49:49] !log upload libleatherman1.4.0_1.4.0+dfsg-1~bpo8+1_amd64.deb to wikimedia-jessie component/facter3 [09:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] (03PS4) 10Arturo Borrero Gonzalez: etcd::ssl: restart etcd service when the SSL cert changes [puppet] - 10https://gerrit.wikimedia.org/r/512338 (https://phabricator.wikimedia.org/T169287) [09:51:17] (03PS1) 10Vgutierrez: redirects.dat: Get rid of redundant wikiipedia.org entries [puppet] - 10https://gerrit.wikimedia.org/r/515020 (https://phabricator.wikimedia.org/T224539) [09:52:48] <_joe_> vgutierrez: having fun heh? [09:53:01] sure [09:57:57] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2002.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906070957_gehel_3... [09:58:52] (03PS1) 10Vgutierrez: redirects.dat: Remove redirections for invalid DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/515022 (https://phabricator.wikimedia.org/T224539) [09:59:15] !log upload libleatherman1.4.0_1.4.0+dfsg-1~bpo9+1_amd64.deb to wikimedia-stretch component/facter3 [09:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:21] 10Operations, 10ops-codfw, 10decommission: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['sarin.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906071003_jbond_74... [10:04:12] (03PS1) 10Vgutierrez: redirects.dat: Remove redundant wikipedia.com rules [puppet] - 10https://gerrit.wikimedia.org/r/515024 (https://phabricator.wikimedia.org/T224539) [10:04:15] (03PS1) 10Vgutierrez: redirects.dat: Remove redundant wikipedia.net redirections [puppet] - 10https://gerrit.wikimedia.org/r/515025 (https://phabricator.wikimedia.org/T224539) [10:04:17] (03PS1) 10Vgutierrez: redirects.dat: Remove redundant rules for wiktionary.com [puppet] - 10https://gerrit.wikimedia.org/r/515026 (https://phabricator.wikimedia.org/T224539) [10:04:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php::monitoring: use runbook links in notes_url [puppet] - 10https://gerrit.wikimedia.org/r/515019 (owner: 10Giuseppe Lavagetto) [10:05:17] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp3043 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/515011 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [10:06:22] 10Operations: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 (10Marostegui) @jbond the install worked fine this time. Thanks for the fix. I will leave up to you if you want to close this task as resolved or leave it open until you've found the proper fix. Thanks! [10:07:30] 10Operations: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 (10jbond) Awesome, ill keep the ticket open so people now im playing with the other two servers [10:09:44] <_joe_> !log restarting php-fpm on the codfw hosts to pick up the recent changes in opcache [10:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:51] (03PS2) 10Jcrespo: labsdb: Depool labsdb1010 to proceed with compression [puppet] - 10https://gerrit.wikimedia.org/r/515018 (https://phabricator.wikimedia.org/T222978) [10:11:57] (03CR) 10Jcrespo: [C: 03+2] labsdb: Depool labsdb1010 to proceed with compression [puppet] - 10https://gerrit.wikimedia.org/r/515018 (https://phabricator.wikimedia.org/T222978) (owner: 10Jcrespo) [10:12:08] (03PS3) 10Vgutierrez: redirects.dat: Ban using .*. [puppet] - 10https://gerrit.wikimedia.org/r/513142 (https://phabricator.wikimedia.org/T133548) [10:13:55] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Ban using .*. [puppet] - 10https://gerrit.wikimedia.org/r/513142 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:14:07] (03PS1) 10Ema: varnish: bump general upload rate limits [puppet] - 10https://gerrit.wikimedia.org/r/515027 (https://phabricator.wikimedia.org/T224884) [10:16:25] (03CR) 10Ema: [C: 03+1] redirects.dat: Get rid of redundant wikiipedia.org entries [puppet] - 10https://gerrit.wikimedia.org/r/515020 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [10:16:27] ema: looks like T225278 is fixed [10:16:27] T225278: Installation failing on late_command.sh - https://phabricator.wikimedia.org/T225278 [10:16:35] ema: my reimage worked, just fyi [10:16:43] marostegui: wonderful, thank you! [10:17:23] (03CR) 10Ema: [C: 03+1] redirects.dat: Remove redirections for invalid DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/515022 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [10:17:49] (03CR) 10Ema: [C: 03+1] redirects.dat: Remove redundant wikipedia.com rules [puppet] - 10https://gerrit.wikimedia.org/r/515024 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [10:18:27] (03CR) 10Ema: [C: 03+1] redirects.dat: Remove redundant wikipedia.net redirections [puppet] - 10https://gerrit.wikimedia.org/r/515025 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [10:18:38] (03CR) 10Ema: [C: 03+1] redirects.dat: Remove redundant rules for wiktionary.com [puppet] - 10https://gerrit.wikimedia.org/r/515026 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [10:18:45] (03CR) 10Vgutierrez: [C: 03+1] varnish: bump general upload rate limits [puppet] - 10https://gerrit.wikimedia.org/r/515027 (https://phabricator.wikimedia.org/T224884) (owner: 10Ema) [10:19:41] (03PS2) 10Ema: varnish: bump general upload rate limits [puppet] - 10https://gerrit.wikimedia.org/r/515027 (https://phabricator.wikimedia.org/T224884) [10:22:00] (03CR) 10Ema: [C: 03+2] varnish: bump general upload rate limits [puppet] - 10https://gerrit.wikimedia.org/r/515027 (https://phabricator.wikimedia.org/T224884) (owner: 10Ema) [10:24:09] (03PS1) 10Alexandros Kosiaris: termbox: Use newer ENV variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/515028 (https://phabricator.wikimedia.org/T220402) [10:24:11] (03PS1) 10Alexandros Kosiaris: Add termbox-0.0.2.tgz [deployment-charts] - 10https://gerrit.wikimedia.org/r/515029 [10:30:18] (03CR) 10Elukey: [C: 03+1] systemd::timer::job: always provision NRPE plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514988 (https://phabricator.wikimedia.org/T225268) (owner: 10BryanDavis) [10:31:31] 10Operations, 10Wikimedia-Mailing-lists: Consider restricting access to list subscriber list - https://phabricator.wikimedia.org/T225269 (10Aklapper) @MarkAHershberger: As far as I know, access to subscribers lists has been and is "restricted". Do you know specific mailing lists where this is not the case, or... [10:31:45] 10Operations, 10Wikimedia-Mailing-lists: Consider restricting access to list subscriber list - https://phabricator.wikimedia.org/T225269 (10Aklapper) 05Open→03Stalled [10:39:06] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) @tarrow, @WMDE-leszek Hi, sorry for taking so long to answer to this, it's been really busy. >>! In T220402#521447... [10:40:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] systemd::timer::job: always provision NRPE plugin [puppet] - 10https://gerrit.wikimedia.org/r/514988 (https://phabricator.wikimedia.org/T225268) (owner: 10BryanDavis) [10:43:12] !log depool cp3043 and reimage as upload_ats T222937 [10:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:17] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [10:43:39] (03PS2) 10Ema: cache: reimage cp3043 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/515011 (https://phabricator.wikimedia.org/T222937) [10:45:43] !log upload libleatherman-data_1.4.0+dfsg-1\~bpo9+1_all.deb to wikimedia-stretch component/facter3 [10:45:46] (03CR) 10Ema: [C: 03+2] cache: reimage cp3043 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/515011 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [10:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:55] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:50:56] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3043.esams.wmnet'] ` The log can be found in `... [10:51:14] ACKNOWLEDGEMENT - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1076_v4, cp1076_v6, cp2002_v4, cp2002_v6, cp2020_v4, cp2020_v6, cp2026_v4, cp2026_v6 Ema reimaging cp3043 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:51:40] !log upload libcpp-hocon0.1.6_0.1.6-1~bpo9+1_amd64.deb to wikimedia-stretch component/facter3 [10:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:48] (03CR) 10Bartosz Dziewoński: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514193 (owner: 10Bartosz Dziewoński) [11:01:59] 10Operations, 10ops-codfw, 10decommission: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sarin.codfw.wmnet'] ` Of which those **FAILED**: ` ['sarin.codfw.wmnet'] ` [11:05:17] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp3043_v4, cp3043_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:08:01] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 30 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:10:30] (03PS1) 10Jcrespo: wikireplica: Depool actually labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/515036 (https://phabricator.wikimedia.org/T222978) [11:10:54] (03PS2) 10Jcrespo: wikireplica: Depool actually labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/515036 (https://phabricator.wikimedia.org/T222978) [11:11:55] (03CR) 10Jcrespo: [C: 03+2] wikireplica: Depool actually labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/515036 (https://phabricator.wikimedia.org/T222978) (owner: 10Jcrespo) [11:13:35] (03CR) 10Muehlenhoff: Allow Hadoop-related profiles to deploy Kerberos keytabs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [11:14:15] (03PS1) 10Volans: dbconfig: honor scope in config get [software/conftool] - 10https://gerrit.wikimedia.org/r/515037 [11:17:23] (03CR) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [11:21:49] (03CR) 10Muehlenhoff: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [11:23:05] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [11:23:10] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:30:28] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3043.esams.wmnet'] ` and were **ALL** successful. [11:31:24] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:38:33] jouncebot: Nemo_bis [11:38:36] err sorry [11:38:38] jouncebot: next [11:38:39] In 60 hour(s) and 21 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190610T0000) [11:39:09] :) [11:41:45] (03PS1) 10Jbond: install_server - late_command: mask the puppet service [puppet] - 10https://gerrit.wikimedia.org/r/515042 [11:43:01] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) @akosiaris Thanks! Now https://gerrit.wikimedia.org/r/c/wikibase/termbox/+/515040 is merged the Healthcheck query shoul... [11:43:15] (03CR) 10Jbond: [C: 03+2] install_server - late_command: mask the puppet service [puppet] - 10https://gerrit.wikimedia.org/r/515042 (owner: 10Jbond) [11:45:04] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2002.codfw.wmnet'] ` and were **ALL** successful. [11:45:40] !log pool cp3043 w/ ATS backend T222937 [11:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:45] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [12:13:47] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) Indeed this was fixed. However another regression has crept up it's head Doing a `curl 'http://192.168.99.100:18788... [12:20:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] Remove trailing slash in base path (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/514681 (owner: 10Mobrovac) [12:35:23] (03PS1) 10Jbond: wmf_auto_reimage: improve fingerprint detection [puppet] - 10https://gerrit.wikimedia.org/r/515051 [12:37:32] (03CR) 10Muehlenhoff: install_server - late_command: mask the puppet service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515042 (owner: 10Jbond) [12:38:03] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:40:25] (03CR) 10Jbond: install_server - late_command: mask the puppet service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515042 (owner: 10Jbond) [12:43:30] (03CR) 10Volans: "I'm not against the change, but I'd like to have more info on when/how this can happen as in a systemd world we generate the certificate a" [puppet] - 10https://gerrit.wikimedia.org/r/515051 (owner: 10Jbond) [12:47:21] (03CR) 10Alexandros Kosiaris: "> My main motivation for this right now was to set up a beta cluster instance. Are there docs on how to do this with a new service on k8s?" [puppet] - 10https://gerrit.wikimedia.org/r/514490 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [12:49:29] (03CR) 10CDanis: [C: 03+2] dbconfig: honor scope in config get [software/conftool] - 10https://gerrit.wikimedia.org/r/515037 (owner: 10Volans) [12:49:49] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/515051 (owner: 10Jbond) [12:56:17] (03PS2) 10Arturo Borrero Gonzalez: systemd::timer::job: always provision NRPE plugin [puppet] - 10https://gerrit.wikimedia.org/r/514988 (https://phabricator.wikimedia.org/T225268) (owner: 10BryanDavis) [12:56:29] (03Abandoned) 10Hashar: shinken: add basic spec [puppet] - 10https://gerrit.wikimedia.org/r/497253 (owner: 10Hashar) [12:56:42] (03Abandoned) 10Hashar: Honor absolute paths in .dockerignore [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484547 (https://phabricator.wikimedia.org/T183546) (owner: 10Hashar) [12:58:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] systemd::timer::job: always provision NRPE plugin [puppet] - 10https://gerrit.wikimedia.org/r/514988 (https://phabricator.wikimedia.org/T225268) (owner: 10BryanDavis) [13:00:06] (03PS3) 10Hashar: cassandra: fix spec service provider [puppet] - 10https://gerrit.wikimedia.org/r/503996 [13:00:45] (03CR) 10Hashar: [C: 03+1] "With puppet (4.8.2)" [puppet] - 10https://gerrit.wikimedia.org/r/503996 (owner: 10Hashar) [13:01:49] (03PS6) 10Hashar: Rake: honor rubocop AllCops/Excludes [puppet] - 10https://gerrit.wikimedia.org/r/484410 [13:02:00] (03PS8) 10Hashar: rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) [13:02:02] (03CR) 10CDanis: [C: 03+2] dbconfig: use lists of dicts for sectionLoads (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [13:02:12] (03PS1) 10Filippo Giunchedi: hieradata: add netbox swift dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/515057 [13:02:35] (03PS1) 10Filippo Giunchedi: hieradata: add netbox swift user [puppet] - 10https://gerrit.wikimedia.org/r/515058 [13:03:19] !log aborrero@cumin1001:~$ sudo cumin "P{R:Systemd::Timer::Job}" "puppet agent --disable 'arturo merging systemd timer nrpe change'" (19 hosts affected) merging: https://gerrit.wikimedia.org/r/c/operations/puppet/+/514988 [13:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:42] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] hieradata: add netbox swift dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/515057 (owner: 10Filippo Giunchedi) [13:04:48] (03Merged) 10jenkins-bot: dbconfig: use lists of dicts for sectionLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [13:04:50] (03Merged) 10jenkins-bot: dbconfig: pretty-print get actions [software/conftool] - 10https://gerrit.wikimedia.org/r/514869 (owner: 10Volans) [13:04:52] (03Merged) 10jenkins-bot: dbconfig: save live config before updating it [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 (owner: 10Volans) [13:04:58] !log aborrero@cumin1001:~ $ sudo cumin "P{R:Systemd::Timer::Job}" "puppet agent --enable && run-puppet-agent" (patch already merged) [13:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:10] (03Merged) 10jenkins-bot: dbconfig: add config restore action [software/conftool] - 10https://gerrit.wikimedia.org/r/514983 (owner: 10Volans) [13:05:12] (03Merged) 10jenkins-bot: dbconfig: honor scope in config get [software/conftool] - 10https://gerrit.wikimedia.org/r/515037 (owner: 10Volans) [13:05:15] (03PS3) 10Mobrovac: Remove trailing slash in base path [software/service-checker] - 10https://gerrit.wikimedia.org/r/514681 [13:06:06] (03CR) 10Mobrovac: Remove trailing slash in base path (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/514681 (owner: 10Mobrovac) [13:06:35] (03PS1) 10Vgutierrez: Park wiki-pedia.org with the same config as wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/515059 [13:07:37] (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/515051 (owner: 10Jbond) [13:07:44] (03Abandoned) 10Ema: cache: reimage cp3043 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/514407 (https://phabricator.wikimedia.org/T222937) (owner: 10BBlack) [13:08:03] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/16933/" [puppet] - 10https://gerrit.wikimedia.org/r/515058 (owner: 10Filippo Giunchedi) [13:10:19] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['neodymium.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906071310_... [13:10:27] (03PS1) 10Ema: cache: reiamge cp3039 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/515061 (https://phabricator.wikimedia.org/T222937) [13:11:01] (03CR) 10Ema: [C: 03+1] Park wiki-pedia.org with the same config as wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/515059 (owner: 10Vgutierrez) [13:11:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove trailing slash in base path [software/service-checker] - 10https://gerrit.wikimedia.org/r/514681 (owner: 10Mobrovac) [13:12:04] arturo: FYI pro-tips: 1) P{} is not needed for a simple query given that puppetdb is the default backendbin prod. 2) run-puppet-agent has the --enable option :-) [13:12:18] TIL!! [13:12:23] (03PS2) 10Ema: cache: reimage cp3039 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/515061 (https://phabricator.wikimedia.org/T222937) [13:13:02] (03PS3) 10Hashar: swift: hiera-ize object-replicator interval [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) [13:13:04] (03PS3) 10Hashar: beta: tweak swift replicator [puppet] - 10https://gerrit.wikimedia.org/r/513054 (https://phabricator.wikimedia.org/T160990) [13:13:06] (03PS2) 10Hashar: swift: hiera-ize object server number of workers [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) [13:13:08] (03PS3) 10Hashar: beta: lower swift server workers [puppet] - 10https://gerrit.wikimedia.org/r/513059 (https://phabricator.wikimedia.org/T160990) [13:13:10] (03PS2) 10Hashar: swift: hierarize container_replicator settings [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) [13:13:12] (03PS2) 10Hashar: beta: slow down swift container replication [puppet] - 10https://gerrit.wikimedia.org/r/513063 (https://phabricator.wikimedia.org/T160990) [13:13:31] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [13:13:34] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [13:13:36] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [13:14:37] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp3039 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/515061 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [13:15:48] !log depool cp3039 and reimage as upload_ats T222937 [13:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:53] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [13:15:55] (03CR) 10Ema: [C: 03+2] cache: reimage cp3039 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/515061 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [13:17:51] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3039.esams.wmnet'] ` The log can be found in `... [13:17:55] (03PS1) 10Jhedden: wiki replicas: unfilter deleted rev_len versions [puppet] - 10https://gerrit.wikimedia.org/r/515062 (https://phabricator.wikimedia.org/T101631) [13:18:12] (03PS1) 10Jcrespo: WMFBackup: Increase xtrabackup memory use to 20GB [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515063 (https://phabricator.wikimedia.org/T206203) [13:18:14] (03PS1) 10Jcrespo: mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) [13:18:42] (03CR) 10jerkins-bot: [V: 04-1] WMFBackup: Increase xtrabackup memory use to 20GB [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515063 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:18:44] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [13:21:24] (03CR) 10Muehlenhoff: [C: 03+1] install_server - late_command: mask the puppet service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515042 (owner: 10Jbond) [13:22:28] (03PS1) 10Alexandros Kosiaris: sessionstore: Add discovery records [dns] - 10https://gerrit.wikimedia.org/r/515065 [13:23:00] (03PS2) 10Alexandros Kosiaris: sessionstore: Add discovery records [dns] - 10https://gerrit.wikimedia.org/r/515065 (https://phabricator.wikimedia.org/T220401) [13:23:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] sessionstore: Add discovery records [dns] - 10https://gerrit.wikimedia.org/r/515065 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [13:25:04] PROBLEM - Check systemd state on db2101 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:27:38] I will check that [13:28:32] what was it? [13:28:32] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) `sessionstore.discovery.wmnet` is now around and should be the canonical DNS used to address the service. [13:28:52] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3039_v4, cp3039_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:28:58] jynus: I am checking [13:29:45] it is expecting a mariadb service [13:29:51] when it only has an x1 [13:30:01] ah ok, you are checking then? [13:30:30] well, it took you too much time! [13:30:33] :-) [13:30:59] what is db2101? [13:32:06] I am guessing a database source [13:32:24] it needs disable of the service, do you want to do it or should I? [13:32:33] you do it [13:32:38] ok [13:32:42] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3039_v4, cp3039_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:32:42] (03PS1) 10Alexandros Kosiaris: Assign termbox.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/515067 (https://phabricator.wikimedia.org/T220402) [13:32:44] (03PS1) 10Alexandros Kosiaris: fixup! Assign termbox.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/515068 [13:32:46] (03PS1) 10Alexandros Kosiaris: Enable discovery for termbox [dns] - 10https://gerrit.wikimedia.org/r/515069 (https://phabricator.wikimedia.org/T220402) [13:33:32] (03PS2) 10Alexandros Kosiaris: Assign termbox.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/515067 (https://phabricator.wikimedia.org/T220402) [13:33:34] (03PS2) 10Alexandros Kosiaris: Enable discovery for termbox [dns] - 10https://gerrit.wikimedia.org/r/515069 (https://phabricator.wikimedia.org/T220402) [13:33:46] (03Abandoned) 10Alexandros Kosiaris: fixup! Assign termbox.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/515068 (owner: 10Alexandros Kosiaris) [13:33:48] marostegui: done [13:33:57] ok [13:34:02] RECOVERY - Check systemd state on db2101 is OK: OK - running: The system is fully operational [13:34:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1 as a precaution against having it merged before LVS changes are merged" [dns] - 10https://gerrit.wikimedia.org/r/515069 (https://phabricator.wikimedia.org/T220402) (owner: 10Alexandros Kosiaris) [13:34:14] did you restart or do maintenance there or something? [13:34:24] it is weird it randomly gives that error [13:34:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] Assign termbox.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/515067 (https://phabricator.wikimedia.org/T220402) (owner: 10Alexandros Kosiaris) [13:34:34] I upgraded mariadb@x1 in the morning [13:34:42] ah, then expected [13:34:44] no problem [13:34:54] maybe something else had done something [13:34:58] I know now [13:35:04] on mariadb package install [13:35:11] that's why I said I would check it [13:35:11] the service may get enabled [13:35:36] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/515051 (owner: 10Jbond) [13:35:40] yeah, but in a way it is my fault [13:35:54] for having a bad package [13:36:28] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:38:32] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 28 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [13:43:04] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:43:45] (03PS1) 10Jcrespo: mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) [13:47:47] (03PS2) 10Alexandros Kosiaris: blubberoid: Don't page on LVS failures [puppet] - 10https://gerrit.wikimedia.org/r/514574 [13:47:51] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] blubberoid: Don't page on LVS failures [puppet] - 10https://gerrit.wikimedia.org/r/514574 (owner: 10Alexandros Kosiaris) [13:49:23] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3039.esams.wmnet'] ` and were **ALL** successful. [13:53:22] (03CR) 10Hashar: "The new deployment pipeline does use the Blubberoid service ( via https://blubberoid.wikimedia.org/v1/ ). That is used by the pipeline to" [puppet] - 10https://gerrit.wikimedia.org/r/514574 (owner: 10Alexandros Kosiaris) [13:53:30] akosiaris: ^ blubberoid is production! :D [13:53:58] the deployment pipeline does use https://blubberoid.wikimedia.org/v1/ [13:57:34] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3039 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ema Sadly known T203272 [13:59:30] (03PS1) 10Alexandros Kosiaris: Introduce termbox LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/515074 (https://phabricator.wikimedia.org/T220402) [14:00:56] !log pool cp3039 w/ ATS backend T222937 [14:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:01] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [14:01:48] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "OK but it doesn't look like it warrants paging 26 people for a non end-user visible outage. Alerts btw will still be raised in IRC + email" [puppet] - 10https://gerrit.wikimedia.org/r/514574 (owner: 10Alexandros Kosiaris) [14:02:04] (03PS1) 10Jbond: puppet agent: mask service [puppet] - 10https://gerrit.wikimedia.org/r/515075 [14:02:39] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/515074 (https://phabricator.wikimedia.org/T220402) (owner: 10Alexandros Kosiaris) [14:04:30] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ema) 05Open→03Resolved a:03ema cp3039 was the last node in upload@esams still running Varnish. With its upgrade to cp3039, only the defunct cp3037... [14:04:50] (03PS1) 10Fsero: ldap_requests: bug: replace cstone with ceec [puppet] - 10https://gerrit.wikimedia.org/r/515079 [14:05:31] (03CR) 10Fsero: [C: 03+2] ldap_requests: bug: replace cstone with ceec [puppet] - 10https://gerrit.wikimedia.org/r/515079 (owner: 10Fsero) [14:06:14] (03CR) 10Hashar: [V: 04-1] "Puppet compiler fails:" [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [14:06:17] (03Abandoned) 10Vgutierrez: Park wiki-pedia.org with the same config as wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/515059 (owner: 10Vgutierrez) [14:06:46] (03CR) 10Hashar: [V: 04-1] "puppet compiler fails :-\" [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [14:08:34] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:08:48] (03CR) 10Bstorm: [C: 03+1] wiki replicas: unfilter deleted rev_len versions [puppet] - 10https://gerrit.wikimedia.org/r/515062 (https://phabricator.wikimedia.org/T101631) (owner: 10Jhedden) [14:11:35] (03PS4) 10Hashar: swift: hiera-ize object-replicator interval [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) [14:12:06] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [14:12:50] (03PS3) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) [14:15:30] (03CR) 10Hashar: "PS3 adds a couple hosts for the puppet compiler suggested by Filippo: ms-fe1005 and ms-be1040" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [14:15:40] (03PS4) 10Hashar: beta: tweak swift replicator [puppet] - 10https://gerrit.wikimedia.org/r/513054 (https://phabricator.wikimedia.org/T160990) [14:15:53] (03PS3) 10Hashar: swift: hiera-ize object server number of workers [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) [14:16:01] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [14:17:10] (03PS1) 10Vgutierrez: redirects.dat: Get rid of rules not working due to DNS misconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/515080 [14:21:09] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['neodymium.eqiad.wmnet'] ` and were **ALL** successful. [14:21:28] (03PS4) 10Ema: ATS: add hardening features to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/510168 [14:24:03] (03PS1) 10Andrew Bogott: mediawiki config: update to support newer MW versions [wikitech-static] - 10https://gerrit.wikimedia.org/r/515084 [14:24:20] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] mediawiki config: update to support newer MW versions [wikitech-static] - 10https://gerrit.wikimedia.org/r/515084 (owner: 10Andrew Bogott) [14:25:31] (03PS5) 10Ema: ATS: add hardening features to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/510168 [14:28:42] (03CR) 10Ema: ATS: add hardening features to systemd unit (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510168 (owner: 10Ema) [14:31:28] (03PS1) 10Jbond: install - late_command: Ensure correct version of puppet/facter are installed [puppet] - 10https://gerrit.wikimedia.org/r/515087 [14:31:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/510168 (owner: 10Ema) [14:32:56] (03CR) 10Giuseppe Lavagetto: "I would argue this is more a case of lack of dns configuration than anything else." [puppet] - 10https://gerrit.wikimedia.org/r/515080 (owner: 10Vgutierrez) [14:37:30] (03PS4) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) [14:42:28] (03PS2) 10Vgutierrez: redirects.dat: Get rid of rules not working due to DNS misconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/515080 [14:46:57] 10Operations, 10netbox: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 (10crusnov) a:05Volans→03crusnov [14:49:35] (03CR) 10Fsero: "why did you removed the go.mod and go.sum files? without them we cannot build the package using the vendored folder applied as a patch" [debs/helm-diff] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/512323 (owner: 10Hashar) [14:54:32] (03CR) 10Bstorm: "That looks right, I think https://puppet-compiler.wmflabs.org/compiler1001/16935/tools-sgebastion-07.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [14:55:42] (03PS9) 10Bstorm: dologmsg: extract variables from Toolforge dologmsg [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [14:59:07] (03CR) 10Bstorm: [C: 03+2] dologmsg: extract variables from Toolforge dologmsg [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [15:01:42] (03CR) 10Hashar: [C: 04-1] swift: hiera-ize object server number of workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [15:02:21] 10Operations, 10Wikimedia-Mailing-lists: Consider restricting access to list subscriber list - https://phabricator.wikimedia.org/T225269 (10MarkAHershberger) Perhaps because I'm an admin on some lists, I can see the subscribers and that is what made me think this might be an issue. Still, using my own mailman... [15:03:33] 10Operations, 10Wikimedia-Mailing-lists: Verify that all mailman mailing lists have private_roster=2 - https://phabricator.wikimedia.org/T225269 (10MarkAHershberger) [15:04:01] (03PS4) 10Hashar: swift: hiera-ize object server number of workers [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) [15:04:21] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [15:08:50] (03CR) 10Hashar: "> why did you removed the go.mod and go.sum files? without them we cannot build the package using the vendored folder applied as a patch" [debs/helm-diff] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/512323 (owner: 10Hashar) [15:09:19] !log restart thorium for kernel upgrades [15:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:24] uff reboot [15:09:57] PROBLEM - puppet last run on dns2002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:12:21] (03CR) 10Hashar: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/204/" [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [15:12:33] (03PS4) 10Hashar: beta: lower swift server workers [puppet] - 10https://gerrit.wikimedia.org/r/513059 (https://phabricator.wikimedia.org/T160990) [15:12:47] (03PS3) 10Hashar: swift: hierarize container_replicator settings [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) [15:13:04] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [15:14:55] (03CR) 10Bstorm: "Ah, shoot. There was a typo. I'll fix it." [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [15:16:32] (03PS1) 10Bstorm: dologmsg: fix error in file definition [puppet] - 10https://gerrit.wikimedia.org/r/515100 [15:17:12] (03CR) 10Bstorm: "This was the error. Sorry I missed it." [puppet] - 10https://gerrit.wikimedia.org/r/515100 (owner: 10Bstorm) [15:17:26] (03CR) 10jerkins-bot: [V: 04-1] dologmsg: fix error in file definition [puppet] - 10https://gerrit.wikimedia.org/r/515100 (owner: 10Bstorm) [15:17:43] 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10elukey) @tizianopiccardi you were the correct one to ping :) [15:18:14] (03PS2) 10Bstorm: dologmsg: fix error in file definition [puppet] - 10https://gerrit.wikimedia.org/r/515100 [15:18:50] (03CR) 10Elukey: "Aaron/Timo: whenever you have time let me know if you like the plan and the new change :)" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [15:19:50] (03CR) 10Bstorm: [C: 03+2] dologmsg: fix error in file definition [puppet] - 10https://gerrit.wikimedia.org/r/515100 (owner: 10Bstorm) [15:20:59] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) @akosiaris All those 10 servers should be in the public VLAN like the old onces? [15:27:11] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:30:35] 10Operations, 10Wikimedia-Mailing-lists: Verify that all mailman mailing lists have private_roster=2 - https://phabricator.wikimedia.org/T225269 (10Aklapper) 05Stalled→03Open [15:33:51] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [15:34:15] !log bounce rsyslog on wezen - T199406 [15:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:20] T199406: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 [15:34:44] (03PS1) 10Bstorm: dologmsg: move this little script out of toolforge [puppet] - 10https://gerrit.wikimedia.org/r/515104 [15:35:19] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 806 days) https://wikitech.wikimedia.org/wiki/Logs [15:36:00] (03PS2) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 [15:36:17] PROBLEM - Host ms-be1033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:53] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) @ayounsi I am planning on installing those new servers in row c and row D and I don't have the "interface-range ganeti" in both of those rows Is it okay for me to go ahead and create... [15:37:07] RECOVERY - puppet last run on dns2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:38:30] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1002/16936/tools-sgebastion-07.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [15:42:11] (03CR) 10BryanDavis: [C: 03+1] dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [15:47:17] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) >>! In T220402#5242902, @akosiaris wrote: > Indeed this was fixed. However another regression has crept up it's head T... [15:51:06] (03CR) 10Jforrester: "Note for deployer: Depends on I0ddf0c099d380ba61991b44cd2426b40ecc5e79f which is in wmf.8 and so should be fine to deploy whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514994 (https://phabricator.wikimedia.org/T87892) (owner: 10Smalyshev) [15:57:50] (03CR) 10Faidon Liambotis: [C: 03+1] "Up to Riccardo at this point. My only question is: are there any kind of fatals or warnings reported with this PS?" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) (owner: 10CRusnov) [15:59:49] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:01:49] PROBLEM - HHVM rendering on mw1345 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:02:51] 10Operations, 10netbox: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 (10crusnov) okay so this works, mostly, in labs when manually configured to operate against the deployment-prep Swift cluster. Netbox lets me upload... [16:03:17] RECOVERY - HHVM rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 80187 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:03:22] (03CR) 10Lucas Werkmeister (WMDE): "Thanks for taking care of it!" [puppet] - 10https://gerrit.wikimedia.org/r/515100 (owner: 10Bstorm) [16:04:25] (03CR) 10CRusnov: "Thank you for the review. I shall post an updated results list to the ticket." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) (owner: 10CRusnov) [16:07:23] 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org: Investigate issues with wikitech-static.wikimedia.org - https://phabricator.wikimedia.org/T156570 (10ArielGlenn) You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org <-- this is the header on ever... [16:07:56] (03CR) 10Lucas Werkmeister (WMDE): "I’m not sure if reusing this script is the best way to get the functionality on Cloud VPS instances – we wouldn’t only need to adjust it t" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [16:08:11] (03CR) 10CRusnov: "> Patch Set 6:" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) (owner: 10CRusnov) [16:08:29] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for ganeti2009, ganeti201[0-8] [dns] - 10https://gerrit.wikimedia.org/r/515111 [16:08:52] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS for ganeti2009, ganeti201[0-8] [dns] - 10https://gerrit.wikimedia.org/r/515111 (owner: 10Papaul) [16:12:12] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for ganeti2009, ganeti201[0-8] [dns] - 10https://gerrit.wikimedia.org/r/515111 [16:12:32] (03PS1) 10EBernhardson: Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) [16:13:24] (03CR) 10EBernhardson: "I'm not 100% that this is all that is required, but it might be. There is additional JDBC configuration that can be setup, but as far as i" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson) [16:16:01] RECOVERY - Host ms-be1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [16:18:47] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [16:20:21] RECOVERY - Host ms-be1033 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:22:04] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 (10Cmjohnson) 05Open→03Resolved The motherboard was replaced and the server is back up [16:46:58] 10Puppet, 10Cloud-VPS, 10Wikidata: role::wikibase in wikidata-dev Cloud VPS project broken - https://phabricator.wikimedia.org/T225312 (10Lucas_Werkmeister_WMDE) [16:48:35] 10Puppet, 10Cloud-VPS, 10Wikidata: role::wikibase in wikidata-dev Cloud VPS project broken - https://phabricator.wikimedia.org/T225312 (10Lucas_Werkmeister_WMDE) I tried this out by creating `other-test-T225307` and `wikibase-test-T225307` instances; `other-test-T225307` worked fine, `wikibase-test-T225307`... [16:48:41] 10Puppet, 10Cloud-VPS, 10Wikidata: role::wikibase in wikidata-dev Cloud VPS project broken - https://phabricator.wikimedia.org/T225312 (10Andrew) I wouldn't say that the role is necessarily broken; it may just be that it needs the hiera args provided if the class is applied. You could provide a good default... [16:50:37] (03PS2) 10Jcrespo: mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) [16:51:02] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:51:23] (03PS2) 10BBlack: cp3037: normalize config of dead node [puppet] - 10https://gerrit.wikimedia.org/r/514406 (https://phabricator.wikimedia.org/T222041) [16:52:00] 10Puppet, 10Cloud-VPS, 10Wikidata: role::wikibase in wikidata-dev Cloud VPS project broken (⇒ can’t SSH into wikibase-* instances) - https://phabricator.wikimedia.org/T225312 (10Lucas_Werkmeister_WMDE) [16:52:25] (03CR) 10BBlack: [C: 03+2] cp3037: normalize config of dead node [puppet] - 10https://gerrit.wikimedia.org/r/514406 (https://phabricator.wikimedia.org/T222041) (owner: 10BBlack) [16:57:03] (03PS3) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 [16:57:27] (03PS2) 10Jcrespo: mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) [16:58:20] (03CR) 10jerkins-bot: [V: 04-1] mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [17:03:16] (03CR) 10Bstorm: "Actually Wikitech's Yaml parser or something does something really bad with strings that include a pound sign (#). I was just trying to s" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [17:03:43] (03PS3) 10Jcrespo: mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) [17:03:59] (03PS3) 10Jcrespo: mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) [17:04:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [17:04:49] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 7.414 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:06:09] RECOVERY - Nginx local proxy to apache on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:14:31] (03PS1) 10Mforns: Remove turnilo configuration for old netflow datasource [puppet] - 10https://gerrit.wikimedia.org/r/515121 (https://phabricator.wikimedia.org/T225314) [17:15:13] (03CR) 10Elukey: [C: 03+2] Remove turnilo configuration for old netflow datasource [puppet] - 10https://gerrit.wikimedia.org/r/515121 (https://phabricator.wikimedia.org/T225314) (owner: 10Mforns) [17:23:14] (03PS4) 10Hashar: swift: hierarize container_replicator settings [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) [17:31:31] (03PS1) 10Mforns: analytics::refinery::job::druid_load add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) [17:32:27] (03CR) 10jerkins-bot: [V: 04-1] analytics::refinery::job::druid_load add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) (owner: 10Mforns) [17:32:30] (03CR) 10Mforns: [C: 04-1] "Before merging this, we should fix the problems with netflow ingestion." [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) (owner: 10Mforns) [17:33:45] (03PS2) 10Mforns: analytics::refinery::job::druid_load add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) [17:34:58] (03CR) 10Mforns: [C: 04-1] "We should not merge this until the netflow data ingestion has been fixed." [puppet] - 10https://gerrit.wikimedia.org/r/515125 (https://phabricator.wikimedia.org/T225314) (owner: 10Mforns) [17:37:05] 10Puppet, 10Cloud-VPS, 10Wikidata: role::wikibase in wikidata-dev Cloud VPS project broken (⇒ can’t SSH into wikibase-* instances) - https://phabricator.wikimedia.org/T225312 (10Lucas_Werkmeister_WMDE) Okay, I think I’ve figured out the most important parts. Basically, you don’t want to name your instance `w... [17:37:10] 10Puppet, 10Cloud-VPS, 10Wikidata: role::wikibase in wikidata-dev Cloud VPS project broken (⇒ can’t SSH into wikibase-* instances) - https://phabricator.wikimedia.org/T225312 (10Lucas_Werkmeister_WMDE) 05Open→03Resolved a:03Lucas_Werkmeister_WMDE [17:47:43] (03CR) 10Lucas Werkmeister (WMDE): "> Actually Wikitech's Yaml parser or something does something really bad with strings that include a pound sign (#). I was just trying to" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [17:56:29] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [18:18:17] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:24:50] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) So for network: if possible, can we do one port on the public lan and one port on the private? @RobH Everything else in my first co... [18:26:20] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) Note: the hosts aren't expected to work as routers. They just should have management traffic separated if we can properly run them t... [18:28:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:28:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:31:54] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [18:33:56] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10RobH) a:05Andrew→03ayounsi Ok, I've synced up with @Bstorm via IRC, and we have the following questions to be addressed by our network ad... [18:34:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:34:47] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:35:41] (03CR) 10Hashar: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/206/" [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [18:35:51] (03PS3) 10Hashar: beta: slow down swift container replication [puppet] - 10https://gerrit.wikimedia.org/r/513063 (https://phabricator.wikimedia.org/T160990) [18:39:31] (03CR) 10Lucas Werkmeister (WMDE): "I guess one possible behavior for a grand unified dologmsg would be:" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [18:43:04] (03CR) 10Hashar: "> Alerts btw will still be raised in IRC + email" [puppet] - 10https://gerrit.wikimedia.org/r/514574 (owner: 10Alexandros Kosiaris) [18:56:40] !log performing rolling reboots of logstash codfw frontends for security updates [18:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:05] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:07:23] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 80129 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:11:13] (03PS1) 10Legoktm: ExtensionDistributor: Enable REL1_33 (beta), drop pre-REL1_30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515142 [19:12:27] (03CR) 10Reedy: [C: 03+1] ExtensionDistributor: Enable REL1_33 (beta), drop pre-REL1_30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515142 (owner: 10Legoktm) [19:14:17] (03CR) 10Jforrester: "Let's just ship it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515142 (owner: 10Legoktm) [19:19:08] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10akosiaris) >>! In T224603#5243200, @Papaul wrote: > @ayounsi I am planning on installing those new servers in row c and row D and I don't have the "interface-range gane... [19:29:31] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [19:31:54] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) >>! In T220402#5243209, @Tarrow wrote: > This should now be fixed. Sadly this was due to a mismatch between the code... [19:34:50] (03CR) 10Bstorm: ">" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [19:38:16] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add termbox-0.0.2.tgz [deployment-charts] - 10https://gerrit.wikimedia.org/r/515029 (owner: 10Alexandros Kosiaris) [19:38:26] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] termbox: Use newer ENV variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/515028 (https://phabricator.wikimedia.org/T220402) (owner: 10Alexandros Kosiaris) [19:39:01] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [19:43:05] (03PS4) 10Jcrespo: mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) [19:43:35] (03PS4) 10Jcrespo: mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) [19:44:00] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [19:44:09] (03CR) 10jerkins-bot: [V: 04-1] mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [19:45:11] (03PS4) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 [19:47:28] (03CR) 10Bstorm: "Figuring that this script's overall logic and thinking is not really appropriate for production use anyway (and requires access to cloud a" [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [19:48:18] (03PS5) 10Jcrespo: mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) [19:48:21] (03PS5) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 [19:48:43] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [19:50:21] (03PS5) 10Jcrespo: mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) [19:56:43] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:24:20] (03PS2) 10Alexandros Kosiaris: Introduce termbox LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/515074 (https://phabricator.wikimedia.org/T220402) [21:24:22] (03PS1) 10Alexandros Kosiaris: termbox: Add kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/515177 (https://phabricator.wikimedia.org/T220402) [21:31:02] (03PS6) 10Jcrespo: mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) [21:31:30] (03PS6) 10Jcrespo: mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) [21:31:55] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [22:32:21] PROBLEM - Apache HTTP on mw1263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:33:39] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers