[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [01:12:34] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:34:19] (03CR) 10Thcipriani: [C: 032] Switch path handling to pathlib.Path [software/keyholder] - 10https://gerrit.wikimedia.org/r/458246 (owner: 10Faidon Liambotis) [01:34:55] (03Merged) 10jenkins-bot: Switch path handling to pathlib.Path [software/keyholder] - 10https://gerrit.wikimedia.org/r/458246 (owner: 10Faidon Liambotis) [01:34:57] (03Merged) 10jenkins-bot: Unlink the Unix domain socket when exiting [software/keyholder] - 10https://gerrit.wikimedia.org/r/458247 (owner: 10Faidon Liambotis) [01:39:57] (03CR) 10Thcipriani: [C: 032] Abstract the SSH fingerprint generation [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 (owner: 10Faidon Liambotis) [01:41:00] (03Merged) 10jenkins-bot: Abstract the SSH fingerprint generation [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 (owner: 10Faidon Liambotis) [01:59:32] (03CR) 10Thcipriani: [C: 032] "thought inline" (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458249 (owner: 10Faidon Liambotis) [02:00:06] (03Merged) 10jenkins-bot: Stop spawning ssh-keygen but generate fps ourselves [software/keyholder] - 10https://gerrit.wikimedia.org/r/458249 (owner: 10Faidon Liambotis) [02:00:25] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:09:40] (03CR) 10Thcipriani: [C: 032] "Works beautifully!" [software/keyholder] - 10https://gerrit.wikimedia.org/r/469807 (owner: 10Faidon Liambotis) [02:10:16] (03Merged) 10jenkins-bot: Reload the config on SIGHUP [software/keyholder] - 10https://gerrit.wikimedia.org/r/469807 (owner: 10Faidon Liambotis) [02:10:35] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:13:35] (03PS1) 10Thcipriani: Fix keyholder script permissions [software/keyholder] - 10https://gerrit.wikimedia.org/r/473146 [02:14:04] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:18:08] (03PS1) 10Thcipriani: Move public key read_bytes inside try [software/keyholder] - 10https://gerrit.wikimedia.org/r/473147 [02:24:41] (03PS1) 10Huji: Revert the language of votewiki to English (en) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 [02:25:29] (03PS2) 10Huji: Revert the language of votewiki to English (en) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) [02:26:29] (03PS3) 10Huji: Revert the language of votewiki to English (en) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) [02:34:25] (03PS1) 10Mathew.onipe: wdqs::gui: extract crontasks into a new crontasts.pp [puppet] - 10https://gerrit.wikimedia.org/r/473149 (https://phabricator.wikimedia.org/T209257) [02:48:36] (03CR) 10Mathew.onipe: "PCC output looks good: https://puppet-compiler.wmflabs.org/compiler1002/13447/" [puppet] - 10https://gerrit.wikimedia.org/r/473149 (https://phabricator.wikimedia.org/T209257) (owner: 10Mathew.onipe) [03:02:44] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:20:48] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (039 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [03:23:12] (03PS20) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [03:30:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 754.36 seconds [03:46:49] (03PS1) 10Mathew.onipe: setup.py: added dateutil to test dependencies [cookbooks] - 10https://gerrit.wikimedia.org/r/473150 [03:47:20] (03CR) 10jerkins-bot: [V: 04-1] setup.py: added dateutil to test dependencies [cookbooks] - 10https://gerrit.wikimedia.org/r/473150 (owner: 10Mathew.onipe) [03:53:20] (03Abandoned) 10Mathew.onipe: setup.py: added dateutil to test dependencies [cookbooks] - 10https://gerrit.wikimedia.org/r/473150 (owner: 10Mathew.onipe) [04:08:39] (03PS1) 10Mathew.onipe: setup.py: added python-dateutil to dependencies [cookbooks] - 10https://gerrit.wikimedia.org/r/473151 [04:12:07] elukey: ok! I'll check it out tomorrow morning :) thx!!! [04:15:44] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 247.48 seconds [04:16:14] (03PS12) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [04:19:23] (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [05:30:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473152 (https://phabricator.wikimedia.org/T203709) [05:32:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473152 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [05:33:32] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473152 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [05:34:49] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 (duration: 00m 55s) [05:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:53] !log Deploy schema change on db1109 T203709 [05:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:56] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [05:35:14] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473152 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [05:36:04] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.65 seconds [05:36:15] This is because of the rename ^ [05:36:35] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 661.07 seconds [05:37:56] (03PS1) 10Marostegui: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473153 (https://phabricator.wikimedia.org/T208383) [05:39:25] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [05:39:27] !log Deploy schema change on db2048 (s1 codfw master), this will create lag on s1 codfw - T114117 [05:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:30] T114117: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 [05:40:04] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [05:40:36] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473153 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [05:41:41] (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473153 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [05:42:57] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc2004 - T208383 (duration: 00m 53s) [05:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:00] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [05:43:27] !log Stop MySQL on pc2004 to transfer its data to pc2007 - T208383 [05:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:27] (03CR) 10jenkins-bot: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473153 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:02:31] !log Add ipb_sitewide column to db1073:labtestwiki [06:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:31] * elukey missed marostegui's altering tables in the morning <3 [06:51:49] <3 [07:05:52] !log powercycle lvs2006 - mgmt/serial console blank, not responsive since hours ago [07:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473156 [07:14:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473156 (owner: 10Marostegui) [07:15:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473156 (owner: 10Marostegui) [07:16:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473156 (owner: 10Marostegui) [07:16:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1109 (duration: 00m 54s) [07:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473158 (https://phabricator.wikimedia.org/T203709) [07:18:06] 10Operations, 10ops-codfw, 10Traffic: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10elukey) p:05Triage>03High [07:18:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473158 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [07:19:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473158 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [07:20:44] RECOVERY - Host lvs2006 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [07:20:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 (duration: 00m 53s) [07:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:06] <_joe_> uh [07:21:11] <_joe_> lvs2006 is now up [07:21:13] 2006 came back online? [07:21:15] !log Deploy schema change on db1104 T203709 [07:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:18] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [07:21:44] <_joe_> ema: I'd verify it's working ok thoroughly [07:21:54] that's suspicious... [07:24:10] (03PS1) 10Elukey: profile::an::refinery::job::import_wikitech:dumps: fix timer's script [puppet] - 10https://gerrit.wikimedia.org/r/473159 [07:25:37] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13448/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/473159 (owner: 10Elukey) [07:26:53] gtirloni: o/ are you around? [07:27:20] there seems to be "toolforge: refactor mail server" still to merge on puppetmaster [07:29:23] Cc: arturo --^ [07:31:01] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473158 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [07:36:55] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:37:19] yes yes we know [07:37:48] ^ that must be ChatOps ;P [07:38:09] (03PS1) 10Elukey: Revert "toolforge: refactor mail server" [puppet] - 10https://gerrit.wikimedia.org/r/473160 [07:39:03] (03CR) 10Elukey: [C: 032] "It was left to puppet-merge and I am not 100% sure what are the effects of merging this code, so to be safe I am reverting it. Sorry :(" [puppet] - 10https://gerrit.wikimedia.org/r/473160 (owner: 10Elukey) [07:39:11] (03PS2) 10Elukey: Revert "toolforge: refactor mail server" [puppet] - 10https://gerrit.wikimedia.org/r/473160 [07:40:24] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [07:43:05] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [07:51:35] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Finally it seems that mcrouter now waits only 3s (or a littl... [07:52:38] (03CR) 10Elukey: [C: 031] Remove obsolete cloudera-trusty repository [puppet] - 10https://gerrit.wikimedia.org/r/472950 (owner: 10Muehlenhoff) [07:55:35] (03CR) 10DCausse: [C: 031] elasticsearch: cookbook for multi-cluster services rolling restart (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [08:01:04] (03PS2) 10Muehlenhoff: Remove obsolete cloudera-trusty repository [puppet] - 10https://gerrit.wikimedia.org/r/472950 [08:01:28] (03PS1) 10Elukey: Move import_wikitext_dumps to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/473161 (https://phabricator.wikimedia.org/T202489) [08:02:33] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete cloudera-trusty repository [puppet] - 10https://gerrit.wikimedia.org/r/472950 (owner: 10Muehlenhoff) [08:04:02] (03CR) 10DCausse: [C: 031] elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [08:08:04] (03CR) 10Joal: [C: 031] "LGTM ! Thanks elukey :)" [puppet] - 10https://gerrit.wikimedia.org/r/473161 (https://phabricator.wikimedia.org/T202489) (owner: 10Elukey) [08:08:08] (03CR) 10DCausse: [C: 031] elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [08:08:13] sorry elukey, thanks for handling [08:09:33] I will be in front of my laptop ready for working in a while [08:11:37] ack! np [08:16:02] (03CR) 10Elukey: [C: 032] Move import_wikitext_dumps to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/473161 (https://phabricator.wikimedia.org/T202489) (owner: 10Elukey) [08:16:09] (03PS2) 10Elukey: Move import_wikitext_dumps to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/473161 (https://phabricator.wikimedia.org/T202489) [08:23:50] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [08:26:02] (03PS13) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [08:26:28] (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [08:27:14] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:37:01] !log updating remaining rsyslog on stretch to 8.38.0-1~bpo9+1wmf1 [08:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:51] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [08:38:32] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:38:47] working on --^ [08:42:41] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [08:58:32] PROBLEM - DPKG on mwdebug1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:03:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473162 [09:08:52] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473162 (owner: 10Marostegui) [09:09:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473162 (owner: 10Marostegui) [09:11:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1104 (duration: 00m 55s) [09:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:05] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473162 (owner: 10Marostegui) [09:13:02] RECOVERY - DPKG on mwdebug1002 is OK: All packages OK [09:14:56] (03CR) 10Gehel: [C: 032] "Build and signatures verified." [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/473045 (https://phabricator.wikimedia.org/T209293) (owner: 10DCausse) [09:17:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473165 (https://phabricator.wikimedia.org/T203709) [09:18:04] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10ArielGlenn) [09:19:55] (03PS30) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [09:20:25] !log rollout new prometheus-mcrouter-exporter to mw* - previous rollout didn't work as expected [09:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473165 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [09:22:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473165 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [09:23:33] (03CR) 10Volans: "In general I'm not a big fan of adding dependencies to do simple things, but the TZ parsing in the ISO 8601 format strings might be tricky" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/473151 (owner: 10Mathew.onipe) [09:23:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092 (duration: 00m 52s) [09:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:48] !log Deploy schema change on db1092 T203709 [09:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:51] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [09:25:10] <_joe_> !log uploading new versions of php-msgpack, php-geoip compatible with both php 7.0 and php 7.2 to thirdparty/php72 T208433 [09:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:13] T208433: Package and install php 7.2 in place of php 7.0 - https://phabricator.wikimedia.org/T208433 [09:26:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473165 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [09:26:51] 10Operations, 10Puppet, 10Proposal, 10cloud-services-team (Kanban): Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10ArielGlenn) p:05Triage>03Normal [09:30:54] (03PS1) 10Gehel: Build is now for Stretch by default [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/473167 [09:32:37] 10Operations, 10Patch-For-Review, 10User-Elukey: mcrouter prometheus exporter stops working when mcrouter restarts - https://phabricator.wikimedia.org/T208375 (10elukey) Upgraded according to https://debmonitor.wikimedia.org/packages/prometheus-mcrouter-exporter [09:32:48] 10Operations, 10Patch-For-Review, 10User-Elukey: mcrouter prometheus exporter stops working when mcrouter restarts - https://phabricator.wikimedia.org/T208375 (10elukey) 05Open>03Resolved [09:32:52] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) [09:35:13] 10Operations, 10Puppet, 10Proposal, 10cloud-services-team (Kanban): Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10ArielGlenn) While I don't want to open up this discussion here, it may be the case that some scripts moved out of puppet might share a repo, not nec... [09:35:36] 10Operations, 10Wikimedia-Mailing-lists: Requesting creation of librarycard-dev mailing list - https://phabricator.wikimedia.org/T209081 (10ArielGlenn) p:05Triage>03Normal [09:37:32] PROBLEM - HHVM rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:38:12] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog],Package[rsyslog-gnutls] [09:38:50] (03CR) 10DCausse: [C: 031] Build is now for Stretch by default [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/473167 (owner: 10Gehel) [09:39:42] RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 75312 bytes in 0.706 second response time [09:43:06] (03CR) 10Gehel: wdqs::gui: extract crontasks into a new crontasts.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473149 (https://phabricator.wikimedia.org/T209257) (owner: 10Mathew.onipe) [09:43:16] 10Operations, 10DBA: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10ArielGlenn) p:05Triage>03Normal [09:44:58] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) So the on server caching has reduced the traffic fa... [09:46:52] (03PS2) 10Mathew.onipe: wdqs::gui: extract crontasks into a new crontasks.pp [puppet] - 10https://gerrit.wikimedia.org/r/473149 (https://phabricator.wikimedia.org/T209257) [09:49:31] (03PS3) 10Gehel: wdqs::gui: extract crontasks into a new crontasks.pp [puppet] - 10https://gerrit.wikimedia.org/r/473149 (https://phabricator.wikimedia.org/T209257) (owner: 10Mathew.onipe) [09:50:11] (03CR) 10Gehel: [C: 032] wdqs::gui: extract crontasks into a new crontasks.pp [puppet] - 10https://gerrit.wikimedia.org/r/473149 (https://phabricator.wikimedia.org/T209257) (owner: 10Mathew.onipe) [09:50:49] 10Operations, 10Goal, 10Technical-Debt, 10User-CDanis, 10User-fgiunchedi: Reduce technical debt in metrics monitoring - https://phabricator.wikimedia.org/T177195 (10ArielGlenn) [09:52:07] 10Operations, 10Icinga, 10monitoring, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069 (10ArielGlenn) [09:54:21] !log restarting jenkins on releases1001 to pick up Java security update [09:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:32] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10Krenair) >>! In T174596#4740124, @aborrero wrote: > BTW, I'm focusing on the `eqiad1` deployment... [09:56:57] 10Operations, 10Puppet, 10User-Joe: Passenger spews Exception NoMethodError in Rack application object - https://phabricator.wikimedia.org/T180944 (10ArielGlenn) [09:57:12] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) a:05Addshore>03None Unassigning until we look a... [09:58:23] (03CR) 10Gehel: [C: 032] Build is now for Stretch by default [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/473167 (owner: 10Gehel) [09:58:59] (03PS1) 10Filippo Giunchedi: hieradata: rollout syslog_exporter in esams [puppet] - 10https://gerrit.wikimedia.org/r/473170 (https://phabricator.wikimedia.org/T206633) [09:59:01] (03PS1) 10Filippo Giunchedi: hieradata: rollout syslog_exporter in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/473171 (https://phabricator.wikimedia.org/T206633) [09:59:03] (03PS1) 10Filippo Giunchedi: hieradata: rollout syslog_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/473172 (https://phabricator.wikimedia.org/T206633) [09:59:06] (03PS1) 10Filippo Giunchedi: hieradata: rollout syslog_exporter in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/473173 (https://phabricator.wikimedia.org/T206633) [09:59:08] (03PS1) 10Filippo Giunchedi: hieradata: add namespaced keys note to codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/473174 (https://phabricator.wikimedia.org/T209265) [10:00:54] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: install extensions with versioned package names [puppet] - 10https://gerrit.wikimedia.org/r/473058 (https://phabricator.wikimedia.org/T208433) [10:00:56] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::php: allow picking a php version [puppet] - 10https://gerrit.wikimedia.org/r/470865 (https://phabricator.wikimedia.org/T208433) [10:01:05] elukey: crap, I forgot to puppet-merge.. sorry about that :| thanks for handling it [10:01:33] gtirloni: np! Sorry for the revert but I wasn't sure :( [10:01:41] <_joe_> gtirloni: we just figured it was better to wait for someone involved with the patch to be back and re-revert [10:01:53] yep, that's a sane approach [10:02:47] (03PS2) 10Filippo Giunchedi: hieradata: add namespaced keys note to codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/473174 (https://phabricator.wikimedia.org/T209265) [10:02:49] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add namespaced keys note to codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/473174 (https://phabricator.wikimedia.org/T209265) (owner: 10Filippo Giunchedi) [10:03:03] (03CR) 10DCausse: setup.py: added python-dateutil to dependencies (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/473151 (owner: 10Mathew.onipe) [10:03:44] (03PS14) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [10:03:51] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:04:38] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [10:05:24] (03PS1) 10GTirloni: Revert "Revert "toolforge: refactor mail server"" [puppet] - 10https://gerrit.wikimedia.org/r/473175 (https://phabricator.wikimedia.org/T208579) [10:05:51] (03PS2) 10GTirloni: Revert "Revert "toolforge: refactor mail server"" [puppet] - 10https://gerrit.wikimedia.org/r/473175 (https://phabricator.wikimedia.org/T208579) [10:06:26] (03PS15) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [10:06:43] (03CR) 10GTirloni: [C: 032] Revert "Revert "toolforge: refactor mail server"" [puppet] - 10https://gerrit.wikimedia.org/r/473175 (https://phabricator.wikimedia.org/T208579) (owner: 10GTirloni) [10:08:06] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Patch-For-Review, and 3 others: Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659 (10ArielGlenn) Does this still need to happen or did the cleanup save us from further incidents? [10:09:52] (03PS1) 10Marostegui: pc2007: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/473176 (https://phabricator.wikimedia.org/T208383) [10:10:42] (03Abandoned) 10Mathew.onipe: setup.py: added python-dateutil to dependencies [cookbooks] - 10https://gerrit.wikimedia.org/r/473151 (owner: 10Mathew.onipe) [10:14:40] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) >>! In T174596#4741643, @Krenair wrote: > I guess there's still the question of whethe... [10:14:54] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) 05Open>03Resolved [10:15:08] !log restart elasticsearch on relforge for plugin upgrade - T209293 [10:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:12] T209293: Prepare a deb package with the experimental highlighter 5.5.2.4 - https://phabricator.wikimedia.org/T209293 [10:17:42] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:42] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: install extensions with versioned package names [puppet] - 10https://gerrit.wikimedia.org/r/473058 (https://phabricator.wikimedia.org/T208433) [10:17:44] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::php: allow picking a php version [puppet] - 10https://gerrit.wikimedia.org/r/470865 (https://phabricator.wikimedia.org/T208433) [10:18:38] (03PS2) 10Filippo Giunchedi: hieradata: rollout syslog_exporter in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/473171 (https://phabricator.wikimedia.org/T206633) [10:18:43] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout syslog_exporter in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/473171 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [10:25:43] (03PS1) 10Marostegui: dump_section.py: Increase retention from 18 days to 45 [puppet] - 10https://gerrit.wikimedia.org/r/473177 [10:26:43] (03CR) 10jerkins-bot: [V: 04-1] dump_section.py: Increase retention from 18 days to 45 [puppet] - 10https://gerrit.wikimedia.org/r/473177 (owner: 10Marostegui) [10:26:52] (03PS2) 10Marostegui: dump_section.py: Increase retention from 18 days to 45 [puppet] - 10https://gerrit.wikimedia.org/r/473177 [10:28:04] (03PS2) 10Filippo Giunchedi: hieradata: rollout syslog_exporter in esams [puppet] - 10https://gerrit.wikimedia.org/r/473170 (https://phabricator.wikimedia.org/T206633) [10:28:12] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout syslog_exporter in esams [puppet] - 10https://gerrit.wikimedia.org/r/473170 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [10:28:24] (03PS3) 10Marostegui: dump_section.py: Increase retention from 18 days to 45 [puppet] - 10https://gerrit.wikimedia.org/r/473177 [10:29:36] (03PS4) 10Marostegui: dump_section.py: Increase retention from 18 days to 45 [puppet] - 10https://gerrit.wikimedia.org/r/473177 [10:30:33] (03CR) 10Marostegui: [C: 032] dump_section.py: Increase retention from 18 days to 45 [puppet] - 10https://gerrit.wikimedia.org/r/473177 (owner: 10Marostegui) [10:31:22] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473178 [10:32:29] (03PS5) 10Giuseppe Lavagetto: profile::mediawiki::php: allow picking a php version [puppet] - 10https://gerrit.wikimedia.org/r/470865 (https://phabricator.wikimedia.org/T208433) [10:32:45] (03CR) 10Gehel: "minor comment inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [10:32:49] (03PS2) 10Filippo Giunchedi: graphite: remove old graphite hardware from receiving metrics [puppet] - 10https://gerrit.wikimedia.org/r/471987 (https://phabricator.wikimedia.org/T196484) [10:33:02] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:33:51] (03CR) 10Filippo Giunchedi: [C: 032] graphite: remove old graphite hardware from receiving metrics [puppet] - 10https://gerrit.wikimedia.org/r/471987 (https://phabricator.wikimedia.org/T196484) (owner: 10Filippo Giunchedi) [10:34:04] volans: ping T208576 [10:34:05] T208576: netbox: wmcs reports - https://phabricator.wikimedia.org/T208576 [10:34:37] arturo: pong [10:34:42] :-) [10:35:41] (03CR) 10Gehel: WIP: logstash::input::kafka: add topics_prefix support (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [10:36:06] arturo: what do you need from me on that task? I thought it was mostly to get an agreement on how WMCS will use netbox, so primarily to gather feedback within the WMCS team [10:36:11] but I might have misunderstood it ;) [10:36:51] we discussed on IRC several ways to adapt netbox to our team: tenants? tags? etc [10:37:28] I would like to have a clear workflow to follow, an agreement with you all [10:37:44] (we in WMCS will just adapt to common netbox workflows) [10:38:48] volans: so what I need from you are concrete instructions on to do our reports, basically :-P [10:38:57] on how to do our reports* [10:39:46] ehehe, reports will depend how the data is stored ofc, so better to decide that first :) [10:40:02] cool [10:40:05] !log stop sending metrics to old graphite hardware [10:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:09] as of now we don't have any guideline for tags, as they are basically not used (apart from a test from ARzhel) [10:40:35] volans: but instead of another IRC chat, what if we persist all this into the phab task? [10:41:08] sure I can comment my thoughts, but you probably want at least Fai.don and Ro.bH to comment too ;) [10:41:36] ok [10:41:49] as for the reports, as part of this Q-goal we'll have to write 3 of them at least hence we'll also do the related puppettization [10:42:06] hence it should boild down for you to just write some python code once that first part is ready [10:42:16] and hopefully there is an agreement on how to identify WMCS hosts :D [10:43:22] volans: mmm, when I say report I'm just looking for something like this search result https://netbox.wikimedia.org/search/?q=cloud&obj_type= [10:43:44] but that doesn't involve any python code, right? What do you mean by report? [10:45:23] Netbox has a concept of "reports" that are python scripts that you can run and have access to the Python API of netbox, so you can implement custom logic and are meant to be used as data validation, for example to check that each server has a mgmt interface defined, etc... [10:45:51] ok, then our needs are much more simpler :_P [10:45:51] what you mean is just a way to filter them in the UI :) [10:46:42] sorry I thought you might need something more advanced [10:47:11] !log Deploy schema change on db1116:3318 T203709 [10:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:14] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [10:47:16] why isn't search/q=cloud sufficient? [10:48:25] our servers haven't the same prefix... [10:48:42] and the search box apparently doesn't allow regexps [10:48:55] AFAIK labs* (ideally only temporary) and I guess there might also be "false positives" like labsdb*? [10:49:14] * arturo nods [10:49:28] why aren't two searches (one with lab and one with cloud) not sufficient while this migration is in flight? :P [10:49:48] jouncebot: next [10:49:48] In 1 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T1200) [10:51:16] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) [10:51:27] the naming scheme migration may take a while to complete. We would like to have a single point to get an overview of the HW [10:51:44] but again, I would rather move this to the phab task [10:52:43] first step is to understand the problem I think :) [10:53:23] I was trying to avoid doing the "five whys" over phab, but sure, can do that if you prefer :) [10:54:03] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473178 (owner: 10Marostegui) [10:54:35] yes please, so others in my team can interact as well :-) [10:55:51] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473178 (owner: 10Marostegui) [10:56:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473178 (owner: 10Marostegui) [10:57:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1092 (duration: 00m 52s) [10:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:50] (03PS4) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) [11:04:46] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set a harbor registry for testing - https://phabricator.wikimedia.org/T209271 (10fselles) After looking into it a little bit, packaging harbor would be challenging. Harbor is a set of microservices published as containers. The in... [11:05:50] (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::php: allow picking a php version [puppet] - 10https://gerrit.wikimedia.org/r/470865 (https://phabricator.wikimedia.org/T208433) [11:12:28] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10ArielGlenn) [11:13:24] 10Operations: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10ArielGlenn) [11:16:06] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 (10ArielGlenn) [11:19:50] PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:20:53] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10ArielGlenn) Who in Cloud Services is able to look at this? [11:21:26] ms-be1037 was the debmonitor session, failed [11:21:30] ms-be1037 was the debmonitor session, fixed [11:21:41] :( thx [11:22:00] RECOVERY - Check systemd state on ms-be1037 is OK: OK - running: The system is fully operational [11:27:17] (03PS1) 10GTirloni: toolforge: Only add motd if in tools/toolsbeta projects [puppet] - 10https://gerrit.wikimedia.org/r/473184 [11:27:46] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Only add motd if in tools/toolsbeta projects [puppet] - 10https://gerrit.wikimedia.org/r/473184 (owner: 10GTirloni) [11:28:12] (03PS1) 10Arturo Borrero Gonzalez: toolforge: aptly: create stretch-toolsbeta repo [puppet] - 10https://gerrit.wikimedia.org/r/473185 (https://phabricator.wikimedia.org/T207970) [11:29:01] (03CR) 10jerkins-bot: [V: 04-1] toolforge: aptly: create stretch-toolsbeta repo [puppet] - 10https://gerrit.wikimedia.org/r/473185 (https://phabricator.wikimedia.org/T207970) (owner: 10Arturo Borrero Gonzalez) [11:29:12] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) >>! In T207377#4741857, @ArielGlenn wrote: > Who in Cloud Services is able to look at this? We will discuss this in our team meeting today. [11:30:13] (03PS2) 10GTirloni: toolforge: Only add motd if in tools/toolsbeta projects [puppet] - 10https://gerrit.wikimedia.org/r/473184 [11:30:30] (03PS4) 10Giuseppe Lavagetto: mediawiki::php: install extensions with versioned package names [puppet] - 10https://gerrit.wikimedia.org/r/473058 (https://phabricator.wikimedia.org/T208433) [11:30:59] (03CR) 10Giuseppe Lavagetto: "cherry-picked in beta, did its job correctly." [puppet] - 10https://gerrit.wikimedia.org/r/473058 (https://phabricator.wikimedia.org/T208433) (owner: 10Giuseppe Lavagetto) [11:31:11] (03PS2) 10Arturo Borrero Gonzalez: toolforge: aptly: create stretch-toolsbeta repo [puppet] - 10https://gerrit.wikimedia.org/r/473185 (https://phabricator.wikimedia.org/T207970) [11:31:20] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::php: install extensions with versioned package names [puppet] - 10https://gerrit.wikimedia.org/r/473058 (https://phabricator.wikimedia.org/T208433) (owner: 10Giuseppe Lavagetto) [11:31:25] (03CR) 10GTirloni: [C: 032] toolforge: Only add motd if in tools/toolsbeta projects [puppet] - 10https://gerrit.wikimedia.org/r/473184 (owner: 10GTirloni) [11:31:49] <_joe_> grr [11:31:59] (03PS5) 10Giuseppe Lavagetto: mediawiki::php: install extensions with versioned package names [puppet] - 10https://gerrit.wikimedia.org/r/473058 (https://phabricator.wikimedia.org/T208433) [11:32:01] (03CR) 10jerkins-bot: [V: 04-1] toolforge: aptly: create stretch-toolsbeta repo [puppet] - 10https://gerrit.wikimedia.org/r/473185 (https://phabricator.wikimedia.org/T207970) (owner: 10Arturo Borrero Gonzalez) [11:37:34] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10MoritzMuehlenhoff) While nagios-nrpe-plugin has been upgraded on the Icinga hosts, the nagios-nrpe source package als... [11:40:17] (03Abandoned) 10Giuseppe Lavagetto: php::default_extensions: explicitly install php7.x-json [puppet] - 10https://gerrit.wikimedia.org/r/470864 (owner: 10Giuseppe Lavagetto) [11:59:52] (03PS2) 10Giuseppe Lavagetto: mediawiki: switch to sethandler everywhere [puppet] - 10https://gerrit.wikimedia.org/r/471230 [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T1200). [12:00:04] phuedx: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:14] o/ [12:01:01] (03CR) 10Giuseppe Lavagetto: [C: 032] "This has been running on the canaries (serving ~ 10% of the traffic) for some time now, I would hope any serious issue would've been caugh" [puppet] - 10https://gerrit.wikimedia.org/r/471230 (owner: 10Giuseppe Lavagetto) [12:01:40] looks like it's just me today [12:01:53] ^ zeljkof/hashar [12:02:59] i'll take https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/473164/ first [12:04:57] is https://integration.wikimedia.org/zuul/ empty for everyone else? [12:05:37] (03PS1) 10Arturo Borrero Gonzalez: role: aptly: refactor [puppet] - 10https://gerrit.wikimedia.org/r/473187 [12:06:20] phuedx: sorry, not around today [12:06:33] (03CR) 10jerkins-bot: [V: 04-1] role: aptly: refactor [puppet] - 10https://gerrit.wikimedia.org/r/473187 (owner: 10Arturo Borrero Gonzalez) [12:07:21] zuul status seems to be working again now (was seeing $.zuul is not a function) [12:07:33] (03PS3) 10Arturo Borrero Gonzalez: toolforge: aptly: create stretch-toolsbeta repo [puppet] - 10https://gerrit.wikimedia.org/r/473185 (https://phabricator.wikimedia.org/T207970) [12:11:00] putting the change onto mwdebug1002 [12:11:19] (03PS2) 10Arturo Borrero Gonzalez: role: aptly: refactor [puppet] - 10https://gerrit.wikimedia.org/r/473187 [12:11:39] (03PS3) 10Arturo Borrero Gonzalez: role: aptly: refactor [puppet] - 10https://gerrit.wikimedia.org/r/473187 [12:13:27] (03CR) 10Arturo Borrero Gonzalez: [C: 032] role: aptly: refactor [puppet] - 10https://gerrit.wikimedia.org/r/473187 (owner: 10Arturo Borrero Gonzalez) [12:14:40] seeing the messages appear here https://en.wikipedia.org/w/index.php?title=Special%3AAllMessages&prefix=wikibase-page-schema&filter=all&lang=en&limit=50 on mwdebug1002 [12:16:19] (03PS4) 10Arturo Borrero Gonzalez: toolforge: aptly: create stretch-toolsbeta repo [puppet] - 10https://gerrit.wikimedia.org/r/473185 (https://phabricator.wikimedia.org/T207970) [12:17:21] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: aptly: create stretch-toolsbeta repo [puppet] - 10https://gerrit.wikimedia.org/r/473185 (https://phabricator.wikimedia.org/T207970) (owner: 10Arturo Borrero Gonzalez) [12:22:03] syncing [12:22:52] !log phuedx@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/WikimediaMessages/: SWAT: [[gerrit:473164|Define WikimediaMessages for Wikibase SEO change (T208755)]] (duration: 00m 56s) [12:22:53] phuedx: sorry I am just back from lunch [12:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:55] T208755: Launch A/B test for sameAs property - https://phabricator.wikimedia.org/T208755 [12:22:58] seems you sorted it out :) [12:23:06] hashar: no worries [12:23:26] post-deployment verification [12:23:55] (03PS1) 10Arturo Borrero Gonzalez: toollabs: servives: publish stretch-toolsbeta repo also in the old code [puppet] - 10https://gerrit.wikimedia.org/r/473188 (https://phabricator.wikimedia.org/T207970) [12:24:29] Wouldn't that require a full scap as an i18n change? [12:24:47] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: servives: publish stretch-toolsbeta repo also in the old code [puppet] - 10https://gerrit.wikimedia.org/r/473188 (https://phabricator.wikimedia.org/T207970) (owner: 10Arturo Borrero Gonzalez) [12:26:09] bawolff: that's what i'm looking at. i saw the messages defined on mwdebug but i don't on prod [12:26:27] so scap sync-all or can i do an l18n refresh without a full sync? [12:26:51] The i18n refresh is the expensive part of the full sync [12:26:58] so you might as well just do that [12:28:13] When you do scap pull, it does refresh the i18n for that host, which is why you saw it on mwdebug [12:28:21] !log phuedx@deploy1001 Started scap: SWAT: [[gerrit:473164|Define WikimediaMessages for Wikibase SEO change]] l18n refresh [12:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:39] noted. thanks bawolff [12:30:11] (03PS1) 10Arturo Borrero Gonzalez: aptly: client: include default servername [puppet] - 10https://gerrit.wikimedia.org/r/473190 [12:30:40] (03CR) 10Arturo Borrero Gonzalez: [C: 032] aptly: client: include default servername [puppet] - 10https://gerrit.wikimedia.org/r/473190 (owner: 10Arturo Borrero Gonzalez) [12:31:40] rsync common seems to be taking a while... [12:32:50] * phuedx twiddles thumbs [12:33:41] Its normal it takes forever [12:34:07] Or at least I'm told, I've only done it like a couple times [12:34:55] bawolff: i'm taking notes and i'll add 'em to https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#General_Advice :) [12:35:11] oh yes, it will take ~45 mins probably [12:35:34] phuedx: >or can i do an l18n refresh without a full sync? < nope [12:36:24] follow on question: can i deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/473166/ while i'm waiting for the l18n refresh ;) [12:36:28] I think maybe part of the problem is we have multiple competing how to deploy code docs. The other how to deploy code talks about i18n changes a bit [12:36:39] phuedx: nope :) [12:36:48] ah well. no worries [12:36:49] Sorry, I probably should have told you to do this last [12:37:04] no worries bawolff [12:37:19] the experience is good! [12:37:52] phuedx: So far, you are only on minute 5 of 45 of twiddling your thumbs ;) [12:37:59] haha [12:38:29] new phabricator badge: master thumb twiddler [12:39:20] I kind of found out the hard way too. There was fairly significant vandalism on translatewiki that timed with the bot update and I went to deploy the fix at like 2 am, and then spent 45 minutes of, "I really want to go to bed already" [12:40:28] ouch :/ [12:42:33] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set a harbor registry for testing - https://phabricator.wikimedia.org/T209271 (10Joe) I would say that this sounds like a better direction to go into, yes. What we still miss is a clear idea of how we want our registry infrastru... [12:42:49] so what i'm learning is that i shouldn't get too much experience deploying and/or don't be involved in security otherwise i'll find myself awake at 2 am waiting for scripts to run? ;) [12:43:13] don't sign up to do the midnight swat [12:43:25] ^ noted [12:43:52] I stayed away from running a full scap whenever possible [12:43:57] <_joe_> we should really move the l18n refresh to php 7.x [12:44:02] jouncebot, now [12:44:02] For the next 0 hour(s) and 15 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T1200) [12:45:21] phuedx: Or just get enough experience to know it takes forever, and convince someone else to run the script :) [12:45:54] ^^ i'll reschedule the other change that i've got lined up [12:45:56] bawolff: lol [12:48:52] yes it should get moved, people would be dancing in the streets if that happened [12:48:56] s/if/when/ [12:49:59] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13450/mw1261.eqiad.wmnet/ this is a noop until we convert some server to php 7.2" [puppet] - 10https://gerrit.wikimedia.org/r/470865 (https://phabricator.wikimedia.org/T208433) (owner: 10Giuseppe Lavagetto) [12:50:05] !log phuedx@deploy1001 Finished scap: SWAT: [[gerrit:473164|Define WikimediaMessages for Wikibase SEO change]] l18n refresh (duration: 21m 43s) [12:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:11] (03PS7) 10Giuseppe Lavagetto: profile::mediawiki::php: allow picking a php version [puppet] - 10https://gerrit.wikimedia.org/r/470865 (https://phabricator.wikimedia.org/T208433) [12:50:26] thanks, bawolff [12:50:29] lgtm now [12:50:39] i've rescheduled my other change to the next window [12:50:52] And actually only took 22 minutes [12:51:06] (03PS1) 10GTirloni: toolforge: Add TLS support to mailrelay [puppet] - 10https://gerrit.wikimedia.org/r/473192 (https://phabricator.wikimedia.org/T209347) [12:51:09] i remain a journeyman thumb twiddler [12:51:28] !log European Mid-day SWAT finished [12:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:47] (03PS1) 10Arturo Borrero Gonzalez: Revert "role: aptly: refactor" [puppet] - 10https://gerrit.wikimedia.org/r/473193 [12:52:54] 10Puppet, 10Patch-For-Review: Validate no namespaced keys are present in hieradata/*.yaml - https://phabricator.wikimedia.org/T209265 (10akosiaris) IIRC this is because of the expand_data directive in https://github.com/wikimedia/puppet/blob/production/modules/puppetmaster/files/production.hiera.yaml#L8 It's... [12:53:31] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10ayounsi) [12:53:36] (03CR) 10jerkins-bot: [V: 04-1] Revert "role: aptly: refactor" [puppet] - 10https://gerrit.wikimedia.org/r/473193 (owner: 10Arturo Borrero Gonzalez) [12:54:30] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] "Sorry for the wmf-style violations. They were there before." [puppet] - 10https://gerrit.wikimedia.org/r/473193 (owner: 10Arturo Borrero Gonzalez) [12:54:44] (03PS2) 10Arturo Borrero Gonzalez: Revert "role: aptly: refactor" [puppet] - 10https://gerrit.wikimedia.org/r/473193 [12:54:58] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] Revert "role: aptly: refactor" [puppet] - 10https://gerrit.wikimedia.org/r/473193 (owner: 10Arturo Borrero Gonzalez) [12:55:29] (03PS2) 10GTirloni: toolforge: Add TLS support to mailrelay [puppet] - 10https://gerrit.wikimedia.org/r/473192 (https://phabricator.wikimedia.org/T209347) [12:56:18] (03PS1) 10Banyek: mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) [12:56:48] (03CR) 10jerkins-bot: [V: 04-1] mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [12:57:06] (03CR) 10GTirloni: [C: 032] toolforge: Add TLS support to mailrelay [puppet] - 10https://gerrit.wikimedia.org/r/473192 (https://phabricator.wikimedia.org/T209347) (owner: 10GTirloni) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T1300) [13:00:20] 10Puppet, 10Patch-For-Review: Validate no namespaced keys are present in hieradata/*.yaml - https://phabricator.wikimedia.org/T209265 (10Joe) >>! In T209265#4742010, @akosiaris wrote: > IIRC this is because of the expand_data directive in https://github.com/wikimedia/puppet/blob/production/modules/puppetmaster... [13:01:12] phuedx: congratulations :) [13:13:24] (03PS1) 10GTirloni: toolforge: Add default external hostname for mail relay [puppet] - 10https://gerrit.wikimedia.org/r/473196 (https://phabricator.wikimedia.org/T209356) [13:14:20] 10Puppet, 10Patch-For-Review: Validate no namespaced keys are present in hieradata/*.yaml - https://phabricator.wikimedia.org/T209265 (10akosiaris) >>! In T209265#4742019, @Joe wrote: >>>! In T209265#4742010, @akosiaris wrote: >> IIRC this is because of the expand_data directive in https://github.com/wikimedia... [13:14:21] (03CR) 10GTirloni: [C: 032] toolforge: Add default external hostname for mail relay [puppet] - 10https://gerrit.wikimedia.org/r/473196 (https://phabricator.wikimedia.org/T209356) (owner: 10GTirloni) [13:23:39] (03PS1) 10GTirloni: toolforge: Change active mail relay to tools-mail-02 [puppet] - 10https://gerrit.wikimedia.org/r/473200 (https://phabricator.wikimedia.org/T209356) [13:24:20] 10Operations, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10fgiunchedi) [13:24:40] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10fgiunchedi) 05Open>03Resolved graphite1004 is fully in service, followup for decom is {T209357} [13:24:48] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10fgiunchedi) [13:27:10] (03CR) 10GTirloni: [C: 032] toolforge: Change active mail relay to tools-mail-02 [puppet] - 10https://gerrit.wikimedia.org/r/473200 (https://phabricator.wikimedia.org/T209356) (owner: 10GTirloni) [13:28:08] PROBLEM - Host pc2010 is DOWN: PING CRITICAL - Packet loss = 100% [13:29:26] I thought pc2010 was downtimed [13:29:48] RECOVERY - Host pc2010 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [13:30:13] !log draining ganeti1008 for reboot/kernel security update [13:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:25] (03PS2) 10Banyek: mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) [13:30:58] (03CR) 10jerkins-bot: [V: 04-1] mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [13:31:32] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set a harbor registry for testing - https://phabricator.wikimedia.org/T209271 (10akosiaris) >>! In T209271#4741841, @fselles wrote: > After looking into it a little bit, packaging harbor would be challenging. Harbor is a set of m... [13:33:09] RECOVERY - puppet last run on pc2010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:34:28] PROBLEM - Apache HTTP on mw1313 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:35:38] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.032 second response time [13:35:55] (03PS1) 10GTirloni: toolforge: Fix mail relay certificate (mail_domain->external_hostname) [puppet] - 10https://gerrit.wikimedia.org/r/473203 (https://phabricator.wikimedia.org/T209356) [13:36:51] (03CR) 10GTirloni: [C: 032] toolforge: Fix mail relay certificate (mail_domain->external_hostname) [puppet] - 10https://gerrit.wikimedia.org/r/473203 (https://phabricator.wikimedia.org/T209356) (owner: 10GTirloni) [13:37:05] (03PS2) 10Marostegui: pc2007: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/473176 (https://phabricator.wikimedia.org/T208383) [13:40:49] (03PS3) 10Banyek: mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) [13:40:50] !log Cutting wmf/1.33.0-wmf.4 branch | T206658 [13:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:53] T206658: 1.33.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T206658 [13:41:14] !log Deploy schema change on s8 codfw master (db2045) this will generate lag on s8 codfw - T203709 [13:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:17] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [13:41:19] (03CR) 10jerkins-bot: [V: 04-1] mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [13:42:10] (03CR) 10Marostegui: [C: 032] pc2007: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/473176 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:42:35] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10Miriam) Thanks so much @ArielGlenn ! @toddleroux , @RyanSteinberg @Afandian please follow the steps above! [13:44:56] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10ArielGlenn) Note that the three checklists are for whatever sres are working on these tasks; as we see different users doing the steps we will double-che... [13:46:26] (03PS4) 10Banyek: mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) [13:47:04] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10Krenair) Are collaborators going to be able to log in to officewiki? bastiononly hasn't existed for years. [13:47:26] (03PS1) 10Marostegui: db-codfw.php: Pool pc2007 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473206 (https://phabricator.wikimedia.org/T208383) [13:52:45] (03CR) 10Banyek: "https://puppet-compiler.wmflabs.org/compiler1002/13451/" [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [13:52:58] (03CR) 10Banyek: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473206 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:53:23] (03CR) 10Marostegui: [C: 032] db-codfw.php: Pool pc2007 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473206 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:54:31] (03Merged) 10jenkins-bot: db-codfw.php: Pool pc2007 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473206 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:54:46] (03CR) 10Marostegui: [C: 031] "This will do the trick. We need remove that "if" once the old parsercache have been decommissioned and leave the default basedir to 101 as" [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [13:58:27] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool pc2007 to replace pc2004 (duration: 00m 48s) [13:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:31] (03PS3) 10Alexandros Kosiaris: Proton: Enable cancellable promises [puppet] - 10https://gerrit.wikimedia.org/r/472735 (https://phabricator.wikimedia.org/T204055) (owner: 10Mobrovac) [13:58:36] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Proton: Enable cancellable promises [puppet] - 10https://gerrit.wikimedia.org/r/472735 (https://phabricator.wikimedia.org/T204055) (owner: 10Mobrovac) [13:59:35] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [13:59:36] hashar: thanks \o/ [13:59:52] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) pc2007 is now in production replacing pc2004 [14:00:04] hashar: How many deployers does it take to do MediaWiki train - European version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T1400). [14:00:08] 10Operations, 10Traffic, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) Seems to be testing fine on https://pinkunicorn.wikimedia.org/ , and the pre-deployment to all caches hosts and OCSP Stapling looks fine too. The skew window for the transitio... [14:00:18] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [14:00:19] !log scap prep 1.33.0-wmf.4 # T206658 [14:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:22] T206658: 1.33.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T206658 [14:00:50] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) a:05Banyek>03Marostegui Assigning this to myself to reflect the current status [14:01:55] (03PS1) 10Filippo Giunchedi: prometheus: add rsyslog_exporter rules [puppet] - 10https://gerrit.wikimedia.org/r/473207 (https://phabricator.wikimedia.org/T206633) [14:03:36] !log start plugin and JVM upgrade on elasticsearch / cirrus / codfw - T209293 [14:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:39] T209293: Prepare a deb package with the experimental highlighter 5.5.2.4 - https://phabricator.wikimedia.org/T209293 [14:03:47] (03PS1) 10GTirloni: toolforge: Add strict rules against spam [puppet] - 10https://gerrit.wikimedia.org/r/473208 (https://phabricator.wikimedia.org/T208579) [14:03:53] !log Applied security patches to 1.33.0-wmf.4 | T206658 [14:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:03] (03PS2) 10GTirloni: toolforge: Add strict rules against spam [puppet] - 10https://gerrit.wikimedia.org/r/473208 (https://phabricator.wikimedia.org/T208579) [14:04:17] (03CR) 10Banyek: [C: 032] mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [14:04:30] (03PS5) 10Banyek: mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) [14:04:41] (03CR) 10Banyek: [V: 032 C: 032] mariadb: parsercache different basedirs [puppet] - 10https://gerrit.wikimedia.org/r/473194 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [14:04:44] (03PS1) 10Hashar: Group 0 to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473209 [14:04:52] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Add strict rules against spam [puppet] - 10https://gerrit.wikimedia.org/r/473208 (https://phabricator.wikimedia.org/T208579) (owner: 10GTirloni) [14:05:02] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add rsyslog_exporter rules [puppet] - 10https://gerrit.wikimedia.org/r/473207 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [14:05:13] (03PS2) 10Filippo Giunchedi: prometheus: add rsyslog_exporter rules [puppet] - 10https://gerrit.wikimedia.org/r/473207 (https://phabricator.wikimedia.org/T206633) [14:05:15] (03CR) 10jenkins-bot: db-codfw.php: Pool pc2007 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473206 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [14:05:44] (03PS3) 10GTirloni: toolforge: Add strict rules against spam [puppet] - 10https://gerrit.wikimedia.org/r/473208 (https://phabricator.wikimedia.org/T208579) [14:06:00] (03PS4) 10GTirloni: toolforge: Add strict rules against spam [puppet] - 10https://gerrit.wikimedia.org/r/473208 (https://phabricator.wikimedia.org/T208579) [14:06:39] banyek: ok to merge your change? [14:06:47] Banyek: mariadb: parsercache different basedirs (a4c73eb922) [14:07:06] yes, please [14:07:12] (03PS5) 10GTirloni: toolforge: Add strict rules against spam [puppet] - 10https://gerrit.wikimedia.org/r/473208 (https://phabricator.wikimedia.org/T208579) [14:07:15] banyek: ok! done [14:07:23] I was just asking this in #security :D [14:07:27] godog: thanks [14:07:44] hahah great timining [14:07:49] timing even [14:08:18] (03CR) 10GTirloni: [C: 032] toolforge: Add strict rules against spam [puppet] - 10https://gerrit.wikimedia.org/r/473208 (https://phabricator.wikimedia.org/T208579) (owner: 10GTirloni) [14:12:21] (03PS1) 10Gehel: elasticsearch: add hostname and fqdn as node attributes [puppet] - 10https://gerrit.wikimedia.org/r/473210 [14:13:37] (03PS1) 10BBlack: Switch unified cert to globalsign-2018 at US edges [puppet] - 10https://gerrit.wikimedia.org/r/473211 (https://phabricator.wikimedia.org/T206804) [14:15:13] (03CR) 10Gehel: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/13452/" [puppet] - 10https://gerrit.wikimedia.org/r/473210 (owner: 10Gehel) [14:15:16] !log hashar@deploy1001 Pruned MediaWiki: 1.32.0-wmf.24 (duration: 08m 55s) [14:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:58] !log increase the migration downtime for logstash1007, logstash1008, logstash1009. It should make live migration of these VMs easier and without the need for manual fiddling [14:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:35] (03PS1) 10Volans: interactive: add ensure_shell_session() [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) [14:19:37] (03PS1) 10Volans: remote: refactor Remote.query() API [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) [14:20:16] !log reboot logstash1007, logstash1008, logstash1009 with 500 secs of sleep between them for the migration_downtime ganeti setting to be applied [14:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:18] gehel ^ [14:20:26] !log hashar@deploy1001 Started scap: testwiki to php-1.33.0-wmf.4 | T206658 [14:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:28] T206658: 1.33.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T206658 [14:20:48] akosiaris: thanks (cc godog) [14:20:56] ah yes, my bad [14:21:07] I should have pinged infra foundations [14:21:09] akosiaris: np, happy to know about it! [14:22:00] (03CR) 10Volans: "Almost moved verbatim from the wmf-auto-reimage library as part of the linked task." [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:22:59] 10Operations, 10cloud-services-team, 10Continuous-Integration-Infrastructure (shipyard), 10Nodepool, 10Release-Engineering-Team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [14:23:22] 10Operations, 10cloud-services-team, 10Continuous-Integration-Infrastructure (shipyard), 10Nodepool, 10Release-Engineering-Team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [14:24:54] (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473215 (https://phabricator.wikimedia.org/T204745) [14:25:34] (03CR) 10Andrew Bogott: [C: 032] Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473215 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [14:27:59] !log draining ganeti1007 for reboot/kernel security update [14:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:11] 10Operations, 10cloud-services-team, 10Continuous-Integration-Infrastructure (shipyard), 10Nodepool, 10Release-Engineering-Team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [14:29:52] (03CR) 10DCausse: [C: 031] elasticsearch: add hostname and fqdn as node attributes [puppet] - 10https://gerrit.wikimedia.org/r/473210 (owner: 10Gehel) [14:30:13] (03CR) 10Gehel: [C: 032] elasticsearch: add hostname and fqdn as node attributes [puppet] - 10https://gerrit.wikimedia.org/r/473210 (owner: 10Gehel) [14:30:21] (03PS2) 10Gehel: elasticsearch: add hostname and fqdn as node attributes [puppet] - 10https://gerrit.wikimedia.org/r/473210 [14:31:28] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:33:09] akosiaris: ^ related to your thing? [14:33:22] yup [14:35:27] (03PS1) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [14:39:00] 10Operations, 10Fundraising-Backlog, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) @jkim_wikimedia any luck logging in? [14:39:06] (03PS2) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [14:39:19] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational [14:39:29] RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational [14:39:50] (03Abandoned) 10Giuseppe Lavagetto: php::extension: use version-specific package name by default [puppet] - 10https://gerrit.wikimedia.org/r/470863 (https://phabricator.wikimedia.org/T208433) (owner: 10Giuseppe Lavagetto) [14:40:00] !log hashar@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.4 | T206658 (duration: 19m 34s) [14:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:03] T206658: 1.33.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T206658 [14:40:28] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [14:41:41] (03PS3) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [14:44:06] (03CR) 10Niedzielski: "This goes out tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [14:44:27] hmm, im getting 404s on it.wp javascript files... [14:44:35] Request from 212.178.66.253 via cp1079 cp1079, Varnish XID 785144653 [14:44:38] Error: 404, Not Found [14:45:20] oh nvmd [14:45:37] this gadget is broken, no wonder [14:46:40] (03CR) 10Gehel: "Naming question. I trust that the implementation is correct." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:47:24] (03PS1) 10Niedzielski: Prod: increase Schema.org page split test to 5% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473221 (https://phabricator.wikimedia.org/T208755) [14:47:41] gehel: you know the rules, complains against naming can come only with alternative suggestions :-P [14:48:00] damn, I'll have to rack my brain on that one [14:48:11] (03CR) 10Niedzielski: "This goes out tomorrow evening." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473221 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [14:48:16] on wmf auto reimage it was 'ensure_shell_mode' [14:49:28] should we use 'ensure_shell-multiplexed'? :-P [14:49:38] s/-/_/ [14:50:01] (03CR) 10Hashar: [C: 032] "Train" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473209 (owner: 10Hashar) [14:50:01] it's not really the multiplexed part that we are interested in, more the fact that it is resistant to SSH reset [14:50:08] yeah [14:50:24] reattachable? [14:50:39] (03PS1) 10Niedzielski: Prod: increase Schema.org page split test to 25% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473224 (https://phabricator.wikimedia.org/T208755) [14:50:40] as you can see I was totally out of good names ;) [14:51:10] yeah, 2 hard problems: naming, cache invalidation and off by one errors [14:51:20] (03CR) 10Niedzielski: "This goes out Thursday morning." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473224 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [14:51:52] (03CR) 10Gehel: interactive: add ensure_shell_session() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:52:00] (03Merged) 10jenkins-bot: Group 0 to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473209 (owner: 10Hashar) [14:52:12] volans: I think I like the "reattachable" best [14:52:49] (03PS1) 10Niedzielski: Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) [14:54:13] (03CR) 10Niedzielski: "This goes out Monday morning." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [14:54:23] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group to 1.33.0-wmf.4 | T206658 [14:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:29] T206658: 1.33.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T206658 [14:54:38] (03PS1) 10Filippo Giunchedi: Revert "logstash: temp stop managing indices" [puppet] - 10https://gerrit.wikimedia.org/r/473226 [14:55:37] (03CR) 10Filippo Giunchedi: "To be merged in the next couple of days" [puppet] - 10https://gerrit.wikimedia.org/r/473226 (owner: 10Filippo Giunchedi) [14:57:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Cmjohnson) @ottomata Lets go with cloudvirtan1xxx. [14:59:01] gehel: the only problem of reattachable is that that check passes also for non-interactive shells, that clearly you can't attach to at all :) [14:59:46] yeah, naming is hard (note that I did not -1 the change, I'm fine with merging fi we don't have a better name, it is not awful, just not very descriptive) [14:59:59] !log increase the migration downtime for kafkamon1001. It should make live migration of these VMs easier and without the need for manual fiddling [15:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:07] what about 'persistent'? [15:00:43] same issue for non-interactive, at least to some extent [15:00:51] but probably better [15:00:51] (03PS2) 10Niedzielski: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) [15:00:53] (03PS2) 10Niedzielski: Prod: increase Schema.org page split test to 5% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473221 (https://phabricator.wikimedia.org/T208755) [15:00:55] (03PS2) 10Niedzielski: Prod: increase Schema.org page split test to 25% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473224 (https://phabricator.wikimedia.org/T208755) [15:00:58] (03PS2) 10Niedzielski: Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) [15:01:00] (03PS1) 10Niedzielski: Prod: increase Schema.org page split test to 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473227 (https://phabricator.wikimedia.org/T208755) [15:01:02] volans: it matches the intent at least [15:01:08] detached [15:01:20] * volans going into the wild now [15:02:00] (03CR) 10Niedzielski: "This goes out Monday evening." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473227 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [15:03:15] (03CR) 10jenkins-bot: Group 0 to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473209 (owner: 10Hashar) [15:04:24] (03PS1) 10Fdans: Add info about new fields in the uniques dump [puppet] - 10https://gerrit.wikimedia.org/r/473228 (https://phabricator.wikimedia.org/T167539) [15:08:23] (03PS2) 10Filippo Giunchedi: hieradata: rollout syslog_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/473172 (https://phabricator.wikimedia.org/T206633) [15:08:34] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout syslog_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/473172 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [15:18:12] !log Restore replication consistency options on dbstore2002:3313 as it has caught up - T208320 [15:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:15] T208320: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 [15:18:57] 10Operations, 10DBA: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Marostegui) I just restored the original flags to sync_binlog=1 and trx_commit=1 as s3 caught up. [15:21:05] gehel: volans: how about 'durable'? [15:21:39] !log draining ganeti1006 for reboot/kernel security update [15:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:16] cdanis: also, maybe 'local' even? [15:22:50] group0 looks good [15:22:51] given that the real session is local to the host (non-interactive or tmux/screen) and if you're attached to it remotely is not influent [15:22:55] (03PS1) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) [15:22:55] going back home [15:23:47] (03CR) 10jerkins-bot: [V: 04-1] certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [15:25:12] (03CR) 10Herron: "Thanks for the feedback! This hasn't yet been deployed to my test environment (working on that today) so I'll set the WIP flag in gerrit " [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [15:25:43] (03PS2) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) [15:26:41] (03CR) 10Alex Monk: certcentral: switch to active/passive (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [15:26:53] local seems potentially confusing to me [15:28:03] (03CR) 10Vgutierrez: certcentral: switch to active/passive (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [15:28:30] short names are confusing, like 'safe', 'local', etc... longer names are hard to convey the right concept :) [15:28:45] it's a conundrum [15:29:25] (03PS3) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) [15:29:36] volans: don't spend too much time on it, there is a docstring expalining what it does... [15:29:48] I'm not, you are :) [15:29:56] (03CR) 10jerkins-bot: [V: 04-1] certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [15:30:00] (03PS4) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) [15:30:08] I think "durable session" captures all of that for me, but it was also my idea, so of course it does ;) [15:30:09] PROBLEM - Check systemd state on ganeti1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:15] but yeah it is a conundrum [15:30:18] PROBLEM - Check systemd state on ganeti1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:50] (03CR) 10jerkins-bot: [V: 04-1] certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [15:31:18] (03PS1) 10Giuseppe Lavagetto: mediawiki: install php 7.2 instead of 7.0 [puppet] - 10https://gerrit.wikimedia.org/r/473230 (https://phabricator.wikimedia.org/T208433) [15:31:22] (03PS1) 10Giuseppe Lavagetto: mediawiki::appserver: install php-fpm everywhere [puppet] - 10https://gerrit.wikimedia.org/r/473231 (https://phabricator.wikimedia.org/T208433) [15:31:26] (03PS5) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) [15:32:08] (03PS2) 10Giuseppe Lavagetto: mediawiki: install php 7.2 instead of 7.0 [puppet] - 10https://gerrit.wikimedia.org/r/473230 (https://phabricator.wikimedia.org/T208433) [15:32:10] (03PS2) 10Giuseppe Lavagetto: mediawiki::appserver: install php-fpm everywhere [puppet] - 10https://gerrit.wikimedia.org/r/473231 (https://phabricator.wikimedia.org/T208433) [15:32:14] (03PS1) 10Ema: ATS: logging.yaml support [puppet] - 10https://gerrit.wikimedia.org/r/473232 (https://phabricator.wikimedia.org/T204225) [15:34:30] 10Operations, 10ops-codfw, 10Traffic: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Vgutierrez) The system is online since 07:30 UTC [15:39:20] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: install php 7.2 instead of 7.0 [puppet] - 10https://gerrit.wikimedia.org/r/473230 (https://phabricator.wikimedia.org/T208433) (owner: 10Giuseppe Lavagetto) [15:42:57] (03PS1) 10Giuseppe Lavagetto: mediawiki: correctly use a string in hiera for php versions [puppet] - 10https://gerrit.wikimedia.org/r/473233 [15:43:02] <_joe_> sigh ^^ [15:43:25] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: correctly use a string in hiera for php versions [puppet] - 10https://gerrit.wikimedia.org/r/473233 (owner: 10Giuseppe Lavagetto) [15:43:55] <_joe_> why is jenkins so slow today? [15:44:54] some murphy's law corollary? [15:45:09] !log restart tilerator on maps1004 [15:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:28] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:29] (03PS1) 10Andrew Bogott: mwopenstackclients: select the designate endpoint for the proper region [puppet] - 10https://gerrit.wikimedia.org/r/473234 (https://phabricator.wikimedia.org/T209374) [15:46:01] <_joe_> mw2135 is me [15:46:04] (03CR) 10jerkins-bot: [V: 04-1] mwopenstackclients: select the designate endpoint for the proper region [puppet] - 10https://gerrit.wikimedia.org/r/473234 (https://phabricator.wikimedia.org/T209374) (owner: 10Andrew Bogott) [15:48:17] (03CR) 10GTirloni: [C: 031] mwopenstackclients: select the designate endpoint for the proper region [puppet] - 10https://gerrit.wikimedia.org/r/473234 (https://phabricator.wikimedia.org/T209374) (owner: 10Andrew Bogott) [15:48:42] (03CR) 10BBlack: [C: 031] certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [15:48:56] <_joe_> !log upgrading extensions on all appservers / jobrunners while upgrading to php 7.2 [15:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:22] (03PS2) 10Andrew Bogott: mwopenstackclients: select the designate endpoint for the proper region [puppet] - 10https://gerrit.wikimedia.org/r/473234 (https://phabricator.wikimedia.org/T209375) [15:50:38] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:51:00] (03CR) 10Andrew Bogott: [C: 032] mwopenstackclients: select the designate endpoint for the proper region [puppet] - 10https://gerrit.wikimedia.org/r/473234 (https://phabricator.wikimedia.org/T209375) (owner: 10Andrew Bogott) [15:51:19] (03PS1) 10Dzahn: switch icinga-stretch to icinga and icinga to icinga-old [dns] - 10https://gerrit.wikimedia.org/r/473235 (https://phabricator.wikimedia.org/T202782) [15:54:12] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [15:58:17] (03PS2) 10Cwhite: admin: add add shell account for Leszek Manicki [puppet] - 10https://gerrit.wikimedia.org/r/472052 (https://phabricator.wikimedia.org/T208717) [15:58:56] (03CR) 10Cwhite: "> Okay, yeah, we re-use the ldap uid these days. If you haven't done" [puppet] - 10https://gerrit.wikimedia.org/r/472052 (https://phabricator.wikimedia.org/T208717) (owner: 10Cwhite) [15:58:59] PROBLEM - puppet last run on mwdebug2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/php-fpm.conf],File[/etc/php/7.2/fpm/pool.d] [16:04:08] RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:04:58] here to assist in Gerrit upgrade if anything needed [16:07:18] !log anomie@mwmaint1002 Running refreshExternallinksIndex.php on section 3 wikis in group 0 for T209373 [16:07:18] !log anomie@mwmaint1002 Running refreshExternallinksIndex.php on labtestwiki for T209373 [16:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:21] T209373: Run maintenance/refreshExternallinksIndex.php on all wikis - https://phabricator.wikimedia.org/T209373 [16:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:44] (03PS1) 10Vgutierrez: lvs: configure lvs2010 interfaces [puppet] - 10https://gerrit.wikimedia.org/r/473238 (https://phabricator.wikimedia.org/T209337) [16:09:28] (03CR) 10Hashar: [C: 032] Merge tag 'v2.15.6' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/471758 (owner: 10Thcipriani) [16:11:12] addshore: o/ are there any additional steps that i should take when deploying a change to the wikibase extension or is it a regular extension deploy these days? [16:11:50] (03CR) 10Hashar: [C: 032] "Archiva is like magic" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/472003 (https://phabricator.wikimedia.org/T205784) (owner: 10Thcipriani) [16:12:09] (03PS2) 10Ema: ATS: logging.yaml support [puppet] - 10https://gerrit.wikimedia.org/r/473232 (https://phabricator.wikimedia.org/T204225) [16:13:09] (03CR) 10Ema: [C: 032] ATS: logging.yaml support [puppet] - 10https://gerrit.wikimedia.org/r/473232 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [16:13:41] (03CR) 10Nuria: [C: 031] Add info about new fields in the uniques dump [puppet] - 10https://gerrit.wikimedia.org/r/473228 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [16:13:44] (03CR) 10Hashar: [V: 032 C: 032] Gerrit 2.15.6 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/472003 (https://phabricator.wikimedia.org/T205784) (owner: 10Thcipriani) [16:14:13] (03CR) 10Hashar: [V: 032 C: 032] Merge tag 'v2.15.6' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/471758 (owner: 10Thcipriani) [16:14:59] (03CR) 10Elukey: Add info about new fields in the uniques dump (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473228 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [16:17:44] (03PS2) 10Fdans: Add info about new fields in the uniques dump [puppet] - 10https://gerrit.wikimedia.org/r/473228 (https://phabricator.wikimedia.org/T167539) [16:17:51] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Ok! [16:18:24] (03PS4) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [16:19:28] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) [16:19:42] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-both/read-old on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473241 (https://phabricator.wikimedia.org/T188327) [16:20:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) @Cmjohnson I updated {T207194} to reflect the new naming. Please proceed and then assign to Cloud VPS folks f... [16:20:18] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473241 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [16:21:08] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10RobH) [16:21:35] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-old on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473241 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [16:21:46] (03CR) 10Gehel: "PCC agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/13458/" [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [16:22:08] PROBLEM - DPKG on mw2205 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:22:45] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-old on test wikis and mediawikiwiki (T188327) (duration: 00m 54s) [16:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:48] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [16:23:09] RECOVERY - DPKG on mw2205 is OK: All packages OK [16:24:14] (03CR) 10Ema: [C: 031] lvs: configure lvs2010 interfaces [puppet] - 10https://gerrit.wikimedia.org/r/473238 (https://phabricator.wikimedia.org/T209337) (owner: 10Vgutierrez) [16:24:25] (03PS1) 10Dzahn: icinga: make icinga1001 active and einsteinium passive [puppet] - 10https://gerrit.wikimedia.org/r/473244 (https://phabricator.wikimedia.org/T202782) [16:24:44] <_joe_> uhm mw2205 might be me [16:24:51] (03CR) 10Vgutierrez: [C: 032] lvs: configure lvs2010 interfaces [puppet] - 10https://gerrit.wikimedia.org/r/473238 (https://phabricator.wikimedia.org/T209337) (owner: 10Vgutierrez) [16:24:59] (03PS2) 10Vgutierrez: lvs: configure lvs2010 interfaces [puppet] - 10https://gerrit.wikimedia.org/r/473238 (https://phabricator.wikimedia.org/T209337) [16:25:08] PROBLEM - DPKG on mw2155 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:26:18] RECOVERY - DPKG on mw2155 is OK: All packages OK [16:28:06] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Vgutierrez) we will be replacing lvs2006 with lvs2010 [16:28:45] we are going to restart Gerrit soonish for a scheduled upgrade [16:28:49] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can... [16:29:48] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@d2763c6]: v2.15.6 to gerrit2001 [16:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:59] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@d2763c6]: v2.15.6 to gerrit2001 (duration: 00m 11s) [16:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:19] oooohhhh [16:30:27] (03CR) 10Herron: [C: 031] "Ready when you are!" [puppet] - 10https://gerrit.wikimedia.org/r/472720 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [16:31:02] (03CR) 10Cwhite: [C: 031] switch icinga-stretch to icinga and icinga to icinga-old [dns] - 10https://gerrit.wikimedia.org/r/473235 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [16:31:08] (03CR) 10Cwhite: [C: 031] icinga: make icinga1001 active and einsteinium passive [puppet] - 10https://gerrit.wikimedia.org/r/473244 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [16:31:09] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10RobH) [16:31:18] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10RobH) [16:32:31] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@d2763c6]: v2.15.6 to cobalt [16:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:33] Do tell me if GerritBot breaks (as i've updated it to use the new phab api) [16:32:38] PROBLEM - DPKG on mw2256 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:32:41] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@d2763c6]: v2.15.6 to cobalt (duration: 00m 10s) [16:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:09] !log restarting gerrit service for upgrade to 2.15.6 [16:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:39] RECOVERY - DPKG on mw2256 is OK: All packages OK [16:35:33] hashar: Yay! It's down :) [16:36:17] It's up again :) [16:37:20] <_joe_> d3r1ck: what's down? [16:37:32] _joe_ gerrit [16:37:33] _joe_: Gerrit :), but it's up now :) [16:37:54] <_joe_> d3r1ck: hah, sorry, I thought it was a more general consideration :P [16:38:58] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [16:40:39] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-old on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473241 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [16:41:08] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [16:42:51] running puppet on kafka2001 to see if it was temporary [16:42:55] 10Operations, 10Patch-For-Review, 10User-Joe: Package and install php 7.2 in place of php 7.0 - https://phabricator.wikimedia.org/T208433 (10Joe) status update: php 7.2 is installed on the application servers in production and in beta. Exception is for now the maintenance servers, and the dump hosts, where... [16:43:28] works fine, so puppet was probably running when gerrit was restarted [16:43:59] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:45:57] 10Operations: change my email address in the techcom alias - https://phabricator.wikimedia.org/T209391 (10kchapman) [16:48:47] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) a:03aborrero [16:48:49] PROBLEM - DPKG on mw2202 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:49:41] 10Operations, 10ops-eqiad, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: rack/setup/install sessionstore100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T209393 (10RobH) [16:49:58] RECOVERY - DPKG on mw2202 is OK: All packages OK [16:50:18] yea, one of those failed git pulls usually happens when gerrit restarts [16:52:30] jouncebot: now [16:52:33] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [16:52:35] jouncebot: next [16:52:35] In 0 hour(s) and 7 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T1700) [16:52:52] ok, i see no changes scheduled [16:53:17] we will go ahead with switching over icinga server in a bit [16:53:44] (03PS3) 10Cwhite: admin: add add shell account for Leszek Manicki [puppet] - 10https://gerrit.wikimedia.org/r/472052 (https://phabricator.wikimedia.org/T208717) [16:54:26] (03CR) 10Cwhite: [C: 032] admin: add add shell account for Leszek Manicki [puppet] - 10https://gerrit.wikimedia.org/r/472052 (https://phabricator.wikimedia.org/T208717) (owner: 10Cwhite) [16:55:51] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10colewhite) 05Open>03Resolved [16:56:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10colewhite) [16:57:51] 10Operations, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10colewhite) p:05Triage>03Low [16:59:08] (03PS1) 10Andrew Bogott: Horizon: allow wikidumpparse to create VMs in the eqiad1-r region [puppet] - 10https://gerrit.wikimedia.org/r/473249 (https://phabricator.wikimedia.org/T209386) [17:00:04] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:22] <_joe_> jouncebot: none, apparently [17:01:29] !log starting migration of icinga server - maintenance windows [17:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:36] (03PS2) 10Dzahn: icinga: make icinga1001 active and einsteinium passive [puppet] - 10https://gerrit.wikimedia.org/r/473244 (https://phabricator.wikimedia.org/T202782) [17:03:14] (03CR) 10Dzahn: [C: 032] icinga: make icinga1001 active and einsteinium passive [puppet] - 10https://gerrit.wikimedia.org/r/473244 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [17:04:13] !log disabled puppet on all 3 icinga servers, re-enabling on einsteinium , going through https://wikitech.wikimedia.org/wiki/Icinga#Failover_Icinga_between_the_active_and_passive_servers [17:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:34] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:09:35] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] ` [17:09:47] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can... [17:09:50] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] ` [17:10:11] 10Operations, 10ops-codfw: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [17:10:22] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can... [17:11:04] 10Operations, 10ops-codfw: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) p:05Triage>03Normal [17:11:25] PROBLEM - HTTPS on einsteinium is CRITICAL: SSL CRITICAL - failed to verify icinga-old.wikimedia.org against icinga.wikimedia.org [17:12:22] (03CR) 10Dzahn: [C: 032] switch icinga-stretch to icinga and icinga to icinga-old [dns] - 10https://gerrit.wikimedia.org/r/473235 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [17:12:38] icinga-wm nicely reports the changein icinga itself.. changing that now [17:13:25] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-icinga] [17:13:26] !log Added 172.16.0.0/21 to the allowed connections for wikilabels postgresql on labsdb1004 [17:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:07] 10Operations, 10ops-codfw: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10fgiunchedi) Racking plan looks good to me! [17:19:16] (03PS1) 10Cwhite: icinga: disable service notifications on einsteinium and enable on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/473250 (https://phabricator.wikimedia.org/T202782) [17:19:44] RECOVERY - HTTPS on einsteinium is OK: SSL OK - Certificate icinga-old.wikimedia.org valid until 2019-02-11 16:18:45 +0000 (expires in 89 days) [17:19:48] (03PS3) 10Elukey: Add info about new fields in the uniques dump [puppet] - 10https://gerrit.wikimedia.org/r/473228 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [17:20:39] (03CR) 10Dzahn: [C: 031] icinga: disable service notifications on einsteinium and enable on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/473250 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:21:23] !log ran puppet on einsteniumr; e-enabling puppet on tegmen and icinga1001 [17:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:05] (03PS1) 10Cwhite: icinga: transition to new active host [puppet] - 10https://gerrit.wikimedia.org/r/473251 (https://phabricator.wikimedia.org/T202782) [17:22:23] (03CR) 10Dzahn: [C: 032] icinga: disable service notifications on einsteinium and enable on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/473250 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:23:09] (03PS2) 10Cwhite: icinga: transition to new active host [puppet] - 10https://gerrit.wikimedia.org/r/473251 (https://phabricator.wikimedia.org/T202782) [17:23:19] (03CR) 10Elukey: [C: 032] Add info about new fields in the uniques dump [puppet] - 10https://gerrit.wikimedia.org/r/473228 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [17:23:24] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:23:27] (03PS4) 10Elukey: Add info about new fields in the uniques dump [puppet] - 10https://gerrit.wikimedia.org/r/473228 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [17:23:51] (03CR) 10Dzahn: [C: 031] icinga: transition to new active host [puppet] - 10https://gerrit.wikimedia.org/r/473251 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:24:01] (03CR) 10Cwhite: [C: 032] icinga: transition to new active host [puppet] - 10https://gerrit.wikimedia.org/r/473251 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:24:34] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set a harbor registry for testing - https://phabricator.wikimedia.org/T209271 (10fselles) Regarding @Joe 's comment. The last picture should be something similar to this, this is quite complex and is build up from the idea that... [17:24:44] mutante: I'm around, if needed just ping me ;) [17:24:53] shdubsh too ofc [17:25:05] (03PS5) 10Elukey: Add info about new fields in the uniques dump [puppet] - 10https://gerrit.wikimedia.org/r/473228 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [17:25:22] (03CR) 10Elukey: [V: 032 C: 032] Add info about new fields in the uniques dump [puppet] - 10https://gerrit.wikimedia.org/r/473228 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [17:27:05] volans: thanks :) [17:27:18] volans: yep, thanks! [17:27:35] so far so good [17:27:36] !log re-enabled puppet on icinga1001, einsteinium becoming passive [17:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:43] so long icinga-wm from einsteinium [17:28:10] it is disabling notifications for each and every check right now.. on old [17:28:17] and enabling them on new at the same time [17:29:09] hello icinga-wm [17:29:13] Notice: /Stage[main]/Icinga/Service[icinga]: Triggered 'refresh' from 727 events [17:29:17] eheheh [17:29:21] same on both? [17:29:27] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fselles) [17:29:30] that was old.. new is not finished yet [17:29:45] (03PS1) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473254 [17:29:47] (03PS1) 10Zoranzoki21: Add new throttle rule for Art+Feminism Event on 2018-11-17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324) [17:30:04] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473254 (owner: 10Zoranzoki21) [17:30:09] (03CR) 10jerkins-bot: [V: 04-1] Add new throttle rule for Art+Feminism Event on 2018-11-17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324) (owner: 10Zoranzoki21) [17:30:20] (03Abandoned) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473254 (owner: 10Zoranzoki21) [17:30:35] 728 events on icinga1001. done [17:30:40] (03PS2) 10Zoranzoki21: Add new throttle rule for Art+Feminism Event on 2018-11-17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324) [17:30:44] dig +short icinga.wikimedia.org shows icinga1001 on all [17:30:56] no puppet errors [17:32:07] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10RobH) This will be missing rack rails: >>! In T207801#4743235, @Pswaby wrote: > @Pap... [17:32:32] https://icinga.wikimedia.org/icinga/ worsk and is new version 1.13.4 [17:32:43] volans: we did all the steps from your list too.. done [17:32:53] i would call it smooth [17:33:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10elukey) @mforns created https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/prometheus-varnishkafka-exporter [17:33:45] downtimes seems to be there, UI doens't have notification disabled so is the active one [17:34:10] mutante: maybe trigger an IRC-only alert to check it works fine and probably a paging one too to be on the safe side? [17:34:31] btw, update that list on wiki if there was anything different, was from a while ago ;) [17:36:24] CUSTOM - Check systemd state on ganeti1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:29] ^ test 1 [17:37:25] custom notification from lvs1001 work as a paging test? [17:38:47] lvs1001 probably isn't a good source of testing for anything, it's pretty critical! [17:40:10] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni) Some servers may have newer packages still: labservices1002: ` linux-image-generic/trusty-updates,trusty-security 3.13.0.161.171 amd64 [upgradable from: 3.13.0.160.170] ` Should... [17:40:16] well, we need something that is "critical => true" or it wouldn't page [17:40:44] cloudcontrol1003 should also work [17:40:51] 3100 contact_groups admins,sms,admins [17:40:57] the "sms" contactgroup does it [17:41:35] wut? admins should compare one time only and puppet should create the list as such... [17:42:01] if you use something cloud-related check with could-admin first [17:42:10] yea, known issue but it doesnt break anything for icinga [17:42:31] (the duplicate group thing isnt new) [17:43:52] mutante, shdubsh: why the config files are in /etc/nagios? seems conceptually wrong [17:44:25] *some config files [17:44:31] or maybe are not used? [17:46:02] volans: they are used. iirc, it's a side-effect of how the nagios puppet resources behave. [17:47:29] ok [17:52:03] (03PS5) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [17:52:32] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [17:55:27] (03PS1) 10MSantos: Disable replication cron in maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473257 [17:57:43] (03PS6) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [17:57:45] (03PS1) 10Gehel: elasticsearch: create multiple elasticsearch instances on cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/473258 (https://phabricator.wikimedia.org/T207918) [17:59:32] (03PS1) 10MSantos: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 [18:00:02] (03CR) 10jerkins-bot: [V: 04-1] Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T1800). [18:00:26] (03PS7) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [18:00:28] (03PS2) 10Gehel: elasticsearch: create multiple elasticsearch instances on cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/473258 (https://phabricator.wikimedia.org/T207918) [18:00:51] (03CR) 10Gehel: [C: 032] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/13461/maps1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/473257 (owner: 10MSantos) [18:01:30] mateusbs17: ^ [18:02:03] Thanks! I am fixing some stuff on the ncpu right now. [18:02:36] mateusbs17: ping me to review / deploy if I miss the message! [18:03:04] !log icinga migration has concluded, we are now on stretch and icinga1001, einsteinium is passive (T202782) [18:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:07] T202782: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 [18:03:16] (03PS2) 10MSantos: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 [18:04:05] (03CR) 10jerkins-bot: [V: 04-1] Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [18:04:31] (03CR) 10Gehel: "PCC agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/13462/" [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [18:07:14] (03PS3) 10MSantos: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 [18:08:05] (03CR) 10jerkins-bot: [V: 04-1] Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [18:09:10] CUSTOM - LVS HTTP IPv4 on ores.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 6261 bytes in 0.077 second response time [18:09:45] (03PS4) 10MSantos: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 [18:09:57] I assume that's a test? [18:10:21] ok :) [18:10:23] yes [18:11:01] !log the CUSTOM message from ores.svc.codfw was the (one-time) test of the new Icinga server [18:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:50] mmmh I didn't get paged [18:12:08] i put "THIS IS A TEST" into the form field, btw [18:12:14] but it didn't show up like i wanted it to [18:13:21] volans: previous tests of aql looked good. would notification periods prevent you from getting them? [18:13:28] I got it on SMS [18:13:55] I'm on 24h [18:14:05] but might be my phone, let me check first [18:14:09] PROBLEM - Check systemd state on ganeti1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:14:20] ^that's real and also a nice confirmation [18:14:28] that the bot works [18:14:43] shdubsh, mutante: did you add icinga1001 to aqs? [18:14:52] yes, we did [18:14:56] otherwise non-US pages don't work [18:15:01] and we also tested AQL specifically [18:16:32] echo "CRITICAL - it's all down" | mail -a "From: icinga@icinga1001.wikimedia.org" -s TEST @text.aql.com worked [18:16:44] let me do that with your number [18:16:53] (03CR) 10Gehel: Increase tilerator num_workers maps1004 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [18:17:26] mutante: sure [18:17:36] I'm sending myself an sms from another phone too to double check [18:17:55] sent [18:18:15] mutante: got it ..."fun and profit"? [18:18:22] yea [18:18:37] hrmm [18:18:40] i got ordes ok [18:18:43] but nothign with 'test' [18:18:48] got this one but not the other, any european got the other? [18:18:51] was i suposed to? [18:18:52] robh: was a test to me only [18:18:55] oh, cool [18:18:57] whew [18:19:02] mutante: I can check the dashboard on aql [18:19:31] the part that "THIS IS A TEST" did not show is because the actual "body" of the message gets cut off , and it can be different per phone .. or something [18:19:53] that is unrelated..you should received just the ORES test [18:21:13] volans: yes, let's do that. it definitely allows the IP ... [18:21:22] oh it could be the sender address.. logging in too [18:21:35] i know i have to send from icinga@icinga1001 header too or it won't accept it [18:21:50] in addition to the IP whitelist [18:28:38] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Wikimedia-Logstash, and 2 others: Rate limit wdqs logs - https://phabricator.wikimedia.org/T204364 (10Smalyshev) p:05Triage>03Normal [18:33:05] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul we need to re-wire lvs2009 & lvs2010 to connect the first interface (enp175s0f0) to the main row for each server. [18:33:58] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] ` [18:35:15] (03PS5) 10MSantos: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 [18:37:09] (03CR) 10MSantos: Increase tilerator num_workers maps1004 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [18:37:41] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Wikimedia-Logstash, and 2 others: Rate limit wdqs logs - https://phabricator.wikimedia.org/T204364 (10Smalyshev) 05Open>03Resolved [18:39:16] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10MoritzMuehlenhoff) It's fine to upgrade the kernel, I've installed running what was recent when I created the task and those versions are sufficient to fix L1TF, but it's good to move to a... [18:41:00] (03PS1) 10Dzahn: icinga: fix path to puppet_hosts in icinga-downtime [puppet] - 10https://gerrit.wikimedia.org/r/473261 [18:41:45] (03PS2) 10Dzahn: icinga: fix path to puppet_hosts in icinga-downtime [puppet] - 10https://gerrit.wikimedia.org/r/473261 [18:43:02] 10Operations, 10ops-codfw: Rename asw-c4-codfw and asw2-c4-codfw - https://phabricator.wikimedia.org/T209077 (10Papaul) 05Open>03Resolved Done [18:43:51] 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10Papaul) [18:43:53] (03CR) 10Herron: icinga: fix path to puppet_hosts in icinga-downtime (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473261 (owner: 10Dzahn) [18:45:44] 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10Papaul) [18:48:21] (03PS3) 10Dzahn: icinga: fix path to puppet_hosts in icinga-downtime [puppet] - 10https://gerrit.wikimedia.org/r/473261 (https://phabricator.wikimedia.org/T202782) [18:49:15] (03PS4) 10Dzahn: icinga: fix path to puppet_hosts in icinga-downtime [puppet] - 10https://gerrit.wikimedia.org/r/473261 (https://phabricator.wikimedia.org/T202782) [18:49:43] 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10Papaul) @ayounsi can not remove console port since the new switch is using the same port. (port 20) [18:50:10] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) We could start with spare systems and standby servers: * cloudelastic1001.wikimedia.org (stretch) (spare) * cloudelastic1002.wikimedia.org (stretch) (spare) * cloudelastic1003.wi... [18:50:20] 10Operations, 10ops-codfw: unrack/decom cr1-eqord - https://phabricator.wikimedia.org/T208049 (10Papaul) [18:50:41] 10Operations, 10ops-codfw: unrack/decom cr1-eqord - https://phabricator.wikimedia.org/T208049 (10Papaul) 05Open>03Resolved Done [18:52:04] (03CR) 10Dzahn: [C: 032] icinga: fix path to puppet_hosts in icinga-downtime [puppet] - 10https://gerrit.wikimedia.org/r/473261 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:00:11] PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Software caused connection abort [19:02:33] (03PS2) 10Volans: interactive: add ensure_shell_session() [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) [19:02:35] (03PS2) 10Volans: remote: refactor Remote.query() API [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) [19:04:02] (03PS1) 10Herron: exim minimal labs: allow from localhost interface addrs in rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/473263 [19:04:28] (03PS1) 10Paladox: Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 [19:05:03] (03CR) 10Paladox: [C: 031] "Works for me! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/473263 (owner: 10Herron) [19:05:21] (03CR) 10Herron: [C: 032] exim minimal labs: allow from localhost interface addrs in rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/473263 (owner: 10Herron) [19:06:19] RECOVERY - Disk space on stat1004 is OK: DISK OK [19:06:38] just did a force remount, fuse hdfs wasn't feeling well [19:08:00] (03CR) 10Volans: "done" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [19:08:52] elukey: heavy digestion? :-P [19:12:42] (03PS2) 10Paladox: Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 [19:13:59] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez the first NIC of each server is connected to the switch where the server is racked in. Example: lvs2010 is racked in D2 so the first NIC is connect... [19:14:07] PROBLEM - IPsec on rdb2001 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: rdb1001_v4 [19:14:17] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1542136455 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 23 hours 51 minutes - replication_delay is 1542136455 [19:14:23] PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1542136462 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3240 keys, up 23 hours 51 minutes - replication_delay is 1542136462 [19:14:43] PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1542136482 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 23 hours 52 minutes - replication_delay is 1542136482 [19:17:29] (03CR) 10CDanis: [C: 031] "+1 for the name :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [19:20:14] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul so at least in lvs2010, debian installer seems to think that enp175s0f0 is the first NIC, the mac addr is 00:0a:f7:f0:0c:10. in lvs2009 the mac addres... [19:21:27] (03PS2) 10Herron: WIP: logstash::input::kafka: add topics_prefix support [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) [19:22:57] (03CR) 10Herron: WIP: logstash::input::kafka: add topics_prefix support (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [19:23:29] (03PS3) 10Herron: WIP: logstash::input::kafka: add topics_pattern support [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) [19:31:25] !log uploaded librdkafka_0.11.6-1~bpo9+1+wikimedia1 packages to stretch-wikimedia T209300 [19:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:30] T209300: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 [19:33:53] (03PS1) 10Thcipriani: Add example systemd service file [software/keyholder] - 10https://gerrit.wikimedia.org/r/473270 [19:38:47] (03PS1) 10Ottomata: Refine needs to use Hive 1.1.0 jars from CDH 5.10.0 to work around jar version conflict [puppet] - 10https://gerrit.wikimedia.org/r/473271 (https://phabricator.wikimedia.org/T209407) [19:39:51] (03CR) 10jerkins-bot: [V: 04-1] Refine needs to use Hive 1.1.0 jars from CDH 5.10.0 to work around jar version conflict [puppet] - 10https://gerrit.wikimedia.org/r/473271 (https://phabricator.wikimedia.org/T209407) (owner: 10Ottomata) [19:40:10] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [19:41:02] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) The switch has happened. icinga1001 is now the active server and einsteinium is passive. This ticket still open for a grace period until we decom ein... [19:42:48] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/13464/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/473271 (https://phabricator.wikimedia.org/T209407) (owner: 10Ottomata) [19:43:22] (03PS2) 10Ottomata: Refine needs Hive 1.1.0 jars from CDH 5.10.0 to work around jar version conflict [puppet] - 10https://gerrit.wikimedia.org/r/473271 (https://phabricator.wikimedia.org/T209407) [19:43:51] !log otto@deploy1001 Started deploy [analytics/refinery@62d6f4b]: Deploy hive jars from CDH 5.10.0 to workaround Refine bug: T209407 [19:43:52] (03CR) 10jerkins-bot: [V: 04-1] Refine needs Hive 1.1.0 jars from CDH 5.10.0 to work around jar version conflict [puppet] - 10https://gerrit.wikimedia.org/r/473271 (https://phabricator.wikimedia.org/T209407) (owner: 10Ottomata) [19:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:54] T209407: EventLogging Hive Refine broken after upgrade to CDH 5.15.0 - https://phabricator.wikimedia.org/T209407 [19:44:01] 10Operations, 10ops-eqiad, 10DC-Ops: kubestage1001.mgmt down or flapping - https://phabricator.wikimedia.org/T209112 (10Cmjohnson) The cable looks fine, I checked to make sure it was secured. if this is still a problem I can replace the cable [19:44:49] (03PS3) 10Ottomata: Refine needs Hive 1.1.0 jars from CDH 5.10.0 to work around jar version conflict [puppet] - 10https://gerrit.wikimedia.org/r/473271 (https://phabricator.wikimedia.org/T209407) [19:44:56] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1003 - https://phabricator.wikimedia.org/T209408 (10ops-monitoring-bot) [19:45:04] PROBLEM - Host cloudelastic1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:45:06] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10Cmjohnson) a:05Cmjohnson>03RobH Robh can you do the installs please. [19:45:18] RECOVERY - Host cloudelastic1002 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [19:46:25] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez on lvs2010 can you tell me which interface has this MAC address Routing instance : default-switch Vlan MAC MAC... [19:47:34] (03PS11) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) [19:49:48] !log otto@deploy1001 Finished deploy [analytics/refinery@62d6f4b]: Deploy hive jars from CDH 5.10.0 to workaround Refine bug: T209407 (duration: 05m 57s) [19:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:51] T209407: EventLogging Hive Refine broken after upgrade to CDH 5.15.0 - https://phabricator.wikimedia.org/T209407 [19:50:11] (03CR) 10Ottomata: [C: 032] Refine needs Hive 1.1.0 jars from CDH 5.10.0 to work around jar version conflict [puppet] - 10https://gerrit.wikimedia.org/r/473271 (https://phabricator.wikimedia.org/T209407) (owner: 10Ottomata) [19:50:22] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:50:23] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul enp59s0f0 [19:50:31] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Cmjohnson) @ottomata can these go in any row or does it need to be row B? [19:50:38] (03PS1) 10Thcipriani: Pep8, multiline imports [software/keyholder] - 10https://gerrit.wikimedia.org/r/473272 [19:50:45] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni) [19:51:32] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) IIUC it has to be row B for them to be used as Cloud Virts. @Andrew to confirm. If they can go any row, then they should be spread out as even... [19:53:42] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:54:23] (03PS1) 10GTirloni: shinken: Update IP after migration to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/473273 [19:55:43] (03CR) 10GTirloni: [C: 032] shinken: Update IP after migration to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/473273 (owner: 10GTirloni) [19:57:16] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez which position 2nd, 3rd or 4th? since enp175s0f0 is 1st [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181113T2000) [20:11:52] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10herron) [20:11:56] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10herron) 05Open>03Resolved a:03herron [20:12:28] (03PS1) 10Dzahn: icinga: enable daemon log, in addition to syslog, again [puppet] - 10https://gerrit.wikimedia.org/r/473275 (https://phabricator.wikimedia.org/T202782) [20:12:44] (03PS3) 10Herron: logstash: add generic kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [20:13:57] (03CR) 10jerkins-bot: [V: 04-1] logstash: add generic kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [20:19:39] (03PS3) 10Herron: wmcs: cut over to new smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/472720 (https://phabricator.wikimedia.org/T41785) [20:20:26] (03CR) 10Herron: [C: 032] wmcs: cut over to new smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/472720 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:20:54] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 43 probes of 326 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:22:58] !log updated labs realm smarthosts (via hiera) to mx-out0[12].wmflabs.org T41785 [20:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:01] T41785: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 [20:25:31] (03PS2) 10Dzahn: icinga: enable daemon log, in addition to syslog, again [puppet] - 10https://gerrit.wikimedia.org/r/473275 (https://phabricator.wikimedia.org/T202782) [20:26:00] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 25 probes of 326 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:26:22] (03PS1) 10Dzahn: icinga: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) [20:29:01] (03CR) 10Dzahn: [C: 032] icinga: enable daemon log, in addition to syslog, again [puppet] - 10https://gerrit.wikimedia.org/r/473275 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:32:17] (03CR) 10Dzahn: "not planning to merge this before einsteinium is decom'ed to role(spare).. but here it is" [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:43:47] (03PS1) 10Dzahn: icinga: remove einsteinium as an alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/473278 [20:44:09] (03CR) 10Dzahn: "only after https://gerrit.wikimedia.org/r/473278" [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:47:03] (03PS4) 10Herron: logstash: add generic kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [20:48:25] (03CR) 10jerkins-bot: [V: 04-1] logstash: add generic kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [20:49:44] 178 [20:50:02] wrong window! [20:53:34] (03PS2) 10Andrew Bogott: Horizon: allow wikidumpparse to create VMs in the eqiad1-r region [puppet] - 10https://gerrit.wikimedia.org/r/473249 (https://phabricator.wikimedia.org/T209386) [20:53:36] (03PS1) 10Andrew Bogott: Cloud instances: update nrpe host for new shinken IP [puppet] - 10https://gerrit.wikimedia.org/r/473280 [20:54:21] (03CR) 10Andrew Bogott: [C: 032] Cloud instances: update nrpe host for new shinken IP [puppet] - 10https://gerrit.wikimedia.org/r/473280 (owner: 10Andrew Bogott) [20:54:25] (03PS5) 10Herron: logstash: add generic kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [20:55:29] (03CR) 10Gehel: [C: 031] "Minor comment inline, but already +1, feel free to merge." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [20:56:36] (03PS2) 10Dzahn: icinga: remove einsteinium as an alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/473278 (https://phabricator.wikimedia.org/T202782) [20:56:55] (03PS3) 10Volans: interactive: add ensure_shell_is_durable() [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) [20:57:02] (03CR) 10Volans: interactive: add ensure_shell_is_durable() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [20:57:41] (03CR) 10jerkins-bot: [V: 04-1] icinga: remove einsteinium as an alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/473278 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:00:35] (03PS2) 10Paladox: icinga: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:00:37] (03CR) 10Gehel: [C: 04-1] "There are a few issues with type constraints. I'll dig into them tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [21:00:43] (03PS3) 10Dzahn: icinga: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) [21:03:06] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) Looks like this VM is substantially slower - I st... [21:03:09] (03CR) 10Volans: [C: 032] interactive: add ensure_shell_is_durable() [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [21:03:49] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:05:11] (03Merged) 10jenkins-bot: interactive: add ensure_shell_is_durable() [software/spicerack] - 10https://gerrit.wikimedia.org/r/473212 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [21:10:45] PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time [21:11:25] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [21:11:49] RECOVERY - Nginx local proxy to apache on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [21:21:07] XioNoX: for the purpose of being a smokeping target, does it matter if the host has a public IP or private? i get the point is that we have one target per row, and the current test hosts are .wikimedia.org [21:21:53] i want to replace einsteinium with .. ehm.. labservices1001.wikimedia.org then, because it's public and in same row (D3) [21:21:54] D3: test - ignore - https://phabricator.wikimedia.org/D3 [21:22:27] mutante: I guess icinga1001 moved to a different row? [21:22:58] XioNoX: that's right, and why i didn't just replace it with the same service [21:23:01] C8 [21:23:23] first i thought i cant find any other public ones in D3.. then i found labservices though [21:24:21] uhm.. and i was still in racktables..out of habit [21:24:25] mutante: dns1002.wikimedia.org [21:24:32] looking at NetBox [21:25:02] https://netbox.wikimedia.org/dcim/devices/?status=1&rack_group_id=8&role=server [21:25:10] XioNoX: thanks! , yea, my problem was looking at racktables [21:25:44] old habits die hard [21:28:07] (03PS1) 10Dzahn: smokeping: replace target einsteinium with dns1002 [puppet] - 10https://gerrit.wikimedia.org/r/473282 (https://phabricator.wikimedia.org/T202782) [21:29:26] 10Operations, 10New-Readers, 10Patch-For-Review: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10atgo) 05Resolved>03Open @Dzahn just commented on the wrong task, whoops. Reopening because: @Dzahn sometimes I'm getting a Bugzilla page instead of the Bienvendia pag... [21:31:14] XioNoX: that is D1 instead of D3 though? [21:31:15] D3: test - ignore - https://phabricator.wikimedia.org/D3 [21:31:16] D1: Initial commit - https://phabricator.wikimedia.org/D1 [21:31:23] stashbot: sush :) [21:31:23] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [21:31:29] hahaha [21:31:53] mutante: not an issue, same switching fabric [21:33:07] XioNoX: ah! i see.. and .. we already have that target in tthere. i would be duplicating it [21:33:14] just realized on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473282/1/modules/smokeping/files/config.d/Targets [21:33:31] should i just remove einsteinium then? [21:33:49] 10Operations, 10New-Readers, 10Patch-For-Review: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10atgo) p:05High>03Unbreak! Campaign is meant to launch tomorrow :| [21:34:49] mutante: we have 2 in C, 1 in A, one in D, so short answer, yeah, just delete it, long answer if you feel like it, find something in row B :) [21:36:34] XioNoX: ack! i will replace it with authdns1001.wikimedia.org (B1) [21:36:55] great! thanks [21:39:15] 10Operations, 10ops-codfw: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [21:51:43] mutante: nzr is looking for you in -dev [21:52:04] Reedy: yes, i saw the ticket [21:52:15] bblack: who should new readers talk to for this: https://phabricator.wikimedia.org/T207816#4744216 [21:52:18] sehr gut [21:52:39] Does traffic handle new urls? [21:52:42] coreyfloyd: on it haha.. came here for mutante [21:52:55] i mean corey is on it.. but i came here for same reason [21:52:57] Lol [21:53:26] Cool [21:53:27] the new url shows the landing page 1 out of 6 times.. otherwise gives bugzilla page [21:53:39] mutante: ^ [21:53:48] I'll poke it at [21:54:16] nzr_: no worries, i saw all that on the ticket [21:54:31] thanks!! [21:54:54] mutante: do you already have some idea what's going on? [21:54:58] if you need any more info.. feel free to ping me.. [21:55:16] 10Operations, 10decommission, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10RobH) [21:55:26] (also just a general note: this is very late before launch to be noticing problems!) [21:57:10] bblack: yes, clients without support for name based virtual host are getting the default [21:57:12] (03PS6) 10Herron: logstash: add generic kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [21:57:28] nzr_: rest assured i will look into this before the launch [21:57:48] Anne says Thanks! and I do too!! [21:57:57] mutante: there are very few clients like that in the world anymore, surely our own testers wouldn't be hitting that without mentioning the ancient client in question? [21:59:16] (and also, there's only one cert at this IP anyways, but that's beside th epoint) [21:59:40] screenshot is some version of chrome/chromium [22:00:25] yeah I can see it with latest-chromium on random reloads [22:02:11] i saw the ticket reopen immediately. i am just also sitting at lunch. it's not a broken live site. all sites have the same priority of 50 [22:02:29] that's what i can confirm [22:02:48] 10Operations, 10netops, 10cloud-services-team (Kanban): Permit routing from eqiad1-r instances to labnet1001 - https://phabricator.wikimedia.org/T209424 (10Andrew) [22:04:27] the one the screenshot shows is the default server and all the other namevhosts have the same priority [22:05:06] Is it served by just one host? [22:05:18] yes [22:05:54] this used to be active-active with one host per DC but codfw is commented out currently since the last DC swtich [22:06:12] even when it is a/a, one user would only map to one DC [22:06:20] (until some DC-level failover event) [22:06:35] i was going to ask if we can enable both again [22:06:41] btu that's separate [22:06:54] there's probably something annoying happening with the varnish->webserver_misc_static persistent conns, as static-bugzilla is another site served by the same backend, probably the default one with no Host: header. [22:07:11] bblack: yes, that is definitely the dedfault one [22:07:17] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [22:07:31] but all the other vhosts on it seem to be the same and not have that issue [22:07:32] huh [22:07:35] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:07:40] that's odd [22:07:44] I can still ping wikitech-static [22:07:57] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 28.16 ms [22:08:18] Krenair: it could've been a flap in outbound routing from WMF -> wikitech-static for monitoring [22:08:24] esp given timing with mr1-eqsin dropping [22:08:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:08:31] and that ^ [22:08:55] 6939 is Hurricane Electric, problems with them seem to happen lately... [22:09:09] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki) [22:09:31] mutante: both DCs for webserver_misc_static are enabled in puppet config [22:09:40] (bromine/vega) [22:09:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 174 probes of 327 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:11:39] bblack: oh, nice, i thought it was still commented, checking vega too [22:11:55] I'm on vega anyways [22:12:03] (since I map to codfw edge) [22:12:20] so apache2ctl -S shows me all the namevhosts [22:12:32] and i dont see why it would be different from the others [22:12:47] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 223.54 ms [22:13:42] bblack: something annoying would be "the requested server name header gets lost in the caching layer"? [22:14:41] i notice it's the 10th virtual host [22:14:51] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 20 probes of 327 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:14:53] wonderi fsomewhere there is something 0-9 [22:14:59] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:19:28] (03PS4) 10Herron: WIP: logstash::input::kafka: add topics_pattern support [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) [22:21:23] PROBLEM - ensure kvm processes are running on labvirt1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 [22:24:45] !log add term labnet-nova-api to cloud-in4 on cr1/2-eqiad - T209424 [22:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:48] T209424: Permit routing from eqiad1-r instances to labnet1001 - https://phabricator.wikimedia.org/T209424 [22:26:33] bblack: there is one virtual host on there that is not used. i could remove it.. then it would be under 10 again. i dont see what else on the backend makes the difference [22:26:35] (03PS5) 10Herron: logstash::input::kafka: add topics_pattern support [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) [22:27:48] mutante: I doubt it's the count of them that matters [22:28:02] something else tricky is going on here, I think [22:28:29] well, you can try swapping those two vhost [22:28:34] to see if that makes a difference [22:30:02] i could make this site the default, yea. but will others get bienvenida then who shouldnt [22:30:31] (03PS2) 10Herron: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) [22:30:50] (03CR) 10Herron: logstash::input::kafka add support for SSL/TLS options (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [22:30:56] 10Operations, 10netops, 10cloud-services-team (Kanban): Permit routing from eqiad1-r instances to labnet1001 - https://phabricator.wikimedia.org/T209424 (10ayounsi) Pushed to cr1/2-eqiad `lang=diff [edit firewall family inet filter cloud-in4] [...] + term labnet-nova-api { + from { +... [22:32:28] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10RyanSteinberg) I don't seem to have access to Office Wiki and I don't see an option to create an account. Should I share my public SSH key here or wait f... [22:34:00] 10Operations, 10netops, 10cloud-services-team (Kanban): Permit routing from eqiad1-r instances to labnet1001 - https://phabricator.wikimedia.org/T209424 (10ayounsi) 05Open>03Resolved [22:34:25] (03PS7) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [22:34:27] (03CR) 10Andrew Bogott: [C: 032] Horizon: allow wikidumpparse to create VMs in the eqiad1-r region [puppet] - 10https://gerrit.wikimedia.org/r/473249 (https://phabricator.wikimedia.org/T209386) (owner: 10Andrew Bogott) [22:35:41] (03PS1) 10Niedzielski: Prod use HTTPS for Wikidata repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) [22:36:36] (03CR) 10Herron: logstash: add rsyslog-shipper kafka input config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [22:37:16] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Andrew) Yep, Row B. @Cmjohnson, these are to be handled like any other cloudvirt server. For network details it might be best to consult with @ayounsi [22:40:58] (03CR) 10Abián: [C: 04-1] "Sorry, but I think there's no consensus on changing concept URIs to HTTPS. Note that these aren't simple URLs but stable identifiers." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:42:16] !log restart librenms irc bot [22:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:43] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Relabel labvirt1016.eqiad.wmnet as cloudvirt1016.eqiad.wmnet - https://phabricator.wikimedia.org/T209427 (10Andrew) [22:43:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1016.eqiad.wmnet as cloudvirt1016.eqiad.wmnet - https://phabricator.wikimedia.org/T209427 (10Andrew) [22:47:25] (03PS2) 10Niedzielski: Prod use HTTPS for Wikidata repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) [22:52:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "This has been extensively discussed in T153563, and ultimately declined. Entity URIs are supposed to be stable, and thanks to HSTS and HST" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:55:59] (03PS1) 10Bstorm: sonofgridengine: bring k8s bastion role information into bastion profile [puppet] - 10https://gerrit.wikimedia.org/r/473293 (https://phabricator.wikimedia.org/T200557) [22:59:53] (03CR) 10Brian Wolff: "If we are going to keep it http, i think a code comment is warrented to explain the situation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [23:01:05] (03PS2) 10Bstorm: sonofgridengine: bring k8s bastion role information into bastion profile [puppet] - 10https://gerrit.wikimedia.org/r/473293 (https://phabricator.wikimedia.org/T200557) [23:02:43] (03CR) 10Niedzielski: [C: 04-1] "Thank you, Lucas!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [23:03:58] (03CR) 10Bstorm: [C: 032] sonofgridengine: bring k8s bastion role information into bastion profile [puppet] - 10https://gerrit.wikimedia.org/r/473293 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [23:05:48] the way we have all VirtualHosts as *:80 is following docs though for several name based virtual hosts on a single IP [23:07:46] (03CR) 10Cwhite: "This looks good. When do you think this should be deployed?" [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [23:11:58] (03PS1) 10Cwhite: ntp: ensure absent ntpd diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/473295 (https://phabricator.wikimedia.org/T183454) [23:15:08] (03CR) 10Cwhite: "This would come after I13078e6399e188fcb478d2243318cd86e8a5a9f9." [puppet] - 10https://gerrit.wikimedia.org/r/473295 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [23:34:58] (03PS1) 10BBlack: misc-common: piwik cookies should not block caching either [puppet] - 10https://gerrit.wikimedia.org/r/473299 [23:44:20] (03PS1) 10Cwhite: diamond: remove diamond::collector::nginx [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) [23:57:43] o/