[00:05:21] (03PS1) 10Dzahn: installserver: add webperf1002,webperf2002 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/433301 (https://phabricator.wikimedia.org/T194390) [00:06:13] (03CR) 10Dzahn: [C: 032] installserver: add webperf1002,webperf2002 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/433301 (https://phabricator.wikimedia.org/T194390) (owner: 10Dzahn) [00:19:03] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [00:27:13] RECOVERY - Long running screen/tmux on labcontrol1001 is OK: OK: No SCREEN or tmux processes detected. [01:25:43] (03PS4) 10Subramanya Sastry: Enable RemexHtml on wikis with < 100 ns0 errors in high priority cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432621 (https://phabricator.wikimedia.org/T193685) [01:56:02] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [02:22:22] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:40:31] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.3) (duration: 08m 25s) [02:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:42] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 987.22 seconds [03:38:32] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.4) (duration: 16m 26s) [03:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed May 16 03:46:12 UTC 2018 (duration 7m 41s) [03:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:57] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 73.90 seconds [04:39:50] !log kartik@tin Started deploy [cxserver/deploy@7e898c7]: Update cxserver to 112a1a1 (T191285) [04:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:55] T191285: Update API to access short descriptions in Content Translation - https://phabricator.wikimedia.org/T191285 [04:43:42] !log kartik@tin Finished deploy [cxserver/deploy@7e898c7]: Update cxserver to 112a1a1 (T191285) (duration: 03m 52s) [04:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:26] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [05:14:49] !log Deploy schema change on s2 primary master db1054 - T191519 T188299 T190148 [05:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:55] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:14:55] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:14:56] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:16:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433314 (https://phabricator.wikimedia.org/T190148) [05:18:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433314 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:19:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433314 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:20:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433314 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:21:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1086 for alter table (duration: 01m 23s) [05:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:58] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4209355 (10Marostegui) a:05jcrespo>03Papaul It is indeed on predictive failure: ``` physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Predictive Failure) ``` @Papaul can we replace it agai... [05:22:11] !log Deploy schema change on db1086 - T191519 T188299 T190148 [05:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:17] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:22:17] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:22:17] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:22:56] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:24:37] (03CR) 10Marostegui: [C: 031] mariadb: Failover dbproxy1007,8 and 9 and make them passive [dns] - 10https://gerrit.wikimedia.org/r/433015 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [05:29:05] (03CR) 10Marostegui: [C: 031] mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [05:34:47] (03PS1) 10KartikMistry: WIP: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) [05:35:41] (03CR) 10jerkins-bot: [V: 04-1] WIP: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [05:42:59] (03PS1) 10Marostegui: s4,s7.hosts: Add db1120 [software] - 10https://gerrit.wikimedia.org/r/433319 [05:43:56] (03CR) 10Marostegui: [C: 032] s4,s7.hosts: Add db1120 [software] - 10https://gerrit.wikimedia.org/r/433319 (owner: 10Marostegui) [05:44:47] (03Merged) 10jenkins-bot: s4,s7.hosts: Add db1120 [software] - 10https://gerrit.wikimedia.org/r/433319 (owner: 10Marostegui) [05:49:03] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209381 (10Marostegui) eqiad is now ready with all the data on multi-instance, so as soon as the final HW arrives we can just s... [05:52:16] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [05:53:58] (03PS1) 10Marostegui: db-codfw.php: Depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433320 (https://phabricator.wikimedia.org/T190704) [05:54:22] !log removed acpi_power_meter manually from conf1004 (blacklisted module in puppet), Acpi errors in dmesg [05:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:55] the analytics1001/2 puppet failures seems due to an exec that tries to check a znode on zookeeper, and when it hits conf1004 it times ou [05:54:58] out [05:54:59] lovely [05:55:07] (yesterday I swapped conf1001 with conf1004) [05:58:16] ah wait the analytics firewall! [05:58:24] * elukey checks [05:59:44] yes I forgot to add conf1004 in there [05:59:55] so analytics100[12] cannot contact it /o\ [06:11:40] !log Drop unused tables msg_resource msg_resource_links from s5 - T194663 [06:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:44] T194663: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663 [06:15:49] !log Drop unused tables msg_resource msg_resource_links from s6 - T194663 [06:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:06] !log Drop unused tables msg_resource msg_resource_links from s8 - T194663 [06:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:10] T194663: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663 [06:18:27] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:19:44] !log update analytics-in4 on cr1/cr2 eqiad to allow conf100[4-6] (new zookeeper hosts) [06:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:58] !log Drop unused tables msg_resource msg_resource_links from s4 - T194663 [06:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:26] much better now [06:24:05] Cc: XioNoX ---^ (updated cr1/cr2 analytics-in4 firewall rules adding 3 ips to a term) [06:25:11] !log Drop unused tables msg_resource msg_resource_links from s7 - T194663 [06:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:16] T194663: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663 [06:29:56] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:30:06] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt-upgrade-activity] [06:39:11] (03PS4) 10Elukey: role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) [06:44:42] (03CR) 10Elukey: [C: 032] role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [06:45:15] !log Drop unused tables msg_resource msg_resource_links from s3 codfw - T194663 [06:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:32] T194663: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663 [06:49:46] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:56:03] !log upgrading mwdebug servers to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [06:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:16] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:56:26] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:29] (03PS1) 10Elukey: Swap zookeeper on conf1002 with conf1005 [puppet] - 10https://gerrit.wikimedia.org/r/433322 (https://phabricator.wikimedia.org/T182924) [07:04:13] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11225/" [puppet] - 10https://gerrit.wikimedia.org/r/433322 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [07:18:43] !log upgrading mw1240-mw1258 to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [07:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:18] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433323 [07:23:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433323 (owner: 10Marostegui) [07:25:19] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433323 (owner: 10Marostegui) [07:25:54] !log Drop unused tables msg_resource msg_resource_links from s1 - T194663 [07:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:00] T194663: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663 [07:27:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1086 after alter table (duration: 01m 34s) [07:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433324 (https://phabricator.wikimedia.org/T190148) [07:29:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433323 (owner: 10Marostegui) [07:30:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433324 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:32:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433324 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:33:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 for alter table (duration: 01m 20s) [07:33:49] !log Deploy schema change on db1094 - T191519 T188299 T190148 [07:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:55] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [07:33:56] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [07:33:56] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [07:35:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433324 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:36:59] !log installing systemd updates from stretch SUA [07:37:02] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [07:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:21] (03CR) 10Jcrespo: [C: 031] db-codfw.php: Depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433320 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:37:41] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:38:18] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433320 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:39:36] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433320 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:41:27] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2075 to convert it to temporary sanitarium (duration: 01m 20s) [07:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:30] (03CR) 10jenkins-bot: db-codfw.php: Depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433320 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:48:19] (03PS1) 10Marostegui: mariadb: Convert db2075 to temp sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/433325 (https://phabricator.wikimedia.org/T190704) [07:51:22] (03PS2) 10Marostegui: mariadb: Convert db2075 to temp sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/433325 (https://phabricator.wikimedia.org/T190704) [07:52:02] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:52:08] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11226/" [puppet] - 10https://gerrit.wikimedia.org/r/433325 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:52:11] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0 [07:52:46] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2687 bytes in 1.664 second response time [07:52:55] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2702 bytes in 1.664 second response time [07:53:02] FYI - sites seem to be down - in case that's not known :) [07:53:31] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 [07:53:39] <_joe_> varnent: define "sites" [07:53:50] <_joe_> it has probably to do with ulsfo [07:53:55] They are back - but for a few minutes it was at least Commons and enWP [07:54:13] <_joe_> yes, see the alerts here ^^ [07:54:15] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17501 bytes in 0.467 second response time [07:54:20] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17502 bytes in 0.456 second response time [07:54:24] <_joe_> it was the sf cache [07:54:28] no bueno [07:54:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:54:46] varnish availability metrics didn't fully recover [07:54:46] <_joe_> varnent: can you tell me what IP do you get for any random wikipedia? [07:54:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:55:29] _joe_: 198.35.26.96 [07:55:52] yeah that's ulsfo [07:56:08] (03PS1) 10Giuseppe Lavagetto: Depool ulsfo from traffic, having issues [dns] - 10https://gerrit.wikimedia.org/r/433326 [07:56:14] cr2-ulsfo logs are not very clear [07:56:26] <_joe_> let's depool now, ask questions later? [07:56:31] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [07:56:33] May 16 07:49:24 cr2-ulsfo MGMT: rpd[1659]: EVENT xe-1/3/0 index 154 address #0 a8.d0.e5.55.21.98 [07:56:47] <_joe_> ema: care to review? ^^ [07:56:51] that's zayo [07:57:01] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [07:57:05] (03CR) 10Ema: [C: 031] Depool ulsfo from traffic, having issues [dns] - 10https://gerrit.wikimedia.org/r/433326 (owner: 10Giuseppe Lavagetto) [07:57:09] (03CR) 10Giuseppe Lavagetto: [C: 032] Depool ulsfo from traffic, having issues [dns] - 10https://gerrit.wikimedia.org/r/433326 (owner: 10Giuseppe Lavagetto) [07:57:26] You all rock :) [07:57:29] hmmm cr1 has issues too [07:57:41] <_joe_> !log depooled ULSFO from live traffic in dns [07:57:42] _joe_: yes you got my +2, depool now [07:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:46] varnent: :) thanks for the report! [07:57:50] <_joe_> akosiaris: already done [07:57:53] cool [07:58:02] <_joe_> varnent: yes, thanks a lot :) [07:58:45] Happy to help - especially when it interrupts late night Commons uploading of photos ;) [07:58:49] <_joe_> any traffic should be drained from ulsfo in ~ 5 minutes [08:00:01] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:00:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:01:03] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209481 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db207... [08:01:11] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209482 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2075.codfw.wmnet'] ``` Of which those **FAILED**: ```... [08:01:37] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209483 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db207... [08:06:31] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [08:07:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:07:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:20:54] !log Stop MySQL on db1120 on s2, s4, s6 and s7 to copy its content to db2075 - T190704 [08:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:59] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [08:21:57] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209492 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2075.codfw.wmnet'] ``` and were **ALL** successful. [08:28:17] (03PS1) 10Marostegui: db1120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/433331 [08:28:35] (03CR) 10Marostegui: [C: 04-2] "Wait for the transfer to be finished" [puppet] - 10https://gerrit.wikimedia.org/r/433331 (owner: 10Marostegui) [08:37:09] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4209501 (10Deskana) @RobH Thank you! Does their access also cover the mobile domains, such as en.m.wikipedia.org? I'm sorry for not being more explicit about that in the original... [08:37:24] (03CR) 10Volans: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [08:41:10] 10Operations, 10ops-eqiad, 10Traffic: cp1068 memory correctable errors - https://phabricator.wikimedia.org/T194757#4207584 (10ema) p:05Triage>03Normal [08:41:37] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724#4209506 (10ema) p:05Triage>03Normal [08:44:30] <_joe_> win 62 [08:45:11] !log restart dbproxy1003 for upgrade [08:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:06] !log upgrading mw1221-mw1235 to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [08:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:11] (03CR) 10Alexandros Kosiaris: [C: 031] "Nice! A few inline comments. Thanks for this!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/433296 (owner: 10Dzahn) [08:48:17] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433332 [08:48:51] (03Abandoned) 10Mholloway: Add mholloway-shell to maps-admins, kartotherian-admin, tilerator-admin [puppet] - 10https://gerrit.wikimedia.org/r/432393 (https://phabricator.wikimedia.org/T194404) (owner: 10Mholloway) [08:50:35] (03CR) 10Alexandros Kosiaris: [C: 031] tor: add icinga check_tcp for ORPort and DirPort [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614) (owner: 10Dzahn) [08:50:42] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Add Michael Holloway (Reading Infrastructure) to maps admin groups - https://phabricator.wikimedia.org/T194404#4209515 (10Mholloway) Looks good! Thank you, @Dzahn! [08:52:38] (03CR) 10Volans: "Thanks for the review @moritzm see the replies inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [08:54:08] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Small redundancy in aphlict, but this is 1:1 the patch I was about to write, thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [08:54:55] !log restart dbproxy1002 for upgrade [08:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:50] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[python-ldap],Package[initramfs-tools] [09:00:46] (03PS1) 10Ema: varnish: use systemd::service instead of base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/433333 (https://phabricator.wikimedia.org/T194724) [09:01:01] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:04:42] (03CR) 10Alexandros Kosiaris: [C: 031] debmonitor: add server side puppettization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:06:24] (03PS3) 10Zhuyifei1999: profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) [09:06:34] (03CR) 10Alexandros Kosiaris: [C: 031] Allow enabling microcode updates gradually [puppet] - 10https://gerrit.wikimedia.org/r/433160 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [09:07:06] akosiaris: lol [09:07:06] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433332 (owner: 10Marostegui) [09:07:12] (03CR) 10Muehlenhoff: [C: 031] debmonitor: add server side puppettization (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:08:33] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433332 (owner: 10Marostegui) [09:09:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433332 (owner: 10Marostegui) [09:10:14] 10Operations, 10Traffic, 10Patch-For-Review: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368#4209526 (10ema) [09:11:13] (03CR) 10Muehlenhoff: [C: 031] "Looks fine, thanks. Let's still keep the Phab task open, though as we should still deploy tor_nagios as some point as it offers more featu" [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614) (owner: 10Dzahn) [09:16:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 after alter table (duration: 01m 21s) [09:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433334 (https://phabricator.wikimedia.org/T190148) [09:18:48] (03PS1) 10Fdans: Copy geoip archive to hdfs using the hdfs user [puppet] - 10https://gerrit.wikimedia.org/r/433335 [09:19:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433334 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:20:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433334 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:22:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 for alter table (duration: 01m 21s) [09:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:46] !log Deploy schema change on db1101:3317 - T191519 T188299 T190148 [09:22:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433334 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:51] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [09:22:51] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [09:22:51] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [09:24:26] (03PS5) 10Jcrespo: mariadb: Failover dbproxy1007,8 and 9 and make them passive [dns] - 10https://gerrit.wikimedia.org/r/433015 (https://phabricator.wikimedia.org/T187962) [09:24:53] (03CR) 10Elukey: [C: 032] Copy geoip archive to hdfs using the hdfs user [puppet] - 10https://gerrit.wikimedia.org/r/433335 (owner: 10Fdans) [09:25:13] !log upgrading video scalers to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [09:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:48] (03CR) 10Jcrespo: [C: 032] mariadb: Failover dbproxy1007,8 and 9 and make them passive [dns] - 10https://gerrit.wikimedia.org/r/433015 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [09:27:39] (03PS1) 10Ema: vcl: invoke builtin vcl_hit [puppet] - 10https://gerrit.wikimedia.org/r/433338 (https://phabricator.wikimedia.org/T192368) [09:38:10] (03PS2) 10Jcrespo: mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) [09:39:41] (03PS8) 10Volans: debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) [09:39:53] (03CR) 10Volans: "Addressed comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:40:28] (03CR) 10jerkins-bot: [V: 04-1] debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:40:48] (03PS5) 10Jakob: Prepare Lexeme config for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) [09:41:30] (03PS9) 10Volans: debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) [09:41:56] (03Draft1) 10Jakob: Enable WikibaseLexeme on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433147 [09:42:04] (03PS2) 10Jakob: Enable WikibaseLexeme on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433147 [09:42:42] (03PS1) 10Mark Bergsma: Add full unit test coverage of IdleConnection [debs/pybal] - 10https://gerrit.wikimedia.org/r/433341 [09:45:45] (03CR) 10Muehlenhoff: [C: 031] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:47:16] (03PS1) 10Marostegui: db-codfw.php: Specifiy sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433343 (https://phabricator.wikimedia.org/T190704) [09:47:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] mcrouter: add support for listening on the ssl port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431736 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto) [09:55:12] (03CR) 10DCausse: Add extra-analysis analyzers as separate plugins (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/432136 (https://phabricator.wikimedia.org/T193734) (owner: 10DCausse) [09:55:39] (03PS3) 10DCausse: Add extra-analysis analyzers as separate plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/432136 (https://phabricator.wikimedia.org/T193734) [09:55:43] (03CR) 10Alexandros Kosiaris: [C: 031] Add the capability to check for deprecated defines [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433162 (https://phabricator.wikimedia.org/T194724) (owner: 10Giuseppe Lavagetto) [09:56:15] (03CR) 10Alexandros Kosiaris: [C: 031] Check for all the available variants of a hiera call [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433163 (owner: 10Giuseppe Lavagetto) [09:56:55] jouncebot next [09:56:55] In 0 hour(s) and 33 minute(s): Deploy WikibaseLexeme on test.wikidata.org (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1030) [09:57:13] !log stop and reimage db2053 [09:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:05] (03CR) 10Alexandros Kosiaris: [C: 031] debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:01:53] (03CR) 10Alexandros Kosiaris: [C: 031] profile::mediawiki::mcrouter_wancache: add ssl, proxy support [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto) [10:04:10] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [10:07:07] (03PS1) 10Marostegui: sX.hosts: Move db2075 from s5 to s2,4,6,7 [software] - 10https://gerrit.wikimedia.org/r/433345 [10:08:09] (03CR) 10Marostegui: [C: 032] sX.hosts: Move db2075 from s5 to s2,4,6,7 [software] - 10https://gerrit.wikimedia.org/r/433345 (owner: 10Marostegui) [10:08:59] (03Merged) 10jenkins-bot: sX.hosts: Move db2075 from s5 to s2,4,6,7 [software] - 10https://gerrit.wikimedia.org/r/433345 (owner: 10Marostegui) [10:12:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433346 (https://phabricator.wikimedia.org/T193835) [10:13:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433346 (https://phabricator.wikimedia.org/T193835) (owner: 10Marostegui) [10:14:22] !log Drop unused tables msg_resource msg_resource_links from s2 - T194663 [10:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:26] T194663: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663 [10:15:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433346 (https://phabricator.wikimedia.org/T193835) (owner: 10Marostegui) [10:15:20] (03PS1) 10Jcrespo: mariadb: Allow reimage to stretch of db2048, db2039 and db2046 [puppet] - 10https://gerrit.wikimedia.org/r/433347 [10:16:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 as it will be moved to a different rack - T193835 (duration: 01m 21s) [10:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:59] T193835: Move db1067 to row C - https://phabricator.wikimedia.org/T193835 [10:17:54] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage to stretch of db2048, db2039 and db2046 [puppet] - 10https://gerrit.wikimedia.org/r/433347 (owner: 10Jcrespo) [10:18:44] 10Operations, 10Traffic, 10media-storage: Remove unnecessary response headers - https://phabricator.wikimedia.org/T194814#4209672 (10ema) [10:18:55] 10Operations, 10Traffic, 10media-storage: Remove unnecessary response headers - https://phabricator.wikimedia.org/T194814#4209683 (10ema) p:05Triage>03Normal [10:21:03] (03PS1) 10Gehel: wdqs: partman config for wdqs10(09|10) [puppet] - 10https://gerrit.wikimedia.org/r/433348 (https://phabricator.wikimedia.org/T194184) [10:22:54] 10Operations, 10Traffic, 10media-storage: Remove unnecessary response headers - https://phabricator.wikimedia.org/T194814#4209672 (10MoritzMuehlenhoff) We could also simply avoid X-Powered-By at the source; our PHP configs already use "expose_php=off" and for HHVM per https://github.com/facebook/hhvm/issues/... [10:24:16] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433346 (https://phabricator.wikimedia.org/T193835) (owner: 10Marostegui) [10:26:47] jouncebot: next [10:26:47] In 0 hour(s) and 3 minute(s): Deploy WikibaseLexeme on test.wikidata.org (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1030) [10:26:51] o/ [10:29:40] (03CR) 10Giuseppe Lavagetto: [C: 031] "I'm not sure why you left the declarations to still be in the single hosts; isn't this going to be applied to whole roles?" [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [10:29:56] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:30:04] addshore: It is that lovely time of the day again! You are hereby commanded to deploy Deploy WikibaseLexeme on test.wikidata.org. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1030). [10:30:40] o/ [10:31:44] 10Operations, 10Traffic, 10media-storage: Remove unnecessary response headers - https://phabricator.wikimedia.org/T194814#4209672 (10Joe) The `X-Powered-By` part is actually useful for us in order to discern the source of rendering of a page - be it hhvm or php. We will use it during the HHVM => PHP7 migrat... [10:31:45] (03PS6) 10Addshore: Prepare Lexeme config for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob) [10:32:12] 10Operations, 10Traffic, 10media-storage: Remove unnecessary response headers - https://phabricator.wikimedia.org/T194814#4209672 (10fgiunchedi) Ditto for some #thumbor headers: ``` thumbor-engine: wikimedia_thumbor.engine.imagemagick thumbor-processing-time: 413 thumbor-processing-utime: 316 thumbor-reques... [10:32:30] (03PS1) 10Elukey: profile::prometheus::alerts: blacklist some change-prop old cgroups [puppet] - 10https://gerrit.wikimedia.org/r/433350 [10:33:25] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: blacklist some change-prop old cgroups [puppet] - 10https://gerrit.wikimedia.org/r/433350 (owner: 10Elukey) [10:34:11] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4209746 (10ayounsi) While preparing the firewall rule for Dallas I discovered a limitation not accounted for previously. The rule that says "if ping to VIPs, then redirect to... [10:34:28] (03CR) 10Addshore: [C: 032] Prepare Lexeme config for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob) [10:34:42] !log stop and reimage db2046 [10:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:54] (03PS1) 10Gehel: wdqs: configure new wdqs test cluster [puppet] - 10https://gerrit.wikimedia.org/r/433351 (https://phabricator.wikimedia.org/T194184) [10:36:14] (03Merged) 10jenkins-bot: Prepare Lexeme config for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob) [10:40:49] (03CR) 10jenkins-bot: Prepare Lexeme config for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob) [10:41:37] !log uploaded linux 4.9.88-1+deb9u1~bpo8+1 to apt.wikimedia.org/jessie-wikimedia [10:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:58] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:433145|Prepare Lexeme config for test.wikidata.org]] T194250 PT 1/2 (duration: 01m 22s) [10:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:02] T194250: Prepare config for test.wikidata.org - https://phabricator.wikimedia.org/T194250 [10:47:58] !log addshore@tin Synchronized wmf-config/Wikibase.php: [[gerrit:433145|Prepare Lexeme config for test.wikidata.org]] T194250 PT 2/2 (duration: 01m 21s) [10:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:03] (03PS3) 10Addshore: Enable WikibaseLexeme on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433147 (owner: 10Jakob) [10:50:02] (03CR) 10Addshore: [C: 032] Enable WikibaseLexeme on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433147 (owner: 10Jakob) [10:51:20] (03Merged) 10jenkins-bot: Enable WikibaseLexeme on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433147 (owner: 10Jakob) [10:54:01] (03CR) 10jenkins-bot: Enable WikibaseLexeme on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433147 (owner: 10Jakob) [10:57:02] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [10:57:38] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:433147|Enable WikibaseLexeme on test.wikidata.org]] T191458 (duration: 01m 20s) [10:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:42] T191458: Deploy WikibaseLexeme on test.wikidata.org - https://phabricator.wikimedia.org/T191458 [10:58:47] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.67 ms [11:01:20] !log addshore@terbium mwscript extensions/WikibaseLexeme/maintenance/createBlacklistedLexemes.php --wiki testwikidatawiki [11:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:57] jakob_WMDE: so I actually see some errors from parsoid [11:05:58] Unknown contentmodel wikibase-lexeme [11:06:03] leszek_wmde: ^^ [11:06:15] interesting [11:06:27] and then sudddenly it becase known? [11:06:28] I'll file a ticket and I expect we will need to follow up before wikidtaawiki [11:06:34] *became [11:06:45] I guess there is some other config somewhere it needs to be in [11:07:22] ok, lets discuss this internally then [11:08:57] created https://phabricator.wikimedia.org/T194821 [11:10:17] !log rebooting multatuli for some tests [11:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:39] (03CR) 10Volans: "I agree with some of the previous comments on validation and data structure, but looks good enough to unblock this. One comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [11:16:13] (03PS3) 10Muehlenhoff: Allow enabling microcode updates gradually [puppet] - 10https://gerrit.wikimedia.org/r/433160 (https://phabricator.wikimedia.org/T127825) [11:17:18] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Connect [11:18:32] (03CR) 10Muehlenhoff: [C: 032] Allow enabling microcode updates gradually [puppet] - 10https://gerrit.wikimedia.org/r/433160 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [11:24:26] (03PS1) 10Muehlenhoff: Enable intel-microcode for multatuli [puppet] - 10https://gerrit.wikimedia.org/r/433356 [11:28:00] !log WikibaseLexeme slot done [11:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:13] (03CR) 10Muehlenhoff: [C: 032] Enable intel-microcode for multatuli [puppet] - 10https://gerrit.wikimedia.org/r/433356 (owner: 10Muehlenhoff) [11:36:25] (03PS1) 10Addshore: Configure WikibaseLexeme after Repo & Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433358 (https://phabricator.wikimedia.org/T191458) [11:45:21] (03PS1) 10Muehlenhoff: Enable Intel microcode installation for labvirt [puppet] - 10https://gerrit.wikimedia.org/r/433359 (https://phabricator.wikimedia.org/T194258) [12:07:20] (03CR) 10Daniel Kinzler: [C: 031] "agree with intent." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421125 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [12:13:37] (03CR) 10Jcrespo: [C: 031] "+1, but merge puppet explicit changes at around the same time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433343 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [12:27:14] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [12:27:33] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [12:34:33] PROBLEM - ganeti-mond running on ganeti2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [12:34:43] PROBLEM - ganeti-noded running on ganeti2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded [12:35:24] PROBLEM - ganeti-confd running on ganeti2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd [12:35:27] ignore this ^ [12:35:29] I am reimaging [12:35:46] jouncebot: next [12:35:46] In 0 hour(s) and 24 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1300) [12:38:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Aside from the inline comments, I am not sold on the docker build approach at all. a) I think we should be using blubber, not docker-pkg, " (036 comments) [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:41:45] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#4210035 (10Aklapper) >>! In T174342#3790202, @Mholloway wrote: > I've reached out to Partnerships about getting in touch with Maroc and INWI for IP range updates. @Mholloway:... [12:42:41] (03PS2) 10Mark Bergsma: Handle HTTP status 302 and 303 as well as 301 [debs/pybal] - 10https://gerrit.wikimedia.org/r/430393 (https://phabricator.wikimedia.org/T102393) [12:42:43] (03PS2) 10Mark Bergsma: Add full unit test coverage of IdleConnection [debs/pybal] - 10https://gerrit.wikimedia.org/r/433341 [12:42:45] (03PS1) 10Mark Bergsma: Avoid Deferred.cancel() induced CancelledErrors [debs/pybal] - 10https://gerrit.wikimedia.org/r/433364 [12:42:50] 10Operations, 10Cloud-Services, 10netops: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496#4210037 (10faidon) The /25 -> /24 renumbering seems fairly straightforward, but given a) IPv4's depletion (we effectively cannot get more IPv4 space from any of the RIRs), b... [12:45:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] Client self-update capability (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432394 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [12:49:43] RECOVERY - ganeti-noded running on ganeti2003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [12:49:53] (03CR) 10Muehlenhoff: Initial working version (031 comment) [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:50:33] RECOVERY - ganeti-confd running on ganeti2003 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd [12:50:40] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#4210044 (10Aklapper) [12:50:53] RECOVERY - ganeti-mond running on ganeti2003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond [12:54:18] jouncebot: next [12:54:18] In 0 hour(s) and 5 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1300) [12:56:02] !log stop and reimage db2039 (s6 master) [12:56:05] I can SWAT today (as I have a patch in it) [12:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:40] addshore: cool. [12:56:46] addshore: I'm around. [12:56:50] and kart_ I can do yours first :) [12:56:59] :) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1300). [13:00:04] kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] o/ [13:00:31] kart_: already +2ed your patch :) [13:02:48] addshore: yep. Thanks. [13:02:53] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4210052 (10MaxBioHazard) usage: jsub [options...] program [args...] jsub: error: argument program: Program 'MONO_TLS_PROVIDER=btls' not found. [13:03:26] addshore: I have to more commits if that's OK, sorry for the last-minute addition [13:03:36] tgr: should be fine :) [13:03:46] as long as they are not evil! :D [13:03:53] https://gerrit.wikimedia.org/r/#/c/433367/ and https://gerrit.wikimedia.org/r/#/c/433368/ [13:04:22] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [13:04:37] (03PS1) 10Mark Bergsma: Cleanup monitor shutdown handler (invoking stop) after run [debs/pybal] - 10https://gerrit.wikimedia.org/r/433369 [13:04:39] (03PS1) 10Mark Bergsma: Split monitor tests into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/433370 [13:04:47] cool, not evil :) [13:05:21] kart_: will yours be testable on mwdebug1002? [13:05:22] thanks! they are maint scripts, no testing needed [13:05:23] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [13:05:34] elukey: how's turnilo working out? [13:05:41] addshore: yes [13:06:24] tgr: can you add them to the deploy calendar ? [13:06:36] paravoid: atm it seems good, I am pretty sure that it will become the "default" very soon [13:07:26] !log rebooting deployment-ms-be03 for tests related to IBPB passthrough [13:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:29] super slow jenkins... [13:09:06] that isn't new :) [13:09:27] (03CR) 10Mark Bergsma: [C: 031] Cleanup monitor shutdown handler (invoking stop) after run [debs/pybal] - 10https://gerrit.wikimedia.org/r/433369 (owner: 10Mark Bergsma) [13:11:17] twentyafterfour: Look like I have one of your patches too... [13:11:21] elukey: cool :) [13:11:30] "Add CongressLookup submodule, it was not added branch config.json" [13:11:37] * addshore moans as per the SWAT rules [13:11:47] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4210055 (10Vgutierrez) >>! In T194380#4210052, @MaxBioHazard wrote: > usage: jsub [options...] program [args...] > jsub: error: argument program: Program 'MONO_TLS_PROVIDER=btls'... [13:12:25] addshore: done [13:12:31] tgr: thanks [13:12:42] * addshore thinks for a moment about what to do with twentyafterfour's patch [13:12:48] paravoid: we also don't have much choice, and I am pretty sure that pivot will eventually break after a Druid upgrade [13:12:57] (03CR) 10Volans: "Replies inline, patch will follow" (037 comments) [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:13:04] yeah ofc [13:14:08] * addshore added a comment to https://gerrit.wikimedia.org/r/#/c/433216/ and will continue and just not sync it.. [13:14:58] (03CR) 10Mark Bergsma: [C: 031] Split monitor tests into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/433370 (owner: 10Mark Bergsma) [13:15:09] kart_: currently on mwdebug1002 [13:15:16] addshore: ok. testing. [13:15:23] tgr: justed +2ed your 2 :) [13:15:26] *just [13:17:31] !log restarting mariadb processes on dbstore2001 T194516 [13:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:36] T194516: Bug on mariadb systemd unit on stretch for multi-instance hosts - https://phabricator.wikimedia.org/T194516 [13:18:21] addshore: works fine. go ahead. [13:18:28] kart_: thanks! [13:19:33] syncing [13:20:20] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433372 [13:20:29] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433372 [13:20:51] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/ContentTranslation/modules/ve-cx/init/ve.init.mw.CXTarget.js: SWAT: [[gerrit:433355|Fix mistake in 84caceee that causes exceptions with MT card]] T194811 (duration: 01m 21s) [13:20:52] kart_: done, please double check :) [13:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:55] T194811: CX2: Translation or MT fails with an error - https://phabricator.wikimedia.org/T194811 [13:21:05] addshore: Thanks. Checking. [13:21:15] tgr: I'll do my config change before youre core ones [13:21:25] (03CR) 10Addshore: [C: 032] Configure WikibaseLexeme after Repo & Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433358 (https://phabricator.wikimedia.org/T191458) (owner: 10Addshore) [13:22:46] (03CR) 10Ottomata: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/433350 (owner: 10Elukey) [13:22:49] (03Merged) 10jenkins-bot: Configure WikibaseLexeme after Repo & Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433358 (https://phabricator.wikimedia.org/T191458) (owner: 10Addshore) [13:22:57] (03CR) 10Volans: "Thanks for review, answer/question inline" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432394 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [13:23:10] (03CR) 10jenkins-bot: Configure WikibaseLexeme after Repo & Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433358 (https://phabricator.wikimedia.org/T191458) (owner: 10Addshore) [13:23:14] * addshore tests on mwdebug [13:24:55] * addshore syncs [13:26:13] !log addshore@tin Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:433358|Configure WikibaseLexeme after Repo & Client]] T191458 T194250 (duration: 01m 21s) [13:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:18] T194250: Prepare config for test.wikidata.org - https://phabricator.wikimedia.org/T194250 [13:26:18] T191458: Deploy WikibaseLexeme on test.wikidata.org - https://phabricator.wikimedia.org/T191458 [13:27:03] * addshore waits for the last CI on the core patches [13:27:31] tgr: you only added the .4 patch to the calendar ! [13:27:48] oh wait, they are just both on the same line! [13:27:59] * addshore hasn't seen one like that before :) [13:31:28] (03PS6) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) [13:31:29] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [13:31:34] (03CR) 10Ottomata: [V: 032 C: 032] Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata) [13:32:08] RECOVERY - BGP status on cr1-ulsfo is OK: BGP OK - up: 18, down: 0, shutdown: 0 [13:32:38] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [13:34:52] !log addshore@tin Synchronized php-1.32.0-wmf.4/maintenance/: SWAT: [[gerrit:433368|Deduplicate archive.ar_rev_id]] (duration: 01m 31s) [13:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:31] ^we can probably repool ulsfo now [13:36:04] tgr: syncing yours (in bits) [13:36:52] !log addshore@tin Synchronized php-1.32.0-wmf.4/includes/installer/DatabaseUpdater.php: SWAT: [[gerrit:433368|Deduplicate archive.ar_rev_id]] (duration: 01m 21s) [13:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:54] !log restart dbstore2002 for upgrade [13:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:17] !log addshore@tin Synchronized php-1.32.0-wmf.4/autoload.php: SWAT: [[gerrit:433368|Deduplicate archive.ar_rev_id]] (duration: 01m 21s) [13:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:54] !log addshore@tin Synchronized php-1.32.0-wmf.3/maintenance/: SWAT: [[gerrit:433367|Deduplicate archive.ar_rev_id]] (duration: 01m 30s) [13:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:59] !log rolling restart of Kafka jumbo brokers to apply jdk.tls.namedGroups=secp256r1 https://phabricator.wikimedia.org/T182993 [13:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:27] !log addshore@tin Synchronized php-1.32.0-wmf.3/includes/installer/DatabaseUpdater.php: SWAT: [[gerrit:433367|Deduplicate archive.ar_rev_id]] (duration: 01m 21s) [13:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:52] !log addshore@tin Synchronized php-1.32.0-wmf.3/autoload.php: SWAT: [[gerrit:433367|Deduplicate archive.ar_rev_id]] (duration: 01m 21s) [13:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:00] tgr: all done [13:43:17] !log SWAT all done [13:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:44] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433372 (owner: 10Marostegui) [13:44:59] addshore: thanks! [13:45:08] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [13:45:16] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433372 (owner: 10Marostegui) [13:45:59] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [13:47:18] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [13:47:38] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [13:48:07] maybe not [13:48:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 after alter table (duration: 01m 21s) [13:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433376 (https://phabricator.wikimedia.org/T190148) [13:50:31] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433372 (owner: 10Marostegui) [13:51:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433376 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:51:59] 10Operations: Manage apt sources via puppet? - https://phabricator.wikimedia.org/T158562#4210120 (10MoritzMuehlenhoff) Another case I found: ms-be2013-ms-be2021 were unable to install the systemd update that was released via stretch-updates and it turned that that stretch-updates was missing in /etc/apt/sources.... [13:52:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433376 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:53:58] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [13:54:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1098:3317 for alter table (duration: 01m 20s) [13:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:21] !log Deploy schema change on db1098:3317 - T191519 T188299 T190148 [13:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:27] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [13:54:27] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [13:54:27] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [13:55:09] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [13:56:30] 10Operations, 10Cloud-VPS, 10cloud-services-team: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4210133 (10chasemp) a:05chasemp>03Cmjohnson heyo -- I talked to the team about this situation yesterday and the outcome is to mimic labstore1004/5 (except in the pu... [13:56:45] !log Restart mysql on db1116 for testing [13:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:14] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433376 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [14:01:33] (03PS1) 10Alexandros Kosiaris: Reimage as stretch ganeti2006, ganeti2002 [puppet] - 10https://gerrit.wikimedia.org/r/433380 [14:04:22] !log stop and reimage db2048 (s1 master) [14:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:24] !log depool ms-fe1007 for asw2 move - T187962 [14:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:29] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [14:19:03] PROBLEM - Host mw2182 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:08] (03PS1) 10Andrew Bogott: openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433383 (https://phabricator.wikimedia.org/T193579) [14:19:30] !log filippo@neodymium conftool action : set/pooled=no; selector: name=ms-fe1007.eqiad.wmnet [14:19:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] Client self-update capability (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432394 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [14:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:38] (03PS2) 10Andrew Bogott: openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433383 (https://phabricator.wikimedia.org/T193579) [14:19:55] (03CR) 10Alexandros Kosiaris: [C: 032] Reimage as stretch ganeti2006, ganeti2002 [puppet] - 10https://gerrit.wikimedia.org/r/433380 (owner: 10Alexandros Kosiaris) [14:24:14] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4210178 (10fgiunchedi) WRT ms-fe servers (1008 and 1007), please move to asw2 and reallocate to be in two different physical racks. Ditto for ms-be machines,... [14:37:46] PROBLEM - Host ms-fe1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:40:49] (03PS2) 10Volans: Client self-update capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432394 (https://phabricator.wikimedia.org/T167504) [14:40:51] (03PS2) 10Volans: CLI: use lsb_release for OS detection [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432395 [14:41:09] (03CR) 10Volans: "Addressed comment" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432394 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [14:41:36] (03PS2) 10Volans: Make model validation stronger [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432377 [14:42:40] (03CR) 10Volans: [C: 032] Make model validation stronger [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432377 (owner: 10Volans) [14:42:46] (03PS1) 10Elukey: Release 1.1.0-1 [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/433390 (https://phabricator.wikimedia.org/T194808) [14:43:06] RECOVERY - Host ms-fe1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [14:43:39] (03Merged) 10jenkins-bot: Make model validation stronger [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432377 (owner: 10Volans) [14:43:41] (03PS2) 10Elukey: Release 1.1.0-1 [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/433390 (https://phabricator.wikimedia.org/T194808) [14:43:46] (03PS3) 10Volans: Client self-update capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432394 (https://phabricator.wikimedia.org/T167504) [14:48:22] marostegui, jynus: Hey! Are we rebooting one of the labsdb's in 12 minutes or so? [14:48:33] bstorm_, o/ [14:48:41] halfak: bstorm_ is coordinating :) [14:48:48] Got it :) [14:48:51] Thanks [14:49:50] !log pool ms-fe1007 for asw2 move - T187962 [14:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:54] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [14:50:22] !log upgrading codfw job runners to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [14:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:29] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4210261 (10mobrovac) [14:53:34] Halfak and marostegui: Yup, I'm getting set up. [14:53:43] o/ [14:53:58] bstorm_: I am going to silence labsdb1004 and labsdb1005 for 2 hours [14:54:19] Ok thanks. So, labsdb1005 might alert if replication stops? [14:54:46] !log depool ms-fe1008 for asw2 move - T187962 [14:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:02] bstorm_: nope, they do not alert on replication [14:55:31] I will silence labsdb1004 only, much better [14:55:38] We are not touching 1005 today no? [14:55:47] Ok. Just checking. We shouldn't be abusing labsdb1005 today. I figured I'd wait until we found out if there are any issues on 1004. [14:56:00] Cool [14:56:08] Plus, I have to do announcements for the outage on 1005 (and need like a week of lead time) [14:56:08] do you want us to stop replication/mariadb for you? [14:56:20] If you want to do that, that would be great! [14:56:21] or you can handle that? [14:56:34] bstorm_: I will handle that, and upgrade mariadb too [14:56:37] I think I know what to do, but I'm all for having it done right lol [14:56:58] I would suggest us doing it this time, but you being aware of it [14:57:04] bstorm_: I am ready from my side [14:57:05] in case there is an emergency in the future [14:57:14] Great [14:57:38] we showed other cloud people in the past too [14:58:04] bstorm_: I will share with you the notes of what I have done so you can be aware [14:58:06] although literally there is only replication to stop and the service, so not preciselly too complicated :-) [14:58:17] Also great [14:58:58] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4210293 (10mobrovac) [14:59:02] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#3954016 (10mobrovac) 05stalled>03Open [14:59:04] marostegui: which mariadb version is running, does that need an upgrade? [14:59:08] too? [14:59:21] yep [14:59:25] from .33 to .34 [14:59:33] I think there is .35 now [14:59:45] we should be thinking about stretch upgrade some time soon [14:59:53] but not that easy with postgres [14:59:53] I was going to do a full upgrade and whatever version we have already :) [15:00:05] andrewbogott and chasemp: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for WMCS network maintenance . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1500). [15:00:18] hahaha [15:00:56] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4210310 (10MoritzMuehlenhoff) >>! In T190717#4210047, @Lea_WMDE wrote: > Hi @MoritzMuehlenhoff, we have news :) > > - The changes for mobile... [15:00:57] I'm checking which kernel we need to be on... [15:01:13] bstorm_: an apt-full upgrade should give you the correct one [15:01:19] That is true! :-D [15:01:21] !log Stop MySQL on db1067 - T193835 [15:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:25] T193835: Move db1067 to row C - https://phabricator.wikimedia.org/T193835 [15:01:36] Ok, so halfak: are you good to go from your end? Should I stop postgres? [15:01:44] Good to go. [15:02:14] Ok [15:02:32] marostegui, we are ready for the upgrade and DB stop [15:02:35] cool [15:02:37] doing it [15:02:43] I'll kick the reboot when you are done. [15:02:45] !log Stop MySQL on labsdb1004 [15:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:46] Upgrading kernel and mysql [15:03:50] I see that :) [15:04:40] All done [15:04:41] Confirmed Wikilabels is offline [15:04:43] ok [15:04:45] Feel free to reboot whenever you want [15:04:49] In that case, I'll reboot [15:04:54] But the maintenance notice is still up :) [15:05:18] I love molly-guard. lol [15:06:07] so the only real gotcha about dbs is that the server doesn't wait for a clean stop, so we ask other ops to do that manually [15:06:33] it wouldn't be the first time we have to go over a recovery because unclean shutdown [15:06:55] huh [15:07:11] wait.. should I have taken a backup right before the reboot? [15:07:16] so we are abit overprotective about that :-) [15:07:21] I thought we were just restarting postgres :| [15:07:22] no, halfak [15:07:25] kk :) [15:07:27] halfak: we still have labsdb1005 [15:07:40] processes were shutdown cleanly [15:07:42] I don't think halfak's stuff replicates marostegui [15:07:45] this is for example [15:07:48] I didn't realize that postgres was mirrored :D [15:07:53] ah! [15:07:58] It isn't :-D [15:07:59] labsdb1004 is back up [15:08:01] It could be [15:08:05] when people do rolling restarts for kernel upgrades [15:08:07] bstorm_: Am I good to start mysql again? [15:08:15] 4.9.88-1+deb9u1~bpo8+1 [15:08:18] One sec... [15:08:19] yeah [15:08:41] [bstorm@labsdb1004]:~ $ uname -r [15:08:41] 4.9.0-0.bpo.6-amd64 [15:08:50] (03PS3) 10Andrew Bogott: openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433383 (https://phabricator.wikimedia.org/T193579) [15:08:58] We have some stuff that pins things. Let me make sure that one is good. [15:09:03] cool [15:09:14] That *is* a patch upgrade [15:09:22] OK looks like wikilabels is back [15:09:31] bstorm_, am I good to take down the maintenance notice now? [15:09:34] (03CR) 10Andrew Bogott: [C: 032] openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433383 (https://phabricator.wikimedia.org/T193579) (owner: 10Andrew Bogott) [15:10:01] !log pool ms-fe1008 for asw2 move - T187962 [15:10:04] bstorm_: let me know when I can start mysql [15:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:06] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [15:10:19] halfak: just wait a moment, I want to make sure I don't need to manually change kernels in grub [15:11:41] It looks to me like we are good [15:12:12] So please start the mysql services again marostegui [15:12:19] cool [15:12:29] halfak: I think we are good [15:12:53] moritzm: could you confirm the kernel bstorm_ mentioned is good? [15:13:09] linux-image-4.9.0-0.bpo.6-amd64 4.9.88-1+deb9u1~bpo8+1 [15:14:19] I was checking around first. I think it's what we have been going to [15:14:36] Yup, it's what we have on other servers we did this to [15:14:58] bstorm_: all done [15:15:07] Great! [15:15:07] halfak: fyi, as soon as your setup is back up and running I'm going to break it again for network maintenance [15:15:18] Uh [15:15:22] 😂 [15:15:22] what? [15:15:27] Scheduled maintenance? [15:15:29] yep [15:15:43] Yeah. They held off on breaking the network for us to do this. [15:15:52] (03CR) 10Elukey: [C: 032] Release 1.1.0-1 [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/433390 (https://phabricator.wikimedia.org/T194808) (owner: 10Elukey) [15:16:03] Thank you marostegui and halfak!! [15:16:07] I think we are good. [15:16:08] halfak: it should be pretty quick, maybe 5-10 [15:16:10] bstorm_: you are welcome! [15:16:25] bstorm_: ack, that's fine [15:16:31] andrewbogott, this is likely to take the whole service down for a bit, right? [15:16:41] So no point in putting the maintenance notice back up since no one will see it? [15:16:47] halfak: correct [15:16:52] OK w/e then :) [15:17:07] !Log upload burrow 1.1.0 to stretch|jessie-wikimedia [15:17:10] !log labsdb1004 reboot and upgrade kernel to 4.9.0-0.bpo.6-amd64 [15:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:21] argh capital L :P [15:17:27] !log upload burrow 1.1.0 to stretch|jessie-wikimedia [15:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:49] !log upgrade burrow from 1.0.0 to 1.1.0 on kafkamon* hosts [15:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:55] thanks moritzm [15:17:56] bstorm_: https://phabricator.wikimedia.org/T189115#4080358 <-- according to this, `4.9.0-0.bpo.6-amd64` is the good one for jessie [15:18:11] Yup. I checked servers that were already done and bugged other people :) [15:18:21] I had the notes on my now-dead computer previously [15:18:36] wait, moritzm mentioned a different version in #-cloud-admin [15:18:47] !log move ms-be1024 for asw2-c-eqiad - T187962 [15:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:51] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [15:29:06] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of db108* hosts [puppet] - 10https://gerrit.wikimedia.org/r/433402 (owner: 10Jcrespo) [15:29:26] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1088 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433401 (owner: 10Jcrespo) [15:31:03] (03Merged) 10jenkins-bot: mariadb: Depool db1088 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433401 (owner: 10Jcrespo) [15:31:17] (03CR) 10jenkins-bot: mariadb: Depool db1088 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433401 (owner: 10Jcrespo) [15:31:23] (03CR) 10Ema: "> I'm not sure why you left the declarations to still be in the" [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [15:31:38] (03PS3) 10Elukey: role::configcluster: fix zookeeper role config [puppet] - 10https://gerrit.wikimedia.org/r/433400 [15:33:20] (03PS4) 10Elukey: role::configcluster: fix zookeeper role config [puppet] - 10https://gerrit.wikimedia.org/r/433400 [15:34:23] (03CR) 10Mobrovac: [C: 031] "GTG now, works in Beta" [puppet] - 10https://gerrit.wikimedia.org/r/432347 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [15:35:00] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433404 [15:35:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1088 (duration: 03m 26s) [15:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:15] !log move ms-be1025 for asw2-c-eqiad - T187962 [15:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:19] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [15:36:50] (03PS1) 10Ema: Revert "Depool ulsfo from traffic, having issues" [dns] - 10https://gerrit.wikimedia.org/r/433405 [15:37:06] (03PS3) 10Giuseppe Lavagetto: Proton: Add role and profile [puppet] - 10https://gerrit.wikimedia.org/r/432347 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [15:37:54] (03CR) 10Giuseppe Lavagetto: [C: 032] Proton: Add role and profile [puppet] - 10https://gerrit.wikimedia.org/r/432347 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [15:38:01] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11230/" [puppet] - 10https://gerrit.wikimedia.org/r/433400 (owner: 10Elukey) [15:38:08] (03PS5) 10Elukey: role::configcluster: fix zookeeper role config [puppet] - 10https://gerrit.wikimedia.org/r/433400 [15:38:32] (03PS3) 10Rduran: [WIP] Refactor code in transfer.py [puppet] - 10https://gerrit.wikimedia.org/r/432569 (https://phabricator.wikimedia.org/T156462) [15:38:44] (03CR) 10Elukey: [C: 032] role::configcluster: fix zookeeper role config [puppet] - 10https://gerrit.wikimedia.org/r/433400 (owner: 10Elukey) [15:40:16] RECOVERY - Host mw2182 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [15:43:23] (03CR) 10Ema: [C: 032] Revert "Depool ulsfo from traffic, having issues" [dns] - 10https://gerrit.wikimedia.org/r/433405 (owner: 10Ema) [15:43:29] (03PS2) 10Ema: Revert "Depool ulsfo from traffic, having issues" [dns] - 10https://gerrit.wikimedia.org/r/433405 [15:49:12] !log move ms-be1034 for asw2-c-eqiad - T187962 [15:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:19] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [15:50:04] !log stop and reimage db1088 [15:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:11] !log upgrading kernel, microcode and rebooting labnet1002 [15:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:46] PROBLEM - Host ms-be1034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:02] RECOVERY - Host ms-be1034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [15:59:32] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[proton/deploy] [16:00:32] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[proton/deploy] [16:01:47] <_joe_> those will recover on the next run ^^ [16:03:11] <_joe_> !log installing cergen 0.2.2 on the puppetmaster frontends [16:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:26] !log move ms-be1035 for asw2-c-eqiad - T187962 [16:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:31] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [16:10:57] PROBLEM - Host ms-be1035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:15:42] (03PS1) 10Cmjohnson: New production IP db1067 [dns] - 10https://gerrit.wikimedia.org/r/433416 (https://phabricator.wikimedia.org/T193835) [16:16:17] RECOVERY - Host ms-be1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.76 ms [16:18:02] !log move ms-be1036 for asw2-c-eqiad - T187962 [16:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:07] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [16:18:28] !log Power off db1067 for rack move - T193835 [16:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:32] T193835: Move db1067 to row C - https://phabricator.wikimedia.org/T193835 [16:19:58] (03CR) 10Marostegui: [C: 032] New production IP db1067 [dns] - 10https://gerrit.wikimedia.org/r/433416 (https://phabricator.wikimedia.org/T193835) (owner: 10Cmjohnson) [16:20:48] HI, labels.wmflabs.org no work. Is maintenance in progress or is done? [16:21:17] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change db1067 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433417 (https://phabricator.wikimedia.org/T193835) [16:21:27] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:22:30] !log upgrade burrow on kafkamon1001 from 1.0 to 1.1 [16:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:06] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Change db1067 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433417 (https://phabricator.wikimedia.org/T193835) (owner: 10Marostegui) [16:23:38] <_joe_> Zoranzoki21: works for me [16:23:47] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [16:24:37] PROBLEM - Host ms-be1036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:24:53] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change db1067 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433417 (https://phabricator.wikimedia.org/T193835) (owner: 10Marostegui) [16:25:15] Zoranzoki21: the network maintenance is done [16:25:45] But I can not to load new work list [16:29:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db1067 IP - T193835 (duration: 01m 17s) [16:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:05] T193835: Move db1067 to row C - https://phabricator.wikimedia.org/T193835 [16:29:36] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change db1067 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433417 (https://phabricator.wikimedia.org/T193835) (owner: 10Marostegui) [16:30:01] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:30:19] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433404 (owner: 10Jcrespo) [16:30:24] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433404 [16:30:30] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db1067 IP - T193835 (duration: 01m 21s) [16:30:31] (03PS4) 10Marostegui: mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [16:30:32] RECOVERY - Host ms-be1036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms [16:30:40] Hmmm [16:30:49] I cleaned cache memory of browser [16:30:52] (03PS3) 10Volans: Initial working version [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) [16:31:11] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:31:30] (03CR) 10Volans: "Some comment addressed, the other are pending agreement." [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [16:32:22] PROBLEM - Host db1067.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:32:28] ^ expected [16:33:35] (03PS1) 10Milimetric: Use standard logging approach similar to sqoop job [puppet] - 10https://gerrit.wikimedia.org/r/433420 [16:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:48] T193835: Move db1067 to row C - https://phabricator.wikimedia.org/T193835 [16:35:37] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433404 (owner: 10Jcrespo) [16:36:41] (03PS2) 10Giuseppe Lavagetto: mcrouter: add support for listening on the ssl port [puppet] - 10https://gerrit.wikimedia.org/r/431736 (https://phabricator.wikimedia.org/T192370) [16:37:41] RECOVERY - Host db1067.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [16:38:04] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1088 with low weight (duration: 01m 21s) [16:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:41] PROBLEM - Host cp1050 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:41] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:41] PROBLEM - Host cp1048 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:42] PROBLEM - Host cp1051 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:42] PROBLEM - Host cp1055 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:42] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:42] PROBLEM - Host cp1054 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:51] PROBLEM - Host cp1045 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:51] PROBLEM - Host cp1049 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:11] PROBLEM - Host cp1099 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:21] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:44] XioNoX: ^ known ? [16:42:51] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:10] wtf [16:43:21] PROBLEM - Juniper alarms on asw-c-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms [16:43:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0 [16:43:41] I cannot ssh into a random one of them, seems real [16:43:45] thats all c8 [16:44:21] RECOVERY - Host cp1049 is UP: PING WARNING - Packet loss = 61%, RTA = 0.20 ms [16:44:21] RECOVERY - Host cp1053 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [16:44:21] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:44:21] RECOVERY - Host cp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [16:44:21] RECOVERY - Host cp1055 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:44:22] RECOVERY - Host cp1050 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:44:22] RECOVERY - Host cp1052 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [16:44:23] RECOVERY - Host cp1051 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [16:44:23] RECOVERY - Host cp1054 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:44:24] RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:44:31] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:44:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [16:44:51] RECOVERY - Host cp1099 is UP: PING WARNING - Packet loss = 54%, RTA = 0.32 ms [16:45:47] looks like relatively limited impact in 503s, from logstash [16:46:43] What I should do so I can resume working on labels.wmflabs.org [16:46:57] looks like asw-c8 pooped itself and then came back, looking at the logs [16:46:57] godog: I'm wondering if the other hosts on C8 were affected too but icinga didn't alarm or what [16:46:57] It's best to ask in #wikimedia-cloud [16:47:10] Who to ask? I? [16:47:49] If you ask there, some one will be around shortly to ask your question there if you use !help. [16:47:57] paladox: OK, thanks [16:48:11] volans: yeah it is odd only cp machines, perhaps only 10g interfaces were affected [16:48:24] lvses don't have 10g? [16:48:52] paladox: Done, tnx [16:49:04] they do afaik yeah [16:49:17] cmjohnson: Major alarm set, FPC 8 PEM 1 is not powered for asw-c8 [16:50:01] is was C8? [16:50:07] yeah [16:50:17] let me check other hosts there [16:50:17] HASSISD_FRU_OFFLINE_NOTICE: Taking FPC 8 offline: Removal [16:50:22] okay [16:50:26] (03PS3) 10Dzahn: tor: add icinga check_tcp for ORPort and DirPort [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614) [16:50:31] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:21] there is only cps, lvs, ms and mc there [16:51:22] PROBLEM - Host labnet1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:43] fpc8: [16:51:43] ----------------------------------------------------- [16:51:43] Current time: 2018-05-16 16:51:29 UTC [16:51:43] System booted: 2018-05-16 16:41:10 UTC (00:10:19 ago) [16:52:08] I don't think the lvs's do anything [16:52:25] cmjohnson: is it possible that FPC8 had loose power cables and got bumped? [16:52:31] on a CP host it got bnx2x 0000:01:00.0 eth0: NIC Link is Down, just to confirm the current theory [16:52:45] it is possible it looks up frommy end but checking the power cbales [16:53:04] yeah, it's back up [16:53:38] volans: did you got taht from dmesg? [16:53:44] jynus: yes [16:53:52] nothing on a random ms-be [16:53:59] on the same rack [16:54:12] so only 10G as you suggested? [16:54:13] jynus: the ms-be jsut got moved to different racks [16:54:17] oh [16:54:33] let me try with an mc [16:54:34] only the cp servers are prod servers in row C8 [16:54:55] mc are to be decommissioned [16:55:00] I was just trying to see that error, indepedently of productioness [16:55:57] yeah, 2 flaps on an lvs [17:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1700). [17:00:05] subbu: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:07] (03PS1) 10Ottomata: Enable webrequest deletion [puppet] - 10https://gerrit.wikimedia.org/r/433425 [17:00:53] o/ [17:04:40] XioNoX: the power cables were secure and not near anything I was working on. I believe it's a 4500 in the rack, I have a spare [17:04:57] or we can rush the move of 10G servers [17:05:06] cmjohnson: the switch is stable [17:05:41] but it came back in a funky state [17:06:33] which would need a re-install of the OS to be fixed :( [17:07:01] but no need to rush as it looks stable [17:07:53] okay! [17:08:17] cmjohnson: after ensuring PEM0 is working fine, we should start by replacing PEM1 just in case [17:11:32] any swatters around? [17:12:53] * hoo can jump in, if there's need [17:16:10] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:16:15] !log milimetric@tin Started deploy [analytics/refinery@15be6ae]: Fix drop partitions script [17:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:11] RECOVERY - Host labnet1002 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:20:43] !log milimetric@tin Started deploy [analytics/refinery@a205447]: Fix drop partitions script [17:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:15] hoo ... that would be helpful .. i have one config patch in the swat window. [17:22:19] Can do :) [17:22:51] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:22:55] ty :) [17:23:01] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:23:11] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:23:50] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:24:00] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:24:00] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:24:23] (03CR) 10Hoo man: [C: 032] Enable RemexHtml on wikis with < 100 ns0 errors in high priority cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432621 (https://phabricator.wikimedia.org/T193685) (owner: 10Subramanya Sastry) [17:24:58] subbu: Does testing this on mwdebug100(1|2) make sense? Or shall I go straight ahead? [17:25:16] let me verify on mwdebug1002 sure. [17:25:32] it will be a quick verification to make sure tidy has been replaced with remex. [17:25:41] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:26:30] (03PS5) 10Hoo man: Enable RemexHtml on wikis with < 100 ns0 errors in high priority cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432621 (https://phabricator.wikimedia.org/T193685) (owner: 10Subramanya Sastry) [17:26:50] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:27:16] (03CR) 10Hoo man: [C: 032] Enable RemexHtml on wikis with < 100 ns0 errors in high priority cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432621 (https://phabricator.wikimedia.org/T193685) (owner: 10Subramanya Sastry) [17:27:23] had to rebase [17:27:31] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:27:40] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:27:51] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:27:51] k [17:28:45] (03Merged) 10jenkins-bot: Enable RemexHtml on wikis with < 100 ns0 errors in high priority cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432621 (https://phabricator.wikimedia.org/T193685) (owner: 10Subramanya Sastry) [17:29:32] subbu: Should be live on mwdebug1002 [17:29:34] (03CR) 10jenkins-bot: Enable RemexHtml on wikis with < 100 ns0 errors in high priority cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432621 (https://phabricator.wikimedia.org/T193685) (owner: 10Subramanya Sastry) [17:30:05] ok. let me verify. [17:30:49] hoo, lgtm. as long as you don't see any errors in logs, good to go. [17:32:44] Nothing related to this, so here we go! [17:33:54] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable RemexHtml on wikis with < 100 ns0 errors in high priority cats (T193685) (duration: 01m 22s) [17:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:59] T193685: Enable RemexHTML on all wikis with < 100 linter errors in all high priority linter categories in ns0 (main namespace) - https://phabricator.wikimedia.org/T193685 [17:35:55] !log milimetric@tin Finished deploy [analytics/refinery@a205447]: Fix drop partitions script (duration: 15m 12s) [17:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:37] (03CR) 10Mforns: [C: 031] Use standard logging approach similar to sqoop job [puppet] - 10https://gerrit.wikimedia.org/r/433420 (owner: 10Milimetric) [17:36:45] hoo thanks! [17:38:05] (03CR) 10Dzahn: [C: 032] tor: add icinga check_tcp for ORPort and DirPort [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614) (owner: 10Dzahn) [17:39:11] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:39:20] PROBLEM - Host rdb2004 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:30] PROBLEM - Host labnet1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:30] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:42:17] RECOVERY - Host labnet1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:44:48] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:46:48] (03PS2) 10Ottomata: Use standard logging approach similar to sqoop job [puppet] - 10https://gerrit.wikimedia.org/r/433420 (owner: 10Milimetric) [17:47:09] (03PS1) 10Andrew Bogott: labvirt pool: fix a typo in a comment about pooling/depooling [puppet] - 10https://gerrit.wikimedia.org/r/433429 [17:48:02] (03CR) 10Andrew Bogott: [C: 032] labvirt pool: fix a typo in a comment about pooling/depooling [puppet] - 10https://gerrit.wikimedia.org/r/433429 (owner: 10Andrew Bogott) [17:48:47] PROBLEM - Host labnet1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:37] RECOVERY - Host labnet1002 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [17:50:11] (03CR) 10Ottomata: [C: 032] Use standard logging approach similar to sqoop job [puppet] - 10https://gerrit.wikimedia.org/r/433420 (owner: 10Milimetric) [17:50:15] (03PS3) 10Ottomata: Use standard logging approach similar to sqoop job [puppet] - 10https://gerrit.wikimedia.org/r/433420 (owner: 10Milimetric) [17:50:17] (03CR) 10Ottomata: [V: 032 C: 032] Use standard logging approach similar to sqoop job [puppet] - 10https://gerrit.wikimedia.org/r/433420 (owner: 10Milimetric) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1800) [18:13:41] (03PS4) 10Andrew Bogott: wikitech: remove OpenStackManager private settings [puppet] - 10https://gerrit.wikimedia.org/r/432703 (https://phabricator.wikimedia.org/T161553) [18:13:43] (03PS1) 10Andrew Bogott: labvirt partman: Move labvirt1019-1022 to the standard labvirt partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/433432 (https://phabricator.wikimedia.org/T193264) [18:15:25] (03PS1) 10Milimetric: Shorten cron commands by using shorthand flags [puppet] - 10https://gerrit.wikimedia.org/r/433433 [18:32:28] (03CR) 10Ottomata: [C: 032] Shorten cron commands by using shorthand flags [puppet] - 10https://gerrit.wikimedia.org/r/433433 (owner: 10Milimetric) [18:48:49] go to a Gerrit change, click "topic", change topic to something containing a ":". try to download and amend to it. get "fatal .. is not a valid branch name" . heh. i gotta remember not using ":" [18:50:35] (03PS4) 10Dzahn: phabricator: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) [18:50:43] (03CR) 10Dzahn: phabricator: base::service_unit -> systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:55:10] (03PS2) 10Andrew Bogott: labvirt partman: Move labvirt1019-1022 to the standard labvirt partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/433432 (https://phabricator.wikimedia.org/T193264) [18:56:17] (03CR) 10Andrew Bogott: [C: 032] labvirt partman: Move labvirt1019-1022 to the standard labvirt partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/433432 (https://phabricator.wikimedia.org/T193264) (owner: 10Andrew Bogott) [19:00:04] twentyafterfour: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T1900). [19:05:28] (03CR) 10RobH: [C: 031] wdqs: partman config for wdqs10(09|10) [puppet] - 10https://gerrit.wikimedia.org/r/433348 (https://phabricator.wikimedia.org/T194184) (owner: 10Gehel) [19:06:42] PROBLEM - Host cp1099 is DOWN: PING CRITICAL - Packet loss = 100% [19:06:52] PROBLEM - Host cp1045 is DOWN: PING CRITICAL - Packet loss = 100% [19:06:52] PROBLEM - Host cp1050 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:02] PROBLEM - Host cp1054 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:02] PROBLEM - Host cp1051 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:02] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:03] (03CR) 10RobH: [C: 031] wdqs: configure new wdqs test cluster [puppet] - 10https://gerrit.wikimedia.org/r/433351 (https://phabricator.wikimedia.org/T194184) (owner: 10Gehel) [19:07:12] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:13] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:14] PROBLEM - Host cp1049 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:14] PROBLEM - Host cp1055 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:22] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:22] PROBLEM - Host cp1048 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0 [19:09:22] RECOVERY - Host cp1054 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [19:09:22] RECOVERY - Host cp1051 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:09:22] RECOVERY - Host cp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [19:09:22] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:09:22] RECOVERY - Host cp1052 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [19:09:23] RECOVERY - Host cp1053 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [19:09:23] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:09:24] RECOVERY - Host cp1050 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:09:24] RECOVERY - Host cp1055 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [19:09:32] RECOVERY - Host cp1049 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [19:09:32] RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:09:52] RECOVERY - Host cp1099 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [19:10:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [19:13:32] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:14:35] (03CR) 10Smalyshev: wdqs: configure new wdqs test cluster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/433351 (https://phabricator.wikimedia.org/T194184) (owner: 10Gehel) [19:16:52] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:28:13] !log twentyafterfour@tin Synchronized php-1.32.0-wmf.4/extensions/CongressLookup: sync CongressLookup extension on wmf.4 refs T191050 (duration: 01m 22s) [19:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:19] T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050 [19:33:25] (03PS1) 10Urbanecm: Enable $wgUseRCPatrol on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433440 (https://phabricator.wikimedia.org/T194389) [19:37:25] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:38:50] !log The train for 1.32.0-wmf.4 is blocked by fatals in Echo extension. See T194848 [19:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:55] T194848: Fatal error: $this is null in /srv/mediawiki/php-1.32.0-wmf.4/extensions/Echo/includes/model/Event.php on line 345 - https://phabricator.wikimedia.org/T194848 [19:40:24] (03Abandoned) 10Urbanecm: New throttle rule for University of Edinburgh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433138 (https://phabricator.wikimedia.org/T194666) (owner: 10Urbanecm) [19:41:11] stephanebisson: Can you take a look at T194848 [19:42:26] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [19:48:28] jouncebot: next [19:48:28] In 0 hour(s) and 11 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T2000) [19:57:28] @robh hey, can I pm please? [19:58:05] 08Warning Alert for device asw-c-eqiad.mgmt.eqiad.wmnet - Juniper environment status [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T2000). [20:06:14] pm sent =] [20:13:05] 08̶W̶a̶r̶n̶i̶n̶g Device asw-c-eqiad.mgmt.eqiad.wmnet recovered from Juniper environment status [20:13:38] !log Force WriteBack policy on db1067 - T194852 [20:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:43] T194852: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852 [20:28:08] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:17] PROBLEM - Host cp1045 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:17] PROBLEM - Host cp1049 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:17] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:17] PROBLEM - Host cp1048 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:27] PROBLEM - Host cp1054 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:27] PROBLEM - Host cp1050 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:27] PROBLEM - Host cp1099 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:27] PROBLEM - Host cp1055 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:27] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:37] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:37] PROBLEM - Host cp1051 is DOWN: PING CRITICAL - Packet loss = 100% [20:30:52] hey, what's going on? [20:30:54] is this an outage? [20:30:57] RECOVERY - Host cp1047 is UP: PING WARNING - Packet loss = 61%, RTA = 131.99 ms [20:30:57] RECOVERY - Host cp1055 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [20:30:57] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:30:58] RECOVERY - Host cp1053 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [20:30:58] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [20:30:58] RECOVERY - Host cp1049 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [20:30:58] RECOVERY - Host cp1054 is UP: PING WARNING - Packet loss = 66%, RTA = 0.20 ms [20:30:59] RECOVERY - Host cp1052 is UP: PING WARNING - Packet loss = 73%, RTA = 0.31 ms [20:31:04] XioNoX, bblack ^^^ [20:31:07] RECOVERY - Host cp1051 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [20:31:07] RECOVERY - Host cp1048 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [20:31:07] RECOVERY - Host cp1050 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [20:31:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0 [20:31:10] also ema, vgutierrez ^^^ [20:31:27] RECOVERY - Host cp1099 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:31:31] network issue I suppose? and why isn't that causing high-level issues and paging? [20:32:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [20:32:49] paravoid: asw-c8 is borked, it's the 2nd time it rebooted [20:32:58] ouch [20:33:23] how come this isn't cascading into an LVS failure / elevated 500s? [20:33:30] pybal just depooled those or what? [20:34:25] 1st time I though it was Chris bumping in a power cord, but PEM1 seems to be in a bad state and PEM0 isn't holding [20:34:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:34:36] oh there we go heh [20:34:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:35:32] ok, so what's the plan? [20:36:09] fyi bblack is in the air [20:36:12] oops [20:36:29] checking if cmjohnson is still in the DC, but probably not [20:37:02] is asw2-c8 ready and well-connected? [20:37:07] 08Warning Alert for device asw-c-eqiad.mgmt.eqiad.wmnet - Juniper environment status [20:38:07] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:38:16] XioNoX: ^ [20:38:51] paravoid: asw-c8 is the 10G switch, we moved ms-be/ms-fe from there to the other (new) 10G racks (and asw2) earlier today, the only prod servers on that rack are cp servers, and we planned on not moving them as they will be replaced before end of Q [20:38:57] System booted: 2018-05-16 20:27:51 UTC (00:09:10 ago) [20:40:57] oh [20:41:12] did you discuss this earlier? [20:41:18] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:41:43] I have no idea what our total cp* capacity is, can we depool those servers now? [20:41:56] as this will more than likely happen again :/ [20:42:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:42:12] although I guess we're lucky it's happening on a mostly empty rack :) [20:42:39] yeah, old stuff is going to die at one point, but never at a good time [20:43:08] 08̶W̶a̶r̶n̶i̶n̶g Device asw-c-eqiad.mgmt.eqiad.wmnet recovered from Juniper environment status [20:43:43] May 16 20:27:11 fpc 8: power budget received, old = 0 new = 0 [20:43:43] May 16 20:27:11 send: red alarm set, device Power Supply 41, reason FPC 8 PEM 1 is not powered [20:43:43] May 16 20:27:16 FPC 8 removed [20:44:14] My current guess is that PEM0 is reporting as healthy but is not, and PEM1 started flapping [20:44:33] causing the FPC to randomply reboot [20:44:42] I don't know about cp* capacity [20:46:56] May 16 16:40:35 FPC 8 removed [20:46:56] May 16 19:05:43 FPC 8 removed [20:46:56] May 16 20:27:16 FPC 8 removed [20:50:45] 2/4 of misc, 4/8 of text and 4/11 of upload are there [20:50:47] we need to fix this [20:51:18] we can get chris there, but what were you thinking? [20:51:31] replacing PEM1, then PEM0 [20:52:24] ok [20:52:26] call Chris [20:52:53] ok [20:53:09] still not late at all, so he can drop by the DC and do this [20:53:15] (03PS5) 10Andrew Bogott: wikitech: remove OpenStackManager private settings [puppet] - 10https://gerrit.wikimedia.org/r/432703 (https://phabricator.wikimedia.org/T161553) [20:53:17] (03PS1) 10Andrew Bogott: horizon: Update policy rules for nova policies [puppet] - 10https://gerrit.wikimedia.org/r/433445 (https://phabricator.wikimedia.org/T192179) [20:54:36] (03CR) 10Andrew Bogott: [C: 032] horizon: Update policy rules for nova policies [puppet] - 10https://gerrit.wikimedia.org/r/433445 (https://phabricator.wikimedia.org/T192179) (owner: 10Andrew Bogott) [20:54:37] no answer [20:54:53] sending a text [20:58:51] XioNoX: still nothing? [20:59:17] didn't see the 2nd phone number, trying it [21:00:42] no answer, sent a text as well [21:00:47] (03PS1) 10Andrew Bogott: Revert "horizon: Update policy rules for nova policies" [puppet] - 10https://gerrit.wikimedia.org/r/433446 [21:01:18] (03CR) 10Andrew Bogott: [C: 032] Revert "horizon: Update policy rules for nova policies" [puppet] - 10https://gerrit.wikimedia.org/r/433446 (owner: 10Andrew Bogott) [21:03:57] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:38:01] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:01] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:01] PROBLEM - Host cp1048 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:10] PROBLEM - Host cp1050 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:10] PROBLEM - Host cp1054 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:10] PROBLEM - Host cp1055 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:10] PROBLEM - Host cp1099 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:10] PROBLEM - Host cp1049 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:11] PROBLEM - Host cp1045 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:30] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:31] PROBLEM - Host cp1051 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:31] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:50] :( [21:40:20] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0 [21:40:40] RECOVERY - Host cp1055 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [21:40:40] RECOVERY - Host cp1051 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [21:40:40] RECOVERY - Host cp1045 is UP: PING WARNING - Packet loss = 93%, RTA = 0.56 ms [21:40:40] RECOVERY - Host cp1053 is UP: PING OK - Packet loss = 16%, RTA = 0.35 ms [21:40:41] RECOVERY - Host cp1047 is UP: PING WARNING - Packet loss = 28%, RTA = 0.43 ms [21:40:41] RECOVERY - Host cp1046 is UP: PING WARNING - Packet loss = 28%, RTA = 0.90 ms [21:40:41] RECOVERY - Host cp1050 is UP: PING WARNING - Packet loss = 28%, RTA = 0.40 ms [21:40:50] RECOVERY - Host cp1054 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [21:40:50] RECOVERY - Host cp1052 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [21:40:50] RECOVERY - Host cp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [21:40:50] RECOVERY - Host cp1049 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [21:41:10] RECOVERY - Host cp1099 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [21:41:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [21:44:10] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 3 minutes ago with 14 failures. Failed resources (up to 3 shown): File[/usr/share/varnish/tests],File[/etc/mtail/varnishxcps.mtail],File[/etc/varnishmtail-backend/varnishbackend.mtail],File[/etc/mtail/varnishreqstats.mtail] [21:44:20] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 3 minutes ago with 14 failures. Failed resources (up to 3 shown): File[/etc/conftool/schema.yaml],File[/etc/conftool/json-schema/],File[/usr/local/bin/apt2xml],File[/usr/local/bin/puppet-enabled] [21:46:08] 08Warning Alert for device asw-c-eqiad.mgmt.eqiad.wmnet - Juniper environment status [21:53:08] 08̶W̶a̶r̶n̶i̶n̶g Device asw-c-eqiad.mgmt.eqiad.wmnet recovered from Juniper environment status [21:58:44] !log swapping PEM 1 asw-c8-eqiad [21:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:07] (03PS1) 10Pnorman: Fix some issues found with process-osm-data [puppet] - 10https://gerrit.wikimedia.org/r/433488 (https://phabricator.wikimedia.org/T190237) [22:10:30] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:10:40] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:20:00] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/11231/" [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [22:36:32] (03PS5) 10Dzahn: phabricator: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) [22:38:03] (03CR) 10Dzahn: [C: 032] phabricator: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [22:42:00] RECOVERY - Check systemd state on phab2001 is OK: OK - running: The system is fully operational [22:45:59] (03CR) 10Dzahn: [C: 032] "puppet ran was no-op, restarting ssh-phab on 2001 tested, paladox tested in labs too.. no difference" [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [22:55:15] (03PS1) 10Dzahn: otrs: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433491 (https://phabricator.wikimedia.org/T194724) [22:58:57] (03PS2) 10Dzahn: otrs: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433491 (https://phabricator.wikimedia.org/T194724) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180516T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:48] (03PS3) 10Krinkle: mtail: Add xcachestatus to varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/432712 (https://phabricator.wikimedia.org/T190978) [23:01:11] (03PS4) 10Krinkle: mtail: Add xcachestatus to varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/432712 (https://phabricator.wikimedia.org/T190978) [23:01:37] (03CR) 10jerkins-bot: [V: 04-1] mtail: Add xcachestatus to varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/432712 (https://phabricator.wikimedia.org/T190978) (owner: 10Krinkle) [23:05:01] (03PS1) 10Dzahn: librenms: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433495 (https://phabricator.wikimedia.org/T194724) [23:20:23] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/11232/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/433495 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [23:22:13] (03CR) 10Dzahn: [C: 032] "i'm undecided whether i want to delete librenms-syslog.upstart.erb or keep it even though it's not used now" [puppet] - 10https://gerrit.wikimedia.org/r/433495 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [23:23:40] (03CR) 10Dzahn: [C: 032] "noop on netmon1002/netmon2001 - service untouched" [puppet] - 10https://gerrit.wikimedia.org/r/433495 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [23:50:36] (03PS2) 10Dzahn: admins/dzahn: add makevm.sh (create ganeti) to my ~ files [puppet] - 10https://gerrit.wikimedia.org/r/433296 [23:52:00] (03CR) 10Dzahn: admins/dzahn: add makevm.sh (create ganeti) to my ~ files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433296 (owner: 10Dzahn) [23:56:36] (03PS3) 10Dzahn: admins/dzahn: add makevm.sh (create ganeti) to my ~ files [puppet] - 10https://gerrit.wikimedia.org/r/433296