[01:16:45] (03PS7) 10Paladox: WIP: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/435327 [01:30:11] (03PS1) 10Alex Monk: Stop arguing with puppet over empty LVS_SERVICE_IPS [debs/wikimedia-lvs-realserver] - 10https://gerrit.wikimedia.org/r/435719 [02:12:42] (03PS8) 10Paladox: WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 [03:14:18] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.4) (duration: 14m 30s) [03:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 934.93 seconds [04:05:46] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 42.39 seconds [05:18:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435723 (https://phabricator.wikimedia.org/T190148) [05:20:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435723 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:21:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435723 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:21:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435723 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:23:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106 for alter table (duration: 01m 24s) [05:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:37] !log Deploy schema change on db1106 with replication, this will generate lag on labs - T190148 T191519 T188299 [05:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:41] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:23:41] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:23:42] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:25:19] !log Stop MySQL on db2075 to copy its content to db2095 - T190704 [05:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:23] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:38:05] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4235415 (10EddieGP) [05:38:11] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10media-storage, 10Patch-For-Review: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#4235414 (10EddieGP) 05Open>03Resolved [06:01:46] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1963 bytes in 0.101 second response time [06:02:37] <_joe_> who doesn't love a wikidata alert in the morning [06:11:55] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1952 bytes in 0.105 second response time [06:28:15] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints] [06:32:35] PROBLEM - puppet last run on labvirt1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:36:51] !log reimage druid1003 to Debian Stretch (Analytics cluster, backend for Pivot/Turnilo) [06:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:39] !log installing ruby-loofah security updates [06:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:55] RECOVERY - puppet last run on labvirt1021 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:35] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:10:28] !log uploaded hhvm-wikidiff2 1.7.0 (source package name php-wikidiff2) to apt.wikimedia.org [07:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:40] 10Operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#4235483 (10Aklapper) [07:12:46] 10Operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#4235484 (10Aklapper) [07:12:53] 10Operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor throughput towards some destinations - https://phabricator.wikimedia.org/T120425#4235481 (10Aklapper) 05stalled>03declined Unfortunately closing this report as no further information has been provided and as the... [07:19:50] (03PS1) 10Elukey: Override druid1003's zookeeper version after reimage [puppet] - 10https://gerrit.wikimedia.org/r/435727 (https://phabricator.wikimedia.org/T192636) [07:20:21] (03CR) 10Elukey: [C: 032] Override druid1003's zookeeper version after reimage [puppet] - 10https://gerrit.wikimedia.org/r/435727 (https://phabricator.wikimedia.org/T192636) (owner: 10Elukey) [07:21:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435728 [07:23:37] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435728 (owner: 10Marostegui) [07:24:49] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435728 (owner: 10Marostegui) [07:26:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435729 (https://phabricator.wikimedia.org/T190148) [07:27:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 after alter table (duration: 01m 22s) [07:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435729 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:29:04] !log upgrading mw1238-mw1258 to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [07:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435729 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:29:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435728 (owner: 10Marostegui) [07:31:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 for alter table (duration: 01m 20s) [07:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:32] !log Deploy schema change on db1083 - T190148 T191519 T188299 [07:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:38] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [07:31:38] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [07:31:38] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [07:34:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435729 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:40:47] !log enable backup tunnel routing between cr2-ulsfo and cr1-eqdfw - T195584 [07:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:48] 10Operations, 10ops-eqiad: mw1280: CPU error - https://phabricator.wikimedia.org/T195734#4235522 (10MoritzMuehlenhoff) [07:49:30] ACKNOWLEDGEMENT - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T195734 [08:10:50] (03PS1) 10Ladsgroup: ores: Install hunspell-bs on ores nodes [puppet] - 10https://gerrit.wikimedia.org/r/435733 (https://phabricator.wikimedia.org/T194876) [08:27:56] !log upgrading mw1221-mw1235 to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [08:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:08] marostegui: thanks for the indexes :) [08:29:21] :) [08:29:56] (03CR) 10Volans: "Replied to comments inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) (owner: 10Volans) [08:33:32] marostegui: I'll be doing a bunch of deployed in an hour or so to undo all the stuff we did [08:33:38] *deploys [08:34:01] will turn everything on one by one and monitor [08:34:17] addshore: cool! I will only need to revert a db-eqiad.php change in a bit, I will ping you so we can coordinate the deployments [08:34:22] ack! [08:35:05] marostegui: whats the most realtime data for monitoring the number of connections on dbs on s8? :0 [08:35:20] addshore: Probably grafana [08:35:35] whats the dashboard called? I'm searching for DB and Database but failing.... [08:36:43] addshore: probably you might want to look at: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All&from=now-24h&to=now [08:36:56] aaaah, mysql! [08:37:28] And also pick a couple of slaves and check: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1104&var-port=9104 [08:37:54] Either processlist, connections, queries per second and db traffic are good indicators [08:39:02] 10Operations, 10Maps-Sprint: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4235633 (10Gehel) p:05Triage>03Normal [08:44:26] (03PS1) 10Marostegui: s*.hosts: Add db2095 to s2,s4,s6,s7 [software] - 10https://gerrit.wikimedia.org/r/435737 (https://phabricator.wikimedia.org/T190704) [08:45:30] (03CR) 10Marostegui: [C: 032] s*.hosts: Add db2095 to s2,s4,s6,s7 [software] - 10https://gerrit.wikimedia.org/r/435737 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:45:39] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1946 bytes in 0.099 second response time [08:46:09] (03Merged) 10jenkins-bot: s*.hosts: Add db2095 to s2,s4,s6,s7 [software] - 10https://gerrit.wikimedia.org/r/435737 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:51:01] marostegui: another thought was, is there a way to force certain queries to have a shorter timeout that whatever is configured on the db hosts? [08:52:15] addshore: we have a general query killer that kills queries that take more than 60 seconds in production, the problem is that it gets overloaded and cannot kill as fast as possible [08:52:29] addshore: It should be done from the code side [08:52:32] ack [08:53:29] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,podsandbox_status,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:54:12] (03PS1) 10Ema: varnishrls: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/435739 (https://phabricator.wikimedia.org/T184942) [08:54:29] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:54:59] (03PS1) 10Gehel: maps: reimage maps-test2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/435740 (https://phabricator.wikimedia.org/T195741) [08:55:03] (03PS1) 10Gehel: maps: set cassandra version based on role, not based on regexes [puppet] - 10https://gerrit.wikimedia.org/r/435741 (https://phabricator.wikimedia.org/T195741) [08:55:29] im not even sure if the mw abstraction allows that [08:57:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435742 [08:58:12] addshore: Can I merge that? ^ [09:00:12] !log Deploy schema change on db1052 (s1 primary master) - T190148 [09:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:16] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [09:00:56] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435742 (owner: 10Marostegui) [09:01:27] (03PS3) 10Ema: VCL: move RB Accept header normalization to text-fe [puppet] - 10https://gerrit.wikimedia.org/r/434706 [09:02:13] (03CR) 10Ema: [C: 032] VCL: move RB Accept header normalization to text-fe [puppet] - 10https://gerrit.wikimedia.org/r/434706 (owner: 10Ema) [09:02:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435742 (owner: 10Marostegui) [09:02:55] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435742 (owner: 10Marostegui) [09:04:11] !log Deploy schema change on s1 primary master (db1052) - T191519 [09:04:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 after alter table (duration: 01m 20s) [09:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:16] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [09:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:55] (03PS1) 10Gehel: maps: enable prometheus metrics for cassandra on maps-test instance [puppet] - 10https://gerrit.wikimedia.org/r/435746 (https://phabricator.wikimedia.org/T195741) [09:06:01] (03CR) 10Vgutierrez: [C: 031] "looking good :D" [puppet] - 10https://gerrit.wikimedia.org/r/435739 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [09:07:17] (03PS2) 10Ema: varnishrls: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/435739 (https://phabricator.wikimedia.org/T184942) [09:07:34] !log upgrading mw1299-mw1306 to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [09:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:35] marostegui: aaaah, today is a US holiday... [09:14:45] marostegui: yes merge away (sorry in a meeting) [09:15:21] :22 [09:15:46] marostegui: maybe we won't re enable stuff today then, just in case, but instead do the backports adding extra monitoring and preparing to re enable [09:16:27] Sure :) [09:16:57] (03CR) 10Vgutierrez: "This basically happens because jetty is binding the socket to 127.0.0.1:8080," [puppet] - 10https://gerrit.wikimedia.org/r/435670 (owner: 10Alex Monk) [09:23:30] PROBLEM - Apache HTTP on mw1252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.003 second response time [09:23:30] PROBLEM - HHVM rendering on mw1252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.003 second response time [09:23:49] PROBLEM - Nginx local proxy to apache on mw1252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1312 bytes in 1.437 second response time [09:24:07] (03CR) 10Ema: [C: 032] varnishrls: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/435739 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [09:24:21] 10Operations, 10Traffic, 10netops: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365#4235789 (10ayounsi) No more ICMP mentioning cp3039, which helps narrowing down the possible causes. Note that adding the static /32 does not bypass xfrm, traffic stays encrypted. [09:24:27] (03CR) 10Gehel: "ppc agrees, this is a noop: https://puppet-compiler.wmflabs.org/compiler02/11283/" [puppet] - 10https://gerrit.wikimedia.org/r/435741 (https://phabricator.wikimedia.org/T195741) (owner: 10Gehel) [09:27:30] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:32:11] (03PS1) 10Jcrespo: mariadb: Add extra_port on port + 20 for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/435751 [09:32:26] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:34:07] (03PS2) 10Gehel: maps: set cassandra version based on role, not based on regexes [puppet] - 10https://gerrit.wikimedia.org/r/435741 (https://phabricator.wikimedia.org/T195741) [09:34:09] (03PS2) 10Gehel: maps: enable prometheus metrics for cassandra on maps-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/435746 (https://phabricator.wikimedia.org/T195741) [09:35:06] RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 3.421 second response time [09:35:21] (03PS2) 10Jcrespo: mariadb: Add extra_port on port + 20 for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/435751 [09:35:25] RECOVERY - Nginx local proxy to apache on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.053 second response time [09:36:06] RECOVERY - HHVM rendering on mw1252 is OK: HTTP OK: HTTP/1.1 200 OK - 79185 bytes in 0.111 second response time [09:36:34] (03PS6) 10Marostegui: mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [09:38:00] (03PS3) 10Jcrespo: mariadb: Add extra_port on port + 20 for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/435751 [09:38:08] (03PS1) 10Ema: varnishrls: post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/435752 (https://phabricator.wikimedia.org/T184942) [09:38:10] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3443583 (10Gilles) [09:39:54] (03CR) 10Elukey: [C: 031] maps: set cassandra version based on role, not based on regexes [puppet] - 10https://gerrit.wikimedia.org/r/435741 (https://phabricator.wikimedia.org/T195741) (owner: 10Gehel) [09:41:37] (03PS1) 10Marostegui: db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) [09:41:49] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) (owner: 10Marostegui) [09:41:59] (03PS4) 10Elukey: Swap zookeeper on conf1002 with conf1005 [puppet] - 10https://gerrit.wikimedia.org/r/433322 (https://phabricator.wikimedia.org/T182924) [09:45:22] (03PS3) 10Gehel: maps: set cassandra version based on role, not based on regexes [puppet] - 10https://gerrit.wikimedia.org/r/435741 (https://phabricator.wikimedia.org/T195741) [09:45:24] (03PS3) 10Gehel: maps: enable prometheus metrics for cassandra on maps-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/435746 (https://phabricator.wikimedia.org/T195741) [09:46:19] (03PS1) 10Marostegui: mariadb: Promote db1093 to master [puppet] - 10https://gerrit.wikimedia.org/r/435756 (https://phabricator.wikimedia.org/T187962) [09:47:52] (03CR) 10Marostegui: [C: 04-2] "This patch isn't meant to be merged unless there is an emergency" [puppet] - 10https://gerrit.wikimedia.org/r/435756 (https://phabricator.wikimedia.org/T187962) (owner: 10Marostegui) [09:49:03] (03PS1) 10Marostegui: db1061: Upgrade socket location [puppet] - 10https://gerrit.wikimedia.org/r/435757 (https://phabricator.wikimedia.org/T187962) [09:52:35] (03CR) 10Gehel: "ppc looks happy, noop for all except maps-test2004: https://puppet-compiler.wmflabs.org/compiler02/11291/" [puppet] - 10https://gerrit.wikimedia.org/r/435741 (https://phabricator.wikimedia.org/T195741) (owner: 10Gehel) [09:56:07] (03CR) 10Gehel: "ppc looks happy: https://puppet-compiler.wmflabs.org/compiler03/11293/" [puppet] - 10https://gerrit.wikimedia.org/r/435746 (https://phabricator.wikimedia.org/T195741) (owner: 10Gehel) [10:01:53] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435759 (https://phabricator.wikimedia.org/T128546) [10:03:25] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435759 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:04:39] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435759 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:04:53] 10Operations, 10Wikimedia-Mailing-lists: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750#4235875 (10Sylvain_WMFr) [10:06:54] (03PS1) 10Marostegui: db-eqiad.php: Promote db1093 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435760 (https://phabricator.wikimedia.org/T187962) [10:07:12] (03CR) 10Marostegui: [C: 04-2] "This is to be pushed only if we need to do an emergency failover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435760 (https://phabricator.wikimedia.org/T187962) (owner: 10Marostegui) [10:08:13] !log jdrewniak@tin Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:435759|Bumping portals to master (T128546)]] (duration: 01m 21s) [10:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:18] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:09:27] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435759 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:09:34] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:435759|Bumping portals to master (T128546)]] (duration: 01m 20s) [10:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:42] !log upgrading eqiad video scalers to hhvm-wikidiff 1.7.0 [10:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:29] !log upgrading mw1266-mw1275 to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [10:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:40] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1951 bytes in 0.102 second response time [10:42:11] (03CR) 10Ema: [C: 032] varnishrls: post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/435752 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [10:50:40] (03PS4) 10Mark Bergsma: Adapt ProxyFetch tests to use tcpClients and sslClients [debs/pybal] - 10https://gerrit.wikimedia.org/r/434695 [10:50:42] (03PS1) 10Mark Bergsma: Use MemoryReactorClock for testing the UDP monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/435763 [10:50:44] (03PS1) 10Mark Bergsma: Implement common base class for "looping check" monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/435764 [10:52:05] 10Operations, 10Wikimedia-Mailing-lists: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750#4236008 (10Aklapper) Which specific mailing lists was this tested with? Which web browsers was this tested with? (Likely unrelated, but I fail to see CSS appli... [11:00:04] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180528T1100). [11:08:02] 10Operations, 10Wikimedia-Mailing-lists: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750#4236034 (10Sylvain_WMFr) - Tested lists: wikimedia-l, wikimediafr and wlm-announce. - This was tested with Firefox 60 on Ubuntu 17.10 (on two computers), but I... [11:09:35] (03PS1) 10Ladsgroup: Disable search integration in Article Placeholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435769 (https://phabricator.wikimedia.org/T195753) [11:09:36] (03PS1) 10Ladsgroup: Disable Special:ItemDisambiguation in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435770 (https://phabricator.wikimedia.org/T195756) [11:10:52] !log Stop MySQL on db2092 to copy its content to db2094 - T190704 [11:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:57] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [11:11:45] (03PS2) 10Ladsgroup: Disable Special:ItemDisambiguation in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435770 (https://phabricator.wikimedia.org/T195756) [11:11:54] (03Abandoned) 10Ladsgroup: Disable search integration in Article Placeholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435769 (https://phabricator.wikimedia.org/T195753) (owner: 10Ladsgroup) [11:13:52] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364#4236045 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:24:47] 10Operations, 10Wikimedia-Mailing-lists: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750#4236086 (10Aklapper) Indeed my CSS issues above are unrelated. I wonder if https://gerrit.wikimedia.org/r/#/c/432168/ could be somehow related here. CC'ing @her... [11:25:09] 10Operations, 10ops-codfw: mw2182 crash - https://phabricator.wikimedia.org/T194835#4210368 (10MoritzMuehlenhoff) The server is out of warranty since January. @Papaul: Do we have any decommissioned servers from which we could swap the broken CPU? [11:25:23] 10Operations, 10ops-codfw: mw2182 crash - https://phabricator.wikimedia.org/T194835#4236090 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:27:10] 10Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T195306#4236094 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Papaul [11:27:38] !log upgrading API servers in codfw to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [11:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:34] !log cp1008: reboot with intel-microcode T127825 [11:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:38] T127825: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825 [11:33:17] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#4236103 (10ema) [11:33:20] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4236102 (10ema) 05stalled>03Open [11:33:44] (03PS1) 10Ladsgroup: Disable search integration with Article Placeholder temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435775 (https://phabricator.wikimedia.org/T195753) [11:34:24] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#3901180 (10ema) varnishrls removed, thanks @Krinkle. [11:36:35] (03CR) 10Mark Bergsma: [C: 032] Extend unit testing of RunCommand [debs/pybal] - 10https://gerrit.wikimedia.org/r/433702 (owner: 10Mark Bergsma) [11:37:13] (03Merged) 10jenkins-bot: Extend unit testing of RunCommand [debs/pybal] - 10https://gerrit.wikimedia.org/r/433702 (owner: 10Mark Bergsma) [11:41:01] PROBLEM - Check systemd state on cp1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:48:30] RECOVERY - Check systemd state on cp1008 is OK: OK - running: The system is fully operational [11:53:16] (03PS1) 10Aklapper: phabricator: List parent projects for archived projects with open tasks [puppet] - 10https://gerrit.wikimedia.org/r/435776 [11:58:05] (03PS3) 10Mobrovac: VCL: Normalise the Accept-Language header for the REST API [puppet] - 10https://gerrit.wikimedia.org/r/434558 (https://phabricator.wikimedia.org/T195327) [12:05:01] !log aborrero@install1002:~$ sudo -i reprepro --noskipold -C 'thirdparty/mono-project-trusty' update trusty-wikimedia [12:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:19] (03CR) 10Mobrovac: "@Ema done, moved it to text-frontend." [puppet] - 10https://gerrit.wikimedia.org/r/434558 (https://phabricator.wikimedia.org/T195327) (owner: 10Mobrovac) [12:08:22] !log T194665 aborrero@install1002:~$ sudo -i reprepro --noskipold -C 'thirdparty/mono-project-jessie' update jessie-wikimedia [12:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:26] T194665: Provide an up-to-date mono environment on toolforge - https://phabricator.wikimedia.org/T194665 [12:09:57] !log T194665 aborrero@install1002:~$ sudo -i reprepro --noskipold -C 'thirdparty/mono-project-stretch' update stretch-wikimedia [12:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:57] 10Operations, 10Traffic, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4236298 (10aborrero) [12:47:05] (03PS2) 10Gehel: maps: reimage maps-test2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/435740 (https://phabricator.wikimedia.org/T195741) [12:47:54] (03CR) 10Gehel: [C: 032] maps: reimage maps-test2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/435740 (https://phabricator.wikimedia.org/T195741) (owner: 10Gehel) [12:48:31] (03PS4) 10Gehel: maps: set cassandra version based on role, not based on regexes [puppet] - 10https://gerrit.wikimedia.org/r/435741 (https://phabricator.wikimedia.org/T195741) [12:49:27] (03CR) 10Gehel: [C: 032] maps: set cassandra version based on role, not based on regexes [puppet] - 10https://gerrit.wikimedia.org/r/435741 (https://phabricator.wikimedia.org/T195741) (owner: 10Gehel) [12:49:59] (03PS4) 10Gehel: maps: enable prometheus metrics for cassandra on maps-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/435746 (https://phabricator.wikimedia.org/T195741) [12:50:50] (03CR) 10Gehel: [C: 032] maps: enable prometheus metrics for cassandra on maps-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/435746 (https://phabricator.wikimedia.org/T195741) (owner: 10Gehel) [12:52:49] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4235633 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2004.codfw.wmnet'] ``` The log can be... [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180528T1300). [13:00:04] Biplab and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:19] * Urbanecm is here [13:01:01] Biplab is on [13:01:01] I can SWAT today [13:02:21] Biplab: you are first, Urbanecm please stand by [13:02:42] Biplab: do you know how to test at mwdebug? (I can point you to the docs if not) [13:03:17] Please point [13:03:19] I don't know [13:03:41] !log upgrading app servers in codfw to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [13:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:04] Biplab: in short, this is the process I am following https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [13:04:24] Ok [13:04:30] Biplab: this is the server I will deploy the change first https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Canary [13:05:01] you have to install a browser extension https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions [13:05:12] turn it on and select mwdebug1002 [13:05:32] then your traffic will be redirected to that one machine where the change is deployed [13:05:44] once you test it works fine, I will deploy to all machines [13:06:01] Biplab: I'll ping you in a few minutes when the change is at mwdebug [13:06:07] Sure [13:06:14] (03PS5) 10Zfilipin: Enable template editor group on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435106 (https://phabricator.wikimedia.org/T195557) (owner: 10Biplab Anand) [13:07:31] Biplab, if you will be in troubles, feel free to ask - both me and zeljkof will be happy to help [13:07:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435106 (https://phabricator.wikimedia.org/T195557) (owner: 10Biplab Anand) [13:08:39] Biplab: all this extension work might look complicated, but it's not, feel free to ask for help [13:09:00] Urbanecm please help to execute the task [13:09:13] Biplab, what browser do you use? [13:09:17] (03Merged) 10jenkins-bot: Enable template editor group on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435106 (https://phabricator.wikimedia.org/T195557) (owner: 10Biplab Anand) [13:09:20] Chrome [13:09:27] So, open Chrome Store please [13:09:31] (03CR) 10jenkins-bot: Enable template editor group on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435106 (https://phabricator.wikimedia.org/T195557) (owner: 10Biplab Anand) [13:09:48] There's a link "Apps" in your bookmarks [13:09:55] And then there is "Chrome Store" [13:10:01] Search for Wikimedia Debug [13:10:17] And install "WikimediaDebug" extension [13:10:17] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:27] Ok [13:10:30] Biplab, Urbanecm: direct link https://chrome.google.com/webstore/detail/wikimediadebug/binmakecefompkjggiklgjenddjoifbb [13:10:34] Let me try once [13:10:41] from https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions [13:11:05] Then, to the right of URL address bar, you will have a Wikimedia Icon. If you will press it, you will be able to select canary server you want to use [13:11:14] Biplab: your commit is at mwdebug1002, when ever you are ready you can test [13:11:26] The name of debug server will be given be zeljkof when deployed [13:11:31] (as it happened before a moment) [13:11:43] Biplab: this is how it looks like https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Staging_changes [13:11:52] https://wikitech.wikimedia.org/wiki/File:X-Wikimedia-Debug_demo.png [13:12:17] you have to enable the extension (default is off, turn it to on) [13:12:32] and then select mwdebug1002 from the list of servers [13:13:14] Biplab, does it work? [13:13:20] then just go to newiki and test if the feature works [13:13:29] I need more time, apologies [13:13:34] !log restarted pdfrender on scb1002 [13:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:47] My internet is damn slow [13:13:52] Biplab: no problem, take your time, let us know if you need help [13:13:56] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [13:14:23] Could someones from you can't test ? [13:14:28] I am lost [13:14:31] we have some patches regarding the outage that it would be great if we can push them in this swat time :D [13:15:04] Amir1: we are testing 435106, if it's urgent go ahead [13:15:23] no it's not urgent :) [13:15:25] Biplab: :D what did you do so far? [13:15:30] just let me know when you're done [13:15:35] Amir1: sure [13:15:55] Thanks! [13:16:13] Actually I am installing the extension [13:16:39] Biplab: I could test, but I do not know how - how would I check if the feature is enabled? [13:18:22] Urbanecm: should I deploy 430123 to mwdebug, or skip it? [13:18:42] zeljkof, I will test Biplab patch if needed [13:18:55] Thanks urbanecm [13:18:56] Urbanecm: please do, if you know what to do [13:18:58] You can skip mwdebug in 430123 and 430124 [13:19:05] zeljkof, ok, will do [13:19:44] zeljkof, you can deploy Biplab patch [13:20:09] Biplab: Urbanecm can explain to you how to test during SWAT today, for future reference, we'll move ahead now since there are more patches to deploy [13:20:17] Urbanecm: ok, deploying [13:20:25] 10Operations, 10Puppet, 10cloud-services-team: Puppet class systemd needs to throw a more useful error - https://phabricator.wikimedia.org/T195553#4236401 (10MoritzMuehlenhoff) p:05Triage>03Low [13:20:46] Sure [13:20:57] Sorry for the inconvenience [13:21:14] Biplab: no problem, it was confusing to me too the first time I used it :) [13:21:36] Urbanecm: there should be short video on how to use the extension, want to create one ;) [13:21:47] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:435106|Enable template editor group on newiki (T195557)]] (duration: 01m 21s) [13:21:50] I can see it being confusing for new people [13:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:52] T195557: Enable Template Editor User Group on Nepali Wikipedia - https://phabricator.wikimedia.org/T195557 [13:22:14] Biplab, Urbanecm: 435106 is deployed, please test [13:22:16] zeljkof, create a phab task :D, can try it later, but I'm not sure if I will remember it [13:22:30] zeljkof, working, thanks [13:22:36] Biplab: now there is no need to use the browser extension, just test at newiki [13:22:58] Thanks for your help [13:23:17] Biplab: thanks for deploying with #releng! ;) [13:23:21] zeljkof, you are welcome [13:23:35] Urbanecm: will try to create the video myself, if there is time today, should be short [13:23:55] zeljkof, you are staffer (if I recall correctly), I'm a volunteer, so I have to do something else as well :D [13:24:22] Urbanecm: yeah, it's my job to fix the docs as needed :D [13:24:36] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430123 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:24:39] How's that position called officially, if I may ask? [13:24:51] Urbanecm: my job title? [13:24:53] 10Operations, 10Traffic, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4236420 (10Vgutierrez) [13:24:53] yep [13:24:56] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4236418 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [13:25:23] it used to be QA engineer, I think it is software engineer now, but I'm not sure, it may still be QA [13:25:43] (03Merged) 10jenkins-bot: Upload new logos for yiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430123 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:27:29] Thank you [13:28:09] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:430123|Upload new logos for yiwikisource (T193562)]] (duration: 01m 19s) [13:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:14] T193562: yiwikisource logo needs updating - https://phabricator.wikimedia.org/T193562 [13:28:20] Urbanecm: 430123 deployed [13:28:24] ack [13:28:43] (03CR) 10Zfilipin: [C: 032] "zfilipin@terbium:~$ echo "https://en.wikipedia.org/static/images/project-logos/yiwikisource.png" | mwscript purgeList.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430123 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:29:00] Urbanecm: purged yiwikisource.png [13:29:07] ack [13:29:15] (03PS3) 10Zfilipin: Use uploaded HD logos for yiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:29:28] (03CR) 10jenkins-bot: Upload new logos for yiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430123 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:29:33] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4236437 (10Vgutierrez) [13:30:53] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:32:24] (03Merged) 10jenkins-bot: Use uploaded HD logos for yiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:34:15] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:430124|Use uploaded HD logos for yiwikisource (T193562)]] (duration: 01m 19s) [13:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:20] T193562: yiwikisource logo needs updating - https://phabricator.wikimedia.org/T193562 [13:34:43] (03CR) 10jenkins-bot: Use uploaded HD logos for yiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:34:50] Urbanecm: 430124 is deployed [13:34:58] ack [13:35:04] (03PS2) 10Zfilipin: Revert "Revert "Enable $wgUseRCPatrol on azwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435628 (https://phabricator.wikimedia.org/T194389) (owner: 10Urbanecm) [13:36:25] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435628 (https://phabricator.wikimedia.org/T194389) (owner: 10Urbanecm) [13:37:46] (03Merged) 10jenkins-bot: Revert "Revert "Enable $wgUseRCPatrol on azwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435628 (https://phabricator.wikimedia.org/T194389) (owner: 10Urbanecm) [13:39:02] Urbanecm: 435628 is at mwdebug [13:40:14] zeljkof, testing [13:40:33] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:435628|Revert "Revert "Enable $wgUseRCPatrol on azwiki"" (T194389)]] (duration: 01m 20s) [13:40:36] (03CR) 10jenkins-bot: Revert "Revert "Enable $wgUseRCPatrol on azwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435628 (https://phabricator.wikimedia.org/T194389) (owner: 10Urbanecm) [13:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:37] T194389: Enable $wgUseRCPatrol on azwiki - https://phabricator.wikimedia.org/T194389 [13:40:56] Urbanecm: thanks to copy/paste error, I have already deployed it :) [13:41:35] It's working, so nothing happened :D [13:41:41] Thank you [13:41:45] Urbanecm: 434359 has merge conflict :/ [13:42:04] Is it possible to process 434508 while I will be solving the conflict? [13:42:46] Urbanecm: sure [13:42:57] (03PS2) 10Zfilipin: New protection level on the Hungarian Wikipedia - trusted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434508 (https://phabricator.wikimedia.org/T194568) (owner: 10Urbanecm) [13:44:39] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434508 (https://phabricator.wikimedia.org/T194568) (owner: 10Urbanecm) [13:45:40] (03PS5) 10Urbanecm: Enable "File mover" flag on zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [13:46:09] (03Merged) 10jenkins-bot: New protection level on the Hungarian Wikipedia - trusted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434508 (https://phabricator.wikimedia.org/T194568) (owner: 10Urbanecm) [13:46:23] (03CR) 10jenkins-bot: New protection level on the Hungarian Wikipedia - trusted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434508 (https://phabricator.wikimedia.org/T194568) (owner: 10Urbanecm) [13:46:39] (03PS9) 10Paladox: WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 [13:48:17] Urbanecm: 434508 is at mwdebug [13:48:37] ack [13:49:15] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4236536 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:49:42] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4236538 (10elukey) The plan is to reimage phab1001 to Stre... [13:50:24] zeljkof, please deploy [13:50:30] Urbanecm: ok, deploying [13:50:32] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4082038 (10MoritzMuehlenhoff) Status update: Half of our active data centre and the majority of servers in our backup DC have been upgraded t... [13:51:18] ack [13:52:01] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:434508|New protection level on the Hungarian Wikipedia - trusted (T194568)]] (duration: 01m 20s) [13:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:06] T194568: New protection level on the Hungarian Wikipedia - https://phabricator.wikimedia.org/T194568 [13:52:54] Urbanecm: 434508 deployed [13:53:07] (03PS6) 10Zfilipin: Enable "File mover" flag on zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [13:55:20] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [13:56:42] (03Merged) 10jenkins-bot: Enable "File mover" flag on zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [13:57:51] Urbanecm: 434359 is at mwdebug [13:58:03] ack [13:58:03] Amir1: we are at the last commit, you are next :) [13:58:22] cool, thanks! [13:58:44] Urbanecm: uh, just noticed the time, not much time left :/ [13:58:45] (03CR) 10jenkins-bot: Enable "File mover" flag on zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [13:59:03] zeljkof, deploy please [13:59:22] Amir1: nothing after swat, so there should be time for your patches :) [13:59:34] Urbanecm: deploying [13:59:51] If we won't break wikis, I'm not sure how hard it will be to get ops [13:59:56] Thank you. I'm starting :D [14:00:10] Amir1 just a few more seconds... [14:00:14] <_joe_> Urbanecm: ? [14:00:32] I did not understand that too :D [14:00:44] Usually, on US Holiday, it is a little bit harder [14:00:46] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:434359|Enable "File mover" flag on zh.wikipedia (T195247)]] (duration: 01m 19s) [14:00:49] are you saying we have to break the wikis more often? :D [14:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:51] T195247: Enable "File mover" flag on zh.wikipedia - https://phabricator.wikimedia.org/T195247 [14:01:05] Urbanecm: 434359 is deployed [14:01:09] zeljkof, sure, if you are in SWAT = Setting Wiki Ablaze Team :D [14:01:10] Amir1: you are next [14:01:12] <_joe_> Urbanecm: not really during the EU day, give the SRE team is mostly located in the EU nowadays :) [14:01:20] <_joe_> that's why I was perplexed [14:01:42] Oh, ok. [14:01:47] zeljkof, thanks [14:01:55] (03PS2) 10Ladsgroup: Disable search integration with Article Placeholder temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435775 (https://phabricator.wikimedia.org/T195753) [14:02:05] Taking over! [14:02:13] Urbanecm: thank you for deploying with #releng! ;) [14:02:26] yw [14:04:17] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435770 (https://phabricator.wikimedia.org/T195756) (owner: 10Ladsgroup) [14:05:41] (03Merged) 10jenkins-bot: Disable Special:ItemDisambiguation in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435770 (https://phabricator.wikimedia.org/T195756) (owner: 10Ladsgroup) [14:08:27] (03PS1) 10Ladsgroup: Revert "Disable Special:ItemDisambiguation in Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435788 [14:08:42] (03CR) 10Ladsgroup: "This gives out 500 on canary nodes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435788 (owner: 10Ladsgroup) [14:08:49] addshore: ^ [14:08:58] (03CR) 10Ladsgroup: [C: 032] Revert "Disable Special:ItemDisambiguation in Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435788 (owner: 10Ladsgroup) [14:09:37] (03CR) 10jenkins-bot: Disable Special:ItemDisambiguation in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435770 (https://phabricator.wikimedia.org/T195756) (owner: 10Ladsgroup) [14:09:40] o/ [14:10:12] addshore: your idea doesn't work :( [14:10:15] (03Merged) 10jenkins-bot: Revert "Disable Special:ItemDisambiguation in Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435788 (owner: 10Ladsgroup) [14:10:19] should we do it the Daniel way? [14:12:15] Amir1: doesnt work? I tested it locally [14:12:18] *looks at the patch* [14:12:33] I see why it's wrong [14:12:41] it should SpecialBlankpage [14:12:47] fantastic [14:13:33] xD [14:13:48] =] [14:14:54] (03CR) 10jenkins-bot: Revert "Disable Special:ItemDisambiguation in Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435788 (owner: 10Ladsgroup) [14:15:35] (03PS1) 10Gehel: maps: don't fix postgresql version [puppet] - 10https://gerrit.wikimedia.org/r/435789 (https://phabricator.wikimedia.org/T195741) [14:16:26] (03PS1) 10Ladsgroup: Disable Special:ItemDisambiguation in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435790 (https://phabricator.wikimedia.org/T195756) [14:17:39] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435790 (https://phabricator.wikimedia.org/T195756) (owner: 10Ladsgroup) [14:19:13] (03Merged) 10jenkins-bot: Disable Special:ItemDisambiguation in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435790 (https://phabricator.wikimedia.org/T195756) (owner: 10Ladsgroup) [14:20:49] (03CR) 10jenkins-bot: Disable Special:ItemDisambiguation in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435790 (https://phabricator.wikimedia.org/T195756) (owner: 10Ladsgroup) [14:21:03] works now [14:21:05] deloying [14:21:20] !log cp1045,cp2001,cp3007,cp5001: reboot with intel-microcode T127825 [14:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:25] T127825: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825 [14:21:55] (03CR) 10Gehel: "ppc agrees this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/435789 (https://phabricator.wikimedia.org/T195741) (owner: 10Gehel) [14:22:00] (03CR) 10Gehel: [C: 032] maps: don't fix postgresql version [puppet] - 10https://gerrit.wikimedia.org/r/435789 (https://phabricator.wikimedia.org/T195741) (owner: 10Gehel) [14:22:00] Amir1: awesome [14:23:23] !log ladsgroup@tin Synchronized wmf-config/Wikibase-production.php: Disable Special:ItemDisambiguation in Wikidata (T195756) (duration: 01m 20s) [14:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:27] T195756: Disable Special:ItemDisambiguation - https://phabricator.wikimedia.org/T195756 [14:25:28] (03PS10) 10Paladox: WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 [14:26:03] (03CR) 10jerkins-bot: [V: 04-1] WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 (owner: 10Paladox) [14:27:32] (03PS11) 10Paladox: WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 [14:28:00] (03CR) 10jerkins-bot: [V: 04-1] WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 (owner: 10Paladox) [14:28:40] The canary looks fine [14:28:45] (03PS12) 10Paladox: WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 [14:29:29] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435775 (https://phabricator.wikimedia.org/T195753) (owner: 10Ladsgroup) [14:29:48] !log ladsgroup@tin Synchronized php-1.32.0-wmf.5/extensions/ArticlePlaceholder: Add config variable to disable SearchHookHandler (T195753) (duration: 01m 18s) [14:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:53] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4236616 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2004.codfw.wmnet'] ``` The log can be... [14:29:53] T195753: Disable search integration with Article Placeholder temporarily - https://phabricator.wikimedia.org/T195753 [14:30:55] (03Merged) 10jenkins-bot: Disable search integration with Article Placeholder temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435775 (https://phabricator.wikimedia.org/T195753) (owner: 10Ladsgroup) [14:32:28] (03CR) 10jenkins-bot: Disable search integration with Article Placeholder temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435775 (https://phabricator.wikimedia.org/T195753) (owner: 10Ladsgroup) [14:32:31] Canary seems fine [14:32:34] moving to all nodes [14:33:00] (03PS13) 10Paladox: WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 [14:34:55] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Disable search integration with Article Placeholder temporarily (T195753) (duration: 01m 20s) [14:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:00] T195753: Disable search integration with Article Placeholder temporarily - https://phabricator.wikimedia.org/T195753 [14:37:35] canary node seems fine as well [14:37:40] moving to everywhere [14:39:24] !log ladsgroup@tin Synchronized php-1.32.0-wmf.4/extensions/ArticlePlaceholder: Add config variable to disable SearchHookHandler (T195753) (duration: 01m 21s) [14:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:13] !log EU SWAT is done [14:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:10] (03PS1) 10Marostegui: db1124,db1125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/435796 (https://phabricator.wikimedia.org/T190704) [15:04:05] (03CR) 10Marostegui: [C: 032] db1124,db1125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/435796 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [15:09:49] (03PS1) 10Ayounsi: Use arping to detect duplicated IPs [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) [15:16:57] any ops to merge this? https://gerrit.wikimedia.org/r/#/c/435733/ [15:17:05] that'd be great, thanks [15:24:46] (03PS2) 10Ayounsi: Use arping to detect duplicated IPs [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) [15:25:47] (03CR) 10Mark Bergsma: [C: 032] Use MemoryReactorClock for monitor unit tests and adopt tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/434685 (owner: 10Mark Bergsma) [15:26:31] (03Merged) 10jenkins-bot: Use MemoryReactorClock for monitor unit tests and adopt tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/434685 (owner: 10Mark Bergsma) [15:27:27] Amir1: let me get to it [15:27:34] (03PS2) 10Marostegui: ores: Install hunspell-bs on ores nodes [puppet] - 10https://gerrit.wikimedia.org/r/435733 (https://phabricator.wikimedia.org/T194876) (owner: 10Ladsgroup) [15:27:37] marostegui: thank you! [15:28:14] (03CR) 10Marostegui: [C: 032] ores: Install hunspell-bs on ores nodes [puppet] - 10https://gerrit.wikimedia.org/r/435733 (https://phabricator.wikimedia.org/T194876) (owner: 10Ladsgroup) [15:28:28] Amir1: done [15:28:57] Great! [15:33:45] (03CR) 10Ayounsi: "Puppet compiler on a mix or internal and external hosts: https://puppet-compiler.wmflabs.org/compiler02/11298/" [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [15:37:13] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hunspell-bs] [15:38:12] PROBLEM - puppet last run on druid1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hunspell-bs] [15:38:33] marostegui: ^^^ [15:38:40] I am checking yep [15:39:45] Amir1: I am going to revert it [15:40:02] what [15:40:04] I am taking a quick look but I don't have much time to deal with this now [15:40:16] https://packages.debian.org/stretch/hunspell-bs [15:40:29] it's in strecth [15:40:40] why ores is enabled in druid1001 [15:40:48] please revert [15:40:49] thanks [15:40:52] :) [15:40:57] (03PS1) 10Marostegui: Revert "ores: Install hunspell-bs on ores nodes" [puppet] - 10https://gerrit.wikimedia.org/r/435805 [15:41:13] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hunspell-bs] [15:41:14] druid1001 might be for testing spark/ORES integration [15:41:25] joal from analytics or maybe elukey would know. [15:41:47] I searched that it exists in stretch but it probably doesn't exist in older versions [15:41:49] (03CR) 10Marostegui: [C: 032] Revert "ores: Install hunspell-bs on ores nodes" [puppet] - 10https://gerrit.wikimedia.org/r/435805 (owner: 10Marostegui) [15:41:51] and now thorium [15:42:07] halfak: I suppose you're right [15:42:10] 10Operations, 10netops, 10Patch-For-Review: Detect IP address collisions - https://phabricator.wikimedia.org/T189522#4236820 (10ayounsi) a:03ayounsi [15:42:32] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hunspell-bs] [15:42:39] what is the os of druid1005? [15:42:43] joal: ^ [15:42:59] Amir1: 1001 is jessie [15:43:12] and so is 1005 [15:43:19] Amir1: elukey has bumped druid1003 to stretch today, other nodes (1001,2,4,5,6) [15:43:22] are jessie [15:43:33] yes, it doesn't exist in jessie [15:43:38] https://packages.debian.org/jessie/hunspell-bs [15:44:10] I only thought about ores nodes that are stretch atm [15:44:42] elukey: Do you think we could remove ores from druid for now? [15:45:04] PROBLEM - puppet last run on druid1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hunspell-bs] [15:46:13] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:46:46] and thorium and analytics1003 [15:47:00] practically anything that runs jessie [15:47:22] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:47:32] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:47:33] PROBLEM - puppet last run on druid1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hunspell-bs] [15:47:39] here I am sorry [15:48:13] RECOVERY - puppet last run on druid1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:49:17] I am checking up the puppet code [15:49:30] but I think it is coming from profile::hadoop::common [15:49:41] that it is deployed on a lot of hosts [15:49:43] <_joe_> !log uploading cergen 0.2.3 [15:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:02] RECOVERY - puppet last run on druid1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:50:22] Ahhhh elukey - I asked andrew to make hadoop nodes have the ores code, in order to test/try running models in hadoop [15:50:55] confirmed [15:50:55] # ores::base for ORES packages. [15:50:56] require ::ores::base [15:51:12] so all the hadoop worker nodes are on stretch now [15:51:20] except analytics1001/1002/1003 and the druid nodes [15:51:23] and thorium [15:51:52] is hunspell-bs something that we can backport to jessie-wikimedia? [15:51:57] this bosnian can wait if you're planning to move soon (week-ish) [15:52:20] I never backported so can't tell [15:52:42] RECOVERY - puppet last run on druid1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:52:48] Amir1: o/ - so a weekish is probably not enough for all those nodes :) [15:52:55] but we can explore two roads [15:53:02] 1) backport the package (will try to help) [15:53:19] 2) remove ores::base from profile::hadoop::common (but I'd need joal's approval :) [15:53:42] o/ [15:53:57] as you wish, I don't mind either way [15:54:06] elukey: no prob for me as of now - I've been testing and will not have time to test again soon [15:54:22] elukey: let's just comment with the reason, so that we remember [15:55:45] I think that hunspell-bs should be easily buildable for jessie, it is a dictionary afaics.. Amir1 is it good if I try it later on/tomorrow and the report back to you [15:55:48] ? [15:56:02] sure thing [15:56:15] !log Reboot db1124 and db1125 for more testing - T190704 [15:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:20] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [15:59:39] !log swap zookeeper from conf1002 to conf1005 [15:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:00] all right let's start [16:00:03] (03CR) 10Elukey: [C: 032] Swap zookeeper on conf1002 with conf1005 [puppet] - 10https://gerrit.wikimedia.org/r/433322 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [16:00:08] (03PS5) 10Elukey: Swap zookeeper on conf1002 with conf1005 [puppet] - 10https://gerrit.wikimedia.org/r/433322 (https://phabricator.wikimedia.org/T182924) [16:02:07] !log Stop MySQL on db2095 for testing - T190704 [16:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:11] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [16:04:21] (03CR) 10Vgutierrez: Use arping to detect duplicated IPs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [16:05:48] ok so zookeeper on conf1005 is running, and I can see the znodes from zkCli.sh on it [16:06:03] going to wait a sec to get the new prometheus metrics [16:08:47] k [16:09:09] (03CR) 10Volans: [C: 04-1] "The general structure is ok, I'd make the script a clean script and not a template. See comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [16:11:07] someone familiar with mass msg here? [16:12:02] mobrovac: verified that my change didn't affect the main-codfw zk cluster this time :P [16:12:11] cool [16:12:18] !log upgrading job runners in codfw to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [16:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:25] all looking good on my side too elukey [16:12:40] good I can see conf1005's metric, proceeding with the rolling restart [16:13:50] kk [16:14:46] so mirror maker just complained in the analytics chan, not sure why since I've only added conf1005 [16:15:13] ahahahha burrow [16:15:38] hehe [16:16:46] !log restart prometheus-burrow-exporter on kafkamon* [16:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:02] now recovered [16:18:12] okkkk [16:18:34] !log stop and mask zookeeper on conf1002 [16:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:57] !log zookeeper cluster restart completed (main-eqiad / conf1*) [16:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:10] all right so conf1005 is the new leader [16:22:21] and conf1003/4 are correctly listed as followers [16:22:41] going to do some sanity checks before rolling restart kafka main-eqiad [16:23:52] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4236872 (10Marostegui) The definitive hardware for codfw is now in place and replicating: db2094: s1, s3, s5, s8 db2095: s2, s... [16:24:13] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4236873 (10Marostegui) [16:26:25] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4236874 (10Marostegui) [16:29:50] mobrovac: ready to roll restart kafka main eqiad [16:30:18] k let's go elukey [16:31:03] !log roll restart kafka on kafka100[1-3] to pick up new zookeeper settings [16:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:08] restarting kafka1001 [16:36:52] weird thing from kafka topics --describe [16:36:53] [2018-05-28 16:36:27,284] WARN Client session timed out, have not heard from server in 6004ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn) [16:37:49] hm [16:37:50] ah lovely.. mobrovac it seems that kafka1001 can contact only conf1003 [16:37:53] and not 1004/5 [16:37:58] weird [16:38:02] firewall? [16:38:11] possibly, I am going to triple check all [16:40:33] argh found it [16:40:57] mobrovac: yes firewall, going to send a cr now [16:41:06] hehe [16:43:01] a while ago we switched the ipv6 addresses to interface::add_ip6_mapped { 'main': } [16:43:56] ah right [16:44:04] (03PS3) 10Ayounsi: Use arping to detect duplicated IPs [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) [16:44:04] conf2* was fine since it still uses ipv4 [16:45:33] (03PS4) 10Ayounsi: Use arping to detect duplicated IPs [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) [16:46:03] Last login: Sun May 27 21:58:00 2018 from 10.68.18.65 [16:46:05] Didn't this used to do a lookup? [16:48:50] !log upgrading codfw video scalers to hhvm-wikidiff 1.7.0 [16:48:52] (03PS1) 10Elukey: network::constants: update kafka[12]00[1-3] ipv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/435810 (https://phabricator.wikimedia.org/T182924) [16:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:40] Krenair: not on stretch [16:50:36] ah [16:52:33] (03CR) 10Ayounsi: Use arping to detect duplicated IPs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [16:58:15] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/435810 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [16:58:31] moritzm: thanksss [16:58:41] (03CR) 10Elukey: [C: 032] network::constants: update kafka[12]00[1-3] ipv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/435810 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [16:59:47] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4237007 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2004.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['maps-test2004.codfw.wmnet'] ``` [17:00:56] super, kafka topics --describe works fine now :) [17:02:21] going to move forward with the rest of the restarts [17:03:00] restarting kafka1002 [17:08:09] !log upgrading mwdebug servers in codfw to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [17:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:56] !log gehel@tin Started deploy [wdqs/wdqs@0e40344]: WDQS updater and GUI [17:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:25] mobrovac: kafka main restarts completed [17:19:33] !log restart kafka on kafka1012->23 to pick up the new zookeeper settings [17:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:55] !log gehel@tin Finished deploy [wdqs/wdqs@0e40344]: WDQS updater and GUI (duration: 08m 59s) [17:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:44] SMalyshev: deployment complete, tests are green [17:26:41] !log roll restart of kafka on kafka-jumbo* to pick up the new zookeeper settings [17:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:07] 10Blocked-on-Operations, 10Puppet, 10Reading-Infrastructure-Team-Backlog, 10Sentry, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#4237053 (10Tgr) @DZahn it's experimental, and currently unmaintained. I hope to get back to it eventually. [17:40:11] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4237089 (10Gehel) It looks like we have a few missing dependencies in Stretch: * jvm-tools (only available on jessie, could probably just be copied to stretch, need... [17:55:59] (03PS1) 10Alex Monk: role::mail::mx: Permit changing certificate [puppet] - 10https://gerrit.wikimedia.org/r/435814 [17:56:32] (03CR) 10jerkins-bot: [V: 04-1] role::mail::mx: Permit changing certificate [puppet] - 10https://gerrit.wikimedia.org/r/435814 (owner: 10Alex Monk) [17:58:12] ^ why does it not like that [17:58:16] 17:56:30 modules/role/manifests/mail/mx.pp:3 wmf-style: Found hiera call in class 'role::mail::mx' for 'role::mail::mx::cert_name' [17:58:20] is this some roles vs. profiles thing? [17:58:36] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4235633 (10MoritzMuehlenhoff) >>! In T195741#4237089, @Gehel wrote: > It looks like we have a few missing dependencies in Stretch: > > * jvm-tools (only available o... [17:59:13] Krenair: yep - https://wikitech.wikimedia.org/wiki/Puppet_coding [17:59:30] hiera calls are not allowed in roles [18:10:12] elukey, so I've gotta what [18:10:28] set them to default directly to $facts['...'] [18:10:37] then introduce a profile which lets it get set via hiera? [18:10:42] that just passes through to the role? [18:11:04] I didn't check the code review so I don't have a lot of suggestions to make, I was just pointing you to the right document :) [18:12:35] hm okay thanks [18:16:15] !log restart kafka mirror maker on kafka1012->14 - failed after the last round of kafka restarts [18:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:28] o/ Krenair, you might know the answer to my question while it is a US holiday! [18:26:46] is it possible to get pages for individual site outages? rather than just general MW error increases? [18:27:02] Or should i just file a ticket? :P [18:29:33] If you have a metric to measure [18:30:23] Or trigger [18:32:33] Reedy: sure [18:32:41] Reedy: is this all in puppet somewhere? [18:32:47] Should be [18:32:56] Actual contacts are in private puppet IIRC [18:33:12] indeed [18:35:27] yeah I never got pages so idk [18:38:03] grepping around for "mediawiki error" and cant find it [18:38:23] (03CR) 10Volans: "Replies inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [18:39:10] addshore: afaik you'd need to have a contact-group and set up paging aliases for it [18:39:28] and then set up the contact_group variable in the alerts in puppet [18:39:33] I already have a contact and afaik wikidata also already has a group [18:40:01] ah then even simpler - check for example listing analytics as contact_group [18:40:15] addshore: what do you mean by 'individual site'? [18:40:34] (datacenter vs website vs service) [18:41:25] website, wikidata.org [18:42:01] I know there is an alarm for "mediawiki errors / exceptions", thats what I'm currently looking for and then was going to see if I could add a similar alarm specifically for wikidatawiki [18:42:08] going afk but let me know if you need more help tomorrow (Riccardo should be more knowledgeable than me for sure) [18:43:44] Found it "Monitor MediaWiki fatals and exceptions." [18:44:01] sumSeries(logstash.rate.mediawiki.fatal.ERROR.sum, logstash.rate.mediawiki.exception.ERROR.sum) [18:44:12] I wonder if that is split per wiki at all [18:46:05] that one probably not IIRC [18:46:27] the general rule is that if there is a metric sure we can alarm on that :) [18:47:11] btw I'm about to go off as well in few minutes, dinner time [18:50:35] twentyafterfour: hi! are you still planning to do the remaining train deploy this aft? [18:51:15] For CentralNotice, if u have a chance, the submodule should go to 4de28b6428c14820a71ee2ef178e41973fa7868d now [18:51:27] https://gerrit.wikimedia.org/r/#/c/435817/ [18:51:29] AndyRussG: I think so [18:51:57] cool! thx in advance, if there's any inconvenience also pls don't worry, it can also wait 'till tomorrow :) [18:54:37] AndyRussG: ok, I've got it [18:55:29] twentyafterfour: cool, thc much! :) [18:58:23] PROBLEM - tilerator on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6534: Connection refused [18:58:52] PROBLEM - Check systemd state on maps-test2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:58:53] PROBLEM - cassandra CQL 10.192.16.35:9042 on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 9042: Connection refused [18:58:53] PROBLEM - tileratorui on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6535: Connection refused [18:58:58] (03PS1) 10Elukey: network::constants: update kafka-jumbo ipv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/435821 (https://phabricator.wikimedia.org/T182924) [18:59:11] damn, expired downtime on maps-test2004 [18:59:22] PROBLEM - cassandra service on maps-test2004 is CRITICAL: NRPE: Command check_cassandra-state not defined [18:59:49] (03CR) 10Elukey: [C: 032] network::constants: update kafka-jumbo ipv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/435821 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [19:07:44] !log attempting to get the wmf.5 train back on track. Deploying a fix for T195514 (https://gerrit.wikimedia.org/r/c/435292/) to unblock T191051 [19:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:49] T195514: Can't copy and paste a list on office.wiki page in the visual editor - https://phabricator.wikimedia.org/T195514 [19:07:50] T191051: 1.32.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T191051 [19:08:00] jouncebot: next [19:08:00] In 17 hour(s) and 51 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1300) [19:12:08] !log roll restart of kafka-mirror maker (main eqiad -> jumbo) on kafka-jumbo* for zookeeper conf updates [19:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:04] !log twentyafterfour@tin Synchronized php-1.32.0-wmf.5/extensions/CentralNotice/: sync wmf.5 CentralNotice for AndyRussG (duration: 01m 25s) [19:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:33] :) [19:53:07] can I assume that these test failures are ignorable? https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/47836/ [19:53:30] definitely not caused by the change I'm trying to merge... hmm [20:02:14] !log restart kafka on kafka1003 as attempt to solve the under-replicated partitions warning [20:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:34] !log train still held up by test failures: https://gerrit.wikimedia.org/r/#/c/435825/ [20:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:30] !log Test failures on https://gerrit.wikimedia.org/r/#/c/435825/ are preventing deployment of the fix for a critical deployment blocker (see T195514) 1.32.0-wmf.5 still blocked refs T191051 [20:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:35] T195514: Can't copy and paste a list on office.wiki page in the visual editor - https://phabricator.wikimedia.org/T195514 [20:14:36] T191051: 1.32.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T191051 [20:14:58] sorry for repeating myself, forgot to reference task ids the first time [20:16:50] (03PS1) 10Subramanya Sastry: Enable RemexHtml on a bunch of additional wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435829 (https://phabricator.wikimedia.org/T195263) [20:36:01] twentyafterfour: hmm, so, blocked? Mmmm so it goes ;p [20:36:17] for CN it can definitely wait, no rush [21:32:45] 10Operations, 10Discovery, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service: Stas needs root access on WDQS test cluster - https://phabricator.wikimedia.org/T195797#4237424 (10Smalyshev) [22:40:57] (03PS14) 10Paladox: WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 [22:42:49] (03CR) 10Paladox: "I think this is now ready." [puppet] - 10https://gerrit.wikimedia.org/r/435327 (owner: 10Paladox) [22:55:03] (03PS15) 10Paladox: WIP: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 [22:55:16] (03PS5) 10Anomie: WIP: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [22:55:34] (03CR) 10jerkins-bot: [V: 04-1] WIP: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [23:07:40] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4237539 (10Pnorman) We may need to deploy different binaries for kartotherian and tilerator on Jessie and Stretch. The package contains [at least one hard-coded refe... [23:08:36] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4237541 (10Pnorman) [23:12:31] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4237542 (10Pnorman) [23:14:45] (03PS16) 10Paladox: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 [23:15:31] (03PS17) 10Paladox: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) [23:18:28] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#4237548 (10Paladox) [23:19:12] (03Abandoned) 10Paladox: Planet: Replace rss20.xml with atom.xml (backwards compat filename) [puppet] - 10https://gerrit.wikimedia.org/r/435218 (https://phabricator.wikimedia.org/T168490) (owner: 10Paladox)