[01:00:50] (03CR) 10BryanDavis: "> It would also be good to have this read config from a default" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/435691 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [01:19:10] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [01:22:29] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:34:17] (03CR) 10BryanDavis: [C: 031] Toolforge: add sqlite3 package to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/436903 (owner: 10Chico Venancio) [01:34:53] (03CR) 10jerkins-bot: [V: 04-1] Toolforge: add sqlite3 package to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/436903 (owner: 10Chico Venancio) [02:08:59] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1196 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:11:05] (03PS2) 10BryanDavis: Toolforge: add sqlite3 package to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/436903 (https://phabricator.wikimedia.org/T196006) (owner: 10Chico Venancio) [02:43:02] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.6) (duration: 14m 33s) [02:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:16] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Jun 4 02:53:16 UTC 2018 (duration 10m 14s) [02:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:20] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 921.90 seconds [04:01:49] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1961 bytes in 0.060 second response time [04:06:59] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1967 bytes in 0.100 second response time [04:11:39] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 178.50 seconds [05:17:45] (03CR) 10Dzahn: [C: 031] "ready to be merged - 3 days have passed since https://phabricator.wikimedia.org/T195837#4240357" [puppet] - 10https://gerrit.wikimedia.org/r/436093 (https://phabricator.wikimedia.org/T195837) (owner: 10Gilles) [05:20:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437163 (https://phabricator.wikimedia.org/T191316) [05:21:19] (03CR) 10Dzahn: [C: 031] Add Reedy to contint-docker group [puppet] - 10https://gerrit.wikimedia.org/r/436860 (https://phabricator.wikimedia.org/T196192) (owner: 10Reedy) [05:21:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437163 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:23:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437163 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:23:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437163 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:24:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1082 for alter table (duration: 00m 53s) [05:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:23] !log Deploy schema change on db1082 with replication (this will generate lag on labs for s5) - T191316 T192926 T89737 T195193 [05:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:29] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:29:30] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:29:30] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:29:30] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:32:39] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:38:48] (03PS1) 10Nehajha: Read rcfile if it exists and parse arguments from it [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) [05:42:44] (03CR) 10Nehajha: "I was already working on this solution not the most elegant one though. If this solution does not cover all the use cases or isn't the bes" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [05:46:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437165 (https://phabricator.wikimedia.org/T190704) [05:48:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437165 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:49:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437165 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:49:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437165 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:50:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 - T190704 (duration: 00m 49s) [05:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:39] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:52:12] !log Stop replication in sync on db1121 and db2051 - T190704 [05:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:31] (03PS6) 10Elukey: Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [05:58:21] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4252661 (10Marostegui) db2095:s4 has been finally moved under db2073 as db2051 (codfw master) already caught up with eqiad and... [05:58:45] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4252662 (10Marostegui) [05:59:46] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437166 [06:00:40] PROBLEM - MariaDB Slave Lag: s5 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 676.79 seconds [06:00:49] PROBLEM - MariaDB Slave Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 683.43 seconds [06:00:59] I silenced those I thought [06:02:03] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437166 (owner: 10Marostegui) [06:03:10] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:03:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437166 (owner: 10Marostegui) [06:03:50] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437166 (owner: 10Marostegui) [06:05:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 - T190704 (duration: 00m 49s) [06:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:31] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [06:06:30] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1970 bytes in 0.075 second response time [06:08:56] (03PS1) 10Marostegui: db-codfw.php: Repool db2059 and db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437172 (https://phabricator.wikimedia.org/T190704) [06:10:40] PROBLEM - MariaDB Slave Lag: s5 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.26 seconds [06:10:49] PROBLEM - MariaDB Slave Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.72 seconds [06:11:36] I did silence those [06:11:40] Maybe it is not working again? [06:15:06] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2059 and db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437172 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [06:16:31] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2059 and db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437172 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [06:17:10] icinga downtimes aren't going thru [06:17:12] PROBLEM - MariaDB Slave Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 712.22 seconds [06:18:41] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2059, db2075 - T190704 (duration: 00m 49s) [06:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:45] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [06:19:38] (03CR) 10jenkins-bot: db-codfw.php: Repool db2059 and db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437172 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [06:21:42] RECOVERY - MariaDB Slave Lag: s5 on db1082 is OK: OK slave_sql_lag Replication lag: 0.40 seconds [06:21:42] (03CR) 10Joal: [C: 031] "LGTM ! Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/437056 (owner: 10Elukey) [06:23:10] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job:data_purge: fix webrequest datasource [puppet] - 10https://gerrit.wikimedia.org/r/437056 (owner: 10Elukey) [06:26:55] (03CR) 10Joal: "One nit about packages list" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [06:30:59] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:31:30] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf] [06:42:10] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1967 bytes in 0.070 second response time [06:54:31] (03PS2) 10Muehlenhoff: Add gilles to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/436093 (https://phabricator.wikimedia.org/T195837) (owner: 10Gilles) [06:55:18] I am not sure if I trust this gilles person [06:55:29] moritzm: is he a trustable person? :P [06:56:13] (03CR) 10Muehlenhoff: [C: 032] Add gilles to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/436093 (https://phabricator.wikimedia.org/T195837) (owner: 10Gilles) [06:56:20] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:30] ( gilles: since I already pinged your username - let us know in the analytics chan if you need help with hadoop data) [06:56:50] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:39] RECOVERY - MariaDB Slave Lag: s5 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:05:13] (03PS2) 10Gehel: Send elasticsearch slowlogs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/436841 (https://phabricator.wikimedia.org/T196180) (owner: 10EBernhardson) [07:05:40] RECOVERY - MariaDB Slave Lag: s5 on db1116 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:05:51] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4252716 (10Vgutierrez) >>! In T182993#4248709, @Ottomata wrote: > Hm, ya, sounds like a way off before we get that in Debian then, ya? Is that... [07:06:13] (03CR) 10Gehel: [C: 032] Send elasticsearch slowlogs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/436841 (https://phabricator.wikimedia.org/T196180) (owner: 10EBernhardson) [07:11:05] !log starting elasticsearch cluster restart on eqiad - T193734 [07:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:11] T193734: Move Serbian language wikis from extra-analysis to extra-analysis-serbian plugin - https://phabricator.wikimedia.org/T193734 [07:12:29] 10Operations, 10ops-eqiad: Degraded RAID on wtp1043 - https://phabricator.wikimedia.org/T196260#4252729 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Cmjohnson [07:13:32] (03PS1) 10Elukey: geowiki::job::limn: add MAILTO env variable to avoid root@ spam [puppet] - 10https://gerrit.wikimedia.org/r/437178 [07:13:46] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4252731 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff @Gilles You can now log int... [07:14:34] (03CR) 10Elukey: [C: 032] geowiki::job::limn: add MAILTO env variable to avoid root@ spam [puppet] - 10https://gerrit.wikimedia.org/r/437178 (owner: 10Elukey) [07:15:57] 10Operations, 10ops-codfw, 10fundraising-tech-ops: frdb2001 RAID disk failure - https://phabricator.wikimedia.org/T196251#4252737 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Papaul [07:16:29] PROBLEM - Host elastic1030 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:09] PROBLEM - Host elastic1033 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:19] PROBLEM - Host elastic1045 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:19] RECOVERY - Host elastic1030 is UP: PING WARNING - Packet loss = 80%, RTA = 1.00 ms [07:17:53] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review: Add Reedy to contint-docker group - https://phabricator.wikimedia.org/T196192#4252742 (10MoritzMuehlenhoff) p:05Triage>03Normal Patch looks fine, but this request needs to pass the three day waiting/review... [07:18:10] RECOVERY - Host elastic1033 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [07:18:19] RECOVERY - Host elastic1045 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [07:19:08] 10Operations, 10WMF-Blog-Social-Team, 10Wikimedia-Mailing-lists: Request mailman list for upcoming affiliate campaign - https://phabricator.wikimedia.org/T196003#4252748 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff @MelodyKramer The list has been created, let me know if you need any ad... [07:22:04] elastic10(300|33|45) took a bit more time than expected to reboot, sorry for the noise [07:24:06] (03PS1) 10Marostegui: redact_sanitarium.sh: Add db1124,db1125 [puppet] - 10https://gerrit.wikimedia.org/r/437179 (https://phabricator.wikimedia.org/T190704) [07:24:51] (03CR) 10Marostegui: [C: 032] redact_sanitarium.sh: Add db1124,db1125 [puppet] - 10https://gerrit.wikimedia.org/r/437179 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:25:14] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437180 [07:25:18] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437180 [07:26:56] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437180 (owner: 10Marostegui) [07:27:33] 10Operations, 10Discovery, 10Discovery-Search, 10Wikidata, and 4 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4252763 (10Gehel) Data load is complete, this can be closed. [07:27:43] Morning all :) (no questions from me today, just a good morning) [07:28:21] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437180 (owner: 10Marostegui) [07:29:28] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437180 (owner: 10Marostegui) [07:29:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1082 after alter table (duration: 00m 51s) [07:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:10] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active [07:35:43] (03PS1) 10Muehlenhoff: Remove access for groovier [puppet] - 10https://gerrit.wikimedia.org/r/437184 [07:38:59] 10Operations, 10Dumps-Generation, 10Wikimedia-log-errors: High rate of "Memcached error .. CONNECTION FAILURE" on snapshot hosts - https://phabricator.wikimedia.org/T196303#4251960 (10ArielGlenn) p:05Triage>03High a:03ArielGlenn [07:39:52] (03PS2) 10Muehlenhoff: Remove access for groovier [puppet] - 10https://gerrit.wikimedia.org/r/437184 [07:41:33] (03CR) 10Muehlenhoff: [C: 032] Remove access for groovier [puppet] - 10https://gerrit.wikimedia.org/r/437184 (owner: 10Muehlenhoff) [07:44:59] PROBLEM - Host elastic1047 is DOWN: PING CRITICAL - Packet loss = 100% [07:45:29] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 22, down: 2, shutdown: 0 [07:45:29] PROBLEM - Host elastic1050 is DOWN: PING CRITICAL - Packet loss = 100% [07:45:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437185 (https://phabricator.wikimedia.org/T191316) [07:46:09] PROBLEM - Host elastic1046 is DOWN: PING CRITICAL - Packet loss = 100% [07:47:19] RECOVERY - Host elastic1047 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [07:47:20] RECOVERY - Host elastic1046 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [07:47:39] RECOVERY - Host elastic1050 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [07:57:36] (03PS1) 10Gehel: wdqs: cleanup declarration of blazegraph options [puppet] - 10https://gerrit.wikimedia.org/r/437187 (https://phabricator.wikimedia.org/T194653) [07:57:38] (03PS1) 10Gehel: wdqs: reduce ban to a minimum on the internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/437188 (https://phabricator.wikimedia.org/T194653) [07:57:45] !log Stop replication on db2094:3315 for testing [07:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:50] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 677.41 seconds [08:04:31] ^that is me [08:10:33] !log restarting icinga due to ongoing check/downtime issues [08:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:42] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:14:36] (03PS5) 10Muehlenhoff: Enable base::service_auto_restart for Burrow Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/434934 (https://phabricator.wikimedia.org/T135991) [08:16:26] (03PS9) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) [08:20:26] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for Burrow Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/434934 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:20:47] (03CR) 10Vgutierrez: Implement kubernetes configuration observer (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [08:22:12] (03CR) 10Mark Bergsma: "I believe due to the "magic" the ConfigObserver classes use to detect all available observers, kubernetes.py needs to be imported in main." [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [08:24:17] (03CR) 10Vgutierrez: "> I believe due to the "magic" the ConfigObserver classes use to" [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [08:26:32] PROBLEM - Host pc2005 is DOWN: PING CRITICAL - Packet loss = 100% [08:26:47] mmmm [08:27:05] indeed down [08:27:39] looks like it is rebooting [08:27:51] although the console is frozen [08:28:10] let's wait a bit see, what happens [08:28:24] yeah - I can only see: "Starting" [08:28:36] So it might be even something older from when it first started [08:28:44] let's wait a bit [08:28:55] older, what do you mean? [08:29:04] when it first boot [08:29:12] rac has a critical error logged [08:29:17] an old entry from when it was starting daemons [08:29:20] so maybe it is not from now [08:29:27] (03CR) 10Mark Bergsma: "> > I believe due to the "magic" the ConfigObserver classes use to" [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [08:29:45] but it doesn't tell what: "Critical: A fatal error was detected on a component at bus 0 device 3 function 1" [08:29:54] same for bus 1 device 0 function 0 [08:30:10] so descriptive [08:31:37] there is nothing happening [08:31:42] should we go for a reboot? [08:31:44] hard reset [08:31:45] yes [08:31:58] server is under warranty for another six months [08:32:10] but keep connected for diagnostic printouts [08:32:32] I will start a ticket [08:32:49] rebooted, let's seee [08:33:37] it is booting up normally now, no errors so far [08:34:03] (03PS7) 10Elukey: Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [08:34:06] we got something [08:34:15] not mounting? [08:34:20] jynus: once you've the ticket, let me know so I can paste it [08:34:44] 10Operations, 10ops-codfw, 10DBA: pc2005 down - https://phabricator.wikimedia.org/T196339#4252912 (10jcrespo) [08:34:47] I didn't have time to add more: https://phabricator.wikimedia.org/T196339 [08:35:13] 10Operations, 10ops-codfw, 10DBA: pc2005 down - https://phabricator.wikimedia.org/T196339#4252912 (10Marostegui) After hard resetting it and checking how it boots: ``` Enumerating Boot options... Enumerating Boot options... Done UEFI0067: A PCIe link training failure is observed in Embedded Network Device... [08:36:26] so the network card? [08:36:33] or its lane, whatever [08:36:55] 10Operations, 10ops-codfw, 10DBA: pc2005 down - https://phabricator.wikimedia.org/T196339#4252928 (10MoritzMuehlenhoff) a:03Papaul [08:38:25] yeah who knows, let's leave it like that until papaul can get to it I would say [08:39:03] 10Operations, 10ops-codfw, 10DBA: pc2005 down - https://phabricator.wikimedia.org/T196339#4252912 (10Marostegui) p:05Triage>03High [08:40:13] (03PS1) 10Jcrespo: mariadb: Depool pc1005, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437194 [08:40:47] (03PS2) 10Jcrespo: mariadb: Depool pc1005, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437194 (https://phabricator.wikimedia.org/T196339) [08:41:35] (03CR) 10Marostegui: "should be pc2005, but other than the title it looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437194 (https://phabricator.wikimedia.org/T196339) (owner: 10Jcrespo) [08:42:09] (03PS3) 10Jcrespo: mariadb: Depool pc2005, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437194 (https://phabricator.wikimedia.org/T196339) [08:42:45] (03CR) 10Marostegui: [C: 031] mariadb: Depool pc2005, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437194 (https://phabricator.wikimedia.org/T196339) (owner: 10Jcrespo) [08:43:12] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 61481 MB (12% inode=99%) [08:44:33] (03CR) 10Jcrespo: [C: 032] mariadb: Depool pc2005, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437194 (https://phabricator.wikimedia.org/T196339) (owner: 10Jcrespo) [08:45:28] I am going to downtime pc2005 [08:46:14] (03Merged) 10jenkins-bot: mariadb: Depool pc2005, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437194 (https://phabricator.wikimedia.org/T196339) (owner: 10Jcrespo) [08:46:49] (03PS2) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437185 (https://phabricator.wikimedia.org/T191316) [08:49:42] RECOVERY - Disk space on elastic1019 is OK: DISK OK [08:51:17] 10Operations, 10Dumps-Generation, 10Wikimedia-log-errors: High rate of "Memcached error .. CONNECTION FAILURE" on snapshot hosts - https://phabricator.wikimedia.org/T196303#4252949 (10ArielGlenn) T196125 I suppose. FWiW nutcracker shows about a million client errors (curl to localhost:2222) for memcached on... [08:53:18] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc2005 (duration: 00m 50s) [08:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437185 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [08:54:16] Logstash Error rate for mw1262.eqiad.wmnet' failed: ERROR: 33% OVER_THRESHOLD [08:54:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437185 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [08:55:50] (03CR) 10Gehel: "Puppet compiler agrees, this is a noop: https://puppet-compiler.wmflabs.org/compiler02/11350/" [puppet] - 10https://gerrit.wikimedia.org/r/437187 (https://phabricator.wikimedia.org/T194653) (owner: 10Gehel) [08:56:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 for alter table (duration: 00m 49s) [08:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:29] No errors for me with mw1262 [08:56:41] !log Deploy schema change on db1097:3315 - T191316 T192926 T89737 T195193 [08:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:48] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [08:56:48] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [08:56:48] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [08:56:48] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [08:57:10] (03PS8) 10Elukey: Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [08:57:27] yeah, not seeing any unusual errors on mw1262 either [08:58:01] I think it is important to communicate deploying errors, even if many times they maybe false positives [09:00:27] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4252966 (10jcrespo) [09:01:58] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4252912 (10jcrespo) @papaul Please try any trivial thing you want, but these hosts are leased, and provider should take care of any hw issues. [09:02:11] (03CR) 10Elukey: "New pcc: https://puppet-compiler.wmflabs.org/compiler03/11353/" [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [09:02:41] (03CR) 10Gehel: "Puppet compiler agrees: https://puppet-compiler.wmflabs.org/compiler02/11352/" [puppet] - 10https://gerrit.wikimedia.org/r/437188 (https://phabricator.wikimedia.org/T194653) (owner: 10Gehel) [09:03:47] (03CR) 10Joal: [C: 031] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [09:04:31] !log addshore@terbium:~$ for i in {1..2500}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki [09:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:03] 10Operations, 10Dumps-Generation, 10Wikimedia-log-errors: High rate of "Memcached error .. CONNECTION FAILURE" on snapshot hosts - https://phabricator.wikimedia.org/T196303#4252974 (10ArielGlenn) Just a check that the basics work: ``` root@snapshot1005:~# telnet localhost 11212 Trying ::1... Trying 127.0.0.1... [09:07:39] (03PS4) 10Nehajha: Man page for webservice Added docs directory and rst files [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) [09:08:20] (03CR) 10jerkins-bot: [V: 04-1] Man page for webservice Added docs directory and rst files [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) (owner: 10Nehajha) [09:11:36] (03CR) 10Nehajha: "I am not sure but probably conf.py file will be required to test this. I have tried to make this doc as concise as possible since we alrea" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) (owner: 10Nehajha) [09:12:29] (03CR) 10Nehajha: ">" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) (owner: 10Nehajha) [09:16:00] (03CR) 10jenkins-bot: mariadb: Depool pc2005, hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437194 (https://phabricator.wikimedia.org/T196339) (owner: 10Jcrespo) [09:16:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437185 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [09:26:29] (03CR) 10Alexandros Kosiaris: "Sigh, quite hacky but fine by me" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436592 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:28:24] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437202 [09:31:39] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437202 (owner: 10Marostegui) [09:33:03] (03PS10) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) [09:33:06] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437202 (owner: 10Marostegui) [09:33:23] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437202 (owner: 10Marostegui) [09:34:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 after alter table (duration: 00m 49s) [09:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:30] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/437204 [09:35:02] (03CR) 10jerkins-bot: [V: 04-1] dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/437204 (owner: 10Marostegui) [09:36:12] Come on jenkins! :) [09:36:28] (03PS2) 10Marostegui: dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/437204 (https://phabricator.wikimedia.org/T190704) [09:36:57] marostegui: we don't know how to commit :( [09:37:28] BUg vs Bug! [09:38:01] (03CR) 10Vgutierrez: "> > > I believe due to the "magic" the ConfigObserver classes use to" [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [09:38:18] marostegui: OCD gods are pleased by our commit msg validator [09:38:57] (03CR) 10Marostegui: [C: 032] dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/437204 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [09:39:34] !log Reload haproxy on dbproxy1010 to depool labsdb1010 - https://phabricator.wikimedia.org/T190704 [09:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:05] <_joe_> vgutierrez: if you try to complain, I'll ask volans to make a second version of it [09:46:13] <_joe_> it will take him 3 months, and it would be 15k LOC minimum, plus some spare cycles on the kubernetes machines for running tensorflow, but imagine how clean and standardised our commit messages would be afterwards [09:53:00] (03PS2) 10Arturo Borrero Gonzalez: toollabs: add /etc/aliases file for tools-mail server [puppet] - 10https://gerrit.wikimedia.org/r/436752 (https://phabricator.wikimedia.org/T196137) [09:53:05] (03PS1) 10Muehlenhoff: Enable microcode for Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/437206 (https://phabricator.wikimedia.org/T127825) [09:54:14] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: add /etc/aliases file for tools-mail server [puppet] - 10https://gerrit.wikimedia.org/r/436752 (https://phabricator.wikimedia.org/T196137) (owner: 10Arturo Borrero Gonzalez) [09:56:44] (03PS2) 10Ppchelko: Remove unused jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/436574 (https://phabricator.wikimedia.org/T190327) [10:03:02] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437208 (https://phabricator.wikimedia.org/T128546) [10:03:13] (03CR) 10jerkins-bot: [V: 04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437208 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:04:38] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437208 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:05:37] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437209 (https://phabricator.wikimedia.org/T128546) [10:07:23] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437209 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:08:46] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437209 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:09:32] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437209 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:09:58] (03CR) 10Elukey: [C: 031] Enable microcode for Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/437206 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:13:58] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:437209|Bumping portals to master (T128546)]] (duration: 00m 51s) [10:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:04] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:14:48] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:437209|Bumping portals to master (T128546)]] (duration: 00m 49s) [10:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:14] (03PS1) 10Giuseppe Lavagetto: Add entries for the videoscaler VIP [dns] - 10https://gerrit.wikimedia.org/r/437210 (https://phabricator.wikimedia.org/T188947) [10:20:42] (03PS3) 10Arturo Borrero Gonzalez: Toolforge: add sqlite3 package to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/436903 (https://phabricator.wikimedia.org/T196006) (owner: 10Chico Venancio) [10:23:58] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Toolforge: add sqlite3 package to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/436903 (https://phabricator.wikimedia.org/T196006) (owner: 10Chico Venancio) [10:25:05] 10Operations, 10Analytics, 10hardware-requests: Site: eqiad | hardware request for a dedicated stat analytics host for the Research team - https://phabricator.wikimedia.org/T196080#4253157 (10elukey) 05Open>03declined Sure we can decline and start another one. For the specs we don't have specific require... [10:25:07] 10Operations, 10Dumps-Generation, 10Wikimedia-log-errors: High rate of "Memcached error .. CONNECTION FAILURE" on snapshot hosts - https://phabricator.wikimedia.org/T196303#4253159 (10ArielGlenn) Because I've looked at the libmemcached code for where CONNECTION_FAILURE is set, and I don't see any good reason... [10:27:14] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: disable PrivateTmp everywhere [puppet] - 10https://gerrit.wikimedia.org/r/437212 [10:27:27] <_joe_> moritzm: ^^ [10:27:30] 10Operations, 10Analytics, 10hardware-requests: Site: eqiad | hardware request a new stat analytics host - https://phabricator.wikimedia.org/T196345#4253175 (10elukey) [10:29:20] (03CR) 10Muehlenhoff: profile::mediawiki::jobrunner: disable PrivateTmp everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437212 (owner: 10Giuseppe Lavagetto) [10:31:23] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: disable PrivateTmp everywhere [puppet] - 10https://gerrit.wikimedia.org/r/437212 [10:31:24] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: disable PrivateTmp everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437212 (owner: 10Giuseppe Lavagetto) [10:33:11] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/437212 (owner: 10Giuseppe Lavagetto) [10:33:43] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11355/" [puppet] - 10https://gerrit.wikimedia.org/r/437212 (owner: 10Giuseppe Lavagetto) [10:33:51] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: disable PrivateTmp everywhere [puppet] - 10https://gerrit.wikimedia.org/r/437212 [10:39:01] <_joe_> !log rolling restart of apache on the jobrunners to pick the changed privatetmp setting, rotating logs [10:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:47] (03PS2) 10Muehlenhoff: Enable microcode for Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/437206 (https://phabricator.wikimedia.org/T127825) [10:45:52] (03CR) 10Muehlenhoff: [C: 032] Enable microcode for Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/437206 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:51:14] !log reimage ganeti1004, ganeti1008 to stretch [10:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:03] (03PS3) 10Muehlenhoff: Switch video scalers to a profile [puppet] - 10https://gerrit.wikimedia.org/r/430892 [10:58:55] (03CR) 10Zhuyifei1999: "> Is this build failing because of conf.py?" (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) (owner: 10Nehajha) [11:00:04] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T1100). [11:01:04] (03CR) 10Zhuyifei1999: [C: 04-1] Man page for webservice Added docs directory and rst files (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) (owner: 10Nehajha) [11:03:34] (03CR) 10Zhuyifei1999: Read rcfile if it exists and parse arguments from it (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [11:06:39] (03PS1) 10Arturo Borrero Gonzalez: toolforge: fix duplicated sqlite3 package declaration in toolforge bastions [puppet] - 10https://gerrit.wikimedia.org/r/437216 (https://phabricator.wikimedia.org/T196006) [11:07:27] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: fix duplicated sqlite3 package declaration in toolforge bastions [puppet] - 10https://gerrit.wikimedia.org/r/437216 (https://phabricator.wikimedia.org/T196006) (owner: 10Arturo Borrero Gonzalez) [11:08:10] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:12:53] (03PS1) 10Alexandros Kosiaris: kubestage1001: Test vg_to_remove functionality [puppet] - 10https://gerrit.wikimedia.org/r/437218 [11:19:09] (03CR) 10Alexandros Kosiaris: [C: 032] kubestage1001: Test vg_to_remove functionality [puppet] - 10https://gerrit.wikimedia.org/r/437218 (owner: 10Alexandros Kosiaris) [11:31:47] (03CR) 10Alexandros Kosiaris: [C: 031] "Haven't tested this but LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/436812 (owner: 10Muehlenhoff) [11:33:03] 10Operations, 10Traffic: Package libvmod-re2 - https://phabricator.wikimedia.org/T196355#4253372 (10ema) p:05Triage>03Normal [11:38:33] !log rebalance row_A, row_C nodegroups in ganeti01.svc.eqiad.wmnet cluster [11:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:56] (03PS1) 10Alexandros Kosiaris: ganeti: Absent DSA SSH key [puppet] - 10https://gerrit.wikimedia.org/r/437219 (https://phabricator.wikimedia.org/T177371) [11:45:58] (03PS1) 10Alexandros Kosiaris: ganeti: Remove DSA ssh key [puppet] - 10https://gerrit.wikimedia.org/r/437220 (https://phabricator.wikimedia.org/T177371) [11:53:05] <_joe_> akosiaris: \o/ [11:53:50] (03PS1) 10Giuseppe Lavagetto: jobrunner_tls: generalize to support videoscalers as well [puppet] - 10https://gerrit.wikimedia.org/r/437223 (https://phabricator.wikimedia.org/T188947) [11:53:52] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::videoscaler: add TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/437224 (https://phabricator.wikimedia.org/T188947) [11:53:54] (03PS1) 10Giuseppe Lavagetto: conftool-data: Add missing data for videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/437225 (https://phabricator.wikimedia.org/T188947) [11:53:56] (03PS1) 10Giuseppe Lavagetto: lvs::configuration: Add configuration for the videoscaler service [puppet] - 10https://gerrit.wikimedia.org/r/437226 (https://phabricator.wikimedia.org/T188947) [11:54:08] 10Operations, 10Performance-Team, 10Traffic, 10HTTPS: TLS certificates renewal process - https://phabricator.wikimedia.org/T196248#4253412 (10Krinkle) [11:54:31] (03CR) 10jerkins-bot: [V: 04-1] role::mediawiki::videoscaler: add TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/437224 (https://phabricator.wikimedia.org/T188947) (owner: 10Giuseppe Lavagetto) [12:07:02] (03PS2) 10Giuseppe Lavagetto: jobrunner_tls: generalize to support videoscalers as well [puppet] - 10https://gerrit.wikimedia.org/r/437223 (https://phabricator.wikimedia.org/T188947) [12:10:38] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler02/11357/ I will merge this later" [puppet] - 10https://gerrit.wikimedia.org/r/437223 (https://phabricator.wikimedia.org/T188947) (owner: 10Giuseppe Lavagetto) [12:18:03] !log krinkle@deploy1001 Started deploy [performance/navtiming@b229f75]: (no justification provided) [12:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:08] !log krinkle@deploy1001 Finished deploy [performance/navtiming@b229f75]: (no justification provided) (duration: 00m 05s) [12:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:27] 10Operations, 10Analytics, 10DC-Ops, 10procurement: Analytics hosts missing in Inventory/Refresh - https://phabricator.wikimedia.org/T196072#4245913 (10faidon) We have a number of spreadsheets tracking inventory, refreshes, CapEx budgets etc. Which one are you referring to specifically (doc & sheet)? [12:30:34] elukey: ^ can discuss here too [12:47:50] 10Operations, 10Analytics, 10DC-Ops, 10procurement: Analytics hosts missing in Inventory/Refresh - https://phabricator.wikimedia.org/T196072#4253576 (10elukey) 05Open>03Resolved a:03elukey Wrong tab in the spreadsheet! :) [12:49:04] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030#4253597 (10ayounsi) >>! In T196030#4246171, @RobH wrote: > I replaced both of the optics with wholly different optics and a wholly different fiber cable. So these are using a second set o... [12:49:09] !log restart elastic2001 to enable G1 GC - T156137 [12:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:13] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [12:53:38] jouncebot: next [12:53:38] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T1300) [12:54:11] (03PS1) 10Gehel: elasticsearch: enable G1 garbage collector [puppet] - 10https://gerrit.wikimedia.org/r/437231 (https://phabricator.wikimedia.org/T156137) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] Hi everyone [13:00:18] o/ [13:00:28] (03PS1) 10Marostegui: db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437235 (https://phabricator.wikimedia.org/T190704) [13:00:31] I can swat today [13:00:35] (03PS1) 10Gehel: wdqs: collect custom dropwizard metrics for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/437236 [13:02:34] Good! [13:04:11] Urbanecm: reviewing 436524 [13:04:16] zeljkof, ack [13:05:20] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436524 (https://phabricator.wikimedia.org/T195247) (owner: 10Urbanecm) [13:07:05] (03Merged) 10jenkins-bot: Assign movefile to autoreviewrs and patrollers on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436524 (https://phabricator.wikimedia.org/T195247) (owner: 10Urbanecm) [13:08:08] (03PS1) 10Urbanecm: Initial configuration for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437238 (https://phabricator.wikimedia.org/T196360) [13:08:37] Urbanecm: 436524 is at mwdebug1002 [13:08:49] zeljkof, ack [13:09:30] (03CR) 10jenkins-bot: Assign movefile to autoreviewrs and patrollers on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436524 (https://phabricator.wikimedia.org/T195247) (owner: 10Urbanecm) [13:09:42] (03PS2) 10Zfilipin: Temporarily enable MFMobileMainPageCss in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436310 (https://phabricator.wikimedia.org/T195905) (owner: 10Urbanecm) [13:10:15] zeljkof, please deploy 436524 [13:10:28] Urbanecm: deploying [13:10:34] zeljkof, ack [13:11:23] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:436524|Assign movefile to autoreviewrs and patrollers on zhwiki (T195247)]] (duration: 00m 52s) [13:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:28] T195247: Enable "File mover" flag on zh.wikipedia - https://phabricator.wikimedia.org/T195247 [13:11:38] Urbanecm: 436524 deployed [13:12:00] zeljkof, thx [13:12:39] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436310 (https://phabricator.wikimedia.org/T195905) (owner: 10Urbanecm) [13:13:10] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Access to usergroups for Marshall Miller - https://phabricator.wikimedia.org/T194550#4253706 (10Ottomata) Done, try now! Use your shell username (mmiller) and your wikitech/ldap password. [13:14:02] (03Merged) 10jenkins-bot: Temporarily enable MFMobileMainPageCss in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436310 (https://phabricator.wikimedia.org/T195905) (owner: 10Urbanecm) [13:14:33] (03PS2) 10Gehel: cassandra: cassandra-tools always make sense to be available [puppet] - 10https://gerrit.wikimedia.org/r/434462 [13:15:04] (03CR) 10jenkins-bot: Temporarily enable MFMobileMainPageCss in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436310 (https://phabricator.wikimedia.org/T195905) (owner: 10Urbanecm) [13:15:10] Urbanecm: 436310 is at mwdebug [13:15:36] (03PS2) 10Zfilipin: Set wgProofreadPagePageSeparator to '' for jawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436168 (https://phabricator.wikimedia.org/T195873) (owner: 10Urbanecm) [13:15:53] zeljkof, please deploy it. [13:15:59] (it==436310) [13:16:16] Urbanecm: deploying [13:16:58] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:436310|Temporarily enable MFMobileMainPageCss in ruwiki (T195905)]] (duration: 00m 50s) [13:17:00] Urbanecm: 436310 deployed [13:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:02] T195905: Temporarily enable MFMobileMainPageCss in Russian Wikipedia - https://phabricator.wikimedia.org/T195905 [13:17:15] zeljkof, ack [13:17:30] (03CR) 10Gehel: [C: 032] cassandra: cassandra-tools always make sense to be available [puppet] - 10https://gerrit.wikimedia.org/r/434462 (owner: 10Gehel) [13:18:16] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436168 (https://phabricator.wikimedia.org/T195873) (owner: 10Urbanecm) [13:19:53] (03Merged) 10jenkins-bot: Set wgProofreadPagePageSeparator to '' for jawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436168 (https://phabricator.wikimedia.org/T195873) (owner: 10Urbanecm) [13:20:10] (03PS1) 10Alexandros Kosiaris: Bump lvm puppet module to 1.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/437241 [13:20:12] (03PS1) 10Alexandros Kosiaris: lvm: Always force vgremoval [puppet] - 10https://gerrit.wikimedia.org/r/437242 [13:20:40] Urbanecm: 436168 is at mwdebug [13:20:42] (03CR) 10jerkins-bot: [V: 04-1] Bump lvm puppet module to 1.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/437241 (owner: 10Alexandros Kosiaris) [13:20:56] (03CR) 10jerkins-bot: [V: 04-1] lvm: Always force vgremoval [puppet] - 10https://gerrit.wikimedia.org/r/437242 (owner: 10Alexandros Kosiaris) [13:21:11] zeljkof, ack [13:21:40] (03CR) 10jenkins-bot: Set wgProofreadPagePageSeparator to '' for jawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436168 (https://phabricator.wikimedia.org/T195873) (owner: 10Urbanecm) [13:22:17] (03PS2) 10Zfilipin: Set wgProofreadPagePageSeparator='' on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436169 (https://phabricator.wikimedia.org/T194875) (owner: 10Urbanecm) [13:22:21] (03CR) 10Alexandros Kosiaris: "This is upstream code, we should not be forcing our code conventions on it" [puppet] - 10https://gerrit.wikimedia.org/r/437241 (owner: 10Alexandros Kosiaris) [13:22:56] (03CR) 10Alexandros Kosiaris: "upstream code, no rubocop for this one" [puppet] - 10https://gerrit.wikimedia.org/r/437242 (owner: 10Alexandros Kosiaris) [13:25:12] zeljkof, working [13:26:10] Urbanecm: ok, deploying [13:27:11] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:436168|Set wgProofreadPagePageSeparator to empty string for jawikisource (T195873)]] (duration: 00m 49s) [13:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:16] T195873: Set ProofreadPage page separator on ja.source - https://phabricator.wikimedia.org/T195873 [13:28:21] (03PS5) 10Nehajha: Man page for webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) [13:28:24] Urbanecm: 436168 is deployed [13:28:29] zeljkof, ack [13:29:07] (03PS1) 10Urbanecm: Initial configuration for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437245 (https://phabricator.wikimedia.org/T183706) [13:29:14] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436169 (https://phabricator.wikimedia.org/T194875) (owner: 10Urbanecm) [13:30:23] (03Merged) 10jenkins-bot: Set wgProofreadPagePageSeparator='' on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436169 (https://phabricator.wikimedia.org/T194875) (owner: 10Urbanecm) [13:30:54] (03PS3) 10Giuseppe Lavagetto: jobrunner_tls: generalize to support videoscalers as well [puppet] - 10https://gerrit.wikimedia.org/r/437223 (https://phabricator.wikimedia.org/T188947) [13:31:45] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner_tls: generalize to support videoscalers as well [puppet] - 10https://gerrit.wikimedia.org/r/437223 (https://phabricator.wikimedia.org/T188947) (owner: 10Giuseppe Lavagetto) [13:31:47] 10Operations, 10Traffic, 10Wikimedia-Site-requests, 10HTTPS: Wikimedia Hungary's website should use HTTPS - https://phabricator.wikimedia.org/T196368#4253761 (10Urbanecm) This domain is not controlled by Wikimedia Foundation, is it? [13:32:31] Urbanecm: 436169 is at mwdebug [13:32:34] ack [13:33:38] zeljkof, please deploy [13:33:43] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::videoscaler: add TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/437224 (https://phabricator.wikimedia.org/T188947) [13:33:48] Urbanecm: ok, deploying [13:33:49] (03CR) 10jenkins-bot: Set wgProofreadPagePageSeparator='' on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436169 (https://phabricator.wikimedia.org/T194875) (owner: 10Urbanecm) [13:33:52] 10Operations, 10Traffic, 10Wikimedia-Site-requests, 10HTTPS: Wikimedia Hungary's website should use HTTPS - https://phabricator.wikimedia.org/T196368#4253743 (10Vgutierrez) we don't control the domain AFAIK nor the server where is hosted (193.218.98.220 / dyna-220.sx5.cable.tolna.net) [13:33:52] ack [13:34:10] (03PS2) 10Zfilipin: Set $wgMetaNamespace to "Вікіцытатнік" on bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436997 (https://phabricator.wikimedia.org/T196230) (owner: 10Urbanecm) [13:34:13] (03CR) 10jerkins-bot: [V: 04-1] role::mediawiki::videoscaler: add TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/437224 (https://phabricator.wikimedia.org/T188947) (owner: 10Giuseppe Lavagetto) [13:34:44] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:436169|Set wgProofreadPagePageSeparator to empty string on zhwikisource (T194875)]] (duration: 00m 49s) [13:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:50] T194875: Set ProofreadPage page separator on zh.source - https://phabricator.wikimedia.org/T194875 [13:35:03] Urbanecm: deployed [13:35:18] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436997 (https://phabricator.wikimedia.org/T196230) (owner: 10Urbanecm) [13:36:27] (03Merged) 10jenkins-bot: Set $wgMetaNamespace to "Вікіцытатнік" on bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436997 (https://phabricator.wikimedia.org/T196230) (owner: 10Urbanecm) [13:37:17] ack [13:37:22] Urbanecm: 436997 is at mwdebug [13:37:40] ack [13:38:34] zeljkof, are you sure? [13:39:09] Urbanecm: let me check :) but I was pretty sure :) [13:39:30] !log Running populateExternallinksIndex60.php on group 2 for T59176. FYI: this will probably take until next Friday to complete. [13:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:34] T59176: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176 [13:40:14] Urbanecm: oops, my mistake, forgot to run scap at mwdebug :/ it's there now [13:41:28] zeljkof, ack [13:42:53] working [13:42:58] please deploy [13:43:07] Urbanecm: ok, deploying [13:43:44] ack [13:44:05] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:436997|Set $wgMetaNamespace to "Вікіцытатнік" on bewikiquote (T196230)]] (duration: 00m 49s) [13:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:09] T196230: Rename meta namespace on be.wikiquote.org - https://phabricator.wikimedia.org/T196230 [13:44:19] Urbanecm: it's deployed [13:44:24] ack [13:44:41] (03PS2) 10Gehel: wdqs: collect dropwizard and custom metrics for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/437236 [13:45:48] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436990 (https://phabricator.wikimedia.org/T196134) (owner: 10Urbanecm) [13:47:12] zeljkof, looks there is time, can I add one more patch? [13:47:13] (03Merged) 10jenkins-bot: Change bewikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436990 (https://phabricator.wikimedia.org/T196134) (owner: 10Urbanecm) [13:47:20] Urbanecm: sure [13:47:25] Ok, will add to the calendar [13:47:56] (03PS2) 10Marostegui: db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437235 (https://phabricator.wikimedia.org/T190704) [13:48:11] Urbanecm: 436990 is at mwdebug [13:48:15] zeljkof, testing [13:48:48] zeljkof, please deploy [13:49:04] Note: The last two patches should go directly to production. [13:49:10] Urbanecm: ok, deploying [13:49:13] ack [13:49:32] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::videoscaler: add TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/437224 (https://phabricator.wikimedia.org/T188947) [13:49:48] !log zfilipin@deploy1001 Synchronized static/images/project-logos: SWAT: [[gerrit:436990|Change bewikiquote logo (T196134)]] (duration: 00m 49s) [13:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:53] T196134: Change bewikiquote logo - https://phabricator.wikimedia.org/T196134 [13:50:23] Urbanecm: deployed, purging [13:50:26] ack [13:50:44] (03CR) 10Ottomata: [C: 031] Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [13:51:29] Urbanecm: purged, please check [13:54:03] (03PS1) 10Giuseppe Lavagetto: Add fake private keys for videoscalers [labs/private] - 10https://gerrit.wikimedia.org/r/437247 [13:54:07] Urbanecm: I don't feel comfortable deploying 436988, I would like some of the experienced people to take a look at it first, like hashar or no_justification [13:54:35] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add fake private keys for videoscalers [labs/private] - 10https://gerrit.wikimedia.org/r/437247 (owner: 10Giuseppe Lavagetto) [13:55:29] zeljkof, to explain what it does, with no opposing of your decision: The files can be executed. My patch strip the bit that allows execution off the files. [13:56:04] Urbanecm: I understood that part, I'm not sure if that was a mistake or needed, I would not like to break stuff :/ [13:56:57] (03CR) 10Zfilipin: "This was scheduled for EU SWAT today, but I did not feed comfortable deploying it until somebody with more experience takes a look. I have" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436988 (https://phabricator.wikimedia.org/T196225) (owner: 10Urbanecm) [13:58:04] Ok, I'll get CR+1 from some other deployers and schedule it later. [13:58:10] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11360/mw1307.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/437224 (https://phabricator.wikimedia.org/T188947) (owner: 10Giuseppe Lavagetto) [13:58:15] From my side, I think you can close the window [13:58:19] Urbanecm: the same for 436994 [13:58:40] The other patch is depending on the first patch, system should not let you merge them out of order :) [13:58:48] So I understand it as well [13:58:51] Thank you for your work! [13:58:55] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11361/" [puppet] - 10https://gerrit.wikimedia.org/r/436435 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [13:58:58] (03PS2) 10Ottomata: webperf: Fix jumbo-eqiad reference to be compatible with Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/436435 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [13:59:00] (03CR) 10Ottomata: [V: 032 C: 032] webperf: Fix jumbo-eqiad reference to be compatible with Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/436435 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [13:59:00] please have at least one +1 from somebody with more experience [13:59:17] Urbanecm: thank you for deploying with #releng! ;) [13:59:30] Sure, will do. [13:59:39] Will you mark the patches as not done in calendar? Or should I? [14:00:11] Urbanecm: please you do it [14:00:36] Can I deploy db-eqiad.php then? :) [14:02:16] !log EU SWAT finished [14:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:23] marostegui: go ahead! :) [14:02:35] (03PS9) 10Ottomata: Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 [14:02:50] !log installing wireshark security updates [14:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:09] zeljkof: thanks! [14:03:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437235 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [14:04:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437235 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [14:05:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool all sanitariums masters - T190704 (duration: 00m 49s) [14:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:50] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [14:06:21] (03CR) 10Ottomata: [C: 04-1] "Oh, I see what you did. The reason we have some of these listed in different places, is because the stuff in the statistics module is not" [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [14:09:03] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 365.28 seconds [14:09:03] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 366.19 seconds [14:09:18] that's probably the script anomie was mentioning earlier [14:09:30] 10Operations, 10Traffic: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4253860 (10Vgutierrez) p:05Triage>03Normal [14:09:59] !log Stop replication on all sanitarium masters to move labsdb1010 to another sanitarium host - T190704 [14:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:26] (03PS2) 10Giuseppe Lavagetto: conftool-data: Add missing data for videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/437225 (https://phabricator.wikimedia.org/T188947) [14:10:54] !log lithium:~# systemctl restart rsyslog.service [14:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:59] Yeah, at the moment the script is doing arwiki, which is on s7. [14:11:18] (03PS1) 10Ema: package_builder: disable lintian warning about ITP bug [puppet] - 10https://gerrit.wikimedia.org/r/437251 [14:12:20] ottomata: Curious what the best practice is for one-off consuming from kafka? Specifically, access to kafkacat. Should we install it on the webperf machines? Or use access to one of the stat machines, of so which one/which group is recommended for staff needing EL/Kafka access? [14:13:54] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: Add missing data for videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/437225 (https://phabricator.wikimedia.org/T188947) (owner: 10Giuseppe Lavagetto) [14:14:22] (03CR) 10Alexandros Kosiaris: [C: 031] package_builder: disable lintian warning about ITP bug [puppet] - 10https://gerrit.wikimedia.org/r/437251 (owner: 10Ema) [14:15:12] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1963 bytes in 0.080 second response time [14:16:11] (03CR) 10jenkins-bot: Set $wgMetaNamespace to "Вікіцытатнік" on bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436997 (https://phabricator.wikimedia.org/T196230) (owner: 10Urbanecm) [14:16:16] (03CR) 10jenkins-bot: Change bewikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436990 (https://phabricator.wikimedia.org/T196134) (owner: 10Urbanecm) [14:16:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437235 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [14:18:23] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [14:19:44] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=videoscaler,service=nginx,dc=eqiad [14:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:57] (03PS1) 10Alexandros Kosiaris: lvs: Make it explicit we use the kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/437254 [14:20:13] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.070 second response time [14:20:29] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=.*,service=mathoid,cluster=kubernetes,name=.* [14:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:18] (03CR) 10Muehlenhoff: [C: 031] package_builder: disable lintian warning about ITP bug [puppet] - 10https://gerrit.wikimedia.org/r/437251 (owner: 10Ema) [14:21:23] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: cluster=videoscaler,service=nginx,dc=codfw [14:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:30] (03PS2) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [14:21:46] Urbanecm: I have discovered phab badges recently... https://phabricator.wikimedia.org/people/badges/4747/ [14:21:52] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:21:52] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: cluster=videoscaler,service=nginx,dc=codfw,name=mw211.* [14:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:35] (03PS2) 10Ema: package_builder: disable lintian warning about ITP bug [puppet] - 10https://gerrit.wikimedia.org/r/437251 [14:23:09] (03CR) 10Ema: [C: 032] package_builder: disable lintian warning about ITP bug [puppet] - 10https://gerrit.wikimedia.org/r/437251 (owner: 10Ema) [14:24:23] (03PS1) 10Giuseppe Lavagetto: discovery: add videoscaler data [puppet] - 10https://gerrit.wikimedia.org/r/437258 [14:24:44] (03PS2) 10Giuseppe Lavagetto: discovery: add videoscaler data [puppet] - 10https://gerrit.wikimedia.org/r/437258 [14:25:30] (03CR) 10Giuseppe Lavagetto: [C: 032] discovery: add videoscaler data [puppet] - 10https://gerrit.wikimedia.org/r/437258 (owner: 10Giuseppe Lavagetto) [14:26:28] 10Operations, 10Traffic: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4253860 (10Jdforrester-WMF) Maybe combine the two so as to be something to give said IT admins something to go on? > Wikipedia is tightening its security measures,... [14:27:00] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=videoscaler,name=eqiad [14:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:56] !log installing wireshark security updates [14:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool all sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437263 [14:30:37] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4253943 (10Marostegui) [14:31:00] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231531 (10Marostegui) labsdb1010 was switched over to the new sanitarium hosts. [14:31:44] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool all sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437263 (owner: 10Marostegui) [14:32:31] (03PS2) 10Giuseppe Lavagetto: Add entries for the videoscaler VIP [dns] - 10https://gerrit.wikimedia.org/r/437210 (https://phabricator.wikimedia.org/T188947) [14:32:31] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.08 seconds [14:32:31] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.34 seconds [14:32:39] 10Operations, 10Traffic: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4253947 (10Vgutierrez) @Jdforrester-WMF the short message should be addressed to non-technical users on their language (if possible) but we will be also providing a... [14:32:48] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/437266 [14:32:55] (03PS2) 10Marostegui: Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/437266 [14:33:02] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool all sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437263 (owner: 10Marostegui) [14:33:23] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool all sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437263 (owner: 10Marostegui) [14:33:40] (03CR) 10Marostegui: [C: 032] Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/437266 (owner: 10Marostegui) [14:33:51] (03CR) 10Giuseppe Lavagetto: [C: 032] Add entries for the videoscaler VIP [dns] - 10https://gerrit.wikimedia.org/r/437210 (https://phabricator.wikimedia.org/T188947) (owner: 10Giuseppe Lavagetto) [14:34:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool all sanitariums masters - T190704 (duration: 00m 49s) [14:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:09] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [14:34:16] !log Reload haproxy on dbproxy1010 to repool labsdb1010 - T190704 [14:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:14] !log ladsgroup@terbium:~$ foreachwikiindblist large deleteAutoPatrolLogs.php --sleep 2 --check-old [14:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:41] !log ppchelko@deploy1001 Started restart [cpjobqueue/deploy@c6dc83d]: (no justification provided) [14:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:55] 10Operations, 10Traffic: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4253954 (10BBlack) English grammar nits: it would be `forward secret ciphers` (meaning "ciphers which have the property of forward secrecy"). But these terms "forwa... [14:41:18] (03PS1) 10Ema: Initial debianization [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/437268 (https://phabricator.wikimedia.org/T196355) [14:43:04] (03PS2) 10Giuseppe Lavagetto: lvs::configuration: Add configuration for the videoscaler service [puppet] - 10https://gerrit.wikimedia.org/r/437226 (https://phabricator.wikimedia.org/T188947) [14:44:06] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs::configuration: Add configuration for the videoscaler service [puppet] - 10https://gerrit.wikimedia.org/r/437226 (https://phabricator.wikimedia.org/T188947) (owner: 10Giuseppe Lavagetto) [14:44:18] (03PS2) 10Ema: Initial debianization [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/437268 (https://phabricator.wikimedia.org/T196355) [14:49:52] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [14:51:51] PROBLEM - puppet last run on mw2243 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [14:51:51] PROBLEM - puppet last run on mw2224 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [14:52:14] ^ that's me and will recover soonish [14:54:52] (03CR) 10BryanDavis: Read rcfile if it exists and parse arguments from it (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [14:54:52] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:55:52] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [14:56:22] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4253993 (10Nuria) Some docs for @Gilles regarding eventlogging access in hive: https://wikitech.wikimed... [14:56:51] RECOVERY - puppet last run on mw2224 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:57:47] (03CR) 10Nuria: [C: 031] profile::analytics::refinery::job:data_purge: fix webrequest datasource [puppet] - 10https://gerrit.wikimedia.org/r/437056 (owner: 10Elukey) [14:58:26] PROBLEM - puppet last run on labnodepool1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [15:00:55] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:01:55] RECOVERY - puppet last run on mw2243 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:03:15] RECOVERY - puppet last run on labnodepool1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:05:58] !log gzipped large rotated log files in analytics1003:/var/log/hive to clear icinga disk space warning [15:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:57] zeljkof, thanks! [15:09:28] <_joe_> !log performing rolling restart of authdns servers to pick up ip change for the videoscalers [15:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:25] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4254035 (10Joe) 05Open>03Resolved [15:11:04] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4024921 (10Joe) The LVS endpoint is now available at `videoscaler.discovery.wmnet` [15:14:03] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231989 (10Marostegui) [15:21:41] 10Operations, 10DBA, 10Performance-Team, 10Availability (MediaWiki-MultiDC): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#4254070 (10jcrespo) [15:21:47] 10Operations, 10DBA, 10Performance-Team, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review: Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#4254068 (10jcrespo) 05Open>03Resolved I am going to consider this resolved- testing was done, it is not enough... [15:22:21] PROBLEM - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1001 threshold =0.1 breach: status: yellow, number_of_nodes: 32, unassigned_shards: 996, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3114, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 5, active_shar [15:22:21] r: 89.3521488133, active_shards: 8358, initializing_shards: 0, number_of_data_nodes: 32, delayed_unassigned_shards: 0 [15:22:39] hey [15:22:40] ^ looking... [15:22:42] ok [15:22:52] <_joe_> I'm here if needed [15:23:06] what's the impact? [15:23:12] no impact... [15:23:36] thats about 3 machines worth of shards, hmm [15:23:49] we can certainly run without 3 machines, it will still be fine. but curious [15:23:49] there were more shards than usual on the last 3 servers that were restarting, so we triggered our heuristic of number of unassigned shards [15:24:06] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378#4254084 (10jcrespo) [15:24:25] ebernhardson: yep, 3 nodes just restarted as part of the rolling restart of eqiad [15:24:35] gehel: ahh, that explains pretty well then :) [15:24:41] RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 35, unassigned_shards: 788, number_of_pending_tasks: 253, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3114, task_max_waiting_in_queue_millis: 45525, cluster_name: production-search-eqiad, relocating_shards: 4, active_shards_percent_as_num [15:24:41] active_shards: 8549, initializing_shards: 17, number_of_data_nodes: 35, delayed_unassigned_shards: 0 [15:24:45] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370#4254101 (10jcrespo) [15:24:47] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378#4254100 (10jcrespo) [15:25:08] we should probably lower the threshold on that alert [15:25:14] what is the threshold for unassigned shard out of curiosity? ~800? [15:25:32] 10Operations, 10DBA, 10Performance-Team, 10Availability (MediaWiki-MultiDC): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#4254103 (10jcrespo) [15:25:34] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378#4254084 (10jcrespo) [15:27:02] herron: 10% and we were at ~10.4% during a few minutes [15:27:10] (03CR) 10Elukey: "> Oh, I see what you did. The reason we have some of these listed in" [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [15:27:12] gotcha [15:28:04] restarting 3 nodes on a 35 nodes cluster should leave less that 10% shards unassigned if they were evenly distributed, but... [15:29:11] because we keep bumping into the disk thresholds various machines can't accept shards and we get the counts fairly unbalanced :)( [15:29:45] but because our shards vary from 50kb to 50GB the disk thresholds are hard to not hit :P [15:30:15] (03CR) 10Bstorm: [C: 032] ssh known_hosts: sort resources by certname [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [15:30:29] (03PS3) 10Bstorm: ssh known_hosts: sort resources by certname [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [15:30:31] 10Operations, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#4254110 (10Krinkle) Graphite's read-only API for metric values and meta views ( and ) are both open. And Grafana has a read-only version... [15:31:03] 10Operations, 10Security-Team, 10Wikimedia-General-or-Unknown: Non-NDA users cannot access graphite.wikimedia.org - https://phabricator.wikimedia.org/T56713#4254111 (10Krinkle) [15:31:26] 10Operations, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#4254116 (10Krinkle) [15:31:32] 10Operations, 10Security-Team, 10Wikimedia-General-or-Unknown: Non-NDA users cannot access graphite.wikimedia.org - https://phabricator.wikimedia.org/T56713#4254119 (10Krinkle) [15:32:02] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378#4254120 (10jcrespo) So this is the plan- use proxysql to perform connection pooling, only for cross-dc writes (read should always be loc... [15:32:29] (03PS1) 10Gehel: elasticsearch: raise unassigned shard alerting threshold [puppet] - 10https://gerrit.wikimedia.org/r/437274 [15:32:56] ebernhardson: ^ raising to 15% if that looks good to you too [15:32:57] (03CR) 10Ottomata: [C: 04-1] "Yeah, stat module is just crappy. If we wanted to unify like this, we would be making the profile::analytics::cluster::packages class LO" [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [15:33:32] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378#4254123 (10jcrespo) Work has been done already on puppetizing proxysql and creating a pakage for easy install, plus a small production t... [15:33:45] 10Operations, 10Traffic: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4254124 (10Vgutierrez) Long explanation: ``` We have removed support for non forward secret ciphers, specifically AES128-SHA, which your browser software relies on... [15:34:34] (03CR) 10EBernhardson: [C: 031] "looks like this alert pages (that's what critical = true does?) so 10% is way too close to normal operations. 15% should be pretty hard to" [puppet] - 10https://gerrit.wikimedia.org/r/437274 (owner: 10Gehel) [15:34:48] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4251148 (10Cmjohnson) this server is out of warranty. In the past I've reapplied thermal paste. Let me know if you would like to schedule a time to do that. [15:35:03] (03PS2) 10Gehel: elasticsearch: raise unassigned shard alerting threshold [puppet] - 10https://gerrit.wikimedia.org/r/437274 [15:35:15] (03CR) 10Bstorm: [C: 032] "This seems to be working exactly as planned." [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [15:35:40] (03CR) 10Gehel: "@EBernhardson: yes, critical => true means paging" [puppet] - 10https://gerrit.wikimedia.org/r/437274 (owner: 10Gehel) [15:35:44] (03CR) 10Gehel: [C: 032] elasticsearch: raise unassigned shard alerting threshold [puppet] - 10https://gerrit.wikimedia.org/r/437274 (owner: 10Gehel) [15:37:18] (03CR) 10Elukey: "> Yeah, stat module is just crappy. If we wanted to unify like this," [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [15:37:31] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4254133 (10Andrew) > In the past I've reapplied thermal paste. Let me know if you would like to schedule a time to do that. You've done that on this exact server? If so, is it likely that a... [15:38:08] (03CR) 10Andrew Bogott: "Thank you, Alex and Brooke!" [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [15:42:41] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4254150 (10Cmjohnson) Oh, then the issue is probably not related to the thermal paste. Was something else going on that would stress the CPU at that time? [15:54:47] akosiaris: FYI, I'm going to try to deploy the "drafttopic" model to the ORES cluster during the services window, 20:00 tonight. It's possible I'll need to reduce the number of workers, or that we'll see OOM errors. Should be fun either way. [16:00:00] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254230 (10Cmjohnson) @Marostegui I need to access the servers smart storage administrator which requires me to boot into during the post. When would be a good time for me to take the server down for 15... [16:00:58] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254231 (10Marostegui) @Cmjohnson I can depool the server for you tomorrow. Does that work? [16:01:59] (03PS1) 10Ppchelko: Specify videoscalers uri in hiera/changeprop manifest. [puppet] - 10https://gerrit.wikimedia.org/r/437281 (https://phabricator.wikimedia.org/T190327) [16:02:40] (03CR) 10jerkins-bot: [V: 04-1] Specify videoscalers uri in hiera/changeprop manifest. [puppet] - 10https://gerrit.wikimedia.org/r/437281 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [16:05:45] (03PS2) 10Ppchelko: Specify videoscalers uri in hiera/changeprop manifest. [puppet] - 10https://gerrit.wikimedia.org/r/437281 (https://phabricator.wikimedia.org/T190327) [16:08:32] (03CR) 10Ppchelko: "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler02/11363/" [puppet] - 10https://gerrit.wikimedia.org/r/437281 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [16:10:04] (03CR) 10Imarlier: [C: 031] "Minor nit (spelling). Other than that, looks good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [16:13:07] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254286 (10Cmjohnson) @marostegui that will work! Thanks [16:15:08] 10Operations, 10Packaging: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741#4254299 (10Imarlier) a:05MoritzMuehlenhoff>03Imarlier [16:16:02] 10Operations, 10Packaging: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729#4254312 (10Imarlier) a:05MoritzMuehlenhoff>03Imarlier [16:17:16] (03PS9) 10Krinkle: webperf: Add statsv, navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [16:18:03] (03CR) 10Imarlier: [C: 031] webperf: Add statsv, navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [16:18:12] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#4254340 (10Joe) >>! In T91820#4218746, @Krinkle wrote: > There are cases where a cookie doesn't work (specifically, for th... [16:21:50] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#4254387 (10BBlack) Well, a potential lesser goal that involves fewer moving parts would just be to loadbalance non-session... [16:24:25] (03CR) 10Giuseppe Lavagetto: [C: 031] Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [16:24:53] _joe_: <3 [16:25:37] <_joe_> vgutierrez: you did all the work there, I basically did armchair reviewing :P [16:26:15] <_joe_> vgutierrez: next I'll try to take a look at how to manage the configuration from the API, then we can call pybal a cloud-native loadbalancer [16:28:36] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#4254410 (10Joe) >>! In T91820#4254387, @BBlack wrote: > Well, a potential lesser goal that involves fewer moving parts wou... [16:29:15] 10Operations, 10WMF-Blog-Social-Team, 10Wikimedia-Mailing-lists: Request mailman list for upcoming affiliate campaign - https://phabricator.wikimedia.org/T196003#4254413 (10MelodyKramer) Thank you! @aubrie is going to take in on from here! ⚽ ⚽ [16:29:56] (03PS1) 10Ppchelko: Disable redis queue for videoscaler jobs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437286 (https://phabricator.wikimedia.org/T190327) [16:30:21] _joe_: damn.. you get all the cool stuff.. I need a way to steal it from you again :P [16:35:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:35:25] ema: ^ [16:35:53] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [16:35:55] seems to be upload-only, strangely [16:36:09] the esams alert I mean, now confirmed by the second alert heh [16:37:01] seems a spike only [16:37:42] although sustained for 2 minutes [16:39:25] esams only, it seems [16:42:32] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [16:43:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:48:47] XioNoX: cpu usage on cr1-esams seems unusually high? [16:49:33] looking [16:50:11] (03PS3) 10Ppchelko: Remove unused jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/436574 (https://phabricator.wikimedia.org/T190327) [16:53:24] ema: it's higher than previously but can't find anything wrong so far. Logs are quiet, processes are fine [17:00:04] gehel: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T1700). [17:00:17] jouncebot: o/ [17:03:41] (03PS4) 10Ppchelko: Remove unused jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/436574 (https://phabricator.wikimedia.org/T190327) [17:08:09] (03PS4) 10RobH: Create new admin group with root access on WDQS test cluster [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [17:08:14] 10Operations, 10Discovery, 10SRE-Access-Requests, 10Wikidata, and 2 others: Stas needs root access on WDQS test cluster - https://phabricator.wikimedia.org/T195797#4254505 (10RobH) Please note this was approved in today's SRE team meeting. I'll go ahead and start merging @MoritzMuehlenhoff's patches. [17:08:53] (03CR) 10RobH: [C: 032] Create new admin group with root access on WDQS test cluster [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [17:09:54] (03PS5) 10Ppchelko: Remove unused jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/436574 (https://phabricator.wikimedia.org/T190327) [17:13:51] (03CR) 10Ppchelko: "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler02/11367/" [puppet] - 10https://gerrit.wikimedia.org/r/436574 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [17:16:27] (03CR) 10Smalyshev: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/437188 (https://phabricator.wikimedia.org/T194653) (owner: 10Gehel) [17:16:46] (03PS2) 10Gehel: wdqs: cleanup declarration of blazegraph options [puppet] - 10https://gerrit.wikimedia.org/r/437187 (https://phabricator.wikimedia.org/T194653) [17:17:30] (03CR) 10Gehel: [C: 032] wdqs: cleanup declarration of blazegraph options [puppet] - 10https://gerrit.wikimedia.org/r/437187 (https://phabricator.wikimedia.org/T194653) (owner: 10Gehel) [17:17:47] (03PS2) 10Gehel: wdqs: reduce ban to a minimum on the internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/437188 (https://phabricator.wikimedia.org/T194653) [17:18:20] (03CR) 10Gehel: [C: 032] wdqs: reduce ban to a minimum on the internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/437188 (https://phabricator.wikimedia.org/T194653) (owner: 10Gehel) [17:18:25] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4234323 (10jcrespo) We can put labsdb1009 down, but for the future, we should install the utilities on the appropiate hosts- we shouldn't have to restart a server just to be able to change a disk. [17:18:39] !log pnorman@deploy1001 Started deploy [tilerator/deploy@074d01a] (cleartables): Deploy new meddo to test [17:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:47] (03PS1) 10RobH: adding smalyshev to wdqs-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/437293 (https://phabricator.wikimedia.org/T195797) [17:21:24] (03CR) 10RobH: [C: 032] adding smalyshev to wdqs-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/437293 (https://phabricator.wikimedia.org/T195797) (owner: 10RobH) [17:21:42] (03Abandoned) 10RobH: Add Stas to wdqs-test-roots group [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [17:22:04] !log pnorman@deploy1001 Finished deploy [tilerator/deploy@074d01a] (cleartables): Deploy new meddo to test (duration: 03m 24s) [17:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:32] !log gehel@deploy1001 Started deploy [wdqs/wdqs@fd534fa]: WDQS: new GUI and blazegraph versions [17:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:45] 10Operations, 10Discovery, 10SRE-Access-Requests, 10Wikidata, and 2 others: Stas needs root access on WDQS test cluster - https://phabricator.wikimedia.org/T195797#4254585 (10RobH) 05Open>03Resolved a:03RobH These patches have been merged, and should be live shortly on hosts! [17:25:32] 10Operations, 10Discovery, 10SRE-Access-Requests, 10Wikidata, and 2 others: Stas needs root access on WDQS test cluster - https://phabricator.wikimedia.org/T195797#4254591 (10Gehel) @RobH : thanks! [17:27:09] !log pnorman@deploy1001 Started deploy [tilerator/deploy@074d01a] (cleartables): Deploy new meddo to test [17:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:03] !log pnorman@deploy1001 Finished deploy [tilerator/deploy@074d01a] (cleartables): Deploy new meddo to test (duration: 02m 54s) [17:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:57] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@fd534fa]: WDQS: new GUI and blazegraph versions (duration: 08m 25s) [17:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:26] SMalyshev: deployment completed, tests are green [17:32:07] (03PS13) 10Imarlier: webperf: Make the different webperf roles explicit [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [17:33:22] (03CR) 10Imarlier: "@Dzahn - Cumin alias change made." [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [17:35:14] (03CR) 10Smalyshev: [C: 031] wdqs: collect dropwizard and custom metrics for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/437236 (owner: 10Gehel) [17:37:10] !log pnorman@deploy1001 Started deploy [tilerator/deploy@074d01a] (cleartables): Disable configurations after v3view [17:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:36] !log pnorman@deploy1001 Finished deploy [tilerator/deploy@074d01a] (cleartables): Disable configurations after v3view (duration: 00m 26s) [17:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:52] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254633 (10Marostegui) I thought about it but there are no deb packages: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_b256f556f71b41cf99c67fc608&swEnvOid=4004#tab1 We can probably use alie... [17:44:46] !log disabled SUL+wikitech 2FA for MarkAHershberger (T196370) [17:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:50] T196370: 2fa reset for MarkAHershberger - https://phabricator.wikimedia.org/T196370 [17:45:00] (03PS3) 10Gehel: wdqs: collect dropwizard and custom metrics for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/437236 [17:45:41] (03CR) 10Gehel: [C: 032] wdqs: collect dropwizard and custom metrics for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/437236 (owner: 10Gehel) [17:45:56] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4234323 (10MoritzMuehlenhoff) @Marostegui : hpssaducli is present in the thirdparty/hwraid component for stretch already. [17:47:51] !log pnorman@deploy1001 Started deploy [tilerator/deploy@074d01a] (cleartables): Disable configurations after v3view [17:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:15] !log pnorman@deploy1001 Finished deploy [tilerator/deploy@074d01a] (cleartables): Disable configurations after v3view (duration: 00m 24s) [17:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:59] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254650 (10Marostegui) Looks like it was easier than expected and I was able to extract the binary after converting the rpm to deb. I have run: ``` root@labsdb1009:/home/marostegui# ./hpssaducli -ssd -f... [17:49:25] (03CR) 10RobH: [C: 031] "This looks fine, but must await the 3 business day wait. I'll merge this Wednesday, 2018-06-06." [puppet] - 10https://gerrit.wikimedia.org/r/436860 (https://phabricator.wikimedia.org/T196192) (owner: 10Reedy) [17:50:05] (03CR) 10Dzahn: [C: 031] "thank you imarlier, perfect. i didn't mean to nitpick, this is for getting deb package upgrades based on role, merging it" [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [17:50:10] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254653 (10MoritzMuehlenhoff) >>! In T195690#4254646, @MoritzMuehlenhoff wrote: > @Marostegui : hpssaducli is present in the thirdparty/hwraid component for stretch already. And the component is enabled... [17:51:00] marlier: about to merge that, k? [17:51:09] !log pnorman@deploy1001 Started deploy [tilerator/deploy@074d01a] (cleartables): Disable configurations after v3view [17:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:19] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4254655 (10Marostegui) >>! In T195690#4254653, @MoritzMuehlenhoff wrote: >>>! In T195690#4254646, @MoritzMuehlenhoff wrote: >> @Marostegui : hpssaducli is present in the thirdparty/hwraid component for s... [17:51:22] (03PS14) 10Dzahn: webperf: Make the different webperf roles explicit [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [17:51:33] mutante: great, go for it -- I'll keep an eye on the servers. [17:51:34] !log pnorman@deploy1001 Finished deploy [tilerator/deploy@074d01a] (cleartables): Disable configurations after v3view (duration: 00m 25s) [17:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:41] !log pnorman@deploy1001 Started deploy [tilerator/deploy@074d01a] (cleartables): Enable osm-pbf source [17:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:46] marlier: starting a quick compiler run for good measure.. will be quick [17:53:06] !log pnorman@deploy1001 Finished deploy [tilerator/deploy@074d01a] (cleartables): Enable osm-pbf source (duration: 00m 25s) [17:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:16] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/11368/" [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [17:54:59] marlier: merged on puppet master, do you want to run the agents? [17:55:11] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4254658 (10Andrew) Sorry, @Cmjohnson, to be clear I was asking if you've already replaced the paste on this server, not saying that I think you have. [17:55:12] Will do [17:55:24] cool. and conpiler output linked above. looks good [17:55:58] !log pnorman@deploy1001 Started deploy [tilerator/deploy@074d01a] (cleartables): Enable osm-intl source [17:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:24] !log pnorman@deploy1001 Finished deploy [tilerator/deploy@074d01a] (cleartables): Enable osm-intl source (duration: 00m 25s) [17:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:38] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Access to usergroups for Marshall Miller - https://phabricator.wikimedia.org/T194550#4254662 (10MMiller_WMF) Thanks @ottomata -- it's working. [17:59:01] mutante: Did just what it was supposed to (no change on webperf1002, webperf1001 had its MOTD updated but that's all. [17:59:02] Thanks! [17:59:35] marlier: great! you're welcome [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T1800). [18:00:04] dmaza and bd808: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:28] here here [18:00:46] o/ [18:00:54] !log pnorman@deploy1001 Started deploy [tilerator/deploy@074d01a] (cleartables): Enable osm source [18:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:19] !log pnorman@deploy1001 Finished deploy [tilerator/deploy@074d01a] (cleartables): Enable osm source (duration: 00m 25s) [18:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:39] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4251148 (10chasemp) A note on not-paging for our weekly meeting: https://phabricator.wikimedia.org/T152368#2849231 [18:06:30] (03PS6) 10Krinkle: deployment-prep: add webperf to scap::dsh::groups [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) [18:06:54] mutante: Could you approve this beta-only patch (already cherry-picked there) - https://gerrit.wikimedia.org/r/#/c/436586/ [18:07:31] dmaza: hmm... looks like we don't have any excited deployers for the swat window [18:07:39] yup :( [18:07:43] I was about to ask [18:08:51] technically I know how to do all of it, but I don't really have time for babysitting post deploy right now [18:09:51] I think my patch is pretty safe if you are up for it [18:11:31] Looks like pnorman is doing some deploying [18:13:28] (03CR) 1020after4: [C: 031] scap::target: List allowed service commands, instead of wildcard [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier) [18:13:39] thcipriani: do you have time to do a couple of small SWAT patches? [18:14:20] one is turning on a feature flag for itwiki and the other is a static html change for noc [18:14:22] uhh, sure! Give me a minute to get situated. [18:14:36] * bd808 is going to run and find food really quick [18:16:02] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:16:53] (03PS2) 10Thcipriani: Enable $wgCookieSetOnIpBlock on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436707 (https://phabricator.wikimedia.org/T196121) (owner: 10Dmaza) [18:17:17] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436707 (https://phabricator.wikimedia.org/T196121) (owner: 10Dmaza) [18:18:13] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.02 seconds [18:18:30] (03Merged) 10jenkins-bot: Enable $wgCookieSetOnIpBlock on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436707 (https://phabricator.wikimedia.org/T196121) (owner: 10Dmaza) [18:19:16] dmaza: ^ is live on mwdebug1002, check please [18:19:29] (03CR) 10jenkins-bot: Enable $wgCookieSetOnIpBlock on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436707 (https://phabricator.wikimedia.org/T196121) (owner: 10Dmaza) [18:19:31] thanks.. checking now [18:21:26] (03PS2) 10Thcipriani: Fix wrong link to Server Admin Log on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430879 (https://phabricator.wikimedia.org/T193848) (owner: 10Aklapper) [18:25:30] thcipriani: I don't think it is working. mwdebug1002 right ? [18:25:59] dmaza: yep, lemme double check that it made it over [18:26:24] thanks [18:27:14] hrm, yep, seems like it should be there: https://phabricator.wikimedia.org/P7212 [18:27:53] uughh.. weird.. give me a minute [18:27:58] (03CR) 10Ottomata: [C: 031] profile::kafka::burrow: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429457 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [18:28:38] * thcipriani touches IS.php for good measure [18:30:59] !log pnorman@deploy1001 Started deploy [tilerator/deploy@074d01a] (cleartables): Redeploy to test [18:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:24] !log pnorman@deploy1001 Finished deploy [tilerator/deploy@074d01a] (cleartables): Redeploy to test (duration: 00m 25s) [18:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:38] !log otto@deploy1001 Started deploy [eventlogging/eventbus@3a5c395]: T196077 [18:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:45] T196077: EventBus service can drop a few messages during kafka leadership change - https://phabricator.wikimedia.org/T196077 [18:32:55] (03PS4) 10Herron: scap::target: List allowed service commands, instead of wildcard [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier) [18:35:16] !log deploying eventlogging-service-eventbus for T196077 [18:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:45] (03CR) 10Herron: [C: 032] "Great! Thanks for the reviews. Moving forward with this now" [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier) [18:36:45] !log otto@deploy1001 Finished deploy [eventlogging/eventbus@3a5c395]: T196077 (duration: 04m 07s) [18:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:58] (03PS3) 10Herron: profile::kafka::burrow: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429457 (https://phabricator.wikimedia.org/T175361) [18:40:32] thcipriani: I have no idea why this isn't not working. Feel free to roll it back [18:40:50] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4254816 (10Cmjohnson) @andrew sorry, I did misunderstand. I need to purchase more thermal paste. Can this wait a few days? [18:41:35] dmaza: ok, rolling back, sorry problems weren't obvious :( [18:42:03] no worries. Thank you [18:42:12] (03CR) 10Herron: [C: 032] profile::kafka::burrow: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429457 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [18:43:42] (03PS1) 10Thcipriani: Revert "Enable $wgCookieSetOnIpBlock on itwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437296 [18:44:06] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437296 (owner: 10Thcipriani) [18:44:48] !log bouncing kafka on kafka2003 to test T196077 [18:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:53] T196077: EventBus service can drop a few messages during kafka leadership change - https://phabricator.wikimedia.org/T196077 [18:45:13] (03Merged) 10jenkins-bot: Revert "Enable $wgCookieSetOnIpBlock on itwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437296 (owner: 10Thcipriani) [18:45:52] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430879 (https://phabricator.wikimedia.org/T193848) (owner: 10Aklapper) [18:47:28] (03Merged) 10jenkins-bot: Fix wrong link to Server Admin Log on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430879 (https://phabricator.wikimedia.org/T193848) (owner: 10Aklapper) [18:47:49] (03PS1) 10Pnorman: Remove duplicate osm2pgsql parameter [puppet] - 10https://gerrit.wikimedia.org/r/437297 (https://phabricator.wikimedia.org/T194106) [18:52:46] !log thcipriani@deploy1001 Synchronized docroot/noc/index.html: SWAT: [[gerrit:430879|Fix wrong link to Server Admin Log on noc.wikimedia.org]] T193848 (duration: 00m 50s) [18:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:50] T193848: https://wikitech.wikimedia.org/view/ no longer redirects to /wiki - https://phabricator.wikimedia.org/T193848 [18:53:24] bd808: ^ live now, thanks for poking that patch [18:54:34] thcipriani: thanks for deploying. Looks like it may take a bit for the misc cache to forget the old version, but with a ?foo cache buster I see the correct link. [18:55:30] cool, thanks for checking :) [19:00:32] !log bouncing kafka2003 again to test T196077 with python-kafka 1.4.3 [19:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:37] T196077: EventBus service can drop a few messages during kafka leadership change - https://phabricator.wikimedia.org/T196077 [19:01:50] (03CR) 10jenkins-bot: Revert "Enable $wgCookieSetOnIpBlock on itwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437296 (owner: 10Thcipriani) [19:01:55] (03CR) 10jenkins-bot: Fix wrong link to Server Admin Log on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430879 (https://phabricator.wikimedia.org/T193848) (owner: 10Aklapper) [19:04:45] !log otto@deploy1001 Started restart [eventlogging/eventbus@3a5c395]: bouncing eventbus after upgrading to python-kafka 1.4.3 for T196077 [19:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:12] PROBLEM - Host labtestneutron2002 is DOWN: PING CRITICAL - Packet loss = 100% [19:05:17] (03PS7) 10Dzahn: deployment-prep: add webperf to scap::dsh::groups [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [19:05:21] (03CR) 10Gehel: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/437297 (https://phabricator.wikimedia.org/T194106) (owner: 10Pnorman) [19:05:29] (03CR) 10Dzahn: [C: 032] deployment-prep: add webperf to scap::dsh::groups [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [19:05:38] (03PS1) 10Eevans: cassandra: update configuration for 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/437298 (https://phabricator.wikimedia.org/T178905) [19:05:42] PROBLEM - Host labtestneutron2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:06:08] (03PS2) 10Gehel: Remove duplicate osm2pgsql parameter [puppet] - 10https://gerrit.wikimedia.org/r/437297 (https://phabricator.wikimedia.org/T194106) (owner: 10Pnorman) [19:06:14] Krinkle: yep, done [19:06:32] RECOVERY - Host labtestneutron2002 is UP: PING OK - Packet loss = 0%, RTA = 37.12 ms [19:06:42] RECOVERY - Host labtestneutron2001 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [19:06:44] (03CR) 10Gehel: [C: 032] Remove duplicate osm2pgsql parameter [puppet] - 10https://gerrit.wikimedia.org/r/437297 (https://phabricator.wikimedia.org/T194106) (owner: 10Pnorman) [19:06:45] !log bouncing kafka2003 one more time for T196077 [19:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:49] T196077: EventBus service can drop a few messages during kafka leadership change - https://phabricator.wikimedia.org/T196077 [19:09:37] (03CR) 10Krinkle: [C: 04-1] Move scap::sources from role::deployment_server to common [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) (owner: 10Krinkle) [19:10:08] (03CR) 10Eevans: "[PC output](http://puppet-compiler.wmflabs.org/11371)" [puppet] - 10https://gerrit.wikimedia.org/r/437298 (https://phabricator.wikimedia.org/T178905) (owner: 10Eevans) [19:10:27] (03PS1) 10Kaldari: Deploy page creation log to Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437299 (https://phabricator.wikimedia.org/T196400) [19:10:29] !log elasticsearch cluster restart on eqiad completed - T193734 [19:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:33] T193734: Move Serbian language wikis from extra-analysis to extra-analysis-serbian plugin - https://phabricator.wikimedia.org/T193734 [19:10:38] (03PS2) 10Eevans: cassandra: update configuration for 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/437298 (https://phabricator.wikimedia.org/T178905) [19:13:16] (03PS2) 10Kaldari: Enable page creation log on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437299 (https://phabricator.wikimedia.org/T196400) [19:17:41] (03PS1) 10Dzahn: phabricator: add role to node phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) [19:18:02] (03PS2) 10Dzahn: phabricator: add role to node phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) [19:28:03] marlier: so xhgui is already on webperf1002/2002 but there remaisn some work to remove it from tungsten, is that right? [19:32:05] mutante: that's right. [19:32:38] All of the data is stored in a mongo instance, we need to migrate that over to the new hosts, then tell the collectors where to send it. [19:33:04] Probably going to be a couple of weeks, because of offiste and annual reviews, unfortunately. [19:34:37] marlier: gotcha :) was just looking at the checkboxes on the ticket [19:34:48] 'tis all good [19:47:06] (03PS1) 10Pnorman: Increase number of osm2pgsql processes to 8\nThis goes along with changes to the maps/loading documentation to boost max_connections during import [puppet] - 10https://gerrit.wikimedia.org/r/437301 [19:51:14] (03CR) 10Merlijn van Deen: [C: 031] "The code looks sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [19:53:54] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4255057 (10Andrew) > Can this wait a few days? Sure. [19:57:50] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10PHP 7.0 support, 10Patch-For-Review: php-memcached 3.0 (PHP 7) incompatible with BagOStuff - https://phabricator.wikimedia.org/T196125#4255065 (10aaron) a:03aaron [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T2000). [20:01:04] I'm doing a slightly exciting deployment for ORES today. [20:01:24] "oh good" [20:01:27] 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248#4255076 (10Imarlier) [20:01:31] in what way? [20:01:32] :) [20:03:15] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Performance-Team (Radar): Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034#4255104 (10Imarlier) [20:04:18] 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248#4255119 (10Krinkle) p:05Triage>03Low [20:05:39] (03CR) 10Paladox: [C: 031] phabricator: add role to node phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [20:05:46] A new type of classifier, "draft topic", is landing on the production cluster. It's the first to rely on * LFS and the first using large (several GB), in-memory word embeddings. If our current pre-fork memory allocation strategy holds steady, this won't risk using up the 20-ish-GB headroom available on each box. [20:06:31] But there is a small risk of OOM, so I'll probably start with a canary, in this case ores2001 I suppose [20:07:08] LFS has been a bumpy road, but the risk there is simply that the first deployment to each box might not succeed. [20:09:07] This isn't ideal obviously, and TODO [20:09:41] greg-g: ^ [20:09:54] (03PS6) 10Merlijn van Deen: Do not connect to SQL server for a dry run [puppet] - 10https://gerrit.wikimedia.org/r/432532 [20:09:56] (03PS4) 10Merlijn van Deen: labs/db: create basic integration test for maintain-meta_p [puppet] - 10https://gerrit.wikimedia.org/r/432698 [20:09:58] (03PS4) 10Merlijn van Deen: labs/db: maintain-meta_p restructuring [puppet] - 10https://gerrit.wikimedia.org/r/434323 [20:10:58] awight: ack [20:12:28] (03CR) 10BryanDavis: "> What happens in case of HSTS when a POST is made to a http url?" [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [20:13:22] greg-g: I've had this on two production canaries before now, then rolled back after an hour or so. [20:14:13] Just in case you mean the yucky "ack" rather than tcp:ack [20:21:09] (03CR) 10Eevans: [C: 031] cassandra: update configuration for 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/437298 (https://phabricator.wikimedia.org/T178905) (owner: 10Eevans) [20:23:09] !log arlolra@deploy1001 Started deploy [parsoid/deploy@828034c]: Updating Parsoid to bd5a840 [20:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:43] 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248#4255150 (10BBlack) Speaking for the big unified certs we get from commercial vendors: we generally do wait ~24h (usually longer?) , between the issue date of new major... [20:27:15] !log andrew@deploy1001 Started deploy [horizon/deploy@12aa2d3]: fix for T192179 [20:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:20] T192179: Horizon should not show 'Launch' button on 'Images' page for non-project admins - https://phabricator.wikimedia.org/T192179 [20:28:54] awight: tcp :) [20:30:13] 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248#4255184 (10BBlack) As for the rest, especially with the one-offs using LetsEncrypt scripting today, we definitely don't have this kind of resiliency, or any kind of dep... [20:30:50] !log andrew@deploy1001 Finished deploy [horizon/deploy@12aa2d3]: fix for T192179 (duration: 03m 35s) [20:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:40] !log awight@deploy1001 Started deploy [ores/deploy@d77e52c]: ores2001 canary of drafttopic; T176336 [20:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:44] T176336: Deploy drafttopic model to production ORES - https://phabricator.wikimedia.org/T176336 [20:33:24] * awight gives greg-g an extra OSI layer 4 hi-five [20:34:03] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@828034c]: Updating Parsoid to bd5a840 (duration: 10m 54s) [20:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:14] !log awight@deploy1001 Finished deploy [ores/deploy@d77e52c]: ores2001 canary of drafttopic; T176336 (duration: 02m 35s) [20:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:21] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@276ea43]: Update mobileapps to f579f0d [20:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:39] lots of deploys in parallel! :) [20:36:10] subbu spots the singularity before anyone [20:36:52] PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:37:02] PROBLEM - ores on ores2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.073 second response time [20:37:54] this ease of deploy is one thing i am going to miss when parsoid gets integrated into core .. [20:40:10] !log awight@deploy1001 Started deploy [ores/deploy@d77e52c]: l ores2001.codfw.wmnet ores2001 canary of drafttopic; T176336 (take 2 after init'ing LFS) [20:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:15] T176336: Deploy drafttopic model to production ORES - https://phabricator.wikimedia.org/T176336 [20:40:15] !log awight@deploy1001 Started deploy [ores/deploy@d77e52c]: ores2001 canary of drafttopic; T176336 (take 2 after init'ing LFS)f [20:40:16] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@276ea43]: Update mobileapps to f579f0d (duration: 05m 54s) [20:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:38] !log awight@deploy1001 Finished deploy [ores/deploy@d77e52c]: ores2001 canary of drafttopic; T176336 (take 2 after init'ing LFS)f (duration: 00m 22s) [20:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:05] (03CR) 10Alex Monk: "One thing I should probably write down explicitly about this is that while Cumin and Puppetboard hosts are let through to this service by " [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [20:48:39] !log awight@deploy1001 Started deploy [ores/deploy@65e979f]: ores2001 canary of drafttopic; T176336 (take 3 after bumping revision) [20:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:43] T176336: Deploy drafttopic model to production ORES - https://phabricator.wikimedia.org/T176336 [20:50:26] !log awight@deploy1001 Finished deploy [ores/deploy@65e979f]: ores2001 canary of drafttopic; T176336 (take 3 after bumping revision) (duration: 01m 47s) [20:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:58] 10Operations, 10Dumps-Generation, 10Wikimedia-log-errors: High rate of "Memcached error .. CONNECTION FAILURE" on snapshot hosts - https://phabricator.wikimedia.org/T196303#4255217 (10ArielGlenn) Adding a quick note of some things that I looked into, all without any fruitful results: I ran /usr/bin/php7.0 /... [20:57:32] !log awight@deploy1001 Started deploy [ores/deploy@bf182e2]: roll back ores2001 [20:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:43] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational [20:58:43] !log awight@deploy1001 Finished deploy [ores/deploy@bf182e2]: roll back ores2001 (duration: 01m 11s) [20:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:02] RECOVERY - ores on ores2001 is OK: HTTP OK: HTTP/1.0 200 OK - 3691 bytes in 0.095 second response time [20:59:13] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is WARNING: Test Transform wikitext to html responds with unexpected body: h2 id=HeadingHeading/h2 != /^h2.* Heading \/h2/: /en.w [20:59:13] e/media/{title}{/revision} (Get media in test page) is WARNING: Test Get media in test page responds with unexpected value at path /items[2] = Missing keys: [utitles, uthumbnail, ulicense] [21:00:04] bawolff and Reedy: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T2100). [21:00:36] greg-g: ores is back to normal, I couldn't deploy the thing. [21:01:20] awight: :/ [21:03:51] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Next): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#4255246 (10awight) [21:05:02] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4255248 (10Krinkle) [21:07:07] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Next): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#4255270 (10awight) [21:07:40] !log pnorman@deploy1001 Started deploy [kartotherian/deploy@d8dcba3] (cleartables): Redeploy kartotherian to test [21:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:02] !log pnorman@deploy1001 Finished deploy [kartotherian/deploy@d8dcba3] (cleartables): Redeploy kartotherian to test (duration: 00m 21s) [21:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:43] RECOVERY - kartotherian endpoints health on maps-test2004 is OK: All endpoints are healthy [21:26:40] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4255350 (10Johan) a:03Johan The message should also clearly state that this means they won't be able to access Wikipedia in the future (or won't be... [21:29:56] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Next): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#4255372 (10awight) I tried to do a deployment to ores2001 today, running three passes thus: 1. Deploy the master code (d77e52c) in case LFS works... [21:46:35] (03PS1) 10Krinkle: monitoring: Remove unused 'graphite_anomaly' command [puppet] - 10https://gerrit.wikimedia.org/r/437365 [21:53:05] (03PS1) 10Andrew Bogott: Horizon: Remove 'Labs' from page title [puppet] - 10https://gerrit.wikimedia.org/r/437367 (https://phabricator.wikimedia.org/T196199) [21:57:57] (03PS3) 10Andrew Bogott: horizon: fix Horizion title branding [puppet] - 10https://gerrit.wikimedia.org/r/436951 (https://phabricator.wikimedia.org/T196199) (owner: 10Chico Venancio) [21:59:06] (03Abandoned) 10Andrew Bogott: Horizon: Remove 'Labs' from page title [puppet] - 10https://gerrit.wikimedia.org/r/437367 (https://phabricator.wikimedia.org/T196199) (owner: 10Andrew Bogott) [21:59:41] I'm getting rather low speeds downloading a file on a production server, using webproxy.eqiad.wmnet:8080 [22:00:57] ~40kB/s, and I get ~1 MB/s from home. In the past it's been faster. Are there any known issues? [22:02:50] (03PS4) 10Chico Venancio: horizon: fix Horizon title branding [puppet] - 10https://gerrit.wikimedia.org/r/436951 (https://phabricator.wikimedia.org/T196199) [22:16:34] 10Operations, 10ops-codfw, 10fundraising-tech-ops: Rack/Setup frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T196417#4255460 (10Jgreen) [22:19:22] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [22:21:10] 10Operations, 10ops-eqiad, 10Traffic: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923#4255477 (10BBlack) [22:27:29] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4255518 (10Krenair) I looked at the puppetmaster apache config and noticed this line: ``` # If Apache complains about invalid signature... [22:33:54] (03PS3) 10Krinkle: Move scap::sources from role::deployment_server to common [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) [22:34:11] (03CR) 10jerkins-bot: [V: 04-1] Move scap::sources from role::deployment_server to common [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) (owner: 10Krinkle) [22:35:27] !log pnorman@deploy1001 Started deploy [kartotherian/deploy@a588bf4] (cleartables): Deploy var name parameters [22:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:48] !log pnorman@deploy1001 Finished deploy [kartotherian/deploy@a588bf4] (cleartables): Deploy var name parameters (duration: 00m 21s) [22:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:27] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4255531 (10awight) Closing, our plan is simple: * Get deployment working with a single LFS file, in the `submodules/assets` directory.... [22:38:56] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#4255536 (10awight) [22:39:05] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4255532 (10awight) 05Open>03Resolved a:03awight [22:41:33] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4255542 (10awight) Confirmed that git-lfs is installed on deploy1001, removing blocking task dependency. [22:42:11] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4255546 (10awight) [22:42:15] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4255544 (10awight) [22:48:30] 10Operations, 10Dumps-Generation, 10Wikimedia-log-errors: High rate of "Memcached error .. CONNECTION FAILURE" on snapshot hosts - https://phabricator.wikimedia.org/T196303#4255551 (10Krinkle) @ArielGlenn I don't have sudo for `dumpsgen`, but I do for `mwdeploy` on snapshot hosts, and can trigger the error w... [22:57:28] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@825863f]: Potential mitigation for T194325 [22:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:33] T194325: Unrecognized subject messages on Updater - https://phabricator.wikimedia.org/T194325 [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T2300). [23:00:04] kaldari: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:11:58] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@825863f]: Potential mitigation for T194325 [23:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:03] T194325: Unrecognized subject messages on Updater - https://phabricator.wikimedia.org/T194325 [23:21:17] \o [23:21:37] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@825863f]: Potential mitigation for T194325 (duration: 09m 39s) [23:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:42] T194325: Unrecognized subject messages on Updater - https://phabricator.wikimedia.org/T194325 [23:26:12] I can SWAT [23:26:40] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437299 (https://phabricator.wikimedia.org/T196400) (owner: 10Kaldari) [23:26:50] thanks! [23:27:53] (03Merged) 10jenkins-bot: Enable page creation log on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437299 (https://phabricator.wikimedia.org/T196400) (owner: 10Kaldari) [23:29:02] (03CR) 10jenkins-bot: Enable page creation log on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437299 (https://phabricator.wikimedia.org/T196400) (owner: 10Kaldari) [23:29:10] kaldari: your patch is pulled over to mwdebug1002, check please [23:30:20] ok... [23:31:45] testing... [23:33:22] okie doke [23:33:46] thcipriani: hmm, doesn't seem to be functioning. [23:35:25] that's strange. I can see the config on mwdebug1002. I just re-touched the file there. [23:35:51] something similar happened today during "morning" SWAT :\ [23:36:19] I forget, how do I check that my mwdebug plug-in is working? [23:36:31] https://phabricator.wikimedia.org/P7214 [23:37:26] should be a server header in the response, not sure if that's how you're supposed to check, but that's how I've always checked [23:38:04] looks correct: mwdebug1002.eqiad.wmnet [23:39:02] thcipriani: hmm, gimme a sec to investigate some more... [23:39:40] sure [23:39:48] FWIW: https://phabricator.wikimedia.org/P7214#41591 [23:47:10] thcipriani: well, I don't see anything wrong, but it doesn't seem to be doing anything. Would you mind syncing the config change just I can test on the actual Test Wiki back-end. If it still doesn't work, I'll give you a patch to revert it. [23:48:03] kaldari: sure, I can try that, I'm suspicious myself since we had a similar situation in morning swat [23:50:14] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:437299|Enable page creation log on Test Wikipedia]] T196400 (duration: 00m 50s) [23:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:19] T196400: Deploy new page creation log - https://phabricator.wikimedia.org/T196400 [23:50:20] ^ kaldari should be everywhere now [23:50:27] OK, checking... [23:52:10] thcipriani: well, it definitely doesn't work :P [23:52:17] (03PS1) 10Kaldari: Revert "Enable page creation log on Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437376 [23:52:18] :) [23:52:29] Here's the revert: https://gerrit.wikimedia.org/r/#/c/437376/ [23:52:40] thanks [23:52:50] thcipriani: Thanks for the assistance anyway! [23:53:37] no problem! sorry the patch didn't work out :( [23:53:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437376 (owner: 10Kaldari) [23:55:06] (03Merged) 10jenkins-bot: Revert "Enable page creation log on Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437376 (owner: 10Kaldari) [23:57:52] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:437376|Revert "Enable page creation log on Test Wikipedia"]] T196400 (duration: 00m 49s) [23:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:56] T196400: Deploy new page creation log - https://phabricator.wikimedia.org/T196400 [23:59:11] (03CR) 10jenkins-bot: Revert "Enable page creation log on Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437376 (owner: 10Kaldari)