[00:03:00] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:03:49] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [00:39:29] (03PS1) 10Andrew Bogott: Add records for wikitech-static-ord [dns] - 10https://gerrit.wikimedia.org/r/356124 (https://phabricator.wikimedia.org/T164271) [00:41:01] (03CR) 10Andrew Bogott: [C: 032] Add records for wikitech-static-ord [dns] - 10https://gerrit.wikimedia.org/r/356124 (https://phabricator.wikimedia.org/T164271) (owner: 10Andrew Bogott) [00:49:59] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:01] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:59] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [00:51:50] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [01:12:30] PROBLEM - HHVM rendering on mw2149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:19] RECOVERY - HHVM rendering on mw2149 is OK: HTTP OK: HTTP/1.1 200 OK - 73334 bytes in 0.255 second response time [01:26:59] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:28:09] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:59] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [01:29:00] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:49] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [01:33:00] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [02:22:27] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 08m 22s) [02:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:36] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 54s) [02:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:20] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 30 02:49:20 UTC 2017 (duration 6m 44s) [02:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:29] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:03:19] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [04:09:39] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1956.90 Read Requests/Sec=2446.40 Write Requests/Sec=5.50 KBytes Read/Sec=36570.40 KBytes_Written/Sec=2578.00 [04:16:43] (03CR) 10BryanDavis: "A few nits inline, but overall it looks good." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) (owner: 10Andrew Bogott) [04:17:28] (03CR) 10BryanDavis: [C: 031] Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) (owner: 10Andrew Bogott) [04:18:39] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=15.90 Read Requests/Sec=0.10 Write Requests/Sec=2.70 KBytes Read/Sec=0.80 KBytes_Written/Sec=58.00 [04:18:40] (03CR) 10BryanDavis: [C: 031] Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 (owner: 10Andrew Bogott) [04:24:43] (03PS2) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [04:25:51] (03CR) 10jerkins-bot: [V: 04-1] bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [04:28:27] (03CR) 10BryanDavis: logstash - start using elasticsearch-curator for indices cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [04:38:59] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [04:40:59] RECOVERY - Check whether ferm is active by checking the default input chain on restbase-dev1002 is OK: OK ferm input default policy is set [05:55:09] 06Operations, 10ops-eqiad, 10DBA: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3299396 (10Marostegui) [05:55:44] 06Operations, 10ops-eqiad, 10DBA: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3298788 (10Marostegui) @Cmjohnson once you get the replacement BBU from HP, let us know as we need to depool this host before shutting it down. Thanks! [06:23:58] !log Deploy alter table on s3 dbstore2001 - T166278 [06:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:08] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:29:30] (03PS1) 10Marostegui: db-eqiad.php: Repool db1084, depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356133 (https://phabricator.wikimedia.org/T166206) [06:30:41] (03PS2) 10Marostegui: db-eqiad.php: Repool db1084, depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356133 (https://phabricator.wikimedia.org/T166206) [06:32:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1084, depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356133 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:34:02] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1084, depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356133 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:34:10] (03PS2) 10Giuseppe Lavagetto: ChangeProp: Add Redis/Nutcracker connection info [puppet] - 10https://gerrit.wikimedia.org/r/356072 (https://phabricator.wikimedia.org/T161710) (owner: 10Mobrovac) [06:35:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1084, depool db1081 - T166206 (duration: 00m 59s) [06:35:49] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1084, depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356133 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:35:50] !log Deploy alter table s4 - db1081 - https://phabricator.wikimedia.org/T166206 [06:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:52] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [06:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:56] (03PS9) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) [06:41:24] !log Deploy alter table on s3 dbstore1002 - https://phabricator.wikimedia.org/T166278 [06:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:34] !log Deploy alter table on s3 db1038 - T166278 [06:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:43] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:48:55] (03CR) 10Muehlenhoff: [C: 031] "Looks good, the kernel module names are identical on 3.13 to 4.9" [puppet] - 10https://gerrit.wikimedia.org/r/356118 (owner: 10Faidon Liambotis) [06:49:49] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [07:00:17] (03CR) 10Muehlenhoff: [C: 031] "The fix to address https://puppet.com/security/cve/cve-2017-2295 has been deployed on all the puppet masters: https://github.com/puppetlab" [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [07:03:19] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [07:09:06] !log Deploy alter table on enwiki.revision on db1047 - T166452 [07:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:15] T166452: db1047 has been restarted - needs another restart - https://phabricator.wikimedia.org/T166452 [07:17:41] (03PS1) 10Giuseppe Lavagetto: Reduce TTL of etcd conftool entries [dns] - 10https://gerrit.wikimedia.org/r/356135 [07:17:43] (03PS1) 10Giuseppe Lavagetto: etcd: switch writes to eqiad [dns] - 10https://gerrit.wikimedia.org/r/356136 [07:17:45] (03PS1) 10Giuseppe Lavagetto: Restore 5M TTL for conftool SRV records [dns] - 10https://gerrit.wikimedia.org/r/356137 [07:18:49] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:20:15] (03PS1) 10Giuseppe Lavagetto: etcd: set codfw to read-only too. [puppet] - 10https://gerrit.wikimedia.org/r/356138 [07:20:17] (03PS1) 10Giuseppe Lavagetto: etcd: enable replication eqiad => codfw [puppet] - 10https://gerrit.wikimedia.org/r/356139 [07:31:32] (03CR) 10Muehlenhoff: "This looks really great, I'm looking forward to get rid of mysqld_safe! I've added a few remarks." (036 comments) [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [07:38:34] !log wdqs1002 back in LVS - T166524 [07:38:35] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1002.eqiad.wmnet [07:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:42] T166524: high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524 [07:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:49] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3299450 (10Gehel) @RobH: from racktables, it looks like wdqs1002 is 4.5 years old (purchase date = 2012-12-05, same as wdqs1001 - other servers are newer). I'm not s... [07:43:05] 06Operations, 10Analytics, 15User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3299453 (10elukey) [07:44:11] (03CR) 10Gehel: [C: 032] elasticsearch - ignore some warnings related to 5.3.2 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/356079 (https://phabricator.wikimedia.org/T163708) (owner: 10Gehel) [07:48:34] 06Operations, 10Analytics, 15User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3299460 (10elukey) On analytics1033: ``` sudo megacli -AdpBbuCmd -a0 elukey@analytics1033:~$ sudo megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Volt... [07:48:54] 06Operations, 10ops-eqiad, 10Analytics, 15User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3299462 (10elukey) [07:49:58] (03CR) 10Gehel: logstash - start using elasticsearch-curator for indices cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [07:54:05] nuria_: hey :) [07:55:44] (03CR) 10DCausse: [C: 031] elasticsearch - correct naming of curator config files [puppet] - 10https://gerrit.wikimedia.org/r/356052 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [07:56:19] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2893.47 seconds [07:56:30] I silenced it…strange [07:56:33] db1047 is me [08:02:27] (03CR) 10DCausse: [C: 04-1] elasticsearch - configure logging for elasticsearch-curator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356068 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [08:06:19] (03PS2) 10Gehel: elasticsearch - configure logging for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/356068 (https://phabricator.wikimedia.org/T166154) [08:06:43] (03CR) 10Gehel: elasticsearch - configure logging for elasticsearch-curator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356068 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [08:09:19] PROBLEM - DPKG on osmium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:10:26] ^ that's fine [08:11:19] RECOVERY - DPKG on osmium is OK: All packages OK [08:17:29] !log restart kafka on kafka1018 for jvm upgrades [08:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:01] (03CR) 10DCausse: [C: 031] "lgtm, one nitpick but feel free to ignore" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356068 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [08:19:23] (03PS3) 10Gehel: elasticsearch - configure logging for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/356068 (https://phabricator.wikimedia.org/T166154) [08:19:32] (03CR) 10Gehel: elasticsearch - configure logging for elasticsearch-curator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356068 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [08:19:51] (03CR) 10DCausse: [C: 031] elasticsearch - configure logging for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/356068 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [08:23:31] !log restart jmxtrans on all the kafka brokers (analytics+main-codfw/eqiad) for jvm upgrades [08:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:59] (03CR) 10Jcrespo: "All your comments seems wise to me, I will give them a proper look." [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [08:51:39] (03CR) 10Muehlenhoff: "I didn't think of dbstore, so I wasn't aware that the fd limit was dependant on the mariadb role, that makes total sense. My recommendatio" [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [09:05:32] (03PS2) 10Filippo Giunchedi: Don't symlink systemd service instances [puppet] - 10https://gerrit.wikimedia.org/r/356038 (https://phabricator.wikimedia.org/T166389) [09:06:34] (03CR) 10Ema: CLI: add -i/--interactive option (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/354442 (https://phabricator.wikimedia.org/T165838) (owner: 10Volans) [09:08:02] (03CR) 10Volans: "Some general (and some also optional) comments on the structure. I'll review the eventlogging_cleaner.py script later. The ones marked wit" (0322 comments) [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [09:08:33] only 22 comments! No -2, I am lucky today :P [09:08:34] (03CR) 10Filippo Giunchedi: [C: 032] Don't symlink systemd service instances [puppet] - 10https://gerrit.wikimedia.org/r/356038 (https://phabricator.wikimedia.org/T166389) (owner: 10Filippo Giunchedi) [09:08:38] thanks volans! [09:09:02] elukey: see the part "I'll review the eventlogging_cleaner.py script later" :-P [09:09:15] hahahahaha [09:09:21] 22 comments only for the rest? [09:12:41] typo: only 22 comment for the rest :-P [09:15:15] 06Operations, 07HHVM, 07Upstream: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586#3299560 (10MoritzMuehlenhoff) 05Open>03Resolved This is resolved in 3.18.2+dfsg-1+wmf3 and 3.18.2+dfsg-1+wmf4, all the hosts migrated to 3.18 are using that version, so closing. [09:15:17] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3299562 (10MoritzMuehlenhoff) [09:18:09] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 30 probes of 448 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [09:18:21] (03PS11) 10Muehlenhoff: contint: skip hhvm experimental pin on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [09:19:12] (03CR) 10Ema: [C: 031] "Nice!" [software/cumin] - 10https://gerrit.wikimedia.org/r/354637 (https://phabricator.wikimedia.org/T165842) (owner: 10Volans) [09:23:09] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 448 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [09:23:47] (03CR) 10Muehlenhoff: [C: 032] contint: skip hhvm experimental pin on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [09:24:55] (03CR) 10Ema: [C: 031] "Looks great except for that one mildly obscure def h() thing. Thanks!" [software/cumin] - 10https://gerrit.wikimedia.org/r/354442 (https://phabricator.wikimedia.org/T165838) (owner: 10Volans) [09:25:44] 06Operations, 06Performance-Team, 10Thumbor, 05MW-1.30-release-notes (WMF-deploy-2017-05-23_(1.30.0-wmf.2)), 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3299607 (10Gilles) [09:27:08] (03PS4) 10Gehel: elasticsearch - configure logging for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/356068 (https://phabricator.wikimedia.org/T166154) [10:15:04] (03PS12) 10Paladox: contint: skip hhvm experimental pin on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) [10:18:15] !log stopping and backing up db2048 in preparation for reimage [10:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:29] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [10:19:40] that's me ^ [10:20:22] (03PS1) 10Alexandros Kosiaris: calico-node: Enable IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/356158 [10:20:29] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:20:56] parallel port? but what if I need to install my 90's printer to one of the servers! [10:26:30] (03PS2) 10Alexandros Kosiaris: calico-node: Enable IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/356158 [10:26:36] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] calico-node: Enable IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/356158 (owner: 10Alexandros Kosiaris) [10:29:33] 06Operations, 07HHVM: HHVM segfault with mediawiki/core / Scribunto - https://phabricator.wikimedia.org/T166550#3299771 (10hashar) [10:32:00] !log enable calico IPv6 BGP peering for cr1-eqiad [10:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:48] 06Operations, 07HHVM: HHVM segfault with mediawiki/core / Scribunto - https://phabricator.wikimedia.org/T166550#3299785 (10hashar) That is a random trace of doom. My HHVM segfault at various place so will probably just abandon this one. [10:43:46] (03PS2) 10Filippo Giunchedi: tlsproxy: add support to change max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/356155 (https://phabricator.wikimedia.org/T166482) [10:44:37] !log run refreshFileHeaders for group 0 wikis on Terbium [10:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:09] (03CR) 10Filippo Giunchedi: [C: 032] tlsproxy: add support to change max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/356155 (https://phabricator.wikimedia.org/T166482) (owner: 10Filippo Giunchedi) [10:46:40] (03PS2) 10Filippo Giunchedi: hieradata: set max_body_size for swift::proxy [puppet] - 10https://gerrit.wikimedia.org/r/356157 (https://phabricator.wikimedia.org/T166482) [10:47:57] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: set max_body_size for swift::proxy [puppet] - 10https://gerrit.wikimedia.org/r/356157 (https://phabricator.wikimedia.org/T166482) (owner: 10Filippo Giunchedi) [10:52:51] 06Operations, 13Patch-For-Review: Error while enabling symlinked units on stretch systemd - https://phabricator.wikimedia.org/T166389#3299806 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi All done, verified on jessie/stretch that systemctl still DTRTs [10:56:47] (03PS1) 10Giuseppe Lavagetto: Minor fixes to the build, add patches/series [calico-containers] - 10https://gerrit.wikimedia.org/r/356161 [10:57:29] (03CR) 10Jcrespo: "I added more questions, not necessarily just for Moritz, but for anyone that can weight in or knows more about my questions." (036 comments) [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [10:58:56] 06Operations, 06DC-Ops, 06Discovery, 03Interactive-Sprint: Decide what to do with maps-test cluster - https://phabricator.wikimedia.org/T158982#3299811 (10Deskana) 05Open>03Resolved This hasn't been touched for three months now. It therefore seems that the status quo of leaving the boxes there is the p... [11:01:12] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Minor fixes to the build, add patches/series [calico-containers] - 10https://gerrit.wikimedia.org/r/356161 (owner: 10Giuseppe Lavagetto) [11:01:40] (03CR) 10Jcrespo: "I will also compare it more to the debian-shipped one and get the differences. I wonder if we should leave in the galera stuff, in case we" [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [11:01:40] 06Operations, 10TimedMediaHandler, 10media-storage, 13Patch-For-Review: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3299816 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi @revent I see for example https://commons.wikimedia.org/wik... [11:03:13] 06Operations, 10TimedMediaHandler, 10media-storage, 13Patch-For-Review: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3299822 (10Revent) @fgiunchedi Thanks for jumping onto this so quickly. I'll start working on resetting the others, and l... [11:10:52] (03PS1) 10Ema: prometheus: add hwmon collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/356163 (https://phabricator.wikimedia.org/T125205) [11:11:05] (03CR) 10Filippo Giunchedi: "LGTM overall, a couple of questions" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [11:11:38] 06Operations, 10TimedMediaHandler, 10media-storage: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3297326 (10TheDJ) [11:11:46] (03CR) 10Ema: [C: 04-1] "Note that we need to upgrade node_exporter on the Ubuntu hosts before merging this." [puppet] - 10https://gerrit.wikimedia.org/r/356163 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [11:13:49] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Active [11:14:07] !log upgrade grafana to 4.3.1 on krypton [11:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:09] (03CR) 10Filippo Giunchedi: [C: 031] Also strip rpcbind/nfs-common deps on jessie installs [puppet] - 10https://gerrit.wikimedia.org/r/354190 (https://phabricator.wikimedia.org/T106477) (owner: 10Muehlenhoff) [11:16:57] 06Operations, 13Patch-For-Review: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372#3299844 (10Volans) @akosiaris @Joe @faidon I've changed to `stringify_facts = false` my labs project and this are the different facts: - `system_uptime`: ```lang=json # Stringifi... [11:20:14] (03PS1) 10Aude: Enable Wikibase notifications for Wikipedias (except enwiki, dewiki, frwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356165 (https://phabricator.wikimedia.org/T142103) [11:20:49] RECOVERY - BGP status on cr1-ulsfo is OK: BGP OK - up: 14, down: 2, shutdown: 0 [11:20:52] (03CR) 10Mforns: [WIP] Add the eventlogging_cleaner script and base package (032 comments) [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [11:25:22] (03PS1) 10Muehlenhoff: Add Kafka main brokers to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356167 [11:29:35] (03PS2) 10Aude: Enable Wikibase notifications for Wikipedias (except enwiki, dewiki, frwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356165 (https://phabricator.wikimedia.org/T142102) [11:31:24] !log installing fop security updates [11:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:46] (03CR) 10Gehel: logstash - start using elasticsearch-curator for indices cleanup (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [11:35:06] (03PS2) 10Gehel: logstash - start using elasticsearch-curator for indices cleanup [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) [11:35:15] (03PS1) 10Aude: Set wgPageImagesAPIDefaultLicense to 'any' for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356171 (https://phabricator.wikimedia.org/T159678) [11:38:16] (03PS2) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [11:39:43] (03CR) 10jerkins-bot: [V: 04-1] role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (owner: 10Elukey) [11:42:16] copy/pasta fail [11:47:29] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [11:47:29] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [11:48:04] !log Rename update table on enwiki on db1089 host - T139342 [11:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:12] T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342 [11:48:19] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:48:19] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:50:48] 06Operations: Switch etcd back to eqiad, document switchover procedure - https://phabricator.wikimedia.org/T166552#3299882 (10Joe) [11:51:10] 06Operations, 15User-Joe: Switch etcd back to eqiad, document switchover procedure - https://phabricator.wikimedia.org/T166552#3299894 (10Joe) p:05Triage>03Normal a:03Joe [11:51:57] (03PS3) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [12:01:56] !log installing jbig2dec security updates [12:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:00] (03PS4) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [12:07:06] !log restart kafka on kafka1012 for jvm upgrades [12:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:05] !log installin jbig2dec security updates [12:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:29] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3299928 (10Gehel) Taking wdqs1002 out of LVS seems to have given it sufficient breathing space to catch up on replication. I add... [12:30:54] I am running KEYS on rdb2003 to get some info, should be ok but if an alarm fires it is me [12:36:19] PROBLEM - Host elastic2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:19] RECOVERY - Host elastic2007 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [12:42:25] (03CR) 10Muehlenhoff: dbtools: Update package for stretch and include systemd support (034 comments) [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [12:43:59] !log restart kafka on kafka200[123] for jvm upgrades (main-codfw, eventbus) [12:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:22] (03PS2) 10Muehlenhoff: Also strip rpcbind/nfs-common deps on jessie installs [puppet] - 10https://gerrit.wikimedia.org/r/354190 (https://phabricator.wikimedia.org/T106477) [12:55:24] (03CR) 10Muehlenhoff: [C: 032] Also strip rpcbind/nfs-common deps on jessie installs [puppet] - 10https://gerrit.wikimedia.org/r/354190 (https://phabricator.wikimedia.org/T106477) (owner: 10Muehlenhoff) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170530T1300). Please do the needful. [13:00:04] aude: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:05:48] (03CR) 10Hashar: [C: 04-1] "Luasandbox crashes with HHVM 3.18 T165043" [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [13:09:32] (03CR) 10Elukey: [C: 031] "LGTM! Checked the IPs and the look correct." [puppet] - 10https://gerrit.wikimedia.org/r/356167 (owner: 10Muehlenhoff) [13:13:24] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3299978 (10Gehel) a:03Gehel [13:14:40] !log upgrade prometheus-node-exporter to 0.14.0~git20170523-0 on ubuntu systems [13:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:01] (03PS1) 10Alexandros Kosiaris: calico: Organize the required per DC hieradata [puppet] - 10https://gerrit.wikimedia.org/r/356182 [13:15:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] calico: Organize the required per DC hieradata [puppet] - 10https://gerrit.wikimedia.org/r/356182 (owner: 10Alexandros Kosiaris) [13:21:51] !log restart kafka on kafka1001 for jvm upgrades [13:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:05] ah, swat [13:22:21] i might move mine to evening swat [13:23:10] or maybe now [13:23:49] (03CR) 10Aude: [C: 032] Enable Wikibase notifications for Wikipedias (except enwiki, dewiki, frwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356165 (https://phabricator.wikimedia.org/T142102) (owner: 10Aude) [13:25:04] (03Merged) 10jenkins-bot: Enable Wikibase notifications for Wikipedias (except enwiki, dewiki, frwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356165 (https://phabricator.wikimedia.org/T142102) (owner: 10Aude) [13:25:41] (03CR) 10jenkins-bot: Enable Wikibase notifications for Wikipedias (except enwiki, dewiki, frwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356165 (https://phabricator.wikimedia.org/T142102) (owner: 10Aude) [13:26:35] aude: forgot about swat, you are deploying your changes? [13:26:45] yeah [13:27:01] checking them on mwdebug [13:29:03] (03PS1) 10Muehlenhoff: Add Kafka analytics brokers to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356183 [13:32:34] (03PS10) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [13:32:36] (03PS7) 10Andrew Bogott: Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) [13:32:38] (03PS7) 10Andrew Bogott: Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 [13:34:08] looks ok [13:35:42] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Enable Wikibase echo notifications on Wikipedia, except enwiki, dewiki, frwiki T142102 (duration: 00m 42s) [13:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:52] T142102: [Story] Deploy Wikibase notifications to Wikimedia projects - https://phabricator.wikimedia.org/T142102 [13:35:54] \O/ [13:36:44] still looks ok [13:36:48] * aude has one more patch [13:37:17] (03CR) 10Aude: [C: 032] Set wgPageImagesAPIDefaultLicense to 'any' for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356171 (https://phabricator.wikimedia.org/T159678) (owner: 10Aude) [13:39:21] (03CR) 10Ema: "node_exporter upgraded everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/356163 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [13:41:00] (03PS2) 10Aude: Set wgPageImagesAPIDefaultLicense to 'any' for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356171 (https://phabricator.wikimedia.org/T159678) [13:41:05] (03CR) 10Aude: Set wgPageImagesAPIDefaultLicense to 'any' for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356171 (https://phabricator.wikimedia.org/T159678) (owner: 10Aude) [13:41:07] (03CR) 10Aude: [C: 032] Set wgPageImagesAPIDefaultLicense to 'any' for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356171 (https://phabricator.wikimedia.org/T159678) (owner: 10Aude) [13:42:17] (03Merged) 10jenkins-bot: Set wgPageImagesAPIDefaultLicense to 'any' for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356171 (https://phabricator.wikimedia.org/T159678) (owner: 10Aude) [13:42:28] (03CR) 10jenkins-bot: Set wgPageImagesAPIDefaultLicense to 'any' for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356171 (https://phabricator.wikimedia.org/T159678) (owner: 10Aude) [13:44:30] !log restart kafka on kafka1013 for jvm upgrades [13:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:03] checking on mwdebug again [13:47:37] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Set wgPageImagesAPIDefaultLicense for wikidata (duration: 00m 41s) [13:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:54] checking again [13:49:47] done :) [13:55:11] (03CR) 10Elukey: [C: 031] "LGTM (checked all the IPs)" [puppet] - 10https://gerrit.wikimedia.org/r/356183 (owner: 10Muehlenhoff) [13:55:25] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3300123 (10BBlack) 05stalled>03Resolved a:03BBlack >>! In T147569#3294214, @Gilles wrote: > There is an apparent performance improvement t... [13:55:29] 06Operations, 10Monitoring, 13Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3300126 (10ema) >>! In T125205#3043749, @Dzahn wrote: > check_ipmi_sensor has been installed across the fleet but doesn't work. > > running it with options for temperature makes it exi... [13:55:31] 06Operations, 05MW-1.30-release-notes, 10Traffic, 07HTTPS, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3105285 (10Gilles) @aaron you mentioned in your weekly notes that this had a minor performance effect. Positive or negative? How much are we talking about? [13:58:28] !log updating mw2240-mw2242, mw2254-mw2260 to HHVM 3.18 [13:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:08] (03PS3) 10BBlack: interface-rps: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/355787 [14:00:21] (03CR) 10BBlack: [V: 032 C: 032] interface-rps: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/355787 (owner: 10BBlack) [14:00:41] (03PS3) 10BBlack: interface-rps: clean up typing/format-string issues [puppet] - 10https://gerrit.wikimedia.org/r/355788 [14:00:46] (03CR) 10BBlack: [V: 032 C: 032] interface-rps: clean up typing/format-string issues [puppet] - 10https://gerrit.wikimedia.org/r/355788 (owner: 10BBlack) [14:00:54] (03CR) 10BBlack: [V: 032 C: 032] interface-rps: refactor opts handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/355789 (owner: 10BBlack) [14:01:01] (03PS3) 10BBlack: interface-rps: refactor opts handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/355789 [14:01:03] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: add hwmon collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/356163 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [14:01:05] (03CR) 10BBlack: [V: 032 C: 032] interface-rps: refactor opts handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/355789 (owner: 10BBlack) [14:01:42] bblack: I was looking into rewriting the whole interface module, btw :) [14:01:52] (03PS5) 10BBlack: interface-rps: optional NUMA awareness [puppet] - 10https://gerrit.wikimedia.org/r/355790 [14:02:00] (03CR) 10BBlack: [V: 032 C: 032] interface-rps: optional NUMA awareness [puppet] - 10https://gerrit.wikimedia.org/r/355790 (owner: 10BBlack) [14:02:19] paravoid: yeah it needs a lot of cleanup :) [14:02:34] yeah, but also the way we're doing things needs to fundamentally change [14:02:38] I was looking at all that last night [14:02:45] I hope I can get another quiet slot like that one soon.. [14:02:47] the augeas stuff? [14:03:23] <_joe_> elukey: what clusterfuck is hiera for zookeeper [14:03:34] <_joe_> I didn't realize it was so fucked up [14:04:16] _joe_ I tried to make a compromise but I am not sure if I made the right choiches or not [14:04:29] <_joe_> elukey: no the original sin is not yours [14:04:41] paravoid: we're moving towards static config of v4+v6 in /e/n/i that's set up from the installer, right? [14:04:42] <_joe_> you were just not aggressive enough in moving things [14:05:13] maybe :) [14:05:18] it's complicated [14:05:39] I was looking at d-i's source yesterday, it can't do static v4 and v6, we'll have to generate /en/i ourselves [14:05:54] and I was thinking about moving everything to /etc/network/interfaces.d/, separate file per interface+afi [14:06:05] the biggest functional problem I have with the interface module is the augeas mess with changing parameters [14:06:14] and then perhaps erb templating each file of those, and make rps/txqlen etc. parameters [14:06:20] of a single interface stanza [14:06:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See inline comments, and also:" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354449 (owner: 10Elukey) [14:06:23] (03Draft1) 10Paladox: Zuul: Update zuul-merger.systemd.erb to run in the background with a pid [puppet] - 10https://gerrit.wikimedia.org/r/356185 [14:06:25] get rid of all of the augeas stuff entirely [14:06:26] (03PS2) 10Paladox: Zuul: Update zuul-merger.systemd.erb to run in the background with a pid [puppet] - 10https://gerrit.wikimedia.org/r/356185 [14:06:32] (03PS3) 10Paladox: Zuul: Update zuul-merger.systemd.erb to run in the background with a pid [puppet] - 10https://gerrit.wikimedia.org/r/356185 [14:06:35] that would be nice [14:06:54] and then probably converting the rps stuff into an /etc/network/if-up.d script as well [14:07:01] I think the txqlen is already one, I don't remember [14:07:09] but basically push some of the logic there [14:07:09] e.g. part of the above refactorings of interface-rps was to get rid of its parameters other than interface-name, to avoid the duplicate-lines-from-augeas problem [14:07:15] (when changing a minor parameter) [14:07:28] and then write a structured fact (once we have them) for interface -> driver [14:07:34] and get rid of the bnx2x parameter [14:07:37] (03PS4) 10Paladox: Zuul: Update zuul-merger.systemd.erb to run in the background with a pid [puppet] - 10https://gerrit.wikimedia.org/r/356185 [14:08:24] the one for LVS you mean? [14:08:58] (03PS5) 10Paladox: Zuul: Update zuul-merger.systemd.erb to run in the background with a pid [puppet] - 10https://gerrit.wikimedia.org/r/356185 [14:09:53] (03PS3) 10Giuseppe Lavagetto: ChangeProp: Add Redis/Nutcracker connection info [puppet] - 10https://gerrit.wikimedia.org/r/356072 (https://phabricator.wikimedia.org/T161710) (owner: 10Mobrovac) [14:10:23] yes [14:12:12] (03PS6) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [14:12:49] paravoid: technically the caches need one too, but since they're all bnx2x presently, we just assume it (we could do the same for LVS once lvs1001-6 is gone). [14:13:17] and then we'll probably face the same issue in some future year when we decide to swap bnx2x for some intel variant or whatever [14:13:36] but there's no point abstracting things we don't have to in the present. the future can take care of itself when it gets here :P [14:13:44] no, my idea was [14:13:58] $facts['interface_drivers']['eth0'] = 'bnx2x' [14:14:00] (for example) [14:14:03] and then do [14:14:12] (03PS2) 10Ema: prometheus: add hwmon collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/356163 (https://phabricator.wikimedia.org/T125205) [14:14:17] if $facts['interface_drivers'][$interface] == 'bnx2x' { ... } [14:14:24] instead of if $bnx2x { ... } [14:14:24] (03CR) 10Ema: [V: 032 C: 032] prometheus: add hwmon collector to default set [puppet] - 10https://gerrit.wikimedia.org/r/356163 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [14:14:37] in modules/lvs/manifests/interface_tweaks.pp that is [14:14:49] you can populate the drivers from sysfs, it's fairly trivial [14:15:06] interface-rps.py does that already [14:15:24] os.readlink('/sys/class/net/%s/device/driver/module' % device) [14:16:03] anyway, that's a minor detail [14:16:14] basically I have a loose collections of various things to improve [14:16:40] with the ultimate goals of making things simpler, less dependent on augeas and modifying things on the system (more declarative), and enabling IPv6 by default [14:17:02] but I have to fix these step by step and in both existing systems and new systems, so it can get a little tricky [14:26:24] 06Operations, 10Citoid, 10VisualEditor, 06Services (doing), 15User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3300237 (10Mvolz) [14:27:16] <_joe_> !log restarting squid on aluminium. [14:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:13] 06Operations, 10Citoid, 10VisualEditor, 06Services (doing), 15User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3300257 (10Joe) Actually it was a dumb comment - the log I pasted clearly reported TCP_MISS/302, so I'm not su... [14:28:57] (03PS1) 10Muehlenhoff: Add Hadoop masters to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356188 [14:32:21] (03PS1) 10Muehlenhoff: Add Druid hosts to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356189 [14:33:12] 06Operations, 10Citoid, 10VisualEditor, 06Services (doing), 15User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3300269 (10Joe) In fact, I suppose the problem is our proxy IP in eqiad has been banned. From the proxy machin... [14:33:44] (03PS4) 10Giuseppe Lavagetto: ChangeProp: Add Redis/Nutcracker connection info [puppet] - 10https://gerrit.wikimedia.org/r/356072 (https://phabricator.wikimedia.org/T161710) (owner: 10Mobrovac) [14:33:52] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] ChangeProp: Add Redis/Nutcracker connection info [puppet] - 10https://gerrit.wikimedia.org/r/356072 (https://phabricator.wikimedia.org/T161710) (owner: 10Mobrovac) [14:36:42] (03CR) 10jerkins-bot: [V: 04-1] Add Druid hosts to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356189 (owner: 10Muehlenhoff) [14:38:29] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3300275 (10Cmjohnson) Support cases for both ms-be1019 and 1020 have been opened with HPE Your case was successfully submitted. Please note your Case ID: 532010484... [14:42:43] (03PS2) 10Tjones: Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) [14:44:08] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3300289 (10akosiaris) [14:44:10] 06Operations, 05Goal, 13Patch-For-Review, 07kubernetes: Assigning IP space for kubernetes IPs - https://phabricator.wikimedia.org/T165732#3300287 (10akosiaris) 05Open>03Resolved a:03akosiaris [14:44:34] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3150620 (10akosiaris) [14:44:48] 06Operations, 10ops-eqiad, 10DBA: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3300292 (10Cmjohnson) a support case has been opened with HPE Your case was successfully submitted. Please note your Case ID: 5320105305 for future reference. [14:45:02] (03PS5) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [14:45:09] 06Operations, 10ops-eqiad, 10DBA: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3300294 (10Marostegui) Thanks! [14:45:49] 06Operations, 05Goal, 07kubernetes: Prepare to service applications from kubernetes - https://phabricator.wikimedia.org/T162039#3300299 (10akosiaris) [14:45:51] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3150620 (10akosiaris) 05Open>03Resolved a:03akosiaris All of the above has been done and most of it has been tracked in the subtasks. Today I 've finished setting up BGP on `cr1-codfw` and... [14:46:09] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3300301 (10Papaul) disk wipe in progress [14:46:18] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3300302 (10Cmjohnson) @ottomata: is there a better time this week or do you push it out to next week? Also, whatever we change this out with will probably not last l... [14:47:03] (03PS2) 10Muehlenhoff: Add Druid hosts to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356189 [14:47:32] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3300305 (10akosiaris) I think this can be resolved. The kubernetes workers (kubernetes200{1,2,3,4} for which this task for are up and running. [14:47:39] (03PS6) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [14:47:53] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3300306 (10Cmjohnson) @volans and @Marostegui I can do this as soon as you give me the word go but keep in mind this is only going to be temporary. the bbu's... [14:48:29] !log updating mw2140-mw2147, mw2251-mw2253 to HHVM 3.18 [14:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:08] (03PS7) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [14:50:42] 06Operations, 10ops-codfw, 06cloud-services-team: rack/setup/install labtestvirt2002 - https://phabricator.wikimedia.org/T166237#3300324 (10Papaul) [14:50:45] 06Operations, 10ops-codfw, 06cloud-services-team, 10netops: codfw: labtestvirt2002 swith port configuration - https://phabricator.wikimedia.org/T166564#3300312 (10Papaul) [14:52:29] (03CR) 10Elukey: [C: 04-1] "This will probably not work since zookeeper_cluster_name variables are not shared anymore across multiple roles" [puppet] - 10https://gerrit.wikimedia.org/r/354449 (owner: 10Elukey) [14:53:09] <_joe_> !log failing citoid over to codfw, T165105 [14:53:15] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3300328 (10RobH) [14:53:16] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3300327 (10RobH) 05Open>03Resolved [14:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:18] T165105: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105 [14:53:39] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3300333 (10Marostegui) Thanks Chris, I will have this ready for tomorrow so we can do it tomorrow if that works for you? We are aware that this will happen ag... [14:54:48] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3300351 (10Cmjohnson) Great! ping when I can do the swap. [14:55:44] (03PS8) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [14:55:49] (03PS3) 10Ema: monitoring/base: add NRPE command to check temperature [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [14:56:29] 06Operations, 10ops-codfw, 06cloud-services-team: rack/setup/install labtestvirt2002 - https://phabricator.wikimedia.org/T166237#3300359 (10Papaul) [14:57:03] (03PS4) 10Ema: monitoring/base: add NRPE command to check temperature [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [14:58:46] (03CR) 10Elukey: [C: 031] Add Hadoop masters to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356188 (owner: 10Muehlenhoff) [15:02:41] 06Operations, 10Monitoring, 13Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3300370 (10ema) @Dzahn I've amended you patch by calling check_ipmi_sensor with `-ST Temperature` as mentioned above. I've also removed the `if >= jessie` guard since freeipmi seems to... [15:03:02] !log installing mysql-connector-java security update on analytics1031 [15:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:06] 06Operations, 10ops-codfw, 06cloud-services-team: rack/setup/install labtestvirt2002 - https://phabricator.wikimedia.org/T166237#3300388 (10Papaul) @chasemp I have already the server setup for HW RAID 10 what partman recipe do you want to use for the server? We have : - raid10-gpt.cfg - raid10-gpt-srv-ext4... [15:08:29] 06Operations, 13Patch-For-Review: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372#3300390 (10Volans) >>! In T166372#3299844, @Volans wrote: > Then a few diffs that are **labs-only**: > - I now have an `ec2_metadata` fact that I was not getting before > - Getting... [15:14:38] !log installing bash security updates on trusty (jessie already fixed) [15:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:00] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2108594 [15:33:22] !log Deploy alter table on s3.revision on labsdb1009 - T166278 [15:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:32] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [15:33:58] (03CR) 10Hashar: [C: 031] Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [15:38:04] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, modulo nitpick on cron requires but non-blocking" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [15:41:01] 06Operations, 06Multimedia, 10UploadWizard, 10Wikimedia-General-or-Unknown, 10media-storage: Commons UploadWizard fails after repeated attempt to upload a .FLAC audio file - https://phabricator.wikimedia.org/T166490#3297644 (10matmarex) [15:42:27] 06Operations, 10MediaWiki-Uploading, 06Multimedia, 10Wikimedia-General-or-Unknown, 10media-storage: Commons UploadWizard fails after repeated attempt to upload a .FLAC audio file - https://phabricator.wikimedia.org/T166490#3297644 (10matmarex) (Per your description, this seems to affect all methods of up... [15:42:46] 06Operations, 10MediaWiki-Uploading, 06Multimedia, 10Wikimedia-General-or-Unknown, 10media-storage: Commons UploadWizard fails after repeated attempt to upload a .FLAC audio file - https://phabricator.wikimedia.org/T166490#3297644 (10fgiunchedi) @psubhashish1 can you try the upload again? We've fixed tod... [15:44:36] (03CR) 10Alexandros Kosiaris: [C: 031] raid: switch from stringified fact to array [puppet] - 10https://gerrit.wikimedia.org/r/356030 (https://phabricator.wikimedia.org/T166372) (owner: 10Faidon Liambotis) [15:46:16] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3300574 (10thcipriani) >>! In T166345#3298245, @Gilles wrote: > There are no xhprof runs i... [15:48:22] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3300580 (10cscott) Would it make sense to bisect wmf.2 at least before the redeploy? That... [15:49:05] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3300581 (10Gilles) @krinkle are you available to do that? 19:00 UTC today [15:49:11] (03PS5) 10Filippo Giunchedi: swift: introduce storage policies [puppet] - 10https://gerrit.wikimedia.org/r/353878 (https://phabricator.wikimedia.org/T151648) [15:49:13] (03PS1) 10Filippo Giunchedi: swift: introduce container-reconciler [puppet] - 10https://gerrit.wikimedia.org/r/356198 (https://phabricator.wikimedia.org/T151648) [15:50:48] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3300584 (10Gilles) @cscott my understanding is that the issue couldn't be reproduced at al... [15:52:06] (03PS1) 10Faidon Liambotis: Kill scs-ext.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/356199 [15:52:30] (03PS1) 10Faidon Liambotis: Add CAA records for wikimedia.org/wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/356200 (https://phabricator.wikimedia.org/T155806) [15:52:35] bblack: ^ :) [15:53:26] (03CR) 10Faidon Liambotis: [C: 032] Kill scs-ext.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/356199 (owner: 10Faidon Liambotis) [15:54:22] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293304 (10ssastry) >>! In T166345#3300580, @cscott wrote: > Would it make sense to bisect... [15:54:37] paravoid: don't we need both CA? [15:54:46] which are both? [15:55:16] (you have a stray "s" from https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=1760523 .) [15:55:25] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3300626 (10cscott) @Gilles Yes, that is what I meant. It's a little complicated of course... [15:55:38] (also, by looking at that page, I have no idea where we are with the trains, sadly. ) [15:55:51] paravoid: GS too, maybe I'm missing something [15:56:08] (03CR) 10BBlack: [C: 04-1] "I think we should still include GlobalSign now, whether they're currently querying or not." [dns] - 10https://gerrit.wikimedia.org/r/356200 (https://phabricator.wikimedia.org/T155806) (owner: 10Faidon Liambotis) [15:56:21] it can't happen [15:56:26] they don't have a CAA identifier yet [15:57:34] heh ok [15:57:45] is everyone else using canonical domainnames though? [15:57:51] got it :D [15:58:12] issue [; = ]* : The issue property [15:58:15] entry authorizes the holder of the domain name Name> or a party acting under the explicit authority of the holder [15:58:21] of that domain name to issue certificates for the domain in which [15:58:24] the property is published. [15:58:27] is what the RFC says [15:58:42] so it's likely it'll be globalsign.com, yes [15:58:43] ok so what does "they don't have a CAA identifier yet" mean? [15:59:07] they didn't register with a registry of CAA'd issuer domains? [15:59:36] I don't think there's a registry [15:59:52] it's just some arbitrary string (an "Issuer Domain name") that the CA will use [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170530T1600). Please do the needful. [16:00:30] we can guess that it will be globalsign.com, if you want [16:00:36] but there is no evidence this will the case :) [16:00:38] paravoid: yeah I don't know, I haven't read enough [16:00:53] also the issue vs issuewild thing seems odd to me, but I haven't looked at it in the spec [16:01:08] issue implies issuewild, but issuewild doesn't imply issue, is how that generator behaves [16:04:13] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3300667 (10elukey) @Cmjohnson we just need to alert people a couple of days in advance, nothing more. Do you have a preferred date/time? [16:05:50] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3300674 (10Cmjohnson) @elukey Let's do Thursday 1600UTC [16:08:28] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3300692 (10Ottomata) +1 [16:10:51] (03PS1) 10Filippo Giunchedi: install_server: reimage ms-be2001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/356206 (https://phabricator.wikimedia.org/T162609) [16:11:12] (03PS2) 10Filippo Giunchedi: install_server: reimage ms-be2001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/356206 (https://phabricator.wikimedia.org/T162609) [16:13:28] (03CR) 10Filippo Giunchedi: [C: 032] install_server: reimage ms-be2001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/356206 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [16:13:58] (03PS2) 10RobH: Add jdittrich to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/355599 (https://phabricator.wikimedia.org/T165943) (owner: 10Alexandros Kosiaris) [16:14:26] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3295950 (10Cmjohnson) The disk was indeed bad...so it's been replaced. I don't know if I have enough bbu's to go around.. I am swapping them out of decom'd servers. [16:15:12] 06Operations, 10DBA, 10MediaWiki-extensions-SecurePoll, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3300753 (10matmarex) [16:15:28] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3300755 (10RobH) This request has had the actual group figured out now for the 3 day wait. As there are no objections noted, I'm merging Alex's patchset live. [16:15:30] (03CR) 10RobH: [C: 032] Add jdittrich to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/355599 (https://phabricator.wikimedia.org/T165943) (owner: 10Alexandros Kosiaris) [16:16:33] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3300760 (10Marostegui) If it helps there are three more servers totally ready for you to decomm them: T166486 T163778 T164702 [16:17:22] 06Operations, 10DBA, 10MediaWiki-extensions-SecurePoll, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3300764 (10matmarex) [16:17:59] 06Operations, 10DBA, 10MediaWiki-extensions-SecurePoll, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3300419 (10jcrespo) That request never reached DBAs, someone closed it before we could even be aware o... [16:18:55] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3300773 (10RobH) 05Open>03Resolved a:03RobH [16:21:14] (03CR) 10EBernhardson: [C: 031] "not too familiar with scap3, but this looks reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/354472 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [16:21:41] (03PS1) 10Jcrespo: mariadb: Remove db2048 from the list of full reimages, add db2044 [puppet] - 10https://gerrit.wikimedia.org/r/356208 [16:22:54] (03PS2) 10Jcrespo: mariadb: Remove db2048 from the list of full reimages, add db2044 [puppet] - 10https://gerrit.wikimedia.org/r/356208 [16:23:33] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3300786 (10elukey) ``` elukey@db1046:~$ sudo megacli -pdrbld -showprog -physdrv\[32:3\] -aALL Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 35% in 9 Minutes. Exit Code:... [16:24:23] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Remove Salt from wmf-auto-reimage / wmf-reimage - https://phabricator.wikimedia.org/T166300#3300789 (10Volans) [16:25:05] 06Operations, 06Analytics-Kanban, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3300795 (10Nuria) [16:28:08] (03CR) 10Jcrespo: [C: 032] mariadb: Remove db2048 from the list of full reimages, add db2044 [puppet] - 10https://gerrit.wikimedia.org/r/356208 (owner: 10Jcrespo) [16:28:09] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3300842 (10Jan_Dittrich) Thanks! [16:29:02] (03PS10) 10Ottomata: Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [16:29:32] 06Operations, 06Labs: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203#3300848 (10Volans) a:05Volans>03None [16:29:34] 06Operations, 10ops-eqiad, 10Analytics, 15User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3300849 (10Cmjohnson) Ordered both servers to get new cards You have successfully submitted request SR948957999. You have successfully submitted request SR948... [16:32:54] 06Operations, 10ops-eqiad, 06Labs: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T165220#3300869 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Resolving this [16:40:42] !log installing shadow regression update [16:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:25] (03PS3) 10Gehel: logstash - start using elasticsearch-curator for indices cleanup [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) [16:42:22] (03CR) 10Gehel: logstash - start using elasticsearch-curator for indices cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [16:43:07] (03PS4) 10Gehel: logstash - start using elasticsearch-curator for indices cleanup [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) [16:43:36] (03PS5) 10Gehel: logstash - start using elasticsearch-curator for indices cleanup [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) [16:44:08] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3300947 (10Joe) Since the problem presented itself only after ~ 15 minutes after the deplo... [16:45:01] (03PS1) 10Alexandros Kosiaris: Force fact stringification servermon reporter [puppet] - 10https://gerrit.wikimedia.org/r/356212 (https://phabricator.wikimedia.org/T166203) [16:46:06] (03PS2) 10Alexandros Kosiaris: Force fact stringification in servermon reporter [puppet] - 10https://gerrit.wikimedia.org/r/356212 (https://phabricator.wikimedia.org/T166203) [16:46:10] RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [16:47:02] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#2955191 (10BBlack) * Domains: Why start with just wikipedia and wikimedia? We could go after our lower-traffic domains first as a test, but since we don't issue individual cert... [16:47:49] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3293197 (10RobH) [16:48:04] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#3300973 (10BBlack) >>! In T155806#3300964, @BBlack wrote: > (even the non-canonicals, IMHO). Possible with the empty issuer for now, since we don't have plans to issue certs for... [16:48:25] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3293197 (10RobH) a:03RobH Please note that the specifications for this hardware are identical to the spare pool with 4 * 4TB SATA, projected for possible purchase on T166265... [16:49:02] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3300982 (10Ottomata) After talking with Faidon, this order should no longer happen this quarter. New scb nodes are budgeted for next FY, and lated to be purchased in Q3. It... [16:49:55] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3300988 (10Ottomata) 05Open>03declined I'm inclined to decline this task. We can create a new on in Q3 when it is time to order these. [16:51:45] bblack: to be clear, CAA records are meant to be used only at the moment of issuance and only by CAs, not after the fact/during the lifetime of the certificate [16:52:52] we could in theory also have empty records with TTL 5m, and flip them every time we went to digicert/globalsign to get a cert [16:53:36] so worst thing it can happen when GlobalSign adds support for it and we don't notice is that we can't issue a cert by them in under $TTL (currently 5m) [16:54:27] (03CR) 10Volans: [C: 04-1] "Waiting to ensure we don't break anything else, like servermon" [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [16:54:45] ^^^ just as a safety measure ;) [16:57:13] 06Operations: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301040 (10Dzahn) [16:59:45] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3301074 (10Gilles) Sounds worth a try, that would also explain why group0 and group1 are f... [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170530T1700). Please do the needful. [17:02:23] Nothing for ORES today. [17:02:28] paravoid: yeah, but there is some language about auditing software too (e.g. ssllabs, etc) [17:02:38] it does explicitly say that UAs shouldn't check it, though [17:03:07] I kind of like the empty + 5m thing, but on the other hand short TTLs might make it easier to spoof to a cache, lacking DNSSEC [17:03:09] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#3301085 (10faidon) I'm not 100% sure if it was obvious from my description above, so forgive me if I'm repeating something you know already: CAA records are meant to be used only... [17:03:16] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad: (3)+ nodes for Druid / analytics - https://phabricator.wikimedia.org/T166510#3301092 (10RobH) Those have SSDs, and cannot be ordered within the time line for this fiscal year. [17:03:20] the empty +5m won't work because of LE [17:03:36] well, empty of non-LE, for the LE cases (which is mostly wikimedia.org) [17:03:38] (basically responded to the task what I just said above, since I wasn't sure if you were around :) [17:03:43] yeah [17:04:00] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad: (3)+ nodes for Druid / analytics - https://phabricator.wikimedia.org/T166510#3301093 (10Ottomata) p:05High>03Normal Oh foo, I forgot, druid nodes have SSDs. We won't be able to get this in time for this FY's remainder budget.... [17:04:42] the auditing part I was talking about was: [17:04:43] Certificate Evaluator: A party other than a relying party that [17:04:43] evaluates the trustworthiness of certificates issued by [17:04:44] Certification Authorities. [17:04:53] CAA records MAY be used by Certificate Evaluators as a possible indicator of a security policy violation. Such use SHOULD take account of the possibility that published CAA records changed between the time a certificate was issued and the time at which the certificate was observed by the Certificate Evaluator. [17:05:17] yeah I think that's a reference to tools like the CT logs [17:05:26] certspotter etc. [17:05:30] that's basically ssllabs and similar cases, they might raise a pointless yellow flag or something at some point, if they canm see our CAA doesn't match our cert [17:05:39] and yeah CT logs too I'm sure [17:06:30] I don't think we need to be worried much about that, imagine if we had kept the symantec cert on payments for example but decided we weren't going to renew [17:06:31] it's kind of interesting that there's no easy way I can see (as a software author) to tie the CAA domainname to root certs / orgs in a trivial way [17:07:11] there's language in there about the symantec-like case, about allowing for that [17:07:14] we'd remove symantec from the CAA but keep having its certs around [17:08:00] it's really not supposed to be used by anything other than CAs I think and their auditors I think [17:08:10] it's a weird kind of scheme anyway, as it doesn't require DNSSEC either [17:09:00] !log arlolra@tin Started deploy [parsoid/deploy@744f719]: Updating Parsoid to d07dfe1a [17:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:10] so next steps? add it to all domains and add globalsign.com? [17:09:16] https://www.ssllabs.com/ssltest/analyze.html?d=mail.google.com [17:09:44] ^ e.g. here, it reports on CAA and lists the contents. I don't know if they'll ever start "validating" that it matches up. Probably not as a failure, but maybe as an info/warning flag [17:09:51] issue: pki.goog [17:09:52] issue: symantec.com [17:10:06] symantec? really? [17:10:06] is what they have, and then their validating root path is through GeoTrust which is owned by Symantec [17:10:14] yeah [17:10:14] haha [17:10:45] yes very ironic :) [17:10:57] but they're working on being their own root anyways [17:10:59] didn't even know pki.goog existed heh [17:11:11] they bought an old root from globalsign to start off with [17:11:15] (the domain, not the root) [17:11:53] https://security.googleblog.com/2017/01/the-foundation-of-more-secure-web.html [17:12:10] RECOVERY - Disk space on labstore1005 is OK: DISK OK [17:12:25] their out for the symantec issue I guess is that they're static-pinning their symantec-issued intermediate in the browser [17:12:34] (in chrome and whoever else copies their pkp pins, anyways) [17:12:46] they *bought* two root CAs, wow [17:13:43] they recommend DNSSEC in the CAA RFC of course, just don't require it [17:13:57] the world needs to get off the fence on that stuff one way or another [17:14:27] 06Operations, 10MediaWiki-Uploading, 06Multimedia, 10Wikimedia-General-or-Unknown, 10media-storage: Commons UploadWizard fails after repeated attempt to upload a .FLAC audio file - https://phabricator.wikimedia.org/T166490#3301115 (10psubhashish1) It's working now. Thanks a lot all of you! :) [17:14:44] 06Operations, 10MediaWiki-Uploading, 06Multimedia, 10Wikimedia-General-or-Unknown, 10media-storage: Commons UploadWizard fails after repeated attempt to upload a .FLAC audio file - https://phabricator.wikimedia.org/T166490#3301116 (10psubhashish1) 05Open>03Resolved [17:14:53] it's an awful standard, but it's the only one we've got. but other standards are loath to require it for fear of sharing its ill-adopted fate :P [17:16:04] paravoid: what about not using LE for the non-LE canonicals (wikipedia.org, wikivoyage.org, etc)? thoughts? [17:16:24] paravoid: and ditto in the other direction - for the non-canonicals like wikipedia.cz or whatever, should we just blank out their issuers for now? [17:16:36] I didn't add LE for the wikipedia.org [17:16:42] ...part of the changeset [17:16:43] right [17:16:50] so I did that already and +1 on doing it for the rest [17:16:59] and we can blank out issuers for parking domains too, sure [17:17:27] (03PS11) 10Ottomata: Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [17:17:34] and we have some non-parked non-canonicals in DNS still too, e.g. wikimedia.ee [17:17:41] !log arlolra@tin Finished deploy [parsoid/deploy@744f719]: Updating Parsoid to d07dfe1a (duration: 08m 41s) [17:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:49] wmftest.org [17:18:00] we don't have to fix them all in one go, but it'd be nice to eventually [17:18:35] (03CR) 10jerkins-bot: [V: 04-1] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [17:18:50] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3301134 (10debt) 05Open>03Resolved [17:20:42] ack, sounds good to me [17:20:49] paravoid: the symlinks problem creeps into all that mess though [17:20:52] wikiverzita.cz -> wikiversity.org [17:21:11] I guess for a first pass, let's just make sure the canonicals are covered and not worry about the rest [17:21:50] 06Operations: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301143 (10Dzahn) [17:24:41] (03PS2) 10Ottomata: [WIP] Puppetize TLS encryption and auth for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/355796 (https://phabricator.wikimedia.org/T166162) [17:26:30] (03PS12) 10Ottomata: Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [17:27:29] (03CR) 10jerkins-bot: [V: 04-1] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [17:28:55] !log Updated Parsoid to d07dfe1a (T161151, T136653) [17:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:05] T136653: Parsoid doesn't recognize interwiki shortcuts in the href attribute - https://phabricator.wikimedia.org/T136653 [17:29:05] T161151: Parsoid should resolve template paths before providing them to Linter - https://phabricator.wikimedia.org/T161151 [17:29:33] (03PS13) 10Ottomata: Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [17:29:56] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#3301209 (10debt) 05Open>03Resolved a:03debt [17:31:27] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Import new kibana and logstash .debs to wikimedia experimental repository - https://phabricator.wikimedia.org/T160597#3301216 (10debt) 05Open>03Resolved a:03debt [17:31:28] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3301218 (10debt) [17:33:28] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: elasticsearch logs are duplicated in journald - https://phabricator.wikimedia.org/T158664#3301239 (10debt) 05Open>03Resolved a:03debt [17:33:47] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#3301242 (10debt) [17:33:49] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: move data to /srv for the cirrus / elasticsearch clusters - https://phabricator.wikimedia.org/T151328#3301241 (10debt) 05Open>03Resolved [17:34:57] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3301254 (10debt) [17:35:50] 06Operations, 10ops-codfw, 06cloud-services-team: rack/setup/install labtestvirt2002 - https://phabricator.wikimedia.org/T166237#3301271 (10Papaul) @RobH I have already mgmt and productions DNS for labtestvirt2002 shouldn't this be labtestvirt2003 [17:36:45] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3301280 (10debt) p:05Triage>03Normal [17:36:50] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3301281 (10Smalyshev) I think there was talk about replacing these older servers with new ones, maybe we should start with wdqs1... [17:38:19] (03PS3) 10Ottomata: Puppetize TLS encryption and auth for Kafka in confluent module [puppet] - 10https://gerrit.wikimedia.org/r/355796 (https://phabricator.wikimedia.org/T166162) [17:40:55] (03PS14) 10Ottomata: Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [17:41:30] (03PS1) 10Cmjohnson: Fixing several mgmt dns entries in wmnet file...had wrong zone [dns] - 10https://gerrit.wikimedia.org/r/356219 [17:43:55] (03CR) 10Ottomata: "Since file extensions have changed, ca-manager for cassandra certs will need to be run before this is merged. This way puppet can find th" [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [17:43:58] (03CR) 10Cmjohnson: [C: 032] Fixing several mgmt dns entries in wmnet file...had wrong zone [dns] - 10https://gerrit.wikimedia.org/r/356219 (owner: 10Cmjohnson) [17:44:00] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/6579/restbase1010.eqiad.wmnet/change.restbase1010.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [17:44:54] (03CR) 10Ottomata: "This adds PLAINTEXT://:9092 as the default value of listeners. This does change config on existing Kafka brokers, but should result in th" [puppet] - 10https://gerrit.wikimedia.org/r/355796 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [17:45:03] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/6578/" [puppet] - 10https://gerrit.wikimedia.org/r/355796 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [17:45:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:48:14] (03PS15) 10Ottomata: Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [17:48:29] 06Operations: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301370 (10Dzahn) [17:48:57] (03CR) 10Ottomata: "Not sure why this shows cassandra-ca-manager and ca-manager as a delete and add rather than a rename and modify." [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [17:53:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:53:37] 06Operations, 10ops-codfw, 06cloud-services-team: rack/setup/install labtestvirt2002 - https://phabricator.wikimedia.org/T166237#3301373 (10RobH) It seems that there is confusion, due to the fact that racktables shows two labtestvirt2001 systems. I went ahead and connected to the mgmt dns for the existing l... [17:55:20] 06Operations: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301378 (10Dzahn) [17:55:22] 06Operations, 10ops-codfw, 06cloud-services-team: rack/setup/install labtestvirt2002 - https://phabricator.wikimedia.org/T166237#3289416 (10RobH) [17:55:42] !log branching 1.30.0-wmf.3 T165957 [17:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:50] T165957: MW-1.30.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T165957 [17:56:00] (03CR) 10EBernhardson: Enable BM25 for Chinese wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [17:56:02] (03PS1) 10Cmjohnson: adding mgmt entries for ganeti1005-8 T166076 [dns] - 10https://gerrit.wikimedia.org/r/356222 [17:56:16] 06Operations, 10ops-codfw, 06cloud-services-team: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3289416 (10RobH) [17:56:39] (03CR) 10EBernhardson: [C: 031] "tests pass, looks sane enough to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355440 (owner: 10Hashar) [17:58:51] (03CR) 10Cmjohnson: [C: 032] adding mgmt entries for ganeti1005-8 T166076 [dns] - 10https://gerrit.wikimedia.org/r/356222 (owner: 10Cmjohnson) [18:00:22] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3301427 (10Cmjohnson) [18:00:58] 06Operations, 10ops-eqiad, 13Patch-For-Review, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3301431 (10Cmjohnson) [18:01:47] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3301436 (10Cmjohnson) [18:02:21] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3264256 (10Cmjohnson) [18:02:53] (03CR) 10EBernhardson: [C: 031] phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [18:02:58] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3301438 (10Cmjohnson) [18:03:49] 06Operations, 10ops-eqiad, 10Dumps-Generation: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3301439 (10Cmjohnson) [18:04:17] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3301440 (10Cmjohnson) [18:09:23] jouncebot: next [18:09:23] In 0 hour(s) and 50 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170530T1900) [18:10:02] (03CR) 10EBernhardson: [C: 031] mwgrep: Add --etitle option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349352 (owner: 10Krinkle) [18:12:29] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:14:12] (03CR) 10EBernhardson: [C: 031] "Seems this is ready to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353043 (https://phabricator.wikimedia.org/T163463) (owner: 10DCausse) [18:19:03] (03PS5) 10Reedy: Optionally filter private wiki results in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) [18:23:03] (03CR) 10EBernhardson: [C: 031] contint: PHP packages cleanup [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [18:28:47] (03PS1) 10Jdlrobson: Page images can come outside the lead for all projects except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356225 (https://phabricator.wikimedia.org/T166493) [18:34:00] (03PS11) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [18:34:14] (03PS8) 10Andrew Bogott: Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) [18:34:24] (03PS8) 10Andrew Bogott: Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 [18:37:19] !log T160570: Upgrading dev env to Cassandra 3.11 (snapshot) [18:37:27] (03CR) 10Andrew Bogott: [C: 032] Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) (owner: 10Andrew Bogott) [18:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:28] T160570: Cassandra 3.x Tracking - https://phabricator.wikimedia.org/T160570 [18:37:39] (03CR) 10Andrew Bogott: [C: 032] Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) (owner: 10Andrew Bogott) [18:37:51] (03CR) 10Andrew Bogott: [C: 032] Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 (owner: 10Andrew Bogott) [18:40:29] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:46:09] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:56:16] (03PS2) 10Jdlrobson: Add Wikipedia wordmark in Serbian/Macedonian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355625 (https://phabricator.wikimedia.org/T165896) (owner: 10Dereckson) [18:56:18] (03PS1) 10Jdlrobson: Compress all project SVG logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356231 [18:58:13] (03PS1) 10Ottomata: [WIP] profile for 'broad' (name TBD) Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [19:00:04] TBD: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170530T1900). [19:00:25] TBD probably means me in this instance [19:01:14] heh, oops [19:01:32] Krinkle: or AaronSchulz do you have time to try the plan from T166345 ? [19:01:33] T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345 [19:01:35] 06Operations, 10Citoid, 10VisualEditor, 06Services (blocked), 15User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3301685 (10mobrovac) >>! In T165105#3300269, @Joe wrote: > We might want to contact the admins at wiley.com... [19:04:53] (03CR) 10jerkins-bot: [V: 04-1] [WIP] profile for 'broad' (name TBD) Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [19:05:42] (03CR) 10EBernhardson: "it seems PS1 of this was left on the beta cluster, rather than the merged version. I've rebased deployment-puppetmaster02 with the correct" [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [19:14:11] (03CR) 10EBernhardson: "not seeing any related errors in logstash logs for production or beta. Not sure what exactly fixed things but seems safe enough to abandon" [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) (owner: 10Ladsgroup) [19:14:26] (03PS3) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [19:14:29] (03PS1) 10BryanDavis: flake8 fixes for E305 [puppet] - 10https://gerrit.wikimedia.org/r/356234 [19:14:31] (03PS1) 10BryanDavis: ganglia: remove dup define in postgresql check [puppet] - 10https://gerrit.wikimedia.org/r/356235 [19:14:32] (03PS1) 10BryanDavis: flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 [19:15:49] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:16:14] (03CR) 10jerkins-bot: [V: 04-1] flake8 fixes for E305 [puppet] - 10https://gerrit.wikimedia.org/r/356234 (owner: 10BryanDavis) [19:17:15] hmmm... jerkins had a meltdown there... [19:17:27] (03CR) 10jerkins-bot: [V: 04-1] bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [19:18:13] (03CR) 10jerkins-bot: [V: 04-1] ganglia: remove dup define in postgresql check [puppet] - 10https://gerrit.wikimedia.org/r/356235 (owner: 10BryanDavis) [19:18:18] (03CR) 10jerkins-bot: [V: 04-1] flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 (owner: 10BryanDavis) [19:19:07] oh... lame. I checked in garbage with the first patch :/ [19:21:23] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, and 2 others: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345#3301783 (10debt) 05Open>03Resolved This has been deployed and is working well. :) [19:23:00] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#3301807 (10debt) 05Open>03Resolved [19:25:44] (03PS2) 10BryanDavis: flake8 fixes for E305 [puppet] - 10https://gerrit.wikimedia.org/r/356234 [19:25:46] (03PS2) 10BryanDavis: ganglia: remove dup define in postgresql check [puppet] - 10https://gerrit.wikimedia.org/r/356235 [19:25:48] (03PS2) 10BryanDavis: flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 [19:25:50] (03PS4) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [19:26:58] (03CR) 10jerkins-bot: [V: 04-1] flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 (owner: 10BryanDavis) [19:31:28] (03PS3) 10BryanDavis: flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 [19:31:30] (03PS5) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [19:32:31] (03Draft1) 10Paladox: contint: Fix stretch support in package_builder [puppet] - 10https://gerrit.wikimedia.org/r/356237 [19:32:50] (03PS2) 10Paladox: contint: Fix stretch support in package_builder [puppet] - 10https://gerrit.wikimedia.org/r/356237 [19:39:20] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 07Technical-Debt: Determine if benefactorevents.wikimedia.org should be hosted on the production cluster or still on Microsoft Azure - https://phabricator.wikimedia.org/T166240#3301898 (10DStrine) @Dereckson thanks for mentioning this but fr-tech... [19:39:53] (03PS3) 10Paladox: contint: Fix stretch support in package_builder [puppet] - 10https://gerrit.wikimedia.org/r/356237 (https://phabricator.wikimedia.org/T166611) [19:40:20] * AaronSchulz looks at what the plan is [19:43:09] 06Operations: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301941 (10herron) [19:44:49] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:51:24] (03PS4) 10BryanDavis: flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 [19:52:09] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:52:18] (03Draft1) 10Paladox: contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 [19:52:38] (03PS2) 10Paladox: contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) [19:56:03] (03Draft1) 10Paladox: jenkins: Install java 8 onto of java 7 [puppet] - 10https://gerrit.wikimedia.org/r/356243 [19:56:05] (03PS2) 10Paladox: jenkins: Install java 8 onto of java 7 [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) [19:56:21] mutante: these 3 patches are mostly trivial and get us a newer version of flake8 for testing in ops/puppet -- https://gerrit.wikimedia.org/r/#/q/topic:flake8-upgrade -- they also unblock the patch for my dotfiles [20:04:48] !log Disabled puppet on mwdebug1001 [20:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:58] (03PS3) 10Paladox: jenkins: Install java 8 onto of java 7 [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) [20:05:11] !log Manually installed memcached on mwdebug1001, running on default port 11211 [20:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:57] !log Overwriting nutcracker.yml on mwdebug1001 to point memcache cluster only to memcached on localhost [20:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:09] !log Restarting nutcracker on mwdebug1001 [20:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:19] (03CR) 10EBernhardson: "seems reasonably sane" (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [20:11:08] 06Operations, 10MediaWiki-Uploading, 06Multimedia, 10Wikimedia-General-or-Unknown, 10media-storage: Commons UploadWizard fails after repeated attempt to upload a .FLAC audio file - https://phabricator.wikimedia.org/T166490#3302008 (10Aklapper) As this seems to be the same as T166482 I'm marking this task... [20:11:38] 06Operations, 10MediaWiki-Uploading, 06Multimedia, 10Wikimedia-General-or-Unknown, 10media-storage: Commons UploadWizard fails after repeated attempt to upload a .FLAC audio file - https://phabricator.wikimedia.org/T166490#3302010 (10Aklapper) [20:11:40] 06Operations, 10TimedMediaHandler, 10media-storage: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3302013 (10Aklapper) [20:14:19] PROBLEM - nutcracker process on mwdebug1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [20:15:19] RECOVERY - nutcracker process on mwdebug1001 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [20:18:15] !log LDAP - added uid=herron to groups "ops" and "wmf" for ops onboarding of Keith (T166587) [20:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:24] T166587: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587 [20:18:47] (03Draft1) 10Paladox: contint: Update python.pp to support stretch [puppet] - 10https://gerrit.wikimedia.org/r/356246 [20:19:37] (03PS2) 10Paladox: contint: Only install libmysqlclient-dev if on trusty or jessie [puppet] - 10https://gerrit.wikimedia.org/r/356246 (https://phabricator.wikimedia.org/T166611) [20:22:29] (03PS3) 10Paladox: contint: Only install libmysqlclient-dev if on trusty or jessie [puppet] - 10https://gerrit.wikimedia.org/r/356246 (https://phabricator.wikimedia.org/T166611) [20:30:44] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 06Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3302041 (10GWicke) [20:31:38] (03PS1) 10Ottomata: Add ezachte's new ssh key [puppet] - 10https://gerrit.wikimedia.org/r/356267 [20:33:24] (03CR) 10Ottomata: [C: 032] Add ezachte's new ssh key [puppet] - 10https://gerrit.wikimedia.org/r/356267 (owner: 10Ottomata) [20:35:10] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:35:45] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3302058 (10thcipriani) >>! In T166345#3301074, @Gilles wrote: > Sounds worth a try, that w... [20:36:32] !log Set all wikis to wmf.2 via wikiversions.php on mwdebug1001 only; manual nutcracker running a screen to use local memcached for debugging [20:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:45] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3302071 (10Gilles) We're trying all of that as we speak with @aaron, on mwdebug1001 [20:38:16] 06Operations: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3302076 (10herron) My GPG key ID is C574276C (keyserver hkp://pool.sks-keyservers.net) and below are my ssh keys: {P5508} [20:58:22] gilles: I have to go AFK for an appointment [20:58:46] AaronSchulz: ok. so far you haven't spotted anything unusual right? [20:58:53] nothing on my end either [20:59:31] the initial slowness was expected; nothing strange after that, though it's hard to know what to hit. It doesn't seem to just be normal browsing of pages/RC type stuff. [21:00:04] RoanKattouw: Respected human, time to deploy Catalan Wikiquote Flow conversion (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170530T2100). Please do the needful. [21:01:16] un-pleasing [21:01:32] (re the lack of insight into the perf issue) [21:02:49] Did the train not happen today? [21:03:02] And am I good to proceed with my cawikiquote thing? [21:03:30] (03PS1) 10Dzahn: admins: create shell account for Keith Herron [puppet] - 10https://gerrit.wikimedia.org/r/356299 (https://phabricator.wikimedia.org/T166587) [21:10:19] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:12:31] RoanKattouw: the train did not happen today. We are still blocked on T166345. You should be clear to do what you need to on the deployment server afaik. AaronSchulz and/or gilles may be using it still/currently. [21:12:32] T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345 [21:13:00] OK cool [21:13:13] I just need to run some maintenance scripts and then deploy a config change [21:17:23] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3302223 (10Gilles) We did what @joe suggested on mwdebug1001 and we couldn't reproduce the... [21:18:51] RoanKattouw: you're good to go, AaronSchulz just edited wikiversions.php on mwdebug1001 for our debugging [21:19:08] OK, I'll stay away from that box [21:19:47] I'll leave it as is for now if AaronSchulz wants to investigate further when he gets back from his appointment, but so far we've been unable to reproduce the issue [21:21:43] !log mobrovac@tin Started deploy [zotero/translators@f051fe7]: Translators update for T95128 and T166292 [21:21:48] !log mobrovac@tin Finished deploy [zotero/translators@f051fe7]: Translators update for T95128 and T166292 (duration: 00m 05s) [21:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:53] T95128: OpenLibrary citation has incorrect author information (bug filed upstream) - https://phabricator.wikimedia.org/T95128 [21:21:53] T166292: Update zotero translators for Citoid (May 2017) - https://phabricator.wikimedia.org/T166292 [21:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:01] (03CR) 10Herron: [C: 031] admins: create shell account for Keith Herron [puppet] - 10https://gerrit.wikimedia.org/r/356299 (https://phabricator.wikimedia.org/T166587) (owner: 10Dzahn) [21:28:29] !log Running Flow/convertNamespaceFromWikitext.php on all discussion namespaces on cawikiquote (T165497) [21:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:38] T165497: Enable Flow on all talk pages on Catalan Wikiquote - https://phabricator.wikimedia.org/T165497 [21:28:42] (03PS2) 10Dzahn: admins: create shell account for Keith Herron [puppet] - 10https://gerrit.wikimedia.org/r/356299 (https://phabricator.wikimedia.org/T166587) [21:33:04] (03PS4) 10Paladox: jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) [21:33:06] (03CR) 10Dzahn: [C: 032] admins: create shell account for Keith Herron [puppet] - 10https://gerrit.wikimedia.org/r/356299 (https://phabricator.wikimedia.org/T166587) (owner: 10Dzahn) [21:36:09] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:39:19] RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [21:47:09] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:48:41] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3302329 (10Krinkle) @thcipriani @greg We unfortunately don't have hourly-precision debug p... [21:49:08] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for labtestvirt2003 [dns] - 10https://gerrit.wikimedia.org/r/356302 [21:50:35] (03PS1) 10Dzahn: admin: add herron to ops group [puppet] - 10https://gerrit.wikimedia.org/r/356303 (https://phabricator.wikimedia.org/T166587) [21:54:27] (03CR) 10Krinkle: [C: 032] phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [21:55:38] (03CR) 10Krinkle: [C: 032] test: factor out wgConf loading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355440 (owner: 10Hashar) [21:56:50] (03Merged) 10jenkins-bot: phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [21:56:51] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3302366 (10greg) >>! In T166345#3302329, @Krinkle wrote: > @Gilles and I agree that we sho... [21:56:59] (03CR) 10jenkins-bot: phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [21:57:23] (03Merged) 10jenkins-bot: test: factor out wgConf loading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355440 (owner: 10Hashar) [21:58:07] Krinkle: thank you :] [21:58:08] (03Merged) 10jenkins-bot: Test wgLogoHD keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [21:58:16] Krinkle: and sorry for the confusion __destruct() call [21:59:01] (03CR) 10jenkins-bot: test: factor out wgConf loading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355440 (owner: 10Hashar) [22:00:43] funny ci/zuul auto rebased one of the change https://gerrit.wikimedia.org/r/#/c/344798/ [22:02:58] (03PS5) 10Paladox: jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) [22:03:07] (03PS3) 10Paladox: contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) [22:03:18] (03PS4) 10Paladox: contint: Fix stretch support in package_builder [puppet] - 10https://gerrit.wikimedia.org/r/356237 (https://phabricator.wikimedia.org/T166611) [22:07:28] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3302399 (10Gilles) The best time for me would be Thursday EU morning, i.e. from 07:00am GM... [22:08:49] (03PS2) 10Dzahn: admin: add herron to ops group [puppet] - 10https://gerrit.wikimedia.org/r/356303 (https://phabricator.wikimedia.org/T166587) [22:10:24] !log Running populateContentModel.php on all talk namespaces for all tables on cawikiquote [22:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:30] (03PS1) 10Catrope: Make Flow default in all talk namespaces on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356306 (https://phabricator.wikimedia.org/T165497) [22:14:43] (03CR) 10Catrope: [C: 032] Make Flow default in all talk namespaces on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356306 (https://phabricator.wikimedia.org/T165497) (owner: 10Catrope) [22:15:37] (03CR) 10Herron: [C: 032] admin: add herron to ops group [puppet] - 10https://gerrit.wikimedia.org/r/356303 (https://phabricator.wikimedia.org/T166587) (owner: 10Dzahn) [22:16:04] (03Merged) 10jenkins-bot: Make Flow default in all talk namespaces on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356306 (https://phabricator.wikimedia.org/T165497) (owner: 10Catrope) [22:20:27] !log Welcome new root shell user herron (T166587) [22:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:36] T166587: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587 [22:20:44] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Make Flow default in all namespaces on cawikiquote (T165497) (duration: 00m 43s) [22:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:53] T165497: Enable Flow on all talk pages on Catalan Wikiquote - https://phabricator.wikimedia.org/T165497 [22:24:13] 06Operations, 10DBA, 10MediaWiki-extensions-SecurePoll, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3302446 (10demon) >>! In T166568#3300768, @jcrespo wrote: > That request never reached DBAs, someone c... [22:30:50] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3302456 (10Dzahn) [22:43:34] (03PS1) 10Dzahn: icinga: give permissions to run commands to herron [puppet] - 10https://gerrit.wikimedia.org/r/356309 (https://phabricator.wikimedia.org/T166587) [22:43:48] !log created securepoll_elections.el_owner on testwiki T166568 [22:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:57] T166568: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568 [22:45:15] 06Operations, 10DBA, 10MediaWiki-extensions-SecurePoll, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3302513 (10Reedy) el_owner has been around for 8 years, no retrospective patch was added till the bug... [22:45:51] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3302530 (10Dzahn) [22:46:38] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301040 (10Dzahn) [22:47:48] (03CR) 10Chad: Setup apache vhost on scap proxies as well (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [22:49:29] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3302552 (10Dzahn) [22:54:45] (03PS3) 10Krinkle: mwgrep: If --title is set, don't also require '*.js/.css' [puppet] - 10https://gerrit.wikimedia.org/r/349351 [22:54:59] (03PS4) 10Krinkle: mwgrep: Add --etitle option [puppet] - 10https://gerrit.wikimedia.org/r/349352 [22:55:04] (03PS2) 10Dzahn: icinga: give permissions to run commands to herron [puppet] - 10https://gerrit.wikimedia.org/r/356309 (https://phabricator.wikimedia.org/T166587) [22:55:06] (03PS5) 10Krinkle: mwgrep: Add --etitle option [puppet] - 10https://gerrit.wikimedia.org/r/349352 [22:57:22] (03PS2) 10Dzahn: DNS: Add mgmt and production DNS for labtestvirt2003 [dns] - 10https://gerrit.wikimedia.org/r/356302 (owner: 10Papaul) [22:58:59] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 107157 [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170530T2300). [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:52] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production DNS for labtestvirt2003 [dns] - 10https://gerrit.wikimedia.org/r/356302 (owner: 10Papaul) [23:02:47] hello [23:02:54] who's doing swat today? [23:03:51] I can [23:04:03] jdlrobson: First one looks like wrong patch #, but other 2 are fine and I shall merge [23:04:12] (03PS2) 10Chad: Compress all project SVG logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356231 (owner: 10Jdlrobson) [23:04:19] (03CR) 10Chad: [C: 032] Compress all project SVG logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356231 (owner: 10Jdlrobson) [23:04:58] RainbowSprinkles: https://gerrit.wikimedia.org/r/#/c/355625/2 [23:05:41] (03Merged) 10jenkins-bot: Compress all project SVG logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356231 (owner: 10Jdlrobson) [23:05:43] https://gerrit.wikimedia.org/r/356225 is the other one but looks like it needs a rebase [23:05:58] (03PS3) 10Chad: Add Wikipedia wordmark in Serbian/Macedonian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355625 (https://phabricator.wikimedia.org/T165896) (owner: 10Dereckson) [23:06:04] Trivial rebase :) [23:06:10] (03CR) 10Chad: [C: 032] Add Wikipedia wordmark in Serbian/Macedonian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355625 (https://phabricator.wikimedia.org/T165896) (owner: 10Dereckson) [23:06:25] (03PS2) 10Jdlrobson: Page images can come outside the lead for all projects except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356225 (https://phabricator.wikimedia.org/T166493) [23:06:48] Ah, that's the other one, 356225 [23:07:07] (03Merged) 10jenkins-bot: Add Wikipedia wordmark in Serbian/Macedonian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355625 (https://phabricator.wikimedia.org/T165896) (owner: 10Dereckson) [23:08:16] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3302573 (10BBlack) The lack of graph data from falling off the history is a sad commentary on how long this ha... [23:08:21] !log demon@tin Synchronized static/images/mobile/copyright/: Compressed + new images (duration: 00m 42s) [23:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:26] jdlrobson: Ok, new logos live everywhere & enabled for Sr/Mk [23:09:36] RainbowSprinkles: testing :) [23:09:41] Er, enabling.... [23:09:45] 1002 ? [23:09:46] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Add Wikipedia wordmark in Serbian/Macedonian (duration: 00m 45s) [23:09:47] Now! [23:09:52] jdlrobson: Everywhere! :) [23:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:21] RainbowSprinkles: looks beautiful. Sync away! [23:10:27] It already is ;-) [23:10:54] I'm a super lazy rebel. I rebel against process, but only because it's extra keystrokes :p [23:11:19] (03PS3) 10Chad: Page images can come outside the lead for all projects except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356225 (https://phabricator.wikimedia.org/T166493) (owner: 10Jdlrobson) [23:11:23] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#3302582 (10greg) Adding deployment-systems because mwscript lives in the scap puppet modu... [23:11:30] Last one is a little more dangerous, so we'll do it right with mwdebug1002 [23:11:50] (03CR) 10Chad: [C: 032] Page images can come outside the lead for all projects except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356225 (https://phabricator.wikimedia.org/T166493) (owner: 10Jdlrobson) [23:13:00] (03Merged) 10jenkins-bot: Page images can come outside the lead for all projects except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356225 (https://phabricator.wikimedia.org/T166493) (owner: 10Jdlrobson) [23:13:00] RainbowSprinkles: dangerous in which way? [23:13:05] it only runs on LinksUpdate [23:13:08] I dunno, made me paranoid :p [23:13:32] I assume LinksUpdate runs with debug1002 ? [23:13:51] I'm sure it does, but how can we force it to run quickly? It'll get jobqueue'd right? [23:14:17] correct [23:14:22] we just hope i guess :/ [23:14:43] Yay hoping! [23:15:08] W00t [23:15:50] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Page images can come outside the lead for all projects except Wikipedia (duration: 00m 41s) [23:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:36] RainbowSprinkles: ready to test? [23:19:11] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#3302600 (10BBlack) We talked a bit on IRC. Probably the first step is to include all the canonical domains (the 14-domain set in our big unified cert): ``` wikipedia.org wikime... [23:19:59] RainbowSprinkles: it works so you can sync away [23:21:17] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3302614 (10Krinkle) >>! In T124418#3302573, @BBlack wrote: >>>! In T124418#1985526, @BBlack wrote: >> Continui... [23:25:03] jdlrobson: I already did cuz I thought we couldn't test and were just hoping :p [23:25:10] ah ok lolz [23:25:13] cool thanks a bunch :) [23:37:08] jdlrobson: yw [23:42:51] * AaronSchulz wishes he knew what caused https://ganglia.wikimedia.org/latest/?r=custom&cs=05%2F25%2F2017+19%3A00&ce=05%2F25%2F2017+20%3A00&c=API+application+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=bytes_in&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [23:43:55] 32 min or so after sync and a few good min before the probably-unrelated lag stuff at 19:47 [23:50:59] same with cpu user. Interesting that a few boxes dropped a bit at the same time while most increased. Perhaps like some doing pool countered work while the others wait one a full pool. Hard to say. Lots of PC activity in that frame at https://logstash.wikimedia.org/goto/7804750efb77076429317c9e9178c408. [23:53:20] We don't log the decision to wait vs work, so the poolcounter log graphs would match up cleanly with cpu ones anyway.