[01:04:13] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1504573450 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9626828 keys, up 4 minutes 8 seconds - replication_delay is 1504573450 [01:04:33] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:05:32] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4910879 keys, up 5 minutes 22 seconds - replication_delay is 0 [01:06:22] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9616133 keys, up 6 minutes 11 seconds - replication_delay is 0 [01:08:22] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [01:11:22] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:23:26] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.16) (duration: 07m 28s) [02:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Sep 5 02:30:03 UTC 2017 (duration 6m 37s) [02:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:22] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [02:41:32] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:45:26] (03PS1) 10Andrew Bogott: nodepool: specify 'nova' availability zone [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) [03:07:02] (03PS1) 10Andrew Bogott: nova: make default 'nova' availability-zone explicit [puppet] - 10https://gerrit.wikimedia.org/r/375941 (https://phabricator.wikimedia.org/T170447) [03:29:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.32 seconds [04:19:12] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 261.84 seconds [05:48:29] (03PS1) 10Marostegui: db1026.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/375950 (https://phabricator.wikimedia.org/T174763) [05:50:05] (03CR) 10Marostegui: [C: 032] db1026.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/375950 (https://phabricator.wikimedia.org/T174763) (owner: 10Marostegui) [06:01:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375953 (https://phabricator.wikimedia.org/T174509) [06:06:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375953 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [06:08:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375953 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [06:08:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375953 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [06:09:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T174509 (duration: 00m 55s) [06:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:54] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [06:20:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375956 [06:22:28] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master: raise HDFS alarms thresholds [puppet] - 10https://gerrit.wikimedia.org/r/375957 [06:23:04] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375956 (owner: 10Marostegui) [06:23:15] (03CR) 10Elukey: [C: 032] role::analytics_cluster::hadoop::master: raise HDFS alarms thresholds [puppet] - 10https://gerrit.wikimedia.org/r/375957 (owner: 10Elukey) [06:24:33] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375956 (owner: 10Marostegui) [06:25:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T174509 (duration: 00m 46s) [06:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:37] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [06:25:52] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375958 (https://phabricator.wikimedia.org/T174509) [06:26:22] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375956 (owner: 10Marostegui) [06:27:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375958 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [06:28:58] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375958 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [06:29:08] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375958 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [06:29:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 - T174509 (duration: 00m 46s) [06:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:07] !log Deploy alter table on db1083 - T174509 [06:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:20] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:03:25] <_joe_> !log launching manually 3 workers for refreshLinks jobs on commons, T173710 [07:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:37] T173710: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710 [07:09:24] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3579057 (10Joe) We still have around 1.4 million items in queue for commons, evenly divided between `htmlCacheUpdate` jobs and `refreshLinks` jobs. I've... [07:14:09] !log installing libgd security updates on canary app servers (along with hhvm restart) [07:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:30] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3579083 (10Gilles) Looking at the puppet code, I see that thumbor uses ::swift::user and ::swift::key. I pre... [07:48:43] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3579084 (10Gilles) a:05Gilles>03fgiunchedi [07:58:17] (03PS2) 10Muehlenhoff: Install debdeploy-client [puppet] - 10https://gerrit.wikimedia.org/r/375793 (https://phabricator.wikimedia.org/T164817) [08:00:30] 10Operations, 10Performance-Team, 10Thumbor: thumbor1003 behaves differently than other thumbor hosts - https://phabricator.wikimedia.org/T174997#3579095 (10Gilles) [08:05:13] (03CR) 10Muehlenhoff: [C: 032] Install debdeploy-client [puppet] - 10https://gerrit.wikimedia.org/r/375793 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff) [08:07:35] (03PS3) 10Filippo Giunchedi: Add STL support to Thumbor, behind flag [puppet] - 10https://gerrit.wikimedia.org/r/375781 (https://phabricator.wikimedia.org/T161719) (owner: 10Gilles) [08:08:23] (03CR) 10Filippo Giunchedi: [C: 032] Add STL support to Thumbor, behind flag [puppet] - 10https://gerrit.wikimedia.org/r/375781 (https://phabricator.wikimedia.org/T161719) (owner: 10Gilles) [08:13:56] (03PS5) 10Filippo Giunchedi: Thumbor: only use lua in nginx config if "extras" variant [puppet] - 10https://gerrit.wikimedia.org/r/375772 (https://phabricator.wikimedia.org/T174746) (owner: 10Gilles) [08:14:57] (03CR) 10Filippo Giunchedi: [C: 032] Thumbor: only use lua in nginx config if "extras" variant [puppet] - 10https://gerrit.wikimedia.org/r/375772 (https://phabricator.wikimedia.org/T174746) (owner: 10Gilles) [08:15:49] (03PS5) 10Filippo Giunchedi: Add logback filter for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) (owner: 10Gilles) [08:19:34] (03CR) 10Filippo Giunchedi: [C: 032] Add logback filter for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) (owner: 10Gilles) [08:23:30] 10Operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#3579139 (10Joe) [08:23:34] 10Operations, 10JobRunner-Service, 10Wikimedia-General-or-Unknown: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#3579138 (10Joe) 05Open>03Resolved [08:26:39] (03PS2) 10Filippo Giunchedi: Send Thumbor error log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/375859 (https://phabricator.wikimedia.org/T150734) (owner: 10Gilles) [08:33:07] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3579162 (10Aklapper) [08:33:58] (03CR) 10Filippo Giunchedi: [C: 032] Send Thumbor error log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/375859 (https://phabricator.wikimedia.org/T150734) (owner: 10Gilles) [08:34:23] (03CR) 10Hashar: "Comes from the Official Debian repository: https://anonscm.debian.org/cgit/pkg-php/php-defaults.git/" [debs/pkg-php/php-defaults] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/374766 (owner: 10Hashar) [08:35:44] (03CR) 10Hashar: "Forked it from the Official Debian repository https://anonscm.debian.org/cgit/pkg-php/php.git/ They obviously no more support php 5.5 and" [debs/pkg-php/php] (debian/jessie-wikimedia-5.5) - 10https://gerrit.wikimedia.org/r/374782 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [08:47:19] !log uploaded php-luasandbox/hhvm-luasandbox 2.0.14 to apt.wikimedia.org (T173705) [08:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:33] T173705: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705 [08:49:16] !log upgrading canary app servers to hhvm-luasandbox 2.0.14 (T173705) [08:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:56] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [08:55:15] (03PS2) 10Alexandros Kosiaris: Add Croatian language assets [puppet] - 10https://gerrit.wikimedia.org/r/375802 (https://phabricator.wikimedia.org/T172046) (owner: 10Ladsgroup) [08:55:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add Croatian language assets [puppet] - 10https://gerrit.wikimedia.org/r/375802 (https://phabricator.wikimedia.org/T172046) (owner: 10Ladsgroup) [09:05:05] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3121: Connection refused [09:05:06] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3123: Connection refused [09:05:06] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3126: Connection refused [09:05:06] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3120: Connection refused [09:05:06] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3122: Connection refused [09:05:06] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 80: Connection refused [09:05:06] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3127: Connection refused [09:05:12] ACKNOWLEDGEMENT - Check systemd state on cp4024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ema https://phabricator.wikimedia.org/T174891 [09:05:12] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 3120 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3120: Connection refused Ema https://phabricator.wikimedia.org/T174891 [09:05:12] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 3121 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3121: Connection refused Ema https://phabricator.wikimedia.org/T174891 [09:05:12] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 3122 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3122: Connection refused Ema https://phabricator.wikimedia.org/T174891 [09:05:12] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 3123 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3123: Connection refused Ema https://phabricator.wikimedia.org/T174891 [09:05:12] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 3126 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3126: Connection refused Ema https://phabricator.wikimedia.org/T174891 [09:05:12] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 3127 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3127: Connection refused Ema https://phabricator.wikimedia.org/T174891 [09:05:13] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 80 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 80: Connection refused Ema https://phabricator.wikimedia.org/T174891 [09:05:13] ACKNOWLEDGEMENT - puppet last run on cp4024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Ema https://phabricator.wikimedia.org/T174891 [09:05:46] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3125: Connection refused [09:05:46] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3124: Connection refused [09:06:05] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 3124 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3124: Connection refused Ema https://phabricator.wikimedia.org/T174891 [09:06:05] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 3125 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3125: Connection refused Ema https://phabricator.wikimedia.org/T174891 [09:06:34] cp4024 is looking great this fine morning [09:06:35] 09:06:16 up 1 day, 23:39, 1 user, load average: 1100.73, 1099.75, 1098.43 [09:06:45] lol [09:06:53] 1100 load ... nice [09:08:25] RECOVERY - puppet last run on chlorine is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:09:09] "well, there is your problem" [09:16:03] (03PS1) 10Jayprakash12345: Remove Extension:RelatedSites from zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) [09:16:57] ema: seems that CPU#41 complains a lot :-P [09:17:13] probably envy to not being the #42 [09:19:02] :) [09:19:05] (03CR) 10Liuxinyu970226: [C: 031] Remove Extension:RelatedSites from zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [09:24:07] (03PS1) 10Alexandros Kosiaris: Tabs to spaces in monitor_lvs.erb [puppet] - 10https://gerrit.wikimedia.org/r/375972 [09:26:19] (03PS2) 10Alexandros Kosiaris: Tabs to spaces in monitor_lvs.erb [puppet] - 10https://gerrit.wikimedia.org/r/375972 [09:26:34] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Tabs to spaces in monitor_lvs.erb [puppet] - 10https://gerrit.wikimedia.org/r/375972 (owner: 10Alexandros Kosiaris) [09:28:53] volans: you are a poet [09:29:08] lol [09:31:54] (03CR) 10Mobrovac: [C: 031] JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [09:35:08] (03PS1) 10Ladsgroup: Enable sendEchoNotification for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375974 (https://phabricator.wikimedia.org/T142102) [09:37:28] (03PS2) 10Ladsgroup: Enable sendEchoNotification for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375974 (https://phabricator.wikimedia.org/T142102) [09:37:43] (03PS1) 10Alexandros Kosiaris: Pass vhost in jobrunner icinga check_command [puppet] - 10https://gerrit.wikimedia.org/r/375975 [09:43:13] (03CR) 10Alexandros Kosiaris: [C: 032] Pass vhost in jobrunner icinga check_command [puppet] - 10https://gerrit.wikimedia.org/r/375975 (owner: 10Alexandros Kosiaris) [09:47:30] (03CR) 10Alexandros Kosiaris: [C: 031] Rakefile: print offending files when searching typos [puppet] - 10https://gerrit.wikimedia.org/r/375759 (owner: 10Giuseppe Lavagetto) [09:47:45] (03PS1) 10Ladsgroup: dumps: Use black wmf icon [puppet] - 10https://gerrit.wikimedia.org/r/375976 [09:48:51] (03PS1) 10Elukey: eventlogging_sync: fix bug in id/timestamp column handling [puppet] - 10https://gerrit.wikimedia.org/r/375977 (https://phabricator.wikimedia.org/T174815) [09:49:13] 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3579286 (10ema) On September 1st: ``` Sep 1 16:21:57 cp4024 kernel: [ 5179.145078] BUG: unable to handle kernel paging request at ffffffffba8fb2e3 Sep 1 16:21:57 cp4024 kernel: [ 5179.152873] IP: [] 0... [09:52:42] !log reimage kubernetes[12]00[1-4] T170119 [09:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:58] T170119: Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119 [09:53:09] (03PS1) 10Alexandros Kosiaris: Bump calico debian revision [puppet] - 10https://gerrit.wikimedia.org/r/375978 [09:53:33] (03CR) 10jerkins-bot: [V: 04-1] Bump calico debian revision [puppet] - 10https://gerrit.wikimedia.org/r/375978 (owner: 10Alexandros Kosiaris) [09:54:31] Invalid commit message ? [09:54:53] ah... the T [09:54:54] lol [09:54:55] Line 5: Bug: value must be a single phabricator task ID [09:55:20] I can say I was impressed with that [09:55:33] (03PS2) 10Alexandros Kosiaris: Bump calico debian revision [puppet] - 10https://gerrit.wikimedia.org/r/375978 (https://phabricator.wikimedia.org/T170119) [09:55:50] (03CR) 10ArielGlenn: [C: 032] dumps: Use black wmf icon [puppet] - 10https://gerrit.wikimedia.org/r/375976 (owner: 10Ladsgroup) [09:58:23] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [09:58:53] PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:59:32] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [09:59:50] !log Stopped Nodepool on labnodepool1001.eqiad.wmnet [09:59:53] RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational [10:00:00] instances are erroring out [10:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:15] (03PS2) 10Jcrespo: mediawiki: make the wikidata wb_terms rebuild a little bit faster [puppet] - 10https://gerrit.wikimedia.org/r/375741 (https://phabricator.wikimedia.org/T171460) (owner: 10Ladsgroup) [10:00:42] (03PS3) 10Alexandros Kosiaris: Bump calico debian revision [puppet] - 10https://gerrit.wikimedia.org/r/375978 (https://phabricator.wikimedia.org/T170119) [10:01:03] (03CR) 10Jcrespo: [C: 032] mediawiki: make the wikidata wb_terms rebuild a little bit faster [puppet] - 10https://gerrit.wikimedia.org/r/375741 (https://phabricator.wikimedia.org/T171460) (owner: 10Ladsgroup) [10:06:26] (03PS4) 10Alexandros Kosiaris: Bump calico debian revision [puppet] - 10https://gerrit.wikimedia.org/r/375978 (https://phabricator.wikimedia.org/T170119) [10:06:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Bump calico debian revision [puppet] - 10https://gerrit.wikimedia.org/r/375978 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris) [10:07:42] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [10:08:53] !log copy python-logstash from jessie-wikimedia to stretch-wikimedia [10:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:32] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [10:10:03] PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:10:17] at least openstack manages to delete instances [10:11:53] !log reboot cp4024 T174891 [10:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:05] T174891: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891 [10:12:32] (03CR) 10Elukey: [C: 032] eventlogging_sync: fix bug in id/timestamp column handling [puppet] - 10https://gerrit.wikimedia.org/r/375977 (https://phabricator.wikimedia.org/T174815) (owner: 10Elukey) [10:12:37] (03PS2) 10Elukey: eventlogging_sync: fix bug in id/timestamp column handling [puppet] - 10https://gerrit.wikimedia.org/r/375977 (https://phabricator.wikimedia.org/T174815) [10:12:58] (03CR) 10Phedenskog: Make values stackable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [10:18:13] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:18:36] <_joe_> !log stopping puppet, nginx on mw2249 for experimentation [10:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:42] PROBLEM - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:02] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 450 bytes in 7.374 second response time [10:22:02] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 450 bytes in 7.378 second response time [10:22:02] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 7.372 second response time [10:22:02] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 7.375 second response time [10:22:02] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 450 bytes in 0.157 second response time [10:22:03] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 450 bytes in 0.157 second response time [10:22:12] RECOVERY - Host cp4024 is UP: PING OK - Packet loss = 0%, RTA = 78.55 ms [10:22:30] at least openstack manages to delete instances [10:22:33] grr [10:22:33] RECOVERY - Check systemd state on cp4024 is OK: OK - running: The system is fully operational [10:22:42] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.157 second response time [10:22:42] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.157 second response time [10:22:52] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.157 second response time [10:22:53] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.157 second response time [10:23:06] (03CR) 10Phedenskog: Make values stackable (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [10:25:44] !log Restarting Nodepool. I have managed to spawn instances manually for contintcloud tenant [10:25:54] (03PS1) 10Muehlenhoff: Readd rollback handling to debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 [10:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:12] RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational [10:26:42] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [10:29:00] (03PS1) 10Urbanecm: Add abusefilter-view-private to rollbackers in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375981 (https://phabricator.wikimedia.org/T174978) [10:33:18] (03PS1) 10ArielGlenn: add 'general' to the list of properties retrieved for siteinfo dumps [dumps] - 10https://gerrit.wikimedia.org/r/375982 (https://phabricator.wikimedia.org/T171400) [10:36:43] db1009 is showing as down on the proxy [10:36:51] (but the proxy is not in use) [10:38:05] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1009&var-port=9104&from=1504600653026&to=1504607866327 [10:38:44] !log Nodepool / CI are fully backup and processing queued jobs [10:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:26] (03PS1) 10Alexandros Kosiaris: Stretch for kubernetes related boxes [puppet] - 10https://gerrit.wikimedia.org/r/375983 (https://phabricator.wikimedia.org/T170119) [10:42:22] (03CR) 10Alexandros Kosiaris: [C: 032] Stretch for kubernetes related boxes [puppet] - 10https://gerrit.wikimedia.org/r/375983 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris) [10:45:56] PROBLEM - Host kubernetes1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:46:55] RECOVERY - Host kubernetes1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:00:04] hoo: Dear anthropoid, the time has come. Please deploy ArticlePlaceholder (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170905T1100). [11:01:44] (03CR) 10Hoo man: [C: 032] Enable ArticlePlaceholder on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375015 (https://phabricator.wikimedia.org/T174335) (owner: 10Jayprakash12345) [11:02:10] (03PS2) 10Muehlenhoff: Fix broken regexp in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/374825 [11:03:12] (03CR) 10Muehlenhoff: [C: 032] Fix broken regexp in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/374825 (owner: 10Muehlenhoff) [11:03:30] (03Merged) 10jenkins-bot: Enable ArticlePlaceholder on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375015 (https://phabricator.wikimedia.org/T174335) (owner: 10Jayprakash12345) [11:03:42] (03CR) 10jenkins-bot: Enable ArticlePlaceholder on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375015 (https://phabricator.wikimedia.org/T174335) (owner: 10Jayprakash12345) [11:05:27] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on sqwiki (T174335) (duration: 00m 46s) [11:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:42] T174335: Enable ArticlePlaceholder on sqwiki - https://phabricator.wikimedia.org/T174335 [11:07:21] (forgot to git rebase) [11:07:54] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: (Actually) Enable the ArticlePlaceholder on sqwiki (T174335) (duration: 00m 45s) [11:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:18] ah, here we go [11:20:46] PROBLEM - Check whether ferm is active by checking the default input chain on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:20:55] PROBLEM - dhclient process on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:20:56] PROBLEM - DPKG on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:21:05] PROBLEM - Check size of conntrack table on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:21:05] PROBLEM - Disk space on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:21:05] PROBLEM - cassandra-b service on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:21:06] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:21:15] PROBLEM - cassandra-c service on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:21:35] PROBLEM - configured eth on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:21:35] PROBLEM - puppet last run on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:21:35] PROBLEM - MD RAID on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:22:22] did restbase2001 went down? [11:23:01] aren't these reimaged to cassandra 3? [11:23:03] Puppet is disabled. filippo [11:23:05] ok [11:23:08] maybe downtime expired [11:23:12] so maybe that ^ [11:23:25] PROBLEM - HP RAID on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:23:55] PROBLEM - Check systemd state on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:25:36] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:28:16] PROBLEM - salt-minion processes on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:28:59] (03PS1) 10Gilles: Revert "Add logback filter for Thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/375994 [11:29:13] (03PS2) 10Gilles: Revert "Add logback filter for Thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/375994 [11:31:05] PROBLEM - IPMI Temperature on restbase2001 is CRITICAL: Return code of 255 is out of bounds [11:31:20] (03PS1) 10Elukey: eventlogging_cleaner.py: force timestamp to CHAR [puppet] - 10https://gerrit.wikimedia.org/r/375995 (https://phabricator.wikimedia.org/T156933) [11:47:35] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [11:48:25] PROBLEM - Check systemd state on kubernetes1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:51:05] (03CR) 10Volans: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375995 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [11:51:33] (03CR) 10Elukey: [C: 032] "Volans rocks" [puppet] - 10https://gerrit.wikimedia.org/r/375995 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [11:54:51] !log Drop reader_feedback, reader_feedback_history, reader_feedback_pages tables - T174586 [11:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:07] T174586: Remove ReaderFeedback tables from wikis - https://phabricator.wikimedia.org/T174586 [12:00:04] Amir1: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170905T1200). [12:00:17] On it [12:03:38] 10Operations, 10Electron-PDFs, 10Readers-Web-Backlog (Tracking), 10Services (blocked): electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916#3579644 (10ovasileva) [12:04:47] (03PS3) 10Filippo Giunchedi: Revert "Add logback filter for Thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/375994 (owner: 10Gilles) [12:05:03] Amir1: LMK when done, I'll hold off on ^ as that will bounce logstash [12:05:36] okay [12:05:37] thanks [12:06:58] (03CR) 10Ladsgroup: [C: 032] Enable sendEchoNotification for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375974 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [12:08:25] (03Merged) 10jenkins-bot: Enable sendEchoNotification for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375974 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [12:08:36] (03CR) 10jenkins-bot: Enable sendEchoNotification for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375974 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [12:15:25] mwdebug1002 was okay [12:15:31] already deploying [12:15:36] !log ladsgroup@tin Synchronized wmf-config/Wikibase-production.php: Enable sendEchoNotification for enwiki, dewiki, frwiki (T142102) (duration: 00m 46s) [12:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:47] T142102: [Story] Deploy Wikibase notifications to Wikimedia projects - https://phabricator.wikimedia.org/T142102 [12:17:08] godog: it's done now :) [12:17:46] Amir1: neat, thanks for the heads up [12:24:00] PROBLEM - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100% [12:24:06] doh [12:24:45] !log ci: switched mediawiki/core mediawiki/vendor php5.5 jobs from trusty to jessie - T161882 [12:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:58] T161882: Migrate PHP5.5 jobs from Trusty to Jessie - https://phabricator.wikimedia.org/T161882 [12:26:17] ema: try to send it back to the factory and get a new one, this is clearly a lemon ;) [12:28:27] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Add logback filter for Thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/375994 (owner: 10Gilles) [12:43:17] 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3579781 (10Aklapper) >>! In T157761#3076941, @Dzahn wrote: > Yea, this will be done but it's supposed to happen not until September and i don't want to hold on to it until then. Most likely it... [12:46:35] PROBLEM - puppet last run on kubernetes2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 15 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [12:47:23] PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:47:23] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 22 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [12:47:32] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:54:52] RECOVERY - dhclient process on restbase2001 is OK: PROCS OK: 0 processes with command name dhclient [12:55:03] RECOVERY - Check size of conntrack table on restbase2001 is OK: OK: nf_conntrack is 0 % full [12:55:03] RECOVERY - Disk space on restbase2001 is OK: DISK OK [12:55:03] RECOVERY - cassandra-b service on restbase2001 is OK: OK - cassandra-b is active [12:55:12] RECOVERY - DPKG on restbase2001 is OK: All packages OK [12:55:12] RECOVERY - salt-minion processes on restbase2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:55:13] RECOVERY - cassandra-c service on restbase2001 is OK: OK - cassandra-c is active [12:55:32] RECOVERY - configured eth on restbase2001 is OK: OK - interfaces up [12:55:42] RECOVERY - MD RAID on restbase2001 is OK: OK: Active: 15, Working: 15, Failed: 0, Spare: 0 [12:55:42] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:55:43] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [12:55:43] RECOVERY - Check whether ferm is active by checking the default input chain on restbase2001 is OK: OK ferm input default policy is set [12:56:23] that's me ^ [12:56:52] RECOVERY - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.134 port 9042 [12:57:52] (03PS9) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [12:59:18] (03CR) 10Phedenskog: Make values stackable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170905T1300). [13:00:05] Urbanecm, kart_, and Addshore: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:31] o/ [13:00:36] I'm here [13:01:12] RECOVERY - IPMI Temperature on restbase2001 is OK: Sensor Type(s) Temperature Status: OK [13:01:24] o/ [13:01:26] Urbanecm: I'll start with your patches [13:01:33] Ok [13:01:42] addshore: want to deploy yourself, of should I do it? [13:01:58] (03PS1) 10Ema: varnish-backend-restart: do not run if varnish-be is depooled [puppet] - 10https://gerrit.wikimedia.org/r/376004 [13:02:12] zeljkof: if you could do it that would be great! [13:02:20] addshore: sure [13:02:23] for the record... [13:02:26] I can SWAT today! [13:03:39] 10Operations, 10media-storage, 10User-fgiunchedi: swift-recon-cron on ms-be203[34]: [Errno 17] File exists: '/var/lock/swift-recon-object-cron' - https://phabricator.wikimedia.org/T174959#3579821 (10fgiunchedi) [13:03:42] RECOVERY - HP RAID on restbase2001 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:1:5 - Controller: OK - Battery/Capacitor: OK [13:03:43] (03PS10) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [13:04:05] (03CR) 10jerkins-bot: [V: 04-1] Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [13:04:09] Urbanecm: reviewing 375409 [13:04:37] Ack [13:04:52] zeljkof: Around :) [13:05:19] kart_: hi! [13:05:29] (03PS11) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [13:06:25] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375409 (https://phabricator.wikimedia.org/T172284) (owner: 10Urbanecm) [13:07:57] (03Merged) 10jenkins-bot: Update logo for sr.wikibooks.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375409 (https://phabricator.wikimedia.org/T172284) (owner: 10Urbanecm) [13:08:07] (03CR) 10jenkins-bot: Update logo for sr.wikibooks.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375409 (https://phabricator.wikimedia.org/T172284) (owner: 10Urbanecm) [13:09:12] Urbanecm: 375409 is at mwdebug1002 [13:11:17] (03PS2) 10Zfilipin: Add abusefilter-view-private to rollbackers in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375981 (https://phabricator.wikimedia.org/T174978) (owner: 10Urbanecm) [13:12:16] ack [13:12:58] working, please deploy [13:13:06] deploying [13:13:15] (03CR) 10Zfilipin: [C: 031] Add abusefilter-view-private to rollbackers in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375981 (https://phabricator.wikimedia.org/T174978) (owner: 10Urbanecm) [13:14:18] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:375409|Update logo for sr.wikibooks.org (T172284)]] (duration: 00m 46s) [13:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:30] T172284: Logo for sr.wikibooks.org - https://phabricator.wikimedia.org/T172284 [13:14:48] Urbanecm: deployed, purging caches [13:16:04] ack [13:17:03] Urbanecm: caches purged, looks good to me [13:17:19] !log Drop pr_index tables from where ProofreadPage isn't enabled - T174782 [13:17:20] all logos now have bottom text in one line [13:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:33] T174782: Drop pr_index from wikis where ProofreadPage isn't enabled - https://phabricator.wikimedia.org/T174782 [13:17:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375981 (https://phabricator.wikimedia.org/T174978) (owner: 10Urbanecm) [13:18:06] Urbanecm: merging 375981 [13:18:09] ack [13:19:23] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase2005 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376007 (https://phabricator.wikimedia.org/T169939) [13:19:27] (03CR) 10Phedenskog: Make values stackable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [13:19:29] (03Merged) 10jenkins-bot: Add abusefilter-view-private to rollbackers in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375981 (https://phabricator.wikimedia.org/T174978) (owner: 10Urbanecm) [13:19:37] (03CR) 10jerkins-bot: [V: 04-1] cassandra: reprovision restbase2005 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376007 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [13:19:40] (03CR) 10jenkins-bot: Add abusefilter-view-private to rollbackers in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375981 (https://phabricator.wikimedia.org/T174978) (owner: 10Urbanecm) [13:20:22] (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase2005 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376007 (https://phabricator.wikimedia.org/T169939) [13:21:13] RECOVERY - Check the NTP synchronisation status of timesyncd on restbase2001 is OK: OK: synced at Tue 2017-09-05 13:21:06 UTC. [13:21:21] Urbanecm: 375981 is at mwdebug1002 [13:21:59] kart_: your patch is next, will ping you in a few minutes when it's at mwdebug1002 [13:22:15] Doesn't seems good to me... ::( [13:22:31] I'm an idiot, there's no => true... [13:22:50] Will upload followup [13:22:52] Urbanecm: revert? [13:22:52] No [13:22:58] Deploy together with a follow-up [13:23:04] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [13:23:41] 10Operations, 10Analytics-Kanban, 10User-Elukey: Tune Kafka logs to register clients connected - https://phabricator.wikimedia.org/T173493#3579853 (10elukey) [13:23:54] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:24:39] (03PS1) 10Urbanecm: Add abusefilter-view-private to rollbackers in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376008 (https://phabricator.wikimedia.org/T174978) [13:24:44] zeljkof, uploaded ^^ [13:26:12] zeljkof: sure [13:27:03] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376008 (https://phabricator.wikimedia.org/T174978) (owner: 10Urbanecm) [13:28:21] !log reimage restbase2005 - T169939 [13:28:30] (03Merged) 10jenkins-bot: Add abusefilter-view-private to rollbackers in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376008 (https://phabricator.wikimedia.org/T174978) (owner: 10Urbanecm) [13:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:35] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [13:28:39] (03CR) 10jenkins-bot: Add abusefilter-view-private to rollbackers in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376008 (https://phabricator.wikimedia.org/T174978) (owner: 10Urbanecm) [13:28:51] zeljkof: room in the swat for one more? [13:29:13] Urbanecm: it's at mwdebug [13:29:16] Ack [13:29:31] Working, please deploy both patches [13:29:39] jdlrobson: not sure, but add it to the calendar [13:29:45] Urbanecm: deploying [13:29:50] ack [13:30:13] zeljkof: added to calendar [13:30:45] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:376008|Add abusefilter-view-private to rollbackers in zhwiki (T174978)]] (duration: 00m 46s) [13:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:58] T174978: Add abusefilter-view-private to rollbackers in zhwiki - https://phabricator.wikimedia.org/T174978 [13:30:59] Urbanecm: deployed, please check [13:31:15] working [13:31:31] Urbanecm: please add the third commit to the wiki, for reference [13:31:38] ok [13:32:35] kart_: is order in which files are deployed relevant? or any order would do? [13:33:17] zeljkof: no order [13:34:00] kart_: ok, in that case deploying files in random order :) [13:34:14] Urbanecm: thanks, and thanks for deploying with #releng! ;) [13:36:34] kart_: CI is taking forever to merge the patch :| [13:38:00] zeljkof: done. [13:38:26] kart_: yes, a few seconds ago, will be at mwdebug1002 in a few minutes [13:43:03] kart_: it's at mwdebug1002, please test and let me know if I can deploy [13:43:14] OK. Testing. [13:45:37] zeljkof: Looks good. I can publish in my userspace without known errors! So, go ahead. [13:45:53] kart_: ok, deploying [13:47:44] !log zfilipin@tin Synchronized php-1.30.0-wmf.16/extensions/ContentTranslation/: SWAT: [[gerrit:375954|encodeURIComponent title to escape / properly (T174792)]] (duration: 00m 48s) [13:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:56] T174792: Publishing fails when publish target is User namespace - https://phabricator.wikimedia.org/T174792 [13:48:05] kart_: hm, I guess this is just a warning [13:48:11] 13:47:33 Check 'Logstash Error rate for mw1261.eqiad.wmnet' failed: ERROR: 66% OVER_THRESHOLD (Avg. Error rate: Before: 0.08, After: 3.00, Threshold: 1.00) [13:48:30] because scap did deploy without failures [13:48:43] kart_: please test and let me know [13:48:46] zeljkof: okay. [13:51:32] addshore: merging 375518, will ping you in a few minutes when it's at mwdebug1002 [13:51:38] zeljkof: is there any possibility of swatting mine in parallel to Addshore's? The fix was meant to go out yesterday but because of holiday i was away and I can't make the swat window later today [13:51:54] zeljkof: ack [13:52:22] jdlrobson: I would really rather not deploy them together, I rarely deploy core and I am a bit nervous [13:52:52] jdlrobson: if you or addshore want to do the deploy yourselves, feel free to do it in parallel :) [13:53:10] zeljkof: Basic workflow and publishing is OK too, no console errors so far. [13:53:27] zeljkof: where did you see that error? [13:53:47] matt_flaschen: can I make the EU SWAT about 15 minutes longer? there is one patch that does not fit in the window [13:53:53] 10Operations, 10Analytics-Kanban, 10User-Elukey: Tune Kafka logs to register clients connected - https://phabricator.wikimedia.org/T173493#3580049 (10elukey) Tuning the kafka-authorizer appender is definitely important for us since it contains interesting info like: ``` [2017-09-05 13:39:32,147] DEBUG Princ... [13:53:55] kart_: it's in scap output [13:54:13] zeljkof: oh, ok! [13:54:43] kart_: sorry, forgot to mention that, I guess it's just a warning [13:55:06] OK! But, good to know if that's cause by CX or not :) [13:55:49] matt_flaschen: and looks like the current patch will not be deployed before your deployment window :( [13:55:56] (CI still merging it) [13:56:02] silly CI [13:57:03] speaking of silly CI [13:57:07] addshore: I could have merged it before, because it's the only one for core, but I really do not like doing that, I guess it could complicate reverts if something went wrong :| [13:57:08] xdebug makes phpunit wayy slower [13:58:10] (03PS1) 10Elukey: confluent::kafka: set kafka-authorizer log to DEBUG [puppet] - 10https://gerrit.wikimedia.org/r/376015 (https://phabricator.wikimedia.org/T173493) [13:58:20] hashar: yes [13:59:54] hashar: jenkins doesnt have xdebug does it? O_o [14:00:04] matt_flaschen: Respected human, time to deploy Watchlist filters added to RCFilters Beta feature (phab:T164234) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170905T1400). Please do the needful. [14:00:04] T164234: Provide "RC Filters" functionality on the watchlist - https://phabricator.wikimedia.org/T164234 [14:00:19] addshore: for phpunit coverage [14:00:27] only for the coverage tests though right? [14:00:35] *coverage generation [14:00:42] matt_flaschen: can you wait until the current patch is deployed (EU SWAT) [14:01:12] addshore: i think so yeah. Filled it as https://phabricator.wikimedia.org/T175028 [14:01:15] (03CR) 10Elukey: "pcc looks good https://puppet-compiler.wmflabs.org/compiler02/7715/" [puppet] - 10https://gerrit.wikimedia.org/r/376015 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [14:02:09] zeljkof: yaya jenkins finally merged it :) [14:02:21] addshore: yes, will be at mwdebug in a few minutes [14:04:04] (03PS1) 10Ladsgroup: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) [14:05:57] (03PS2) 10Ladsgroup: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) [14:08:26] Amir1: ahh yes, lets not forget that [14:08:54] https://phabricator.wikimedia.org/T175009 [14:09:21] I will work on this now, let's see what's going on there [14:10:04] addshore: the commit is at mwdebug1002, please test and let me know if I can deploy [14:10:27] checking [14:10:47] matt_flaschen: around for deploy? I am still deploying the last commit for EU SWAT [14:11:02] zeljkof: looks good [14:11:11] addshore: ok, deploying [14:11:27] zeljkof, here now, let me know when you're done. [14:11:37] matt_flaschen: ok [14:13:29] (03PS2) 10Volans: Failoid: migrate to Puppet's future parser [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) [14:13:49] !log zfilipin@tin Synchronized php-1.30.0-wmf.16/includes/EditPage.php: SWAT: [[gerrit:375518|Re add wpScrolltop id in EditPage (T174723)]] (duration: 00m 45s) [14:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:02] T174723: [Regression] wpScrolltop feature broken on edit page - https://phabricator.wikimedia.org/T174723 [14:14:02] addshore: deployed, please check [14:14:12] ack [14:14:33] zeljkof: looks good [14:14:41] jdlrobson: we are already over time, sorry, and matt_flaschen has this window, could you please add the commit to another SWAT window? [14:14:55] addshore: great! thanks for deploying with #releng! ;) [14:15:02] !log EU SWAT finished [14:15:13] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: refactor things to the profile [puppet] - 10https://gerrit.wikimedia.org/r/376020 [14:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:14] i noticed. Ill have to talk to Chad and see if it can go out with the train as it's pretty serious and noticeable and I can't make 4pm swat window tonight. [14:15:15] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: Add local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376021 (https://phabricator.wikimedia.org/T174599) [14:15:17] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner_tls: relay requests to the local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376022 (https://phabricator.wikimedia.org/T174599) [14:15:19] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner_tls: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/376023 [14:15:21] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: restrict firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/376024 [14:16:23] (03CR) 10Volans: "No diff with the compiler: https://puppet-compiler.wmflabs.org/compiler02/7716/index-future.html" [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) (owner: 10Volans) [14:17:23] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase2005 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376007 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [14:19:12] (03PS3) 10Reedy: phpcs changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375709 [14:19:20] (03CR) 10Reedy: [C: 032] phpcs changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375709 (owner: 10Reedy) [14:19:35] (03CR) 10Zhuyifei1999: [C: 031] Remove Extension:RelatedSites from zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [14:20:41] (03PS3) 10Volans: failoid: migrate to Puppet's future parser [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) [14:20:53] (03Merged) 10jenkins-bot: phpcs changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375709 (owner: 10Reedy) [14:21:00] (03PS4) 10Reedy: Fix links on highlight.php for dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375036 (https://phabricator.wikimedia.org/T174703) [14:21:02] (03CR) 10Reedy: [C: 032] Fix links on highlight.php for dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375036 (https://phabricator.wikimedia.org/T174703) (owner: 10Reedy) [14:21:04] (03CR) 10jenkins-bot: phpcs changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375709 (owner: 10Reedy) [14:21:06] (03PS4) 10Volans: failoid: migrate to Puppet's future parser [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) [14:21:37] (03CR) 10Volans: [C: 032] failoid: migrate to Puppet's future parser [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) (owner: 10Volans) [14:22:31] (03Merged) 10jenkins-bot: Fix links on highlight.php for dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375036 (https://phabricator.wikimedia.org/T174703) (owner: 10Reedy) [14:22:38] PROBLEM - puppet last run on kubestage1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 18 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [14:22:43] (03PS2) 10Reedy: Remove/collapse a few conditionals in CentralNotice config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374877 [14:23:06] (03CR) 10Reedy: [C: 032] Remove/collapse a few conditionals in CentralNotice config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374877 (owner: 10Reedy) [14:23:24] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: refactor things to the profile [puppet] - 10https://gerrit.wikimedia.org/r/376020 [14:23:28] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:23:48] (03CR) 10jenkins-bot: Fix links on highlight.php for dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375036 (https://phabricator.wikimedia.org/T174703) (owner: 10Reedy) [14:24:29] (03Merged) 10jenkins-bot: Remove/collapse a few conditionals in CentralNotice config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374877 (owner: 10Reedy) [14:25:06] (03PS1) 10Gehel: wdqs - activate wdqs100[45] as wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/376025 (https://phabricator.wikimedia.org/T171210) [14:25:55] !log reedy@tin Synchronized docroot/noc/conf/highlight.php: Fix links for dblists in highlight.php (duration: 00m 44s) [14:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:18] PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:19] (03PS1) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [14:26:38] (03CR) 10Gehel: [C: 031] "LGTM (minor point mostly unrelated to this change discussed directly with volans)" [software/cumin] - 10https://gerrit.wikimedia.org/r/375769 (https://phabricator.wikimedia.org/T174911) (owner: 10Volans) [14:26:48] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:49] PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:49] PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:49] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:26:54] !log reedy@tin Synchronized errorpages/404.php: phpcs (duration: 00m 45s) [14:27:00] (03CR) 10jenkins-bot: Remove/collapse a few conditionals in CentralNotice config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374877 (owner: 10Reedy) [14:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:08] RECOVERY - Host cp3008 is UP: PING WARNING - Packet loss = 54%, RTA = 83.86 ms [14:27:08] RECOVERY - Host cp3004 is UP: PING OK - Packet loss = 0%, RTA = 84.05 ms [14:27:08] RECOVERY - Host cp3007 is UP: PING OK - Packet loss = 0%, RTA = 83.96 ms [14:27:08] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 83.89 ms [14:27:36] ema: network issue? [14:27:48] volans: possibly [14:28:14] shouldn't they cause a broader set of alarms if this happens? [14:28:14] !log reedy@tin Synchronized tests/: phpcs (duration: 00m 46s) [14:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:30] maybe they are under maintenance? [14:28:48] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2052834 [14:29:01] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704#3580154 (10Volans) [14:30:36] !log reedy@tin Synchronized wmf-config/: phpcs (duration: 00m 47s) [14:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:17] (03PS1) 10DCausse: [cirrus] Disable native script for super noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376027 (https://phabricator.wikimedia.org/T174652) [14:32:03] all the cache hosts above are cache_misc [14:32:48] (03CR) 10DCausse: [C: 04-1] "not to be merged before elastic 5.5.2+ is up and running on all clusters." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376027 (https://phabricator.wikimedia.org/T174652) (owner: 10DCausse) [14:40:36] (03PS1) 10Andrew Bogott: labtest: have horizon use the new labtestpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/376028 [14:41:53] (03CR) 10Andrew Bogott: [C: 032] labtest: have horizon use the new labtestpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/376028 (owner: 10Andrew Bogott) [14:45:05] (03PS1) 10Muehlenhoff: Extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 [14:45:36] (03CR) 10jerkins-bot: [V: 04-1] Extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 (owner: 10Muehlenhoff) [14:45:43] PROBLEM - puppet last run on kubestage1002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [14:46:02] PROBLEM - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:32] PROBLEM - Check systemd state on kubestage1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:48:33] PROBLEM - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:40] that's me ^ [14:50:01] (03CR) 10Mobrovac: "Out of curiosity, do we know what DEBUG would bring us as opposed to, say, INFO? A connected question: do we have a guesstimate as to the " [puppet] - 10https://gerrit.wikimedia.org/r/376015 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [14:52:34] (03CR) 10Elukey: "As far as I can see INFO does not log anything, at least from my tests in labs (there is also a comment in the standard log4j config file " [puppet] - 10https://gerrit.wikimedia.org/r/376015 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [14:53:49] greg-g, we're going to go a little over. There isn't a window immediately after us. [14:55:14] (03CR) 10Ema: [C: 032] varnish-backend-restart: do not run if varnish-be is depooled [puppet] - 10https://gerrit.wikimedia.org/r/376004 (owner: 10Ema) [14:55:27] (03PS2) 10Ema: varnish-backend-restart: do not run if varnish-be is depooled [puppet] - 10https://gerrit.wikimedia.org/r/376004 [14:55:32] (03CR) 10Ema: [V: 032 C: 032] varnish-backend-restart: do not run if varnish-be is depooled [puppet] - 10https://gerrit.wikimedia.org/r/376004 (owner: 10Ema) [14:55:33] (03CR) 10Mobrovac: "I'm mostly concerned with the volume and the possible impact on performance/disk, but I like your plan of rolling it out on only one broke" [puppet] - 10https://gerrit.wikimedia.org/r/376015 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [14:56:13] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2099977 [14:58:30] (03CR) 10Ottomata: "This sets dimension types for any dimension with a given name, not just ones for a specific dataset?" [puppet] - 10https://gerrit.wikimedia.org/r/375762 (https://phabricator.wikimedia.org/T168550) (owner: 10Joal) [15:00:46] (03CR) 10Elukey: "I am not too concerned about disk performance since it is a simple log, but I agree with you that data needs to be collected and this is e" [puppet] - 10https://gerrit.wikimedia.org/r/376015 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [15:02:02] volans, elukey: the cp3 misc glitch we've seen before was due to issues on csw2-esams (see #-netops) [15:02:08] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T174534#3580262 (10Papaul) p:05Triage>03Normal [15:02:24] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T174777#3580263 (10Papaul) p:05Triage>03Normal [15:04:04] jouncebot: now [15:04:04] No deployments scheduled for the next 0 hour(s) and 55 minute(s) [15:04:09] ema: ack and thanks for the update (I'm not in that channel too) [15:06:46] (03PS2) 10Volans: Rakefile: print offending files when searching typos [puppet] - 10https://gerrit.wikimedia.org/r/375759 (owner: 10Giuseppe Lavagetto) [15:06:58] _joe_: FYI I'm merging it ^^^ [15:07:12] <_joe_> volans: go on, in a meeting [15:07:15] 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3580285 (10RobH) a:03mark Assigning to @mark for approval of spare server usage. @Mark: We have 4 total spare systems on the shelf ident... [15:07:55] (03CR) 10Volans: [C: 032] Rakefile: print offending files when searching typos [puppet] - 10https://gerrit.wikimedia.org/r/375759 (owner: 10Giuseppe Lavagetto) [15:08:42] (03PS2) 10Volans: Extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 (owner: 10Muehlenhoff) [15:08:52] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [15:09:10] (03CR) 10jerkins-bot: [V: 04-1] Extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 (owner: 10Muehlenhoff) [15:10:38] (03PS3) 10Ema: varnish::common::vcl: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374739 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [15:10:43] (03CR) 10Ema: [V: 032 C: 032] varnish::common::vcl: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374739 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [15:12:40] RECOVERY - HP RAID on ms-be2023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [15:16:18] (03Abandoned) 10Joal: Update pivot systemd start command [puppet] - 10https://gerrit.wikimedia.org/r/375538 (owner: 10Joal) [15:17:38] (03PS2) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [15:17:51] (03PS3) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [15:18:16] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:23:35] (03CR) 10Volans: [C: 04-1] "Two typos, see inline comments. I'll check the actual queries later." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/376029 (owner: 10Muehlenhoff) [15:25:42] (03PS4) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [15:26:10] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:26:13] (03PS5) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [15:26:40] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:27:32] !log mattflaschen@tin Started scap: Prepare to enable RCFilters (WLFilters) on Watchlist [15:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:06] (03PS3) 10Ema: varnish: convert to string integers [puppet] - 10https://gerrit.wikimedia.org/r/374778 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [15:28:10] (03CR) 10Ema: [V: 032 C: 032] varnish: convert to string integers [puppet] - 10https://gerrit.wikimedia.org/r/374778 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [15:29:22] (03CR) 10Ottomata: "If we decide to keep this, I'd like for this to be a parameter that we override in hiera, rather than hardcoding it in the file." [puppet] - 10https://gerrit.wikimedia.org/r/376015 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [15:30:11] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376037 [15:30:15] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376037 [15:31:43] (03CR) 10Ottomata: "Whoa big bug. Yikes. Nice find luca. This must have been happening for a long time. Oof." [puppet] - 10https://gerrit.wikimedia.org/r/375977 (https://phabricator.wikimedia.org/T174815) (owner: 10Elukey) [15:32:35] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376037 (owner: 10Marostegui) [15:34:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376037 (owner: 10Marostegui) [15:35:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 - T174509 (duration: 00m 47s) [15:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:33] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [15:36:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376037 (owner: 10Marostegui) [15:36:15] !log mattflaschen@tin scap aborted: Prepare to enable RCFilters (WLFilters) on Watchlist (duration: 08m 42s) [15:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:14] (03PS2) 10Ema: varnish: stringify instance ports [puppet] - 10https://gerrit.wikimedia.org/r/374946 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [15:40:19] (03CR) 10Ema: [V: 032 C: 032] varnish: stringify instance ports [puppet] - 10https://gerrit.wikimedia.org/r/374946 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [15:43:49] !log 'Scap sync failed on i18n, so I'll deploy just the non-i18n ones' [15:43:54] ^ stephanebisson, James_F [15:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:16] Oh dear. [15:45:17] matt_flaschen: What failed with the scap? [15:45:18] James_F, I don't know what caused the failure, doesn't look like our code: https://phabricator.wikimedia.org/P5956 [15:46:05] James_F, CDB generation. I thought it might be invalid JSON, but hif-latn (one of the failures) looks fine for both core and WikimediaMessages (all we changed). [15:46:17] James_F, also, "Could not chdir to home directory /var/lib/mwdeploy: No such file or directory " [15:46:20] * James_F nods. [15:46:22] Maybe related [15:48:02] James_F, en and qqq are also fine for both repos. [15:48:21] Hmm. [15:48:47] (03PS5) 10Mobrovac: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 [15:49:44] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3580448 (10Gilles) a:05fgiunchedi>03Gilles [15:51:20] (03PS4) 10Elukey: stat1003: remove puppet configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/374332 (https://phabricator.wikimedia.org/T152712) [15:52:35] (03PS1) 10Gilles: Expose Thumbor swift username [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) [15:56:25] (03PS3) 10EBernhardson: Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) [15:56:57] (03CR) 10Gilles: [C: 04-1] "Needs to have the change to PrivateSettings deployed first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) (owner: 10Gilles) [15:57:44] (03PS1) 10Ema: varnish::wikimedia_vcl: explicitly pass $vcl_config [puppet] - 10https://gerrit.wikimedia.org/r/376045 [15:58:44] (03PS6) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [15:58:50] Reedy: around? [15:58:55] addshore: For you? [15:59:05] (03PS7) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [15:59:11] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:59:13] yes, for me :D [15:59:41] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:59:49] I was wondering if there is a 'nice' was to get the collection of create statements for an extension (newsletter) with $wgDBTableOptions etc populated? [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170905T1600). Please do the needful. [16:00:04] reedy and Amir1: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:11] o/ [16:00:13] I should do it through update.php right? [16:00:54] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:01:04] RECOVERY - puppet last run on kubestage1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:01:05] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:01:05] RECOVERY - puppet last run on kubernetes2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:01:15] RECOVERY - puppet last run on kubestage1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:01:24] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:02:36] addshore: What are you trying to do? [16:02:50] mediawikiwiki newsletter tables [16:02:59] For what purpose? [16:03:02] there must be an 'easy' way right? [16:03:13] To do what exactly? :P [16:03:31] create the tables / get the 'expanded' sql to create the tables :P [16:03:52] Why do you need that? [16:04:16] so i can create them .... [16:04:19] Where? [16:04:20] On WMF wikis? [16:04:26] mediawikiwiki :P [16:04:28] ffs xD [16:04:36] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/createExtensionTables.php [16:04:40] Just add it to there [16:04:42] ack [16:04:42] Makes it easier [16:04:49] Because no doubt you're gonna wanna run it elsewhere too :P [16:05:01] wait we already did [16:05:04] so thats the answer :P [16:05:05] PROBLEM - Host kubestage1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:05] PROBLEM - Host kubernetes1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:10] gj [16:05:14] just i forgot what it was, and you forgot we already did it ;) [16:05:44] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational [16:05:45] PROBLEM - Host kubestage1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:54] RECOVERY - Host kubernetes1002 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:05:54] RECOVERY - Host kubestage1002 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:05:54] RECOVERY - Host kubestage1001 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:06:04] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational [16:06:05] RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational [16:06:05] who is going to do puppet SWAT? [16:06:14] RECOVERY - Check systemd state on kubernetes1002 is OK: OK - running: The system is fully operational [16:06:14] RECOVERY - Check systemd state on kubestage1002 is OK: OK - running: The system is fully operational [16:06:25] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational [16:06:37] (03CR) 10Chad: [C: 031] "This is forwards and backwards compatible, can land whenever :)" [puppet] - 10https://gerrit.wikimedia.org/r/375922 (owner: 10Paladox) [16:06:48] I think that people are a bit busy at the moment, I can try to help if you guys want [16:07:26] Reedy: do you have a bit of patience to introduce me to your changes? [16:07:59] sure [16:08:03] !log reboot all kubernetes boxes T170119 [16:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:16] T170119: Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119 [16:08:16] elukey: I don't mind you skipping the apache change if you're not comfortable, as that's a bit more involved [16:08:45] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:09:21] Reedy: it looks straightforward but I might need to review it carefully [16:09:27] let's start with the first two [16:09:56] ow, elukey while you do that I'll do Amir1's if that's ok [16:10:24] godog: o/ - I can leave all the patches if you want, I thought you guys were busy :( [16:10:47] (03PS5) 10Elukey: Generate FancyCaptchas in 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/358395 (https://phabricator.wikimedia.org/T157736) (owner: 10Reedy) [16:11:24] elukey: I am yeah, completely forgot about puppet swat heh but happy to help [16:11:45] godog: all right let's split then [16:12:38] The apache one should be the only real difficult one [16:12:40] Reedy: ok to merge https://gerrit.wikimedia.org/r/#/c/358395 ? I've checked the task and you guys seems to have done the test homeworks :) [16:13:00] Yup. The code it depends on has been in production for a few weeks [16:13:01] :) [16:13:09] (03CR) 10Elukey: [C: 032] Generate FancyCaptchas in 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/358395 (https://phabricator.wikimedia.org/T157736) (owner: 10Reedy) [16:13:28] (03PS8) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [16:13:55] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [16:14:14] !log addshore@terbium:~$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki mediawikiwiki --extension newsletter [16:14:20] (03PS7) 10Elukey: Do the echo when running update.php [puppet] - 10https://gerrit.wikimedia.org/r/354932 (owner: 10Reedy) [16:14:23] (03PS9) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [16:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:53] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [16:16:03] (03PS6) 10Filippo Giunchedi: Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) (owner: 10Ladsgroup) [16:16:48] (03CR) 10Filippo Giunchedi: [C: 032] Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) (owner: 10Ladsgroup) [16:17:12] (03PS8) 10Elukey: wmf-beta-update-databases.py: do the echo when running update.php [puppet] - 10https://gerrit.wikimedia.org/r/354932 (owner: 10Reedy) [16:17:25] retitled to focus on what is changing [16:17:40] (03CR) 10Elukey: [C: 032] wmf-beta-update-databases.py: do the echo when running update.php [puppet] - 10https://gerrit.wikimedia.org/r/354932 (owner: 10Reedy) [16:17:45] (03PS9) 10Elukey: wmf-beta-update-databases.py: do the echo when running update.php [puppet] - 10https://gerrit.wikimedia.org/r/354932 (owner: 10Reedy) [16:18:19] Amir1: change is live [16:18:37] it's so fancy now [16:18:43] (03PS10) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [16:18:50] https://usercontent.irccloud-cdn.com/file/iLowqgZY/image.png [16:19:21] godog: Thank you! [16:19:28] I see a different font o_0 [16:19:33] no worries, my OCD twitched at the fact that "code review" is slightly narrower than "wikimedia" [16:20:10] godog: It's part of the visual identity guide [16:20:33] the font is also Montezert (or something like that) which WMF uses [16:21:45] https://wikimediafoundation.org/wiki/Visual_identity_guidelines#toc-foundation [16:21:58] the rendering I'm seeing is also different, 17:05 *nod* I'll think about it a bit more during the week too [16:21:59] (03PS11) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [16:22:04] Love it Amir1 good job [16:22:04] no [16:22:09] https://phabricator.wikimedia.org/F9342590 [16:22:11] I see http://imgur.com/a/qzLag [16:22:58] interesting, I think it's because the font is not installed [16:22:59] I guess [16:23:44] Reedy: https://gerrit.wikimedia.org/r/#/c/374389/1/modules/mediawiki/files/apache/sites/remnant.conf - I am a bit ignorant of this part of our apache config, but why only a vhost on port 80? Is the config for electcom.wikimedia.org already present ? (only a matter of adding the dns?) [16:23:58] I see the same of Reedy on macOS [16:24:17] elukey: So SSL termination is done infront of apache by a different cluster. So we don't need to listen on 443 [16:24:54] (03PS1) 10Herron: WIP: Change check_ipmi_temp to check_ipmi_sensor and monitor PSUs [puppet] - 10https://gerrit.wikimedia.org/r/376048 (https://phabricator.wikimedia.org/T109903) [16:24:56] Reedy: yep I know [16:25:03] What do you mean is the config already present? [16:25:09] DNS is in another patch in puppet swat [16:25:30] (03CR) 10Chad: "Ugh, this should not have been merged. There was still discussion going on on Phabricator." [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) (owner: 10Ladsgroup) [16:26:02] (03PS1) 10Chad: Revert "Use new logo of WMF for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/376049 [16:26:03] Reedy: ah sorry, the new vhost in the patch is going to serve the content, the redirects happens of course only if the protocol is http [16:26:09] PROBLEM - nova-scheduler process on labtestcontrol2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler [16:26:25] (and X-Forwarded-Proto is not there with https, set by nginx) [16:26:26] okok [16:26:59] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:27:19] PROBLEM - nova-conductor process on labtestcontrol2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor [16:28:05] 10Operations, 10monitoring, 10Patch-For-Review: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#3580635 (10RobH) So is the approach to avoid checking the PDU towers themselves directly? I can see the addition of server checks, but I still thin... [16:28:15] greg-g, Reedy, James_F, stephanebisson, I am still trying to recover from T175041 . I have a sync-dir in progress (not showing anything, but hopefully it is just calculating file diffs). [16:28:15] T175041: scap sync failed on i18n - https://phabricator.wikimedia.org/T175041 [16:28:24] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/376025 (https://phabricator.wikimedia.org/T171210) (owner: 10Gehel) [16:29:00] godog: Please revert the Gerrit change. [16:29:08] (03CR) 10Elukey: [C: 031] "I'd test this first on mwdebug if not done yet" [puppet] - 10https://gerrit.wikimedia.org/r/374389 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy) [16:29:57] 10Operations, 10Goal, 10Kubernetes: Operations Q1 goal: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108#3580647 (10akosiaris) [16:29:59] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, 10Services (watching): Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119#3580644 (10akosiaris) 05Open>03Resolved a:03akosiaris Production clusters in both DCs as well as the staging cluster are now at kubernetes 1.7.4 and... [16:30:21] no_justification: will revert cc Amir1 [16:30:27] * godog looks at the bikeshed [16:30:28] (03CR) 10Ladsgroup: "You said "Symbolic -1 because I hate the logo. But +1 I guess :(", I took it as optional, in the phabricator, I can't see any discussion. " [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) (owner: 10Ladsgroup) [16:30:50] (03PS2) 10Filippo Giunchedi: Revert "Use new logo of WMF for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/376049 (owner: 10Chad) [16:30:52] godog: just answered in gerrit [16:31:14] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3580649 (10Papaul) @elukey I follow the steps that were giving to me from the Dell engineer to power down the server using the NMI button, holding the button down doesn't power the server down.... [16:31:20] I think it's okay to revert if there is another logo suggested, there is no plan, no resource and designer to make the new logo [16:31:29] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, 10Services (watching): Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119#3580653 (10akosiaris) [16:31:33] I'd rather have no logo at all than the one proposed :( [16:31:48] ok, I'll revert for now, please reschedule when there's consensus [16:31:55] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Use new logo of WMF for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/376049 (owner: 10Chad) [16:32:09] RECOVERY - nova-scheduler process on labtestcontrol2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-scheduler [16:32:19] RECOVERY - nova-conductor process on labtestcontrol2001 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor [16:32:20] Amir1: I was mostly referring to legoktm's comment https://phabricator.wikimedia.org/T174576#3568659 [16:32:26] As "discussion" [16:32:27] :) [16:32:29] I agree this one is not Picasso's work but the old one is super super ugly, the new one at least follows WMF identity guideline [16:32:32] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3580657 (10madhuvishy) > All of these solutions so far require onsite. @Cmjohnson If you are back and onsite today, could you please take a look? [16:32:46] Ugly is subjective. I think the new one is uglier :p [16:33:05] Reedy: change live on mwdebug1001, can you check? [16:33:20] elukey: It's not really gonna work without DNS... ;P [16:33:29] PROBLEM - nova-api http on labtestnet2001 is CRITICAL: connect to address 10.192.20.5 and port 8774: Connection refused [16:33:30] PROBLEM - nova-api process on labtestnet2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api [16:33:32] no_justification: my comment is about legoktm's [16:33:43] it's okay to revert if there is another logo suggested, there is no plan, no resource and designer to make a brand-new new logo [16:33:54] Reedy: well you can simulate it with curl --header "Host: etc.." [16:33:55] !log mattflaschen@tin Synchronized php-1.30.0-wmf.16/: Prepare to enable RCFilters (WLFilters) on Watchlist, but without i18n changes (reverted) (duration: 12m 41s) [16:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:20] (03CR) 10Mobrovac: "> If we decide to keep this, I'd like for this to be a parameter that we override in hiera, rather than hardcoding it in the file." [puppet] - 10https://gerrit.wikimedia.org/r/376015 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [16:34:24] Amir1: I'm sure we can figure out something :) [16:34:29] RECOVERY - nova-api http on labtestnet2001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.077 second response time [16:34:30] RECOVERY - nova-api process on labtestnet2001 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/nova-api [16:34:30] * no_justification grabs his box of crayons [16:34:31] :) [16:34:44] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3580667 (10Papaul) @elukey The ISO file is about 3.3 GB can not use mifi to download it so will download it once home and bring it to the DC tomorrow to update the firmware. Thanks. [16:34:56] Reedy: like root@mwdebug1001:/home/elukey# curl --header "Host: electcom.wikimedia.org" localhost, works perfectly [16:35:04] (it shows the redirect) [16:35:34] lgtm then :) [16:35:39] meh, I will revert your revert in one month if a new one is there [16:35:40] the wiki itself hasn't been created [16:35:44] *is not [16:35:47] Reedy: and curl --header "Host: electcom.wikimedia.org" --header "X-Forwarded-Proto: https" localhost this one shows the page [16:36:53] That'll be fine until the wiki gets created then :) [16:37:04] Amir1: I'll revert your revert a month later. Yay it'll be like a monthly date we have :p [16:37:06] Hehehehehe [16:37:24] Reedy: aahh okok, that was my next question :) [16:37:24] :)))))) [16:37:29] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms [16:37:39] (03PS2) 10Ema: varnish: drop varnish::wikimedia_vcl [puppet] - 10https://gerrit.wikimedia.org/r/376045 [16:39:59] papaul: did you powercycle mw2256? [16:40:03] (morning) [16:40:59] ah sorry just seen the update in the task [16:42:27] elukey: yes [16:42:43] (03CR) 10Elukey: [C: 031] "tested on mwdebug1001, apachectl looks ok, redirect on http works and request with X-Forwarded-Proto: https header works as expected (no w" [puppet] - 10https://gerrit.wikimedia.org/r/374389 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy) [16:43:11] (03PS2) 10Elukey: Add electcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/374389 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy) [16:43:43] 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576064 (10BBlack) I suspect this is a hardware failure, but we should do some more software testing on the other nodes first to confirm this isn't buggy / incorrect behavior being triggered by the (somewhat crazy) NUMA-i... [16:43:55] Okay, I'm done, we will have to figure out T175041 and try again later though (plus we have more patches). [16:43:55] T175041: scap sync failed on i18n - https://phabricator.wikimedia.org/T175041 [16:44:17] (03CR) 10Elukey: [C: 032] Add electcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/374389 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy) [16:45:48] !log add new virtualhost electcom.wikimedia.org to the appservers apache config - https://gerrit.wikimedia.org/r/374389 (implies apache config reload) [16:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:29] (03PS2) 10Elukey: Add electcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/374385 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy) [16:51:32] (03PS3) 10Ema: varnish: drop varnish::wikimedia_vcl [puppet] - 10https://gerrit.wikimedia.org/r/376045 [16:51:50] !log reloading dbproxy1005 to repoing to db1009 again- things seem stable right now [16:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:39] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [16:52:43] (03PS8) 10Filippo Giunchedi: Optionally filter private wiki results in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy) [16:53:16] (03PS1) 10Andrew Bogott: nova-conductor: limit number of workers [puppet] - 10https://gerrit.wikimedia.org/r/376057 (https://phabricator.wikimedia.org/T175002) [16:54:17] (03CR) 10Filippo Giunchedi: [C: 032] Optionally filter private wiki results in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy) [16:55:59] Reedy: mwgrep --no-private merged [16:56:05] sweet [16:56:15] should be on tin/terbium when puppet runs there next [16:56:20] No rush to force a run though :) [16:56:30] (03PS3) 10Muehlenhoff: Extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 [16:56:33] (03CR) 10Muehlenhoff: Extend aliases (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/376029 (owner: 10Muehlenhoff) [16:56:34] godog: https://gerrit.wikimedia.org/r/374385 looks good, shall we merge + authdns-update on n1something ? [16:56:40] *nssomething [16:56:52] elukey: yup, lgtm [16:57:02] (03CR) 10Andrew Bogott: [C: 032] nova-conductor: limit number of workers [puppet] - 10https://gerrit.wikimedia.org/r/376057 (https://phabricator.wikimedia.org/T175002) (owner: 10Andrew Bogott) [16:57:06] (03PS2) 10Andrew Bogott: nova-conductor: limit number of workers [puppet] - 10https://gerrit.wikimedia.org/r/376057 (https://phabricator.wikimedia.org/T175002) [16:57:08] (03CR) 10Jcrespo: [C: 031] "Based on stats, I rarely see more than 3 workeds non-idle at each time, but reduing it to 8 may help issues like the 27 concurrent workers" [puppet] - 10https://gerrit.wikimedia.org/r/376057 (https://phabricator.wikimedia.org/T175002) (owner: 10Andrew Bogott) [16:57:10] (03CR) 10Elukey: [C: 032] Add electcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/374385 (https://phabricator.wikimedia.org/T174370) (owner: 10Reedy) [16:58:08] (03CR) 10Ema: [C: 04-1] "This is a partially successful albeit ugly attempt to please the future parser:" [puppet] - 10https://gerrit.wikimedia.org/r/376045 (owner: 10Ema) [16:58:30] !log ran authdns-update from ns1.w.o after https://gerrit.wikimedia.org/r/374385 to create electcom.wikimedia.org [16:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:02] Reedy: https://electcom.wikimedia.org/ [16:59:15] https://electcom.wikimedia.org/?foo [16:59:21] That's expected :) [16:59:25] !log reducing the number of concurrent nova-conductor workers. May cause hiccups as services restart [16:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:51] Reedy: one thing that looks a bit weird though is that .wikimedia.org is usually under misc afaik [16:59:51] (03PS3) 10Niharika29: Enable AbuseFilter runtime profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375072 (https://phabricator.wikimedia.org/T161059) (owner: 10Dmaza) [17:00:03] in dns? [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170905T1700). Please do the needful. [17:00:25] no parsoid deploy today [17:01:10] really there's probably more wikimedia.org hostnames mapped to the text cluster than the misc one [17:01:14] no ORES today [17:01:35] Reedy: see I am wrong, bblack made me feel better, nevermind :) [17:01:45] (but almost all services mapped to the misc cluster are in wikimedia.org, which is a different thing) [17:01:45] lol [17:01:54] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T174777#3580830 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your request is... [17:01:58] bblack: thanks! [17:02:24] so puppet swat completed [17:02:47] \o/ \o/ [17:02:59] 10Operations, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3559677 (10Reedy) Wiki can be created at any time now [17:03:29] (03PS5) 10Elukey: stat1003: remove puppet configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/374332 (https://phabricator.wikimedia.org/T152712) [17:03:31] now let's nuke stat1003 [17:03:33] (perhaps a better approximate mental dividing line is that small wikis which happen to be in wikimedia.org go through the text cluster, and other non-wiki services in wikimedia.org tend to route through the misc cluster) [17:04:03] yep makes sense [17:04:20] I was ok with having this wiki in text, but the .wikimedia.org for a moment was weird to see [17:04:45] the usual afterthoughts after running authdns-update [17:04:45] eventually that distinction will fade away regardless [17:04:50] 10Operations, 10monitoring, 10Patch-For-Review: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#3580848 (10herron) Monitoring both makes sense to me but it sounds like direct PDU monitoring isn't always an option. From the server perspective i... [17:04:56] that scares the hell out of me :D [17:05:11] (03CR) 10Elukey: [C: 032] stat1003: remove puppet configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/374332 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey) [17:05:13] we're slowly heading towards a model of just two big cache clusters: the multimedia one and the non-multimedia one [17:06:09] 10Operations, 10DBA, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3580853 (10jcrespo) > Going to remove the DBA tag from this task as our part is done, but I will remain subscribed as once the wiki is created, we'd need to do... [17:08:56] 10Operations, 10ops-eqiad, 10Services (watching): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3580870 (10RobH) Background: Restbase1010 was ordered on T126049, which did NOT include SSDs. Instead, the non-standard Samsung SSDs were pulled from existing restbase hosts r... [17:10:47] (03CR) 10Dbarratt: [C: 031] Enable AbuseFilter runtime profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375072 (https://phabricator.wikimedia.org/T161059) (owner: 10Dmaza) [17:11:18] (03PS4) 10Herron: WIP: Add letsencrypt certs to mx servers [puppet] - 10https://gerrit.wikimedia.org/r/375427 (https://phabricator.wikimedia.org/T174081) [17:11:40] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add letsencrypt certs to mx servers [puppet] - 10https://gerrit.wikimedia.org/r/375427 (https://phabricator.wikimedia.org/T174081) (owner: 10Herron) [17:14:16] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.12 (duration: 02m 32s) [17:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:20] RECOVERY - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.46 port 9042 [17:17:49] RECOVERY - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.47 port 9042 [17:20:09] (03CR) 10Paladox: [C: 031] "Yep :)." [puppet] - 10https://gerrit.wikimedia.org/r/375922 (owner: 10Paladox) [17:20:13] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3580895 (10RobH) [17:23:02] (03PS1) 10Chad: group0 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376060 [17:23:23] (03CR) 10Chad: [C: 04-2] "NOT YET THAT WOULD BE CRAZY TIEMZ" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376060 (owner: 10Chad) [17:24:02] !log demon@tin Started scap: wmf.17 bootstrap [17:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:19] (03CR) 10RobH: [C: 031] wdqs - activate wdqs100[45] as wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/376025 (https://phabricator.wikimedia.org/T171210) (owner: 10Gehel) [17:31:31] (03PS5) 10Herron: WIP: Add letsencrypt certs to mx servers [puppet] - 10https://gerrit.wikimedia.org/r/375427 (https://phabricator.wikimedia.org/T174081) [17:31:52] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add letsencrypt certs to mx servers [puppet] - 10https://gerrit.wikimedia.org/r/375427 (https://phabricator.wikimedia.org/T174081) (owner: 10Herron) [17:33:42] (03CR) 10Ottomata: [C: 032] webperf: Convert ve.py from ZMQ to KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/375106 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [17:33:45] 10Operations, 10Discovery, 10Discovery-Analysis, 10Maps, and 3 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3581011 (10debt) 05Open>03Resolved Thanks @ema and @gehel! [17:34:08] (03PS3) 10Ottomata: webperf: Refactor ve.py and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/375105 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [17:34:14] (03CR) 10Ottomata: [V: 032 C: 032] webperf: Refactor ve.py and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/375105 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [17:34:20] (03PS4) 10Ottomata: webperf: Convert ve.py from ZMQ to KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/375106 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [17:34:24] (03CR) 10Ottomata: [V: 032 C: 032] webperf: Convert ve.py from ZMQ to KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/375106 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [17:37:39] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [17:38:10] 10Operations, 10Wikimedia-Logstash, 10vm-requests, 10Discovery-Search (Current work), 10Patch-For-Review: Provision VMs on Ganeti for logstash100[123] - https://phabricator.wikimedia.org/T173565#3581036 (10RobH) 05Open>03Resolved I've gone ahead and created sub-task T175045 to track the actual setup.... [17:38:43] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3580895 (10RobH) [17:38:54] !log cp1074 - restart varnish backend, mailbox lag [17:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:09] 10Operations, 10DBA, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3581045 (10Marostegui) >>! In T174370#3580853, @jcrespo wrote: >> Going to remove the DBA tag from this task as our part is done, but I will remain subscribed... [17:44:38] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:48] PROBLEM - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:46:19] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [17:46:36] bblack: lvs1007 just went down? ^^^ [17:47:36] cannot ping/ssh [17:51:22] volans: can ignore [17:51:28] volans: that's me...sorry thought it was in maintenance still [17:51:39] it probably should be, sometimes the maints expire or reset [17:51:52] yeah..i didn't check first [17:52:03] oh ok, thanks [17:52:18] those hosts have been in a perpetual state of "not quite ready to use in production" for one of series of different reasons for like... 2 years? :) [17:52:41] heheh, right [17:52:48] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:53:20] yeah phab dates them being initially racked in July 2015 :/ [17:54:36] :( [17:55:41] (03PS1) 10RobH: setting logstash100[7-9].eqiad.wmnet install params [puppet] - 10https://gerrit.wikimedia.org/r/376062 (https://phabricator.wikimedia.org/T175045) [18:05:19] (03PS6) 10Herron: WIP: Add letsencrypt certs to mx servers [puppet] - 10https://gerrit.wikimedia.org/r/375427 (https://phabricator.wikimedia.org/T174081) [18:05:40] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add letsencrypt certs to mx servers [puppet] - 10https://gerrit.wikimedia.org/r/375427 (https://phabricator.wikimedia.org/T174081) (owner: 10Herron) [18:09:20] (03PS7) 10Herron: WIP: Add letsencrypt certs to mx servers [puppet] - 10https://gerrit.wikimedia.org/r/375427 (https://phabricator.wikimedia.org/T174081) [18:09:58] (03CR) 10RobH: [C: 032] setting logstash100[7-9].eqiad.wmnet install params [puppet] - 10https://gerrit.wikimedia.org/r/376062 (https://phabricator.wikimedia.org/T175045) (owner: 10RobH) [18:10:08] (03PS2) 10RobH: setting logstash100[7-9].eqiad.wmnet install params [puppet] - 10https://gerrit.wikimedia.org/r/376062 (https://phabricator.wikimedia.org/T175045) [18:10:54] (03PS3) 10RobH: setting logstash100[7-9].eqiad.wmnet install params [puppet] - 10https://gerrit.wikimedia.org/r/376062 (https://phabricator.wikimedia.org/T175045) [18:11:02] !log demon@tin Finished scap: wmf.17 bootstrap (duration: 47m 00s) [18:11:04] (03PS4) 10RobH: setting logstash100[7-9].eqiad.wmnet install params [puppet] - 10https://gerrit.wikimedia.org/r/376062 (https://phabricator.wikimedia.org/T175045) [18:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:20] bleh, should have locally rebased that.. [18:11:20] (03PS12) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [18:11:41] (03CR) 10RobH: [C: 032] setting logstash100[7-9].eqiad.wmnet install params [puppet] - 10https://gerrit.wikimedia.org/r/376062 (https://phabricator.wikimedia.org/T175045) (owner: 10RobH) [18:11:52] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:14:13] (03PS13) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [18:14:45] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:16:41] (03PS14) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [18:17:16] (03PS15) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [18:18:13] (03PS2) 10Herron: Add 5 second "greet pause" delay to lists.wikimedia.org SMTP [puppet] - 10https://gerrit.wikimedia.org/r/371958 (https://phabricator.wikimedia.org/T173143) [18:23:49] there's a report that wikipedia is now reachable from turkey for some... [18:24:19] Oh? That's good news if confirmed [18:25:00] yeah, will need some more people to confirm of course, might be just one ISP screwing their block or something. [18:28:48] (03CR) 10Herron: "That's true. I'm not expecting this to filter huge numbers of messages either but every bit helps" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371958 (https://phabricator.wikimedia.org/T173143) (owner: 10Herron) [18:29:56] (03PS1) 10RobH: fixing new logstash entries [puppet] - 10https://gerrit.wikimedia.org/r/376071 (https://phabricator.wikimedia.org/T175045) [18:30:10] (03PS2) 10RobH: fixing new logstash entries [puppet] - 10https://gerrit.wikimedia.org/r/376071 (https://phabricator.wikimedia.org/T175045) [18:30:23] (03CR) 10RobH: [C: 032] fixing new logstash entries [puppet] - 10https://gerrit.wikimedia.org/r/376071 (https://phabricator.wikimedia.org/T175045) (owner: 10RobH) [18:37:58] hrmm, getting no root filesystem defined on these... which is odd [18:50:44] 10Operations, 10Continuous-Integration-Config, 10Release-Engineering-Team: operations-puppet-tests-docker console output lacks color - https://phabricator.wikimedia.org/T175057#3581282 (10hashar) [18:53:02] 10Operations, 10Continuous-Integration-Config, 10Release-Engineering-Team: operations-puppet-tests-docker console output lacks color - https://phabricator.wikimedia.org/T175057#3581296 (10hashar) From https://groups.google.com/forum/#!topic/docker-user/Bp4BaWRw6k4 > No colours: > `docker run -v ~/myproject/... [18:53:27] (03PS1) 10RobH: another install fix for logstash [puppet] - 10https://gerrit.wikimedia.org/r/376073 (https://phabricator.wikimedia.org/T175045) [18:53:43] (03CR) 10RobH: [C: 032] another install fix for logstash [puppet] - 10https://gerrit.wikimedia.org/r/376073 (https://phabricator.wikimedia.org/T175045) (owner: 10RobH) [18:54:58] 10Operations, 10ops-eqiad, 10Services (watching): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3581315 (10Cmjohnson) I do have spares....sorry forgot about those...Samsungs are not standard disks for us. The server gives me green lights for all of them. Can you tell me w... [19:00:05] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170905T1900). Please do the needful. [19:01:31] The irony that the weekly SF emergency system alarm test is at noon on tuesdays right as I roll out the new branch is not lost on me [19:01:37] AIR RAID SIREN, TIME TO DEPLOY! [19:03:18] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2064883 [19:03:29] (03CR) 10Chad: [C: 032] "https://www.youtube.com/watch?v=WmX4DSGvPY4&t=48" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376060 (owner: 10Chad) [19:05:06] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3581403 (10Cmjohnson) I ended up moving the cards to different pci slots and that fixed the issue. @robh passing this to you (again) [19:05:43] !log deleted querycache rows where qc_type = '' on all wikis T174513 [19:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:00] T174513: Cleanup querycache where qc_type = '' - https://phabricator.wikimedia.org/T174513 [19:06:37] (03Merged) 10jenkins-bot: group0 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376060 (owner: 10Chad) [19:06:48] (03CR) 10jenkins-bot: group0 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376060 (owner: 10Chad) [19:07:33] 10Operations, 10ops-eqiad, 10Services (watching): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3581444 (10RobH) So here is the thing for this particular error, while there are IO errors included above, the actual raid controller reports the disk is fine. => ctrl slot=0 l... [19:08:29] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table querycache: try to repair it on query. Default database: tewiktionary. [Query snipped] [19:13:48] PROBLEM - MD RAID on restbase1010 is CRITICAL: CRITICAL: State: degraded, Active: 14, Working: 14, Failed: 1, Spare: 0 [19:13:49] ACKNOWLEDGEMENT - MD RAID on restbase1010 is CRITICAL: CRITICAL: State: degraded, Active: 14, Working: 14, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175061 [19:13:54] 10Operations, 10ops-eqiad: Degraded RAID on restbase1010 - https://phabricator.wikimedia.org/T175061#3581482 (10ops-monitoring-bot) [19:17:58] 10Operations, 10HHVM: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705#3581515 (10Legoktm) I'm seeing failures in the Scribunto test suite using php-luasandbox 2.0.14 (my package for PHP7, not Wikimedia's HHVM one), but not under 2.0.13: ``` 1) LuaSandbox: SandboxTests[1]: setfenv invalid le... [19:19:25] 10Operations, 10Services (watching): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3581519 (10Cmjohnson) a:03fgiunchedi Replaced the disk with an on-site spare. Verified that the disk in slot 2 was /dev/sdc.....once I pulled it out from the server the raid cfg showed /de... [19:20:09] 10Operations, 10ops-eqiad: Degraded RAID on restbase1010 - https://phabricator.wikimedia.org/T175061#3581526 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Replaced the disk...associated with T174392 [19:20:48] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 874.85 seconds [19:20:54] (03PS1) 10Kaldari: Test ArticleCreationWorkflow extension on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376079 (https://phabricator.wikimedia.org/T175054) [19:27:05] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T174857#3581561 (10Cmjohnson) Replaced the disk and it's rebuilding Enclosure Device ID: 32 Slot Number: 6 Drive's position: DiskGroup: 0, Span: 3, Arm: 0 Enclosure position: 1 Device Id: 6 WWN: 5000C5006821E074 S... [19:27:38] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3581577 (10Cmjohnson) [19:28:04] 10Operations, 10Services (watching): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3581579 (10Eevans) Thanks @Cmjohnson ! [19:28:29] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: wmf.17 on group0 [19:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:11] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T174857#3575136 (10jcrespo) Thank you very much, will monitor when it finishes and close the ticket. [19:29:24] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission WMF3248 (old R510) - https://phabricator.wikimedia.org/T172323#3581583 (10Cmjohnson) p:05Triage>03Low [19:30:08] I am checking dbstore1002 [19:30:09] 10Operations, 10ops-eqiad, 10Analytics: Remove stat1002 - https://phabricator.wikimedia.org/T173094#3581589 (10Cmjohnson) p:05Triage>03Lowest stat1002 is still off-site [19:30:24] 10Operations, 10ops-eqiad: Decommission or repair old asw-c2-eqiad - https://phabricator.wikimedia.org/T156398#3581591 (10Cmjohnson) p:05Triage>03Low [19:30:39] 10Operations, 10DBA, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3581600 (10Cmjohnson) p:05Triage>03Low [19:30:57] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3581603 (10Cmjohnson) p:05Triage>03Normal [19:32:39] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:32:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3581609 (10Cmjohnson) It appears to be the cpu. Creating a task with Dell to replace. Record: 16 Date/Time: 08/30/2017 16:13:51 Source: system Severity: Cri... [19:33:09] !log rebuilding querycache table on tewiktionary.dbstore1002, it had crashed [19:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:58] 10Operations, 10DBA, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3581627 (10jcrespo) @Cmjohnson We are going to decom db1048 (but we are not ready yet), please do not take any action here, we will just clone it and ask you to unrack it. Opened for DBA... [19:38:59] !log gilles@tin Synchronized private/PrivateSettings.php: Add Thumbor username to Swift configuration (duration: 00m 48s) [19:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:58] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 153.48 seconds [19:41:09] (03CR) 10Gilles: "Changed to PrivateSettings.php has been deployed both on Beta and Prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) (owner: 10Gilles) [19:41:30] 10Operations, 10DBA, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3581657 (10Cmjohnson) no worries, I was just moving it to a lower priority for me..I am couple of weeks away from tacking decom's [19:42:38] 10Operations, 10HHVM: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705#3581662 (10Anomie) I'm working on it. [20:00:59] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3581781 (10RobH) [20:06:53] 10Operations, 10HHVM: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705#3581815 (10Legoktm) Tests fixed by @anomie :) [20:12:40] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3581849 (10aaron) Those refreshLInks jobs (from wikibase) are the only ones that use multiple titles per job, so they will be a lot slower (seems to be 5... [20:18:42] (03PS1) 10BryanDavis: labsdb: Make /run/mysqld/mysqld.sock default socket location [puppet] - 10https://gerrit.wikimedia.org/r/376092 (https://phabricator.wikimedia.org/T172496) [20:20:54] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3581887 (10bd808) [20:23:43] 10Operations, 10DBA, 10Phabricator, 10hardware-requests: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3581897 (10Peachey88) [20:25:08] PROBLEM - cassandra-a service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:25:19] PROBLEM - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.187 and port 9042: Connection refused [20:25:19] PROBLEM - cassandra-c SSL 10.64.32.196:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:25:28] PROBLEM - cassandra-b service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [20:25:29] PROBLEM - cassandra-a SSL 10.64.32.187:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:25:38] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [20:25:38] PROBLEM - cassandra-c CQL 10.64.32.196:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.196 and port 9042: Connection refused [20:25:49] PROBLEM - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.195 and port 9042: Connection refused [20:25:49] PROBLEM - cassandra-b SSL 10.64.32.195:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:25:58] PROBLEM - Check systemd state on restbase1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:26:14] 10Operations, 10ops-eqiad, 10Release-Engineering-Team (Watching / External): tin has a failing hdd - https://phabricator.wikimedia.org/T174449#3581919 (10Peachey88) [20:26:16] 10Operations, 10ops-eqiad: Pending sectors for one disk on tin.eqiad.wmnet - https://phabricator.wikimedia.org/T174347#3581921 (10Peachey88) [20:35:11] ^^^ on that (they're not in production) [20:39:39] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:47:48] (03CR) 10Rush: [C: 032] labsdb: Make /run/mysqld/mysqld.sock default socket location [puppet] - 10https://gerrit.wikimedia.org/r/376092 (https://phabricator.wikimedia.org/T172496) (owner: 10BryanDavis) [20:48:08] PROBLEM - Restbase root url on restbase-dev1004 is CRITICAL: connect to address 10.64.0.89 and port 7231: Connection refused [20:48:48] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:49:29] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:54:29] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [20:58:37] (03CR) 10MarcoAurelio: [C: 04-1] Remove Extension:RelatedSites from zhwikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [20:59:03] (03PS14) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) [21:06:09] PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [21:12:38] (03PS4) 10Volans: cumin: extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff) [21:14:10] (03CR) 10Volans: "@moritzm: I've uploaded a new PS with some query fixes and correct alphabetical order (not including the colon). Please review them." [puppet] - 10https://gerrit.wikimedia.org/r/376029 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff) [21:14:41] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3582163 (10RobH) [21:16:53] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3582171 (10RobH) [21:22:02] (03CR) 10DCausse: [C: 031] Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson) [21:26:49] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:26:59] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3582198 (10RobH) Ok, I thought labstore1006 was installed, since it was booted, but it and labstore1007 do not show the same disks in the same order. Example: the raid ar... [21:27:11] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3582199 (10RobH) [21:29:54] (03CR) 10BryanDavis: [C: 031] "I would not have bothered to write this patch, but now that Merlijn has written it I don't know of any reason not to merge it." [puppet] - 10https://gerrit.wikimedia.org/r/375860 (https://phabricator.wikimedia.org/T174082) (owner: 10Merlijn van Deen) [21:39:58] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [21:42:46] PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100% [21:45:55] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:48:18] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3582301 (10RobH) [21:54:04] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3582329 (10RobH) [22:04:55] * raynor is ready for SWAT deployment [22:06:32] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3582359 (10Krinkle) [22:06:41] !log demon@tin Synchronized php-1.30.0-wmf.16/extensions/FlaggedRevs/frontend/specialpages/reports/: (no justification provided) (duration: 03m 13s) [22:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:14] !log demon@tin Synchronized php-1.30.0-wmf.16/extensions/FlaggedRevs/frontend/specialpages/reports/: kill some warnings and stuff (duration: 02m 55s) [22:07:20] (03PS1) 10Krinkle: webperf: Decom webperf::ve service [puppet] - 10https://gerrit.wikimedia.org/r/376146 (https://phabricator.wikimedia.org/T175083) [22:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:09] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3582379 (10RobH) [22:11:37] Is mw1163 down? I can't ssh and scap complained about it up there ^ [22:11:52] Needs conftool removal? [22:13:48] 21:06 < icinga-wm> PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [22:13:55] no_justification: yes :) [22:16:59] huh [22:17:03] lemme go take a look at it [22:17:07] we may be able to ust reboot it back online [22:17:39] its locked up on serial, rebooting. [22:17:42] thanks, i wish this wasn't a manual/prone to delay process :/ [22:17:45] !log rebooting mw1163 as its serial console is locked up [22:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:11] ok its back online [22:21:15] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [22:21:20] greg-g / no_justification did you guys wanna redo your sync? [22:21:31] I'll just sync that machine by hand [22:21:46] ty robh [22:21:52] well, its crash has been admin logged, so if it starts to happen again we have a record [22:22:28] oh, now im checking the service event log [22:22:32] and it had a memory error =[ [22:23:01] warranty expired =P [22:23:12] if it has another memory error we'll like just decommission it. [22:23:41] actually, it shows a buncho f older errors too, so im just going to create the task to get decom approval now. [22:25:34] or replace iwth decom memory that matches [22:30:39] 10Operations, 10hardware-requests: decommission mw1163 - https://phabricator.wikimedia.org/T175089#3582459 (10RobH) [22:31:53] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3582487 (10RobH) a:05RobH>03Gehel [22:32:31] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3580895 (10RobH) Ok, handing this to @gehel for followup. Feel free to use or resolve task as needed. [22:36:16] PROBLEM - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:36:36] PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused [22:36:46] PROBLEM - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.114 and port 9042: Connection refused [22:36:46] PROBLEM - Check systemd state on restbase1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:36:46] PROBLEM - cassandra-b service on restbase1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [22:36:56] PROBLEM - cassandra-c service on restbase1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [22:36:56] PROBLEM - cassandra-a service on restbase1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [22:37:25] RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [22:39:42] (03CR) 10Andrew Bogott: "> I can't get anything at all to load." [puppet] - 10https://gerrit.wikimedia.org/r/375452 (owner: 10Andrew Bogott) [22:39:44] (03PS1) 10Odder: Update logo in the footer button image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376150 [22:41:26] (03CR) 10Odder: "All images have been optimised with the standard optipng -o 7 option before uploading." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376150 (owner: 10Odder) [22:42:51] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3582514 (10Krinkle) >>! In T158837#3497281, @fgiunchedi wrote: > > re: coal/coal-web it should be straightforward to use the prometheus python client either by add... [22:44:11] (03PS2) 10Odder: Update logo in the footer button image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376150 (https://phabricator.wikimedia.org/T174603) [22:46:56] RECOVERY - Host labstore1006 is UP: PING WARNING - Packet loss = 93%, RTA = 0.46 ms [22:48:56] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:48:56] PROBLEM - Disk space on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:49:04] 10Operations, 10Performance-Team, 10hardware-requests: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10Krinkle) [22:49:06] PROBLEM - DPKG on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:49:17] PROBLEM - Check systemd state on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:49:17] PROBLEM - dhclient process on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:49:26] PROBLEM - configured eth on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:49:27] PROBLEM - Check size of conntrack table on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:49:36] PROBLEM - SSH on labstore1006 is CRITICAL: connect to address 208.80.154.7 and port 22: Connection refused [22:49:43] 10Operations, 10Performance-Team, 10hardware-requests: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10Krinkle) [22:49:46] PROBLEM - salt-minion processes on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:49:46] PROBLEM - Check the NTP synchronisation status of timesyncd on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:49:54] 10Operations, 10Performance-Team, 10hardware-requests: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10Krinkle) [22:49:56] PROBLEM - puppet last run on labstore1006 is CRITICAL: Return code of 255 is out of bounds [22:49:57] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10Krinkle) [22:50:09] 10Operations, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10Krinkle) [22:56:55] (03PS1) 10Krinkle: jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) [22:57:27] PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100% [23:00:09] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170905T2300). Please do the needful. [23:00:09] Dmaza, raynor, and kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:36] I'm available [23:00:40] ^ [23:02:37] RECOVERY - Host labstore1006 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [23:02:46] RECOVERY - SSH on labstore1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [23:03:08] I'm available [23:04:37] PROBLEM - configured eth on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:04:46] PROBLEM - Check size of conntrack table on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:04:56] PROBLEM - salt-minion processes on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:05:06] PROBLEM - puppet last run on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:05:06] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:05:07] PROBLEM - Disk space on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:05:16] PROBLEM - DPKG on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:05:36] PROBLEM - dhclient process on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:05:36] PROBLEM - Check systemd state on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:06:56] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [23:10:08] (03PS1) 10RobH: updating labstore100[67] partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/376152 (https://phabricator.wikimedia.org/T167984) [23:10:32] (03CR) 10RobH: [C: 032] updating labstore100[67] partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/376152 (https://phabricator.wikimedia.org/T167984) (owner: 10RobH) [23:11:17] RECOVERY - Restbase root url on restbase-dev1004 is OK: HTTP OK: HTTP/1.1 200 - 15664 bytes in 0.011 second response time [23:12:03] 10Operations, 10Continuous-Integration-Config, 10Release-Engineering-Team (Backlog): operations-puppet-tests-docker console output lacks color - https://phabricator.wikimedia.org/T175057#3582656 (10greg) [23:19:27] PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100% [23:22:06] RECOVERY - Host labstore1006 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [23:23:06] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:24:16] PROBLEM - puppet last run on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:24:17] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:24:17] PROBLEM - Disk space on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:24:26] PROBLEM - DPKG on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:24:46] PROBLEM - dhclient process on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:24:46] PROBLEM - Check systemd state on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:24:47] PROBLEM - configured eth on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:24:56] PROBLEM - Check size of conntrack table on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:25:06] PROBLEM - salt-minion processes on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:27:57] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2017214 [23:29:07] PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100% [23:30:16] RECOVERY - Host labstore1006 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [23:31:46] is there anyone to do the SWAT today? addshore, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika? [23:31:59] sorry, I'm in a meeting [23:32:17] PROBLEM - puppet last run on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:32:26] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:32:28] PROBLEM - Disk space on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:32:36] PROBLEM - DPKG on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:32:47] PROBLEM - Check systemd state on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:32:47] PROBLEM - dhclient process on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:32:56] PROBLEM - configured eth on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:32:57] PROBLEM - Check size of conntrack table on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:33:03] Me too. :( [23:33:04] !log 2FA disabled for Omshivaprakash (T175075) [23:33:16] PROBLEM - salt-minion processes on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:18] T175075: Lost two-factor authentication recovery keys & device - https://phabricator.wikimedia.org/T175075 [23:34:52] I'm not sure stashbot should be reading the titles of Security tasks. :/ [23:35:11] foks: It's public [23:35:16] So it will read it [23:35:16] PROBLEM - salt-minion processes on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:35:22] oh, I see! [23:35:26] PROBLEM - puppet last run on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:35:27] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:35:27] PROBLEM - Disk space on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:35:29] what he said [23:35:29] I didn't see that. Sorry. :D [23:35:32] If it's actually hidden from public, it won't, and it won't post :P [23:35:36] PROBLEM - DPKG on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:35:52] DMaza: kaldari Are you here for swat? [23:35:56] Saw the security tag and assumed [23:35:56] PROBLEM - Check systemd state on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:35:56] PROBLEM - dhclient process on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:35:57] PROBLEM - configured eth on labstore1006 is CRITICAL: Return code of 255 is out of bounds [23:35:58] yup [23:36:00] Both said they are.. [23:36:01] * Reedy looks [23:36:01] kbai! [23:36:04] Reedy: Here [23:36:06] PROBLEM - Check size of conntrack table on labstore1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:36:15] (03PS4) 10Reedy: Enable AbuseFilter runtime profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375072 (https://phabricator.wikimedia.org/T161059) (owner: 10Dmaza) [23:36:16] RECOVERY - salt-minion processes on labstore1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:36:18] (03CR) 10Reedy: [C: 032] Enable AbuseFilter runtime profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375072 (https://phabricator.wikimedia.org/T161059) (owner: 10Dmaza) [23:36:26] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [23:36:27] RECOVERY - Check whether ferm is active by checking the default input chain on labstore1006 is OK: OK ferm input default policy is set [23:36:36] RECOVERY - Disk space on labstore1006 is OK: DISK OK [23:36:37] RECOVERY - DPKG on labstore1006 is OK: All packages OK [23:36:47] foks: OOI, how do you disable 2FA? SQL query? Or run the maintenance script? [23:36:56] RECOVERY - dhclient process on labstore1006 is OK: PROCS OK: 0 processes with command name dhclient [23:36:56] RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational [23:37:06] RECOVERY - configured eth on labstore1006 is OK: OK - interfaces up [23:37:06] RECOVERY - Check size of conntrack table on labstore1006 is OK: OK: nf_conntrack is 0 % full [23:37:48] (03Merged) 10jenkins-bot: Enable AbuseFilter runtime profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375072 (https://phabricator.wikimedia.org/T161059) (owner: 10Dmaza) [23:37:58] (03CR) 10jenkins-bot: Enable AbuseFilter runtime profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375072 (https://phabricator.wikimedia.org/T161059) (owner: 10Dmaza) [23:38:07] Reedy: script as described on wikitech [23:38:32] tzatziki: Cool. Just wanted to make sure [23:38:42] Cause I did create the script to make it easier :) [23:39:11] Reedy: cool! [23:39:50] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable AbuseFilter runtime profile T161059 (duration: 00m 49s) [23:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:03] T161059: Measure AbuseFilter runtime - https://phabricator.wikimedia.org/T161059 [23:43:26] (03CR) 10Reedy: [C: 04-1] "It can't be in wmf-config/extension-list as it's not branched in WMF deployment branches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376079 (https://phabricator.wikimedia.org/T175054) (owner: 10Kaldari) [23:45:01] (03PS2) 10Reedy: Test ArticleCreationWorkflow extension on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376079 (https://phabricator.wikimedia.org/T175054) (owner: 10Kaldari) [23:45:04] (03PS3) 10Reedy: Test ArticleCreationWorkflow extension on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376079 (https://phabricator.wikimedia.org/T175054) (owner: 10Kaldari) [23:47:25] Reedy: How do you see if an extension is branched in WMF deployment branches? [23:47:33] Reedy: Whatever happened to testing stuff on mwdebug1002? :P Move fast and break things? [23:47:53] https://github.com/wikimedia/mediawiki/tree/wmf/1.30.0-wmf.17/extensions [23:48:07] thttps://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/config.json [23:48:10] kaldari: ^ one of those :) [23:49:50] RECOVERY - Check the NTP synchronisation status of timesyncd on labstore1006 is OK: OK: synced at Tue 2017-09-05 23:49:43 UTC. [23:50:30] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [23:52:16] (03CR) 10Reedy: [C: 032] Test ArticleCreationWorkflow extension on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376079 (https://phabricator.wikimedia.org/T175054) (owner: 10Kaldari) [23:53:50] raynor: tests keep failing on https://gerrit.wikimedia.org/r/375398 [23:54:06] let me check that ;/ [23:54:10] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 34 failures. Last run 2 minutes ago with 34 failures. Failed resources (up to 3 shown): Package[ntp],Service[systemd-timesyncd],Service[diamond],Service[prometheus-node-exporter] [23:54:17] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3582771 (10RobH) [23:54:26] (03Merged) 10jenkins-bot: Test ArticleCreationWorkflow extension on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376079 (https://phabricator.wikimedia.org/T175054) (owner: 10Kaldari) [23:54:44] Reedy: it's Luad sandbox [23:54:47] Lua* [23:55:03] this patch changes one SVG file [23:55:27] 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352141 (10RobH) a:05RobH>03madhuvishy Ok, after the cards were swapped, the disks now detect in the same order as other hosts. IE: the raid1 flex bays setup as raid1 are showing as... [23:55:31] it might be not related, let me check it [23:55:37] I don't think it is [23:56:08] (03CR) 10jenkins-bot: Test ArticleCreationWorkflow extension on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376079 (https://phabricator.wikimedia.org/T175054) (owner: 10Kaldari) [23:58:32] !log reedy@tin Synchronized wmf-config/: Test ArticleCreationWorkflow extension on the Beta Cluster (duration: 00m 51s) [23:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:57] Reedy - it has to be sth different. locally skin tests work properly, I'm executing the MW Tests scenario - it might take couple minutes on my machine [23:59:32] I checked the console output in jenkins and it looks like sth else [23:59:38] I'm pretty sure it is too [23:59:44] I'm just wondering why it's suddenly started breaking [23:59:47] It seems ok on master.. But broken on .16