[00:04:23] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:05:22] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 74752 bytes in 1.535 second response time [00:17:48] 10Operations, 10Traffic, 10Accessibility, 10Browser-Support-Internet-Explorer: Wikipedia no longer accessible to those using some braille devices - https://phabricator.wikimedia.org/T185582#3943836 (10Cameron11598) Sent! [00:30:22] PROBLEM - HHVM rendering on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:12] RECOVERY - HHVM rendering on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 74750 bytes in 0.301 second response time [00:48:07] 10Operations, 10Cloud-Services, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#3943871 (10brion) 05Resolved>03Open I'm encountering this problem again; the routes seem to have changed but symptoms are similar -- I see a... [00:54:48] * brion blames comcast, but it might be telia :D [01:48:23] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:22] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 74493 bytes in 0.298 second response time [03:26:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 870.19 seconds [03:58:23] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 47.14 seconds [04:25:43] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:26:02] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [04:26:15] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [04:28:21] <_joe_> here I am [04:28:25] <_joe_> loads of fun [04:29:12] <_joe_> is anyone else getting paged? [04:33:12] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [04:33:15] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [04:33:31] <_joe_> !log restarted etcdmirror on conf2002, failure caused by raid resyncs in codfw [04:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:52] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [04:35:00] I did gt the page [05:42:22] PROBLEM - MegaRAID on analytics1038 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [06:02:22] RECOVERY - MegaRAID on analytics1038 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [06:02:23] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:02:36] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [06:02:52] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [06:04:23] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [06:04:45] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.076 second response time [06:04:52] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [06:18:55] <_joe_> !log reduced raid resync speed on conf2* to 5000 KB/s [06:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:10] 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3944015 (10Tgr) Probably should get Bitergia integration by the time of production deployment. [07:12:22] PROBLEM - MegaRAID on analytics1038 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [08:14:32] (03PS3) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [08:21:58] (03PS4) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [09:12:22] RECOVERY - MegaRAID on analytics1038 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [09:32:12] 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3944108 (10Tgr) Will need monitoring as well. There is an [[https://meta.discourse.org/t/prometheus-exporter-plugin-for-discourse/72666|official Prometheus exporter]] whic... [09:58:33] PROBLEM - HHVM rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [09:59:19] ahh how is gerrit now so fast [09:59:32] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 74554 bytes in 0.092 second response time [10:05:07] added downtime for an1038, we'll try to swap the bbu this week [10:10:07] I have question [10:10:19] Which tests mw-testskin run? [10:14:38] (03PS1) 10Amire80: Add sitename for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408032 (https://phabricator.wikimedia.org/T184521) [10:15:07] (03CR) 10Zoranzoki21: [C: 031] Add sitename for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408032 (https://phabricator.wikimedia.org/T184521) (owner: 10Amire80) [11:24:32] PROBLEM - Apache HTTP on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:42] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [11:24:48] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [11:25:13] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:25:22] RECOVERY - Apache HTTP on mw2206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.117 second response time [11:27:02] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:27:42] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [11:27:52] again? [11:27:55] ^ I ran puppet on that host [11:27:57] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.074 second response time [11:28:01] and it was brought up [11:28:22] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [11:45:53] PROBLEM - HHVM rendering on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.601 second response time [11:46:02] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [11:46:23] PROBLEM - Apache HTTP on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [11:46:53] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 74487 bytes in 0.215 second response time [11:47:03] RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.095 second response time [11:47:23] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.062 second response time [11:57:02] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:43:50] (03Draft2) 10محمد شعیب: Enable ArticlePlaceholder ext for urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408043 (https://phabricator.wikimedia.org/T186451) [14:04:10] (03PS1) 10Urbanecm: Make alias from old NS_PROJECT to new NS_PROJECT at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408045 (https://phabricator.wikimedia.org/T185347) [14:17:04] (03PS1) 10Urbanecm: Change cswiki logo for celebration - 400k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408046 (https://phabricator.wikimedia.org/T186455) [14:21:55] (03CR) 10Zoranzoki21: Enable ArticlePlaceholder ext for urwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408043 (https://phabricator.wikimedia.org/T186451) (owner: 10محمد شعیب) [15:06:32] Changes related to phabricator merged on gerrit will be deployed for 2 days? I forgot.. [15:08:04] (03PS3) 10محمد شعیب: Enable ArticlePlaceholder ext for urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408043 (https://phabricator.wikimedia.org/T186451) [15:18:01] (03PS4) 10Zoranzoki21: Enable ArticlePlaceholder ext for urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408043 (https://phabricator.wikimedia.org/T186451) (owner: 10محمد شعیب) [15:18:16] (03CR) 10Zoranzoki21: "Now is ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408043 (https://phabricator.wikimedia.org/T186451) (owner: 10محمد شعیب) [15:22:11] (03CR) 10Jayprakash12345: [C: 031] Make alias from old NS_PROJECT to new NS_PROJECT at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408045 (https://phabricator.wikimedia.org/T185347) (owner: 10Urbanecm) [15:33:04] (03CR) 10Zoranzoki21: [C: 031] Make alias from old NS_PROJECT to new NS_PROJECT at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408045 (https://phabricator.wikimedia.org/T185347) (owner: 10Urbanecm) [17:04:33] (03CR) 10Zoranzoki21: [C: 04-1] "Rebase patch and do tips which told user Framawiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403120 (owner: 10محمد شعیب) [17:05:13] (03Abandoned) 10Framawiki: Allow euwiki bureaucrats to add/remove 'accountcreator' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405771 (https://phabricator.wikimedia.org/T185531) (owner: 10Framawiki) [17:17:04] Please abandon this patch: https://gerrit.wikimedia.org/r/#/c/137982/ [17:17:04] (03PS1) 10Framawiki: Remove old 'accountcreator' rules now handled by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408071 (https://phabricator.wikimedia.org/T185417) [17:20:30] (03PS2) 10Framawiki: Remove old 'accountcreator' rules now handled by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408071 (https://phabricator.wikimedia.org/T185417) [17:29:46] (03PS3) 10محمد شعیب: Changing namespaces on some Urdu language projects. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403120 [17:29:55] (03CR) 10jerkins-bot: [V: 04-1] Changing namespaces on some Urdu language projects. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403120 (owner: 10محمد شعیب) [17:34:48] (03Abandoned) 10محمد شعیب: Changing namespaces on some Urdu language projects. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403120 (owner: 10محمد شعیب) [18:09:30] 10Operations: Backport firejail 0.9.52 for use on Wikimedia appservers - https://phabricator.wikimedia.org/T179022#3944791 (10Legoktm) 05stalled>03Open @MoritzMuehlenhoff firejail 0.9.52 has been released and is in unstable and stretch-backports. [18:16:31] (03PS1) 10Zoranzoki21: Disable Flow extension on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) [19:26:55] Why gerrit sending emails with content (example): $1 would like to $2 review the patch $3? [19:30:23] It shoulden't be sending that. if it is then that's a bug. [19:33:06] Oh is that an example? [19:36:16] Example [19:36:47] Real: Reviewer-bot would like Urbanecm to review this change. [19:36:56] Why he sending it? [19:37:27] He have not to send it [19:37:42] a reviewer added you to a change [19:37:46] and wants you to review it [19:38:35] But why gerrit send ME to reviewer-bot would like Urbanecm to review this change WHICH IS NOT MY CHANGE [19:39:34] Because your on the change. And that was a known bug in 2.13 which was fixed. [19:39:58] It is on 2.14.6-7-g55dde9d68b which you have current [19:40:00] but with the introduction of notedb in 2.14+ (will be used in 2.15) it does not behave the same as reviewdb. [19:40:30] ok [19:40:36] thank you very much [20:22:42] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4029 is CRITICAL: connect to address 10.128.0.129 and port 3128: Connection refused [20:23:42] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4029 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time [22:40:53] !log restart aphlict.service on phab1001 to force it to pick up the new logfile (/var/log/aphlict/aphlict.log rather than the .log.1) [22:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:10] (03PS1) 10Elukey: phabricator: add copytruncate to aphlict's logrotate [puppet] - 10https://gerrit.wikimedia.org/r/408222 [22:47:42] (03CR) 10Elukey: [C: 032] phabricator: add copytruncate to aphlict's logrotate [puppet] - 10https://gerrit.wikimedia.org/r/408222 (owner: 10Elukey) [22:47:50] mutante: --^ [22:53:33] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:23:33] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures