[00:00:23] 10Operations, 10Deployment-Systems, 10Performance-Team, 10HHVM, 10Release-Engineering-Team (Next): Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#3417383 (10greg) [00:32:57] (03CR) 10Legoktm: "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363970 (owner: 10Chad) [01:02:10] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499475730 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9160775 keys, up 2 minutes 8 seconds - replication_delay is 1499475730 [01:02:40] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:40] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:02:41] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:03:30] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9252009 keys, up 3 minutes 26 seconds - replication_delay is 0 [01:03:40] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4547821 keys, up 3 minutes 28 seconds - replication_delay is 0 [01:03:41] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9253208 keys, up 3 minutes 37 seconds - replication_delay is 0 [01:04:20] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9157154 keys, up 4 minutes 9 seconds - replication_delay is 0 [01:51:20] 10Operations, 10Graphite, 10User-fgiunchedi: Delete "servers" metrics in graphite older than 60d - https://phabricator.wikimedia.org/T169972#3415090 (10Krinkle) +1 for pruning these for servers that no longer exist. Will also make autocompletion in Grafana easier by not constantly offering completion options... [01:54:40] PROBLEM - salt-minion processes on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:55:30] RECOVERY - salt-minion processes on thumbor1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:57:01] 10Operations, 10Graphite, 10User-fgiunchedi: Audit groups of metrics in Graphite that allocate a lot of disk space - https://phabricator.wikimedia.org/T1075#3417503 (10Krinkle) [01:57:10] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - prometheus_80 - Could not depool server prometheus2004.codfw.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs2003.codfw.wmnet because of too many down!: thumbor_8800 - Could not depool server thumbor2002.codfw.wmnet because of too many down! [01:59:10] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [02:27:48] (03CR) 10D3r1ck01: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [04:10:00] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=647.80 Read Requests/Sec=376.40 Write Requests/Sec=3.20 KBytes Read/Sec=48179.20 KBytes_Written/Sec=1321.20 [04:19:10] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=15.30 Read Requests/Sec=0.60 Write Requests/Sec=1.40 KBytes Read/Sec=7.60 KBytes_Written/Sec=8.00 [05:38:55] (03PS1) 10BryanDavis: Toolforge: Update motd banners for rebranding [puppet] - 10https://gerrit.wikimedia.org/r/364007 (https://phabricator.wikimedia.org/T168480) [05:55:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:56:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:03:40] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:04:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:21:01] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364011 (https://phabricator.wikimedia.org/T170033) [08:38:10] Hello, is it possible to deploy 364011 right now? It is a throttle rule requested a few hours before event with bad project in phab... Thank you in advance! [08:38:27] See T170033 for details [08:38:27] T170033: Temporarily Lift IP cap for account creation for Editathon in NYC - Saturday July 8th - https://phabricator.wikimedia.org/T170033 [08:42:16] I also asked at -releng [09:13:34] Reedy or legoktm are probs your best bet at these hours, but probably not because its a weekend and all [09:14:00] PROBLEM - HHVM rendering on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:50] RECOVERY - HHVM rendering on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 74528 bytes in 0.154 second response time [09:14:51] ugh [09:15:21] Urbanecm: I'm not gonna deploy this late [11:37:10] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:38:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:44:10] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:46:20] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:29:33] (03Abandoned) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364011 (https://phabricator.wikimedia.org/T170033) (owner: 10Urbanecm) [15:26:26] (03PS3) 10Paladox: Gerrit: Upgrading gerrit to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [15:27:02] (03Abandoned) 10Paladox: ORES: Create user deploy-service using user and group syntax [puppet] - 10https://gerrit.wikimedia.org/r/363042 (https://phabricator.wikimedia.org/T169164) (owner: 10Paladox) [15:50:29] (03PS12) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [15:50:32] (03CR) 10Paladox: Gerrit: Add support for scap (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [16:02:59] (03CR) 10Paladox: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [16:12:22] (03CR) 10Paladox: "@Dzahn per @Thcipriani comment at https://gerrit.wikimedia.org/r/#/c/363726/ both private and pub keys need to be added here. The keys can" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [16:39:44] (03CR) 10Paladox: "Im going to rewrite a few things which will look ugly at first but will look a lot cleaner when we move to scap fully. This is to preserve" [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [17:29:29] (03PS13) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [17:30:38] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [17:30:54] (03CR) 10Paladox: "Once you flip the switch to scap_deploy true, remove /srv/deployment/ on the gerrit host so everything gets generated correctly :)." [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [17:31:57] X.ioNoX started a nice page on wikitech about how to gather data and report a connectivity problem -- https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue. It could use some love from folks with Windows and mobile device troubleshooting skills. [17:32:15] (03PS14) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [17:38:58] bd808 all users that use windows 10 can follow the linux sections. Though older versions of windows 10 will find it hard using some commands [17:39:13] it's safe with the creator update and the up comming update. [17:39:34] all they have to do is open a bash application. and then start typing the commands in. [17:39:48] assuming the latest Windows kit is not too safe with our typical end user :) [17:40:22] what do you mean by safe? [17:40:45] it runs ubuntu, and now opensuse and that other varient which i forgot :) [17:40:50] I mean that it is not common for people to be using the latest windows version and patches [17:41:00] not safe to assume [17:41:10] ah i see [17:41:55] More typical is that they are running whatever version of Windows was installed on their computer when they bought it [17:41:56] hmm, the curl commands will work i think. They can install php using the wampstack for windows. [17:42:06] yep. [17:42:20] * bd808 -> errands [17:42:25] ok [17:55:15] (03PS15) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [17:55:58] (03CR) 10Paladox: "Actually i will do that in a separate change. The separate change will migrate fully to scap. Because the way i did it fully here was ugly" [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [18:43:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:45:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:45:47] ugh i really need to find a way to create a way for icinga-wm_ to remove that _ when its other connection drops, like i (if i remember right) did with jouncebot... [18:48:06] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3418222 (10MF-Warburg) It is reported to me that the VisualEditor and the "2017 wikitext editor" don't work on the wiki. [18:48:10] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:54:10] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:54:10] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:32:07] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3418237 (10Urbanecm) >>! In T168764#3409167, @Dcljr wrote: >>>! In T168764#3407320, @Urbanecm wrote: >> Wiki is reopened and it can be edited by anyone... [19:48:50] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3418243 (10Urbanecm) >>! In T168764#3418222, @MF-Warburg wrote: > It is reported to me that the VisualEditor and the "2017 wikitext editor" don't work... [20:04:40] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:32:40] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [22:14:20] !log Deleted ukwikimedia records in CentralAuth localuser and localnames tables for T170005. [22:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:32] T170005: ukwikimedia_p needs to be removed from meta_p table - https://phabricator.wikimedia.org/T170005 [22:37:40] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:55] (03PS4) 10Paladox: Gerrit: Upgrading gerrit to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [22:47:44] (03PS6) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:47:53] (03PS5) 10Paladox: Gerrit: Upgrading gerrit to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [23:05:40] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures