[00:46:04] (03PS1) 10Dzahn: ores: fix-up web node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) [00:47:22] (03PS2) 10Dzahn: ores: fix-up web node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) [00:48:36] (03PS3) 10Dzahn: ores: fix-up web node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) [00:49:41] (03PS4) 10Dzahn: ores: fix-up web node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) [00:51:28] (03CR) 10Dzahn: [C: 032] "just like schana said "I think nginx needs "^~" before the node location to properly match."" [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) (owner: 10Dzahn) [01:17:24] (03CR) 10Dzahn: WIP: Gerrit: Setup rsync between old and new machines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296957 (owner: 10Chad) [01:20:03] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:22:22] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [02:23:37] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.8) (duration: 08m 52s) [02:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:17] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jul 2 02:29:17 UTC 2016 (duration 5m 40s) [02:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:36:24] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 1 failures [04:56:23] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [05:03:03] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [05:20:32] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:32:01] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [06:32:11] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:20] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:01] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:30] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:57:21] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:31] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:52] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:04:24] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:33] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [07:32:03] 06Operations, 10LDAP-Access-Requests: LDAP Account required for Transparency Report - https://phabricator.wikimedia.org/T138369#2423349 (10siddharth11) @hashar it's fine :) I still don't have access to LDAP. Can someone help with its enrolment process? [07:40:04] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [07:54:53] PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: Puppet has 1 failures [08:07:02] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:11:03] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:33] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:17:22] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [08:19:53] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [08:20:13] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [08:26:34] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:28:52] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [08:29:12] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:21:41] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:08:57] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2423465 (10Trijnstel) HELLO? Is anyone working on this? It's pretty annoying that no one seems to care to quickly fix this... I apologize if I'm mistaken, but it cer... [11:17:19] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:35:40] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2423481 (10Dereckson) Nobody is currently assigned to solve this task. @andre Could you try to find someone with a knowledge of how global rename code works, so we... [13:33:37] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:44] PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:50] PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:56] PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100% [13:34:05] PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100% [13:34:12] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100% [13:34:39] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR [13:34:39] PROBLEM - Router interfaces on pfw-codfw is CRITICAL: CRITICAL: host 208.80.153.195, interfaces up: 47, down: 27, dormant: 0, excluded: 0, unused: 0BRge-2/0/12: down - BRfab0: down - BRfab1.0: down - BRswfab1.0: down - BRfab0.0: down - BRge-2/0/6: down - rigelBRswfab0.0: down - BRswfab0: down - BRge-2/0/7: down - fdb2001BRvlan.2133: down - Subnet frack-bastion-codfwBRge-2/0/3: down - hekaBRge-2/0/15: down - fmsw1-codfw:??? {#???}BRge- [13:34:42] <_joe_> sigh [13:34:49] apergos: around? [13:35:09] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [13:35:19] RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [13:35:25] RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms [13:35:28] oh frack [13:35:32] RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms [13:35:38] RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.53 ms [13:35:45] RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 37.36 ms [13:35:52] RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms [13:37:29] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [13:37:29] RECOVERY - Router interfaces on pfw-codfw is OK: OK: host 208.80.153.195, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 [13:37:35] and they're back [13:37:37] meh [13:38:11] 06Operations, 10Mobile-Content-Service, 10ORES, 06Services: Investigate increased memory pressure on scb1001/2 - https://phabricator.wikimedia.org/T139177#2421697 (10GWicke) @Ladsgroup, that would be great as a stop-gap. Resource usage was one of the concerns we brought up in the initial ORES-to-productio... [13:40:19] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 37.57 ms [13:45:09] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures [13:46:08] PROBLEM - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is CRITICAL: Connection refused [13:46:58] PROBLEM - cassandra-b service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:49:59] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail [13:50:08] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures [13:53:30] RECOVERY - cassandra-b service on restbase1012 is OK: OK - cassandra-b is active [13:54:58] RECOVERY - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is OK: TCP OK - 0.003 second response time on port 9042 [13:55:08] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures [14:00:09] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures [14:05:15] RECOVERY - check_puppetrun on fdb2001 is OK: OK: Puppet is currently enabled, last run 156 seconds ago with 0 failures [14:16:02] ostriches: Hi, are you currently here? I need you for something urgent [14:17:21] But phabricator repo in phabricator.wikimedia.org will need updating before applied [14:17:27] Luke081515 [14:17:39] ostriches https://phabricator.wikimedia.org/T139236 [14:19:06] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:38:00] (03PS1) 10Hashar: scap: align spec with other modules [puppet] - 10https://gerrit.wikimedia.org/r/297129 (https://phabricator.wikimedia.org/T78342) [14:58:38] (03CR) 10Hashar: "The commit message is one year and a half old and hasn't been updated as the change has been rebased." [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [15:01:10] (03PS11) 10Hashar: (WIP) wmflib: basic spec for os_version() (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/178810 [15:31:02] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:30] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [15:36:53] (03PS1) 10Hashar: wmflib: update spec to rspec 3.x [puppet] - 10https://gerrit.wikimedia.org/r/297132 [15:38:08] (03CR) 10jenkins-bot: [V: 04-1] wmflib: update spec to rspec 3.x [puppet] - 10https://gerrit.wikimedia.org/r/297132 (owner: 10Hashar) [15:45:31] (03PS1) 10Hashar: wmflib: mute hiera debug log in spec [puppet] - 10https://gerrit.wikimedia.org/r/297133 [15:46:51] (03CR) 10jenkins-bot: [V: 04-1] wmflib: mute hiera debug log in spec [puppet] - 10https://gerrit.wikimedia.org/r/297133 (owner: 10Hashar) [16:08:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:11:11] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:37:23] (03PS1) 10Odder: Update logo settings for the Nepali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297134 (https://phabricator.wikimedia.org/T139240) [16:38:11] (03CR) 10Odder: "I ran optipng -o7 on all three files before submitting this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297134 (https://phabricator.wikimedia.org/T139240) (owner: 10Odder) [16:41:14] so I just had an e-mail from Gerrit marked as Spam by my e-mail filter [16:41:35] threshold is 5.5, scored 5.7 [16:57:31] PROBLEM - Host asw-b-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [16:59:11] PROBLEM - Host mr1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:59:53] PROBLEM - Host asw-c-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [17:00:11] PROBLEM - Host asw-d-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [17:00:11] PROBLEM - Host asw-a-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [17:00:31] PROBLEM - Host msw1-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [17:01:32] PROBLEM - Host cr1-eqdfw IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ffff::6 [17:03:41] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:05:02] PROBLEM - Host mr1-codfw IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ffff::6 [17:23:13] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: puppet fail [17:34:38] legoktm: you onlien right now? :) [17:42:39] (03PS1) 10Ladsgroup: ores: reduce the web workers to 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/297137 (https://phabricator.wikimedia.org/T139177) [17:49:31] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:37:28] (03CR) 10Ladsgroup: "halfak approved :)" [puppet] - 10https://gerrit.wikimedia.org/r/297137 (https://phabricator.wikimedia.org/T139177) (owner: 10Ladsgroup) [18:58:02] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 1 failures [19:08:42] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:11:02] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [19:15:47] !log Deployed hotfix to phabricator. Restarted apache2 on iridium [19:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:32] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [21:07:38] Steinsplitter: hi [21:08:01] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: puppet fail [21:21:39] * twkozlowski hurting [21:22:09] thank you twentyafterfour [21:22:13] for the phab update. [21:22:15] any issues at this time? I'm having problems receiving CSS [21:22:42] I don't know of any, no [21:23:12] mw1109 is the one serving me [21:28:05] hm yeah I've seen no other reports [21:29:00] it's after midnight here so I'm pretty checked out [21:29:02] 06Operations, 10Mobile-Content-Service, 10ORES, 06Services, 13Patch-For-Review: Investigate increased memory pressure on scb1001/2 - https://phabricator.wikimedia.org/T139177#2424171 (10Ladsgroup) Okay, I checked the ores uwsgi files and each node has 96 web processors (= 192 processors) since most of t... [21:30:16] 06Operations, 10Mobile-Content-Service, 10ORES, 06Revision-Scoring-As-A-Service, and 2 others: Investigate increased memory pressure on scb1001/2 - https://phabricator.wikimedia.org/T139177#2424173 (10Ladsgroup) [21:34:02] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [23:26:34] u-u. [23:27:10] * twkozlowski can't reach Phabricator [23:27:24] or Wikimania 2016 wiki through my mobile phone, for that matter [23:27:28] works here [23:32:36] works on my 4G, too [23:46:59] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: puppet fail [23:54:47] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:58] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [23:57:25] twkozlowski, so it doesn't work on your mobile data connection unless you use 4G?