[00:46:04] <grrrit-wm>	 (03PS1) 10Dzahn: ores: fix-up web node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) 
[00:47:22] <grrrit-wm>	 (03PS2) 10Dzahn: ores: fix-up web node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) 
[00:48:36] <grrrit-wm>	 (03PS3) 10Dzahn: ores: fix-up web node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) 
[00:49:41] <grrrit-wm>	 (03PS4) 10Dzahn: ores: fix-up web node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) 
[00:51:28] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "just like schana said "I think nginx needs "^~" before the node location to properly match."" [puppet] - 10https://gerrit.wikimedia.org/r/297115 (https://phabricator.wikimedia.org/T134782) (owner: 10Dzahn)
[01:17:24] <grrrit-wm>	 (03CR) 10Dzahn: WIP: Gerrit: Setup rsync between old and new machines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296957 (owner: 10Chad)
[01:20:03] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[01:22:22] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[02:23:37] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.8) (duration: 08m 52s)
[02:23:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:29:17] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jul  2 02:29:17 UTC 2016 (duration 5m 40s)
[02:29:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:36:24] <icinga-wm>	 PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:56:23] <icinga-wm>	 PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 1 failures
[05:03:03] <icinga-wm>	 RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[05:20:32] <icinga-wm>	 RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[06:32:01] <icinga-wm>	 PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail
[06:32:11] <icinga-wm>	 PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:20] <icinga-wm>	 PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:33:01] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:02] <icinga-wm>	 PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:56:30] <icinga-wm>	 RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:56:32] <icinga-wm>	 RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[06:57:21] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[06:57:31] <icinga-wm>	 RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:52] <icinga-wm>	 RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[07:04:24] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:06:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[07:32:03] <wikibugs>	 06Operations, 10LDAP-Access-Requests: LDAP Account required for Transparency Report - https://phabricator.wikimedia.org/T138369#2423349 (10siddharth11) @hashar it's fine :)  I still don't have access to LDAP.  Can someone help with its enrolment process?
[07:40:04] <icinga-wm>	 PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail
[07:54:53] <icinga-wm>	 PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:07:02] <icinga-wm>	 RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:11:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:13:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:17:22] <icinga-wm>	 RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[08:19:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[08:20:13] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[08:26:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:26:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:28:52] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[08:29:12] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[10:21:41] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:08:57] <wikibugs>	 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2423465 (10Trijnstel) HELLO? Is anyone working on this? It's pretty annoying that no one seems to care to quickly fix this... I apologize if I'm mistaken, but it cer...
[11:17:19] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:35:40] <wikibugs>	 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2423481 (10Dereckson) Nobody is currently assigned to solve this task.  @andre Could you try to find someone with a knowledge of how global rename code works, so we...
[13:33:37] <icinga-wm>	 PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:33:44] <icinga-wm>	 PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:33:50] <icinga-wm>	 PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:33:56] <icinga-wm>	 PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100%
[13:34:05] <icinga-wm>	 PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100%
[13:34:12] <icinga-wm>	 PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100%
[13:34:39] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR
[13:34:39] <icinga-wm>	 PROBLEM - Router interfaces on pfw-codfw is CRITICAL: CRITICAL: host 208.80.153.195, interfaces up: 47, down: 27, dormant: 0, excluded: 0, unused: 0BRge-2/0/12: down - BRfab0: down - BRfab1.0: down - BRswfab1.0: down - BRfab0.0: down - BRge-2/0/6: down - rigelBRswfab0.0: down - BRswfab0: down - BRge-2/0/7: down - fdb2001BRvlan.2133: down - Subnet frack-bastion-codfwBRge-2/0/3: down - hekaBRge-2/0/15: down - fmsw1-codfw:??? {#???}BRge-
[13:34:42] <_joe_>	 sigh
[13:34:49] <Danny_B>	 apergos: around?
[13:35:09] <icinga-wm>	 PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:19] <icinga-wm>	 RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms
[13:35:25] <icinga-wm>	 RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms
[13:35:28] <apergos>	 oh frack
[13:35:32] <icinga-wm>	 RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms
[13:35:38] <icinga-wm>	 RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.53 ms
[13:35:45] <icinga-wm>	 RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 37.36 ms
[13:35:52] <icinga-wm>	 RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms
[13:37:29] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[13:37:29] <icinga-wm>	 RECOVERY - Router interfaces on pfw-codfw is OK: OK: host 208.80.153.195, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0
[13:37:35] <apergos>	 and they're back
[13:37:37] <apergos>	 meh
[13:38:11] <wikibugs>	 06Operations, 10Mobile-Content-Service, 10ORES, 06Services: Investigate increased memory pressure on scb1001/2 - https://phabricator.wikimedia.org/T139177#2421697 (10GWicke) @Ladsgroup, that would be great as a stop-gap.  Resource usage was one of the concerns we brought up in the initial ORES-to-productio...
[13:40:19] <icinga-wm>	 RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 37.57 ms
[13:45:09] <icinga-wm>	 PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures
[13:46:08] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is CRITICAL: Connection refused
[13:46:58] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[13:49:59] <icinga-wm>	 PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail
[13:50:08] <icinga-wm>	 PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures
[13:53:30] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1012 is OK: OK - cassandra-b is active
[13:54:58] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is OK: TCP OK - 0.003 second response time on port 9042
[13:55:08] <icinga-wm>	 PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures
[14:00:09] <icinga-wm>	 PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures
[14:05:15] <icinga-wm>	 RECOVERY - check_puppetrun on fdb2001 is OK: OK: Puppet is currently enabled, last run 156 seconds ago with 0 failures
[14:16:02] <Luke081515>	 ostriches: Hi, are you currently here? I need you for something urgent
[14:17:21] <paladox>	 But phabricator repo in phabricator.wikimedia.org will need updating before applied
[14:17:27] <paladox>	 Luke081515
[14:17:39] <paladox>	 ostriches https://phabricator.wikimedia.org/T139236
[14:19:06] <icinga-wm>	 RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:38:00] <grrrit-wm>	 (03PS1) 10Hashar: scap: align spec with other modules [puppet] - 10https://gerrit.wikimedia.org/r/297129 (https://phabricator.wikimedia.org/T78342) 
[14:58:38] <grrrit-wm>	 (03CR) 10Hashar: "The commit message is one year and a half old and hasn't been updated as the change has been rebased." [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar)
[15:01:10] <grrrit-wm>	 (03PS11) 10Hashar: (WIP) wmflib: basic spec for os_version() (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/178810 
[15:31:02] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:33:30] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[15:36:53] <grrrit-wm>	 (03PS1) 10Hashar: wmflib: update spec to rspec 3.x [puppet] - 10https://gerrit.wikimedia.org/r/297132 
[15:38:08] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] wmflib: update spec to rspec 3.x [puppet] - 10https://gerrit.wikimedia.org/r/297132 (owner: 10Hashar)
[15:45:31] <grrrit-wm>	 (03PS1) 10Hashar: wmflib: mute hiera debug log in spec [puppet] - 10https://gerrit.wikimedia.org/r/297133 
[15:46:51] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] wmflib: mute hiera debug log in spec [puppet] - 10https://gerrit.wikimedia.org/r/297133 (owner: 10Hashar)
[16:08:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:11:11] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[16:37:23] <grrrit-wm>	 (03PS1) 10Odder: Update logo settings for the Nepali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297134 (https://phabricator.wikimedia.org/T139240) 
[16:38:11] <grrrit-wm>	 (03CR) 10Odder: "I ran optipng -o7 on all three files before submitting this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297134 (https://phabricator.wikimedia.org/T139240) (owner: 10Odder)
[16:41:14] <twkozlowski>	 so I just had an e-mail from Gerrit marked as Spam by my e-mail filter
[16:41:35] <twkozlowski>	 threshold is 5.5, scored 5.7
[16:57:31] <icinga-wm>	 PROBLEM - Host asw-b-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[16:59:11] <icinga-wm>	 PROBLEM - Host mr1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[16:59:53] <icinga-wm>	 PROBLEM - Host asw-c-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:11] <icinga-wm>	 PROBLEM - Host asw-d-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:11] <icinga-wm>	 PROBLEM - Host asw-a-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:31] <icinga-wm>	 PROBLEM - Host msw1-codfw.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[17:01:32] <icinga-wm>	 PROBLEM - Host cr1-eqdfw IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ffff::6
[17:03:41] <icinga-wm>	 PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100%
[17:05:02] <icinga-wm>	 PROBLEM - Host mr1-codfw IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ffff::6
[17:23:13] <icinga-wm>	 PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: puppet fail
[17:34:38] <Steinsplitter>	 legoktm: you onlien right now? :)
[17:42:39] <grrrit-wm>	 (03PS1) 10Ladsgroup: ores: reduce the web workers to 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/297137 (https://phabricator.wikimedia.org/T139177) 
[17:49:31] <icinga-wm>	 RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[18:37:28] <grrrit-wm>	 (03CR) 10Ladsgroup: "halfak approved :)" [puppet] - 10https://gerrit.wikimedia.org/r/297137 (https://phabricator.wikimedia.org/T139177) (owner: 10Ladsgroup)
[18:58:02] <icinga-wm>	 PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:08:42] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:11:02] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[19:15:47] <twentyafterfour>	 !log Deployed hotfix to phabricator. Restarted apache2 on iridium
[19:15:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:22:32] <icinga-wm>	 RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[21:07:38] <legoktm>	 Steinsplitter: hi
[21:08:01] <icinga-wm>	 PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: puppet fail
[21:21:39] * twkozlowski hurting
[21:22:09] <apergos>	 thank you twentyafterfour
[21:22:13] <apergos>	 for the phab update.
[21:22:15] <twkozlowski>	 any issues at this time? I'm having problems receiving CSS
[21:22:42] <apergos>	 I don't know of any, no
[21:23:12] <twkozlowski>	 mw1109 is the one serving me
[21:28:05] <apergos>	 hm yeah I've seen no other reports
[21:29:00] <apergos>	 it's after midnight here so I'm pretty checked out
[21:29:02] <wikibugs>	 06Operations, 10Mobile-Content-Service, 10ORES, 06Services, 13Patch-For-Review: Investigate increased memory pressure on scb1001/2 - https://phabricator.wikimedia.org/T139177#2424171 (10Ladsgroup) Okay, I checked the ores uwsgi files and each node has 96 web processors (= 192  processors) since most of t...
[21:30:16] <wikibugs>	 06Operations, 10Mobile-Content-Service, 10ORES, 06Revision-Scoring-As-A-Service, and 2 others: Investigate increased memory pressure on scb1001/2 - https://phabricator.wikimedia.org/T139177#2424173 (10Ladsgroup)
[21:34:02] <icinga-wm>	 RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[23:26:34] <twkozlowski>	 u-u.
[23:27:10] * twkozlowski can't reach Phabricator
[23:27:24] <twkozlowski>	 or Wikimania 2016 wiki through my mobile phone, for that matter
[23:27:28] <Platonides>	 works here
[23:32:36] <twkozlowski>	 works on my 4G, too
[23:46:59] <icinga-wm>	 PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: puppet fail
[23:54:47] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:58] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[23:57:25] <Krenair>	 twkozlowski, so it doesn't work on your mobile data connection unless you use 4G?