[00:11:16] <Jamesofur>	 !log removed 2FA from EVinente  after verification T182373
[00:11:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:11:28] <stashbot>	 T182373: Disable 2FA for EVinente - https://phabricator.wikimedia.org/T182373
[00:23:05] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen
[00:25:05] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen
[00:25:56] <wikibugs>	 10Operations, 10Mail, 10fundraising-tech-ops: Forward katherine@wikipedia.org and jimmy@wikipedia.org emails to katherine@wikimedia.org and jimmy@wikimedia.org, respectively - https://phabricator.wikimedia.org/T182456#3824508 (10Reedy)
[00:50:45] <icinga-wm>	 PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 303 MB (3% inode=83%)
[01:16:05] <icinga-wm>	 PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[01:17:05] <icinga-wm>	 RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[03:13:45] <icinga-wm>	 PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=83%)
[03:19:22] <wikibugs>	 10Operations: Backport firejail 0.9.52 for use on Wikimedia appservers - https://phabricator.wikimedia.org/T179022#3824685 (10Legoktm)
[03:20:03] <wikibugs>	 10Operations, 10MediaWiki-Platform-Team, 10MediaWiki-Shell: Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603#3824687 (10Legoktm)
[03:24:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 825.17 seconds
[03:51:44] <icinga-wm>	 PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 341 MB (3% inode=83%)
[03:52:17] <wikibugs>	 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3824715 (10mmodell)
[03:52:49] <wikibugs>	 10Operations, 10Packaging, 10Scap: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3821018 (10mmodell) p:05Triage>03High High priority because the new version of scap will help with debugging {T181661}
[03:54:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 284.42 seconds
[04:21:45] <icinga-wm>	 PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 334 MB (3% inode=83%)
[05:43:15] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 34 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[05:48:37] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 11 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[08:05:24] <icinga-wm>	 PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:55] <icinga-wm>	 PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:04] <icinga-wm>	 PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:04] <icinga-wm>	 PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:05] <icinga-wm>	 PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:14] <icinga-wm>	 PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:14] <icinga-wm>	 PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:15] <icinga-wm>	 PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:24] <icinga-wm>	 PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:25] <icinga-wm>	 PROBLEM - SSH on ganeti1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:08:15] <icinga-wm>	 RECOVERY - SSH on ganeti1008 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[08:08:24] <icinga-wm>	 RECOVERY - Host actinium is UP: PING WARNING - Packet loss = 64%, RTA = 16.44 ms
[08:08:34] <icinga-wm>	 RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms
[08:08:34] <icinga-wm>	 RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 3.05 ms
[08:08:34] <icinga-wm>	 RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms
[08:08:34] <icinga-wm>	 RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms
[08:08:34] <icinga-wm>	 RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 1.97 ms
[08:08:34] <icinga-wm>	 RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 2.72 ms
[08:08:54] <icinga-wm>	 RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms
[08:10:44] <icinga-wm>	 RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[08:11:35] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5
[08:23:35] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5
[08:42:54] <icinga-wm>	 RECOVERY - DPKG on ganeti1006 is OK: All packages OK
[09:05:50] <wikibugs>	 (03PS2) 10ArielGlenn: move content translation dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396541 (https://phabricator.wikimedia.org/T179942)
[09:09:23] <wikibugs>	 (03PS1) 10Revi: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487)
[09:18:48] <wikibugs>	 (03PS3) 10ArielGlenn: move content translation dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396541 (https://phabricator.wikimedia.org/T179942)
[09:20:12] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] move content translation dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396541 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn)
[09:25:39] <elukey>	 (fyi the 50x were Piwik's related due to the ganeti host issue)
[10:46:02] <Amir1>	 just wanted to say lots of people are getting fatal error in fawiki, when they want to edit or check preferences 
[10:49:28] <legoktm>	 Amir1: fatal or mwexception?
[10:49:47] <legoktm>	 Amir1: "Global default 'soft' is invalid for field rcOresDamagingPref"
[10:50:00] <Amir1>	 mwexception
[10:50:16] <legoktm>	 it's all that exception
[10:50:39] <legoktm>	 https://phabricator.wikimedia.org/T182354
[10:51:40] <Amir1>	 legoktm: can we deploy a config change
[10:52:04] <legoktm>	 what do you want to change?
[10:52:31] <Amir1>	 change soft to its correct name
[10:52:46] <Amir1>	 but I'm not sure
[10:53:40] <legoktm>	 what does the exception even mean?
[10:56:25] <wikibugs>	 (03PS2) 10ArielGlenn: move production of lists of last good dumps from snapshot to web server [puppet] - 10https://gerrit.wikimedia.org/r/395977 (https://phabricator.wikimedia.org/T182303)
[10:57:05] <legoktm>	 >>> $wgDefaultUserOptions['rcOresDamagingPref'];
[10:57:05] <legoktm>	 => "soft"
[10:58:46] <legoktm>	 Amir1: it's because $wgOresFiltersThresholds['damaging']['likelybad'] = false;
[10:59:02] <legoktm>	 				isset( $wgOresFiltersThresholds[ 'damaging' ][ $level ] ) &&
[10:59:02] <legoktm>	 				$wgOresFiltersThresholds[ 'damaging' ][ $level ] !== false
[10:59:08] <legoktm>	 'soft' maps to 'likelybad'
[10:59:24] <Amir1>	 hmm
[11:01:35] <Amir1>	 okay, let me change that to something else and we deploy
[11:01:49] <Amir1>	 legoktm: apergos: Is it okay?
[11:01:51] <legoktm>	 fawiki is the only one with ['damaging']['likelybad'] = false;
[11:02:17] <apergos>	 eh?
[11:02:27] <Amir1>	 we have a UBN! task 
[11:02:30] <apergos>	 I'm not really here for a weekend deploy
[11:02:32] <apergos>	 oh?
[11:02:37] <legoktm>	 Special:Preferences is broken on fawiki because of ORES
[11:02:44] <Amir1>	 https://phabricator.wikimedia.org/T182354
[11:03:12] <Amir1>	 It's just not preferences, sometimes it's editing VE 
[11:03:48] <legoktm>	 right, because preferences via api.php is broken too
[11:03:51] <apergos>	 when did this start happening?
[11:04:07] <legoktm>	 Amir1: honestly I'd rather just disable ORES again
[11:04:08] <apergos>	 I mean, are we talking about something broken from a deploy last night? 
[11:04:18] <apergos>	 or it's been like this for a few days, or...?
[11:04:29] <legoktm>	 since thursday apparently
[11:04:30] <apergos>	 and do we have the commit that broke it?
[11:04:43] <Amir1>	 good point
[11:05:21] <apergos>	 what I would prefer is to roll back that deploy, unless it also was a response to an ubn
[11:05:37] <legoktm>	 https://gerrit.wikimedia.org/r/#/c/392452/ is the offending commit
[11:05:45] <Amir1>	 https://gerrit.wikimedia.org/r/#/c/394630/ maybe?
[11:06:03] <Amir1>	 probably 
[11:06:37] <apergos>	 december 1?
[11:06:45] <apergos>	 this has been broken since then?
[11:07:00] <legoktm>	 it only hit fawiki on thursday I think
[11:07:15] <Amir1>	 yes
[11:07:36] <Amir1>	 I started to get report of it from yesterday so probably yesterday
[11:07:40] <Amir1>	 *thursday 
[11:08:01] <apergos>	 I se
[11:08:01] <apergos>	 e
[11:08:16] <Amir1>	 legoktm: I can find a number for fawiki
[11:08:25] * apergos goes to look at sal for the deploys
[11:09:39] <legoktm>	 Amir1: I'd rather just disable ORES until Monday. That commit is large enough that I don't feel comfortable reverting it, and I really don't want to touch anything else over the weekend when we've already had enough problems with ORES recently
[11:10:16] <Amir1>	 legoktm: yeah, okay
[11:11:08] <wikibugs>	 (03PS1) 10Ladsgroup: Disable ORES in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396572 (https://phabricator.wikimedia.org/T182354)
[11:11:21] <Amir1>	 https://gerrit.wikimedia.org/r/396572
[11:11:38] <apergos>	 why did it only hit fawiki?
[11:11:49] <legoktm>	 it has a different configuration from the other wikis
[11:12:29] <legoktm>	 and I checked that no other wiki is in the exception.log
[11:12:32] <apergos>	 ic
[11:12:53] <apergos>	 well disable ores is a pretty big hammer, but it's the cleanest
[11:12:58] <apergos>	 given it's the weekend
[11:13:24] <wikibugs>	 (03CR) 10Legoktm: [C: 032] Disable ORES in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396572 (https://phabricator.wikimedia.org/T182354) (owner: 10Ladsgroup)
[11:13:28] <legoktm>	 I'll sync it out
[11:13:43] <apergos>	 okey dokey
[11:13:54] <apergos>	 I/m here for a little while in case something is needed
[11:14:12] <apergos>	 but at a certain point I'll be tuning out this channel again, so ping if there's an issue
[11:14:20] <legoktm>	 thanks, and will do
[11:14:28] <legoktm>	 I'll be up for another hour probably
[11:14:28] <Amir1>	 legoktm: thank
[11:14:30] <Amir1>	 *thanks
[11:15:04] <wikibugs>	 (03Merged) 10jenkins-bot: Disable ORES in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396572 (https://phabricator.wikimedia.org/T182354) (owner: 10Ladsgroup)
[11:15:14] <legoktm>	 And I'll send an email to ops-l in a minute
[11:16:57] <legoktm>	 Amir1: test on mwdebug1002 please?
[11:17:03] <Amir1>	 sure
[11:17:12] <wikibugs>	 (03CR) 10jenkins-bot: Disable ORES in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396572 (https://phabricator.wikimedia.org/T182354) (owner: 10Ladsgroup)
[11:17:31] <Amir1>	 legoktm: LGTM
[11:18:04] <legoktm>	 I successfully loaded Special:preferences without an error (it was broken before), and changed my preferences through VE properly
[11:18:06] <Amir1>	 legoktm: I've pushed for moving RCFilters to a dedicated extension but they are not doing it :/
[11:19:27] <logmsgbot>	 !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Disable ORES in fawiki - T182354 (duration: 00m 45s)
[11:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:39] <stashbot>	 T182354: OresDamagingPref back-compatibility is logging exceptions - https://phabricator.wikimedia.org/T182354
[11:21:34] <icinga-wm>	 PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:21:34] <icinga-wm>	 PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:21:35] <icinga-wm>	 PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:21:44] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:21:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:21:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:21:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:21:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:21:55] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:25] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:25] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:36] <legoktm>	 uhhhh
[11:22:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:54] <icinga-wm>	 PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:54] <icinga-wm>	 PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:54] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:23:04] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:23:05] <legoktm>	 apergos: ^^
[11:23:57] <apergos>	 Dec  9 11:21:02 mw1278 kernel: [2590628.390895] INFO: task hhvm:22782 blocked for more than 120 seconds.
[11:24:05] <apergos>	 yeah I'm already looking, and hating it
[11:24:24] <icinga-wm>	 RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 71816 bytes in 1.796 second response time
[11:24:31] <legoktm>	 did hhvm just lock up?
[11:25:32] <apergos>	 sure seems that way
[11:25:39] <apergos>	 I restarted it on mw1278
[11:25:49] <apergos>	 though it is taking a long time to give me a command line prompt back
[11:26:31] <legoktm>	 this feels like https://phabricator.wikimedia.org/T103886
[11:27:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.115 second response time
[11:27:42] <elukey>	 it happened also yesterday iirc 
[11:27:51] <elukey>	 3/4 appservers locking up
[11:27:54] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.134 second response time
[11:28:09] <elukey>	 I wasn't able to run hhvm-dump-debug to investigate
[11:28:56] <legoktm>	 Amir1: can you leave a note for fawiki that ORES is temporarily disabled?
[11:29:00] <apergos>	 what are you using to restart them?
[11:29:09] <apergos>	 I did the default (clearly wrong) service hhvm restart
[11:29:14] <apergos>	 elukey: 
[11:29:37] <elukey>	 when it locks up in this way it is fine afaik
[11:29:39] <elukey>	 https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1278&from=now-1h&to=now
[11:30:22] <elukey>	 legoktm: do you see anything out of the ordinary in --^
[11:30:53] <Hauskatze>	 where's wikibugs?
[11:30:57] <legoktm>	 I'm not really sure what to look for
[11:31:29] <Amir1>	 legoktm: I'll do
[11:31:31] <elukey>	 legoktm: I was wondering translation-cache related (seeing from the phab task that you posted)
[11:31:36] <Hauskatze>	 wikibugs: test
[11:31:45] <icinga-wm>	 RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 71816 bytes in 5.751 second response time
[11:32:04] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.049 second response time
[11:32:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.021 second response time
[11:32:19] <elukey>	 apergos: are you restarting them?
[11:32:24] <apergos>	 I only restarted one
[11:32:28] <elukey>	 ah super
[11:32:35] <elukey>	 going to check mw1230
[11:32:37] <apergos>	 I wasn't going to do them all at once
[11:33:04] <legoktm>	 elukey: I'm not sure if it was the TC, but the timing of me doing a deploy that touched InitialiseSettings.php and then everything locking up seemed too perfect to be a coincidence
[11:33:14] <legoktm>	 which reminded me of that task
[11:33:15] <apergos>	 if I don't see mw1277 coming back pretty soon I'm going to do that one though
[11:33:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 71815 bytes in 0.111 second response time
[11:33:45] <icinga-wm>	 RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time
[11:33:45] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.048 second response time
[11:34:50] <elukey>	 trying to run hhvm-dump-debug on mw1230 but I think it will probably timeout
[11:34:54] <icinga-wm>	 PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:35:15] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:35:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:35:18] <apergos>	 restarting on 1277.
[11:36:26] <apergos>	 maybe.  this is really taking a long time
[11:37:07] <Hauskatze>	 [WivKzQpAMFQAAFwtn70AAAAR] 2017-12-09 11:36:45: Fatal exception of type "InvalidArgumentException"
[11:37:19] <Hauskatze>	 @wikimania2018
[11:37:25] <icinga-wm>	 RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 71816 bytes in 0.385 second response time
[11:37:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time
[11:37:35] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time
[11:37:35] <legoktm>	 looking
[11:37:39] <apergos>	 1226 next 
[11:37:52] <elukey>	 thanks apergos 
[11:37:56] <legoktm>	 Hauskatze: https://phabricator.wikimedia.org/T182344
[11:38:05] <elukey>	 hhvm-dump-debug still hanging
[11:38:10] <Hauskatze>	 argh, not again :|
[11:38:32] <Hauskatze>	 I'll log through meta and then proxy to that wiki through
[11:39:56] <apergos>	 mw1234 next
[11:40:05] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time
[11:40:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.106 second response time
[11:40:45] <icinga-wm>	 RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 71818 bytes in 3.249 second response time
[11:40:50] <legoktm>	 Hauskatze: it was when you were trying to login?
[11:41:45] <icinga-wm>	 RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 71816 bytes in 0.226 second response time
[11:41:47] <Hauskatze>	 legoktm: affirmative
[11:42:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.075 second response time
[11:42:30] <elukey>	 also restarted mw1230
[11:42:32] <apergos>	 !log restarted hhvm on api servers after lockup
[11:42:34] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.236 second response time
[11:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:44] <elukey>	 apergos: can we specify which ones in the log ?
[11:42:52] <elukey>	 so we'll have a trace
[11:43:14] <apergos>	 sure
[11:43:14] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.124 second response time
[11:43:34] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.147 second response time
[11:43:45] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 71818 bytes in 1.788 second response time
[11:44:22] <apergos>	 !log that server list: mw1278, 1277, 1226, 1234, 1230 
[11:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:43] <apergos>	 why I hate weekend deploys
[11:46:02] <legoktm>	 sorry :(
[11:46:14] <apergos>	 not dinging you, I know it needed to be done
[11:46:16] <apergos>	 just hate 'em
[11:46:52] <legoktm>	 we should have caught this on friday
[11:46:53] <apergos>	 I also wish we understood that hhvm issue... but if wishes were horses, riders would go begging 
[11:46:57] <apergos>	 as the saying doesn't go :-P
[11:47:24] <elukey>	 weird thing is that not even gdb was able to get stack traces
[11:47:35] <elukey>	 usually this is possible
[11:47:43] <apergos>	 what did it do, hang?
[11:49:05] <apergos>	 do we know what... silly question maybe, but... an strace shows? is it hung on a system call, or in some tight loop, or...?
[11:49:28] <wikibugs>	 (03PS3) 10ArielGlenn: move production of lists of last good dumps from snapshot to web server [puppet] - 10https://gerrit.wikimedia.org/r/395977 (https://phabricator.wikimedia.org/T182303)
[11:50:48] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] move production of lists of last good dumps from snapshot to web server [puppet] - 10https://gerrit.wikimedia.org/r/395977 (https://phabricator.wikimedia.org/T182303) (owner: 10ArielGlenn)
[11:53:38] <elukey>	 It should use ptrace to figure out where a process/thread is, but maybe that one was hanging as well for some reason
[11:55:15] <elukey>	 apergos: everything seems stable, going afk now, will re-check later.. thanks !
[11:55:45] <apergos>	 yep, thanks for looking
[11:57:59] <legoktm>	 thanks both :)
[12:00:12] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10WorkType-NewFunctionality: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#3825045 (10Aklapper) a:05csteipp>03None
[12:06:25] <icinga-wm>	 PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 41268 MB (3% inode=99%)
[12:17:01] <wikibugs>	 (03PS1) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942)
[12:17:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn)
[12:22:12] <wikibugs>	 (03CR) 10Framawiki: "And where can I find doc about it ? :)" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki)
[12:23:53] <wikibugs>	 (03PS2) 10EddieGP: Delete mowiki and mowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio)
[12:25:16] <wikibugs>	 (03PS2) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942)
[12:26:31] <wikibugs>	 (03CR) 10EddieGP: "PS2 is a rebase and re-adding a newline that PS1 wanted to remove at EOF of deleted.dblists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio)
[12:27:21] <wikibugs>	 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3825060 (10EddieGP) @MarcoAurelio   You're right about 1 at least. As far as I see we won't need any other Apache change (the virt...
[12:31:33] <wikibugs>	 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3398651 (10Legoktm) Can we swap the order on these changes? First, redirect the domains to the proper wikis (leaving the wikis in...
[12:31:34] <icinga-wm>	 PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 41447 MB (3% inode=99%)
[12:31:39] <wikibugs>	 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3825068 (10MarcoAurelio) Note that 1 is abandoned as I was not able to run the ruby script to convert the dat to conf or vice vers...
[12:32:23] <wikibugs>	 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3825069 (10MarcoAurelio) @Legoktm Looks good to me.
[12:33:04] <wikibugs>	 (03PS3) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942)
[12:38:45] <wikibugs>	 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3825080 (10EddieGP) >>! In T169450#3825066, @Legoktm wrote: > Can we swap the order on these changes? First, redirect the domains...
[12:40:44] <wikibugs>	 (03PS1) 10MarcoAurelio: wm2018: sysops to add and remove 'translationadmin' from their accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492)
[12:41:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wm2018: sysops to add and remove 'translationadmin' from their accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) (owner: 10MarcoAurelio)
[12:51:59] <wikibugs>	 (03PS1) 10MarcoAurelio: [do not merge yet] wikimania2017: closing the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493)
[12:56:04] <icinga-wm>	 PROBLEM - nova-compute process on labvirt1011 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[12:56:46] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 04-1] "Missing [ things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) (owner: 10MarcoAurelio)
[12:58:54] <wikibugs>	 (03PS2) 10MarcoAurelio: wm2018: sysops to add and remove 'translationadmin' from their accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492)
[13:00:04] <icinga-wm>	 RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[13:30:14] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[13:30:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0
[13:48:30] <wikibugs>	 (03PS4) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942)
[13:49:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn)
[13:53:40] <wikibugs>	 (03PS5) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942)
[13:54:37] <wikibugs>	 (03Restored) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio)
[13:54:49] <wikibugs>	 (03PS5) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450)
[14:01:17] <Hauskatze>	 $ ruby compile_redirects.rb
[14:01:17] <Hauskatze>	 compile_redirects.rb:1: syntax error, unexpected ..
[14:01:17] <Hauskatze>	 ../../../../lib/puppet/parser/fu
[14:01:17] <Hauskatze>	   ^
[14:01:17] <Hauskatze>	 compile_redirects.rb:1: unknown regexp options - lb
[14:04:09] <Hauskatze>	 Reedy: any idea ^^ 
[14:05:42] * Hauskatze tries to execute the script in another location
[14:10:25] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:11:26] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.008 second response time
[14:12:54] <wikibugs>	 (03CR) 10EddieGP: "About the difference between funnel and rewrite:" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio)
[14:13:24] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time
[14:13:25] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time
[14:14:28] <wikibugs>	 (03CR) 10EddieGP: "> but instead to als.wiktionary.org/wiki/Wort:Houptsyte" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio)
[14:22:32] <wikibugs>	 (03PS1) 10ArielGlenn: remove some dead vars from content translation dump manifests [puppet] - 10https://gerrit.wikimedia.org/r/396583
[14:28:28] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] remove some dead vars from content translation dump manifests [puppet] - 10https://gerrit.wikimedia.org/r/396583 (owner: 10ArielGlenn)
[14:29:06] <Hauskatze>	 apergos: I think ../../../../lib/puppet/parser/functions/compile_redirects.rb routing is wrong
[14:29:10] <Hauskatze>	 on puppet
[14:29:31] <Hauskatze>	 modules/mediawiki/files/apache/sites/redirects/
[14:35:22] <wikibugs>	 (03PS1) 10ArielGlenn: remove dead vars from misc dump cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/396584
[14:37:34] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] remove dead vars from misc dump cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/396584 (owner: 10ArielGlenn)
[14:40:26] <wikibugs>	 (03PS6) 10MarcoAurelio: apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450)
[14:40:54] * apergos peeks in
[14:41:39] <wikibugs>	 (03PS7) 10MarcoAurelio: apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450)
[14:45:18] <apergos>	 Hauskatze: you cd into the directory puppet/modules/mediawiki/files/apache/sites
[14:45:38] <apergos>	 you edit redirects/redirects.dat
[14:45:44] <wikibugs>	 (03CR) 10MarcoAurelio: "I finnaly managed to run that %&!!#$ script and the .conf file is now here." [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio)
[14:46:22] <apergos>	 you run ruby redirects/compile_redirects.rb redirects/redirects.dat > redirects.conf.saveme  (say)
[14:46:47] <Hauskatze>	 apergos, managed to run it from the lib folder and specifying manually the whole path
[14:46:54] <Hauskatze>	 from redirects it is impossible
[14:46:56] <apergos>	 look at the diff of redirects.conf and whatever you just saved, presumably it's ok
[14:47:01] <Hauskatze>	 says that ../ is not expected
[14:47:09] <Hauskatze>	 you remove those, says they'reneeded
[14:47:10] <apergos>	 then put the new file into redirects.conf
[14:47:30] <apergos>	 try running it from  puppet/modules/mediawiki/files/apache/sites
[14:47:32] <apergos>	 as
[14:47:45] <apergos>	 ruby redirects/compile_redirects.rb redirects/redirects.dat >  something
[14:47:50] <Hauskatze>	 https://gerrit.wikimedia.org/r/#/c/393289/ <-- apergos -- but not to be merged yet
[14:48:13] <apergos>	 most likely I will not be your merger
[14:48:22] <apergos>	 but if you got the script working, great
[14:48:39] <Hauskatze>	 I did on lib: ruby compile_redirects.rb <full_path_to_redirects.dat> > redirects.conf.ma
[14:48:52] <Hauskatze>	 checked it was okay, cut the file and pasted it with correct name
[14:48:54] <apergos>	 wrong folder
[14:49:01] <apergos>	 that's why you had to give the full path
[14:49:21] <apergos>	 but you got the output, so that's the important thing
[14:49:44] <Hauskatze>	 always doing things more complicated that they have to be.. damn me :)
[14:49:58] <apergos>	 heh
[14:50:34] <Hauskatze>	 ευχαριστώ apergos 
[14:50:44] <apergos>	 τπτ
[14:50:48] <Hauskatze>	 :)
[14:52:52] <wikibugs>	 (03CR) 10ArielGlenn: "Hoo, I added you as a fyi, you're not obligated to review (though of course if you see something glaring, please feel free)" [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn)
[15:00:29] <wikibugs>	 (03PS8) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793)
[15:02:44] <wikibugs>	 (03PS9) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793)
[15:10:04] <icinga-wm>	 PROBLEM - Disk space on scb1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=78%)
[15:21:00] <wikibugs>	 (03PS1) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102)
[15:21:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) (owner: 10ArielGlenn)
[15:26:16] <apergos>	 space on scb1001 is mostly the 1gb of ores celery worker logs in daemon.log, taking nearly 1g today, 3.5 million entries compared to 9k entries yesterday
[15:27:14] <awight>	 apergos: FYI we turned the verbosity up recently… may need a new logrotate config
[15:27:31] <apergos>	 awight: please do so, I was just looking at the existig one
[15:27:48] <apergos>	 you should think about rotating when size exceeds X (300M?)
[15:29:29] <apergos>	 because this is the current log, I can't fix it unless the celery workers are restarted on that box 
[15:29:38] <apergos>	 I can manually move the log elsewhere, do    service uwsgi-ores reload    
[15:29:59] <apergos>	 but let's get a new logrotate.conf in there in the next half hour, ok?  awight
[15:31:50] <apergos>	 note this must cover daemon.log
[15:31:53] <awight>	 apergos: I’m not sure how this will work—rotating alone won’t solve the problem, unless the files are flying off to an archive somewhere?
[15:31:54] <apergos>	 because that's where the issues are
[15:34:43] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team: Update logrotate config for scb* boxes, to deal with ORES verbose logging - https://phabricator.wikimedia.org/T182497#3825269 (10awight) p:05Triage>03High
[15:44:04] <icinga-wm>	 RECOVERY - Disk space on scb1001 is OK: DISK OK
[15:45:02] <apergos>	 !log on scb1001 moved daemon.log out of the way, did "service rsyslog rotate", saved the last 5000 entries for use by ores team, removed the log
[15:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:08] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Update logrotate config for scb* boxes, to deal with ORES verbose logging - https://phabricator.wikimedia.org/T182497#3825284 (10ArielGlenn) Needs to happen:  logging to ores logs instead of daemon.log, make sure logrot conf file for ores l...
[15:48:35] <awight>	 !log Making an emergency deployment to ORES logging config to reduce verbosity.
[15:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:04] <icinga-wm>	 PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 329 MB (3% inode=83%)
[15:51:12] <apergos>	 I'll get there
[15:51:17] <apergos>	 I'm doing scb1002 now
[15:53:04] <icinga-wm>	 RECOVERY - Disk space on scb1004 is OK: DISK OK
[15:53:16] <apergos>	 !log did same on scb1002,3,4
[15:53:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:28] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity
[15:53:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:59] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (duration: 00m 31s)
[15:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:24] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity
[15:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:42] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (duration: 00m 17s)
[15:55:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:13] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity
[15:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:06] <apergos>	 the ores200* boxes have much smaller logs, so they must get a lot less/minimal traffic, also their / partitions are much larger so I'm ignoring them
[16:02:11] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (duration: 05m 58s)
[16:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:31] <wikibugs>	 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3825309 (10awight) This is affecting me in production, now: ``` Timeout, server scb2004.codfw.wmnet not responding.  16:01:39 conn...
[16:07:50] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (take 4\!)
[16:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:51] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (take 4\!) (duration: 03m 01s)
[16:11:01] <apergos>	 the scb200* hosts actually do have 1G log files, but 80G root partitions so no worries there
[16:11:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:03] <wikibugs>	 (03PS2) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102)
[16:16:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) (owner: 10ArielGlenn)
[16:17:41] <wikibugs>	 (03PS3) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102)
[16:29:06] <wikibugs>	 (03PS4) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102)
[16:38:57] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Update log config for scb* boxes, to deal with ORES verbose logging - https://phabricator.wikimedia.org/T182497#3825360 (10awight)
[16:39:47] <wikibugs>	 (03PS5) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102)
[16:42:23] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10HTTPS: Check all wikis for inclusions of http resources on https - https://phabricator.wikimedia.org/T36670#3825367 (10Aklapper) a:05csteipp>03None
[16:48:06] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) (owner: 10ArielGlenn)
[16:48:35] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:48:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:49:04] <icinga-wm>	 PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:49:59] <apergos>	 hhvm hang again
[16:50:08] <apergos>	 elukey: in case you're around, wanna try anything there?
[16:57:57] <apergos>	 guess I'll go ahead and restart it
[16:59:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.063 second response time
[17:00:05] <icinga-wm>	 RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 71830 bytes in 0.284 second response time
[17:00:12] <apergos>	 !log restarted hhvm on mw1276, the same old hang with the same old symptoms 
[17:00:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:35] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.039 second response time
[17:02:32] <apergos>	 all right, I'm checked out for the day
[17:02:40] <apergos>	 any emergencies... call someone else :-P
[17:17:51] <Zppix>	 Lol
[17:21:41] <wikibugs>	 (03PS6) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942)
[18:32:48] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841#3825446 (10Aklapper) >>! In T146841#3729163, @Dzahn wrote: > @Seb35 @Peachey88  @Herron since T168467 is resolved meanwhile, d...
[18:37:22] <wikibugs>	 10Operations, 10RT-Migration, 10Wikimedia Phabricator RfC, 10WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#3825778 (10Ebe123)
[20:07:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:07:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:07:54] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:08:34] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71826 bytes in 6.099 second response time
[20:08:44] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.032 second response time
[20:09:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.080 second response time
[20:15:25] <sjoerddebruin>	 Wikimedia\Rdbms\DBQueryError when trying to create item on Wikidata...
[20:15:33] <sjoerddebruin>	 (WixEQApAIC4AAIKfjewAAAAM)
[20:15:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:15:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:15:54] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:16:45] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.923 second response time
[20:17:20] <legoktm>	 sjoerddebruin: looking...
[20:17:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time
[20:17:34] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71824 bytes in 0.208 second response time
[20:17:56] <legoktm>	 sjoerddebruin: Lock wait timeout exceeded; try restarting transaction
[20:18:06] <sjoerddebruin>	 Yeah, it's working again...
[20:20:54] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:21:35] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:21:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:29:35] <icinga-wm>	 PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 40974 MB (3% inode=99%)
[20:43:20] <wikibugs>	 (03PS2) 10Zoranzoki21: Redirect techblog.wikimedia.org to blog.wikimedia.org/c/technology [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki)
[20:48:57] <wikibugs>	 (03PS2) 10Zoranzoki21: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi)
[20:49:08] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] "Now is ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi)
[20:55:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.038 second response time
[20:55:54] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71836 bytes in 0.096 second response time
[20:56:04] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.054 second response time
[21:02:14] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:02:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:03:04] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:03:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.050 second response time
[21:03:54] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71792 bytes in 0.135 second response time
[21:04:04] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.034 second response time
[21:06:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:07:04] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:07:14] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:16:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 3.863 second response time
[21:17:04] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71792 bytes in 0.116 second response time
[21:17:05] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time
[21:49:45] <icinga-wm>	 PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 40381 MB (3% inode=99%)
[22:16:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:16:14] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:16:24] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:18:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 5.594 second response time
[22:18:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71792 bytes in 0.137 second response time
[22:18:14] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time
[22:19:35] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Update log config for scb* boxes, to deal with ORES verbose logging - https://phabricator.wikimedia.org/T182497#3825924 (10awight) p:05High>03Normal Urgent fix is deployed, lowering the priority.
[22:19:55] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:20:25] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:21:24] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 8.623 second response time
[22:22:45] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[22:25:45] <icinga-wm>	 PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 41502 MB (3% inode=99%)
[23:19:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:19:24] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:20:14] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71832 bytes in 0.138 second response time
[23:20:15] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.028 second response time
[23:41:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:41:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:41:25] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:42:24] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.271 second response time
[23:42:24] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71833 bytes in 5.889 second response time
[23:43:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.035 second response time