[00:11:16] !log removed 2FA from EVinente after verification T182373 [00:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:28] T182373: Disable 2FA for EVinente - https://phabricator.wikimedia.org/T182373 [00:23:05] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen [00:25:05] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen [00:25:56] 10Operations, 10Mail, 10fundraising-tech-ops: Forward katherine@wikipedia.org and jimmy@wikipedia.org emails to katherine@wikimedia.org and jimmy@wikimedia.org, respectively - https://phabricator.wikimedia.org/T182456#3824508 (10Reedy) [00:50:45] PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 303 MB (3% inode=83%) [01:16:05] PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [01:17:05] RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [03:13:45] PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=83%) [03:19:22] 10Operations: Backport firejail 0.9.52 for use on Wikimedia appservers - https://phabricator.wikimedia.org/T179022#3824685 (10Legoktm) [03:20:03] 10Operations, 10MediaWiki-Platform-Team, 10MediaWiki-Shell: Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603#3824687 (10Legoktm) [03:24:25] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 825.17 seconds [03:51:44] PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 341 MB (3% inode=83%) [03:52:17] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3824715 (10mmodell) [03:52:49] 10Operations, 10Packaging, 10Scap: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3821018 (10mmodell) p:05Triage>03High High priority because the new version of scap will help with debugging {T181661} [03:54:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 284.42 seconds [04:21:45] PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 334 MB (3% inode=83%) [05:43:15] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 34 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [05:48:37] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 11 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:05:24] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:05:55] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:04] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:06:04] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:05] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:14] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:06:14] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:15] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:24] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:06:25] PROBLEM - SSH on ganeti1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:08:15] RECOVERY - SSH on ganeti1008 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:08:24] RECOVERY - Host actinium is UP: PING WARNING - Packet loss = 64%, RTA = 16.44 ms [08:08:34] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [08:08:34] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 3.05 ms [08:08:34] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [08:08:34] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms [08:08:34] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 1.97 ms [08:08:34] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 2.72 ms [08:08:54] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [08:10:44] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [08:11:35] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:23:35] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:42:54] RECOVERY - DPKG on ganeti1006 is OK: All packages OK [09:05:50] (03PS2) 10ArielGlenn: move content translation dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396541 (https://phabricator.wikimedia.org/T179942) [09:09:23] (03PS1) 10Revi: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) [09:18:48] (03PS3) 10ArielGlenn: move content translation dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396541 (https://phabricator.wikimedia.org/T179942) [09:20:12] (03CR) 10ArielGlenn: [C: 032] move content translation dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396541 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [09:25:39] (fyi the 50x were Piwik's related due to the ganeti host issue) [10:46:02] just wanted to say lots of people are getting fatal error in fawiki, when they want to edit or check preferences [10:49:28] Amir1: fatal or mwexception? [10:49:47] Amir1: "Global default 'soft' is invalid for field rcOresDamagingPref" [10:50:00] mwexception [10:50:16] it's all that exception [10:50:39] https://phabricator.wikimedia.org/T182354 [10:51:40] legoktm: can we deploy a config change [10:52:04] what do you want to change? [10:52:31] change soft to its correct name [10:52:46] but I'm not sure [10:53:40] what does the exception even mean? [10:56:25] (03PS2) 10ArielGlenn: move production of lists of last good dumps from snapshot to web server [puppet] - 10https://gerrit.wikimedia.org/r/395977 (https://phabricator.wikimedia.org/T182303) [10:57:05] >>> $wgDefaultUserOptions['rcOresDamagingPref']; [10:57:05] => "soft" [10:58:46] Amir1: it's because $wgOresFiltersThresholds['damaging']['likelybad'] = false; [10:59:02] isset( $wgOresFiltersThresholds[ 'damaging' ][ $level ] ) && [10:59:02] $wgOresFiltersThresholds[ 'damaging' ][ $level ] !== false [10:59:08] 'soft' maps to 'likelybad' [10:59:24] hmm [11:01:35] okay, let me change that to something else and we deploy [11:01:49] legoktm: apergos: Is it okay? [11:01:51] fawiki is the only one with ['damaging']['likelybad'] = false; [11:02:17] eh? [11:02:27] we have a UBN! task [11:02:30] I'm not really here for a weekend deploy [11:02:32] oh? [11:02:37] Special:Preferences is broken on fawiki because of ORES [11:02:44] https://phabricator.wikimedia.org/T182354 [11:03:12] It's just not preferences, sometimes it's editing VE [11:03:48] right, because preferences via api.php is broken too [11:03:51] when did this start happening? [11:04:07] Amir1: honestly I'd rather just disable ORES again [11:04:08] I mean, are we talking about something broken from a deploy last night? [11:04:18] or it's been like this for a few days, or...? [11:04:29] since thursday apparently [11:04:30] and do we have the commit that broke it? [11:04:43] good point [11:05:21] what I would prefer is to roll back that deploy, unless it also was a response to an ubn [11:05:37] https://gerrit.wikimedia.org/r/#/c/392452/ is the offending commit [11:05:45] https://gerrit.wikimedia.org/r/#/c/394630/ maybe? [11:06:03] probably [11:06:37] december 1? [11:06:45] this has been broken since then? [11:07:00] it only hit fawiki on thursday I think [11:07:15] yes [11:07:36] I started to get report of it from yesterday so probably yesterday [11:07:40] *thursday [11:08:01] I se [11:08:01] e [11:08:16] legoktm: I can find a number for fawiki [11:08:25] * apergos goes to look at sal for the deploys [11:09:39] Amir1: I'd rather just disable ORES until Monday. That commit is large enough that I don't feel comfortable reverting it, and I really don't want to touch anything else over the weekend when we've already had enough problems with ORES recently [11:10:16] legoktm: yeah, okay [11:11:08] (03PS1) 10Ladsgroup: Disable ORES in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396572 (https://phabricator.wikimedia.org/T182354) [11:11:21] https://gerrit.wikimedia.org/r/396572 [11:11:38] why did it only hit fawiki? [11:11:49] it has a different configuration from the other wikis [11:12:29] and I checked that no other wiki is in the exception.log [11:12:32] ic [11:12:53] well disable ores is a pretty big hammer, but it's the cleanest [11:12:58] given it's the weekend [11:13:24] (03CR) 10Legoktm: [C: 032] Disable ORES in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396572 (https://phabricator.wikimedia.org/T182354) (owner: 10Ladsgroup) [11:13:28] I'll sync it out [11:13:43] okey dokey [11:13:54] I/m here for a little while in case something is needed [11:14:12] but at a certain point I'll be tuning out this channel again, so ping if there's an issue [11:14:20] thanks, and will do [11:14:28] I'll be up for another hour probably [11:14:28] legoktm: thank [11:14:30] *thanks [11:15:04] (03Merged) 10jenkins-bot: Disable ORES in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396572 (https://phabricator.wikimedia.org/T182354) (owner: 10Ladsgroup) [11:15:14] And I'll send an email to ops-l in a minute [11:16:57] Amir1: test on mwdebug1002 please? [11:17:03] sure [11:17:12] (03CR) 10jenkins-bot: Disable ORES in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396572 (https://phabricator.wikimedia.org/T182354) (owner: 10Ladsgroup) [11:17:31] legoktm: LGTM [11:18:04] I successfully loaded Special:preferences without an error (it was broken before), and changed my preferences through VE properly [11:18:06] legoktm: I've pushed for moving RCFilters to a dedicated extension but they are not doing it :/ [11:19:27] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Disable ORES in fawiki - T182354 (duration: 00m 45s) [11:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:39] T182354: OresDamagingPref back-compatibility is logging exceptions - https://phabricator.wikimedia.org/T182354 [11:21:34] PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:21:34] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:21:35] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:21:44] PROBLEM - Nginx local proxy to apache on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:21:44] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:21:45] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:21:45] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:21:54] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:21:55] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:05] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:24] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:25] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:25] PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:36] uhhhh [11:22:45] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:54] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:54] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:54] PROBLEM - Nginx local proxy to apache on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:04] PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:05] apergos: ^^ [11:23:57] Dec 9 11:21:02 mw1278 kernel: [2590628.390895] INFO: task hhvm:22782 blocked for more than 120 seconds. [11:24:05] yeah I'm already looking, and hating it [11:24:24] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 71816 bytes in 1.796 second response time [11:24:31] did hhvm just lock up? [11:25:32] sure seems that way [11:25:39] I restarted it on mw1278 [11:25:49] though it is taking a long time to give me a command line prompt back [11:26:31] this feels like https://phabricator.wikimedia.org/T103886 [11:27:35] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.115 second response time [11:27:42] it happened also yesterday iirc [11:27:51] 3/4 appservers locking up [11:27:54] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.134 second response time [11:28:09] I wasn't able to run hhvm-dump-debug to investigate [11:28:56] Amir1: can you leave a note for fawiki that ORES is temporarily disabled? [11:29:00] what are you using to restart them? [11:29:09] I did the default (clearly wrong) service hhvm restart [11:29:14] elukey: [11:29:37] when it locks up in this way it is fine afaik [11:29:39] https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1278&from=now-1h&to=now [11:30:22] legoktm: do you see anything out of the ordinary in --^ [11:30:53] where's wikibugs? [11:30:57] I'm not really sure what to look for [11:31:29] legoktm: I'll do [11:31:31] legoktm: I was wondering translation-cache related (seeing from the phab task that you posted) [11:31:36] wikibugs: test [11:31:45] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 71816 bytes in 5.751 second response time [11:32:04] RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.049 second response time [11:32:05] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.021 second response time [11:32:19] apergos: are you restarting them? [11:32:24] I only restarted one [11:32:28] ah super [11:32:35] going to check mw1230 [11:32:37] I wasn't going to do them all at once [11:33:04] elukey: I'm not sure if it was the TC, but the timing of me doing a deploy that touched InitialiseSettings.php and then everything locking up seemed too perfect to be a coincidence [11:33:14] which reminded me of that task [11:33:15] if I don't see mw1277 coming back pretty soon I'm going to do that one though [11:33:44] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 71815 bytes in 0.111 second response time [11:33:45] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time [11:33:45] RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.048 second response time [11:34:50] trying to run hhvm-dump-debug on mw1230 but I think it will probably timeout [11:34:54] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:35:15] PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:35:15] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:35:18] restarting on 1277. [11:36:26] maybe. this is really taking a long time [11:37:07] [WivKzQpAMFQAAFwtn70AAAAR] 2017-12-09 11:36:45: Fatal exception of type "InvalidArgumentException" [11:37:19] @wikimania2018 [11:37:25] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 71816 bytes in 0.385 second response time [11:37:34] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time [11:37:35] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time [11:37:35] looking [11:37:39] 1226 next [11:37:52] thanks apergos [11:37:56] Hauskatze: https://phabricator.wikimedia.org/T182344 [11:38:05] hhvm-dump-debug still hanging [11:38:10] argh, not again :| [11:38:32] I'll log through meta and then proxy to that wiki through [11:39:56] mw1234 next [11:40:05] RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [11:40:35] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.106 second response time [11:40:45] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 71818 bytes in 3.249 second response time [11:40:50] Hauskatze: it was when you were trying to login? [11:41:45] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 71816 bytes in 0.226 second response time [11:41:47] legoktm: affirmative [11:42:24] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.075 second response time [11:42:30] also restarted mw1230 [11:42:32] !log restarted hhvm on api servers after lockup [11:42:34] RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.236 second response time [11:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:44] apergos: can we specify which ones in the log ? [11:42:52] so we'll have a trace [11:43:14] sure [11:43:14] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.124 second response time [11:43:34] RECOVERY - Nginx local proxy to apache on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.147 second response time [11:43:45] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 71818 bytes in 1.788 second response time [11:44:22] !log that server list: mw1278, 1277, 1226, 1234, 1230 [11:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:43] why I hate weekend deploys [11:46:02] sorry :( [11:46:14] not dinging you, I know it needed to be done [11:46:16] just hate 'em [11:46:52] we should have caught this on friday [11:46:53] I also wish we understood that hhvm issue... but if wishes were horses, riders would go begging [11:46:57] as the saying doesn't go :-P [11:47:24] weird thing is that not even gdb was able to get stack traces [11:47:35] usually this is possible [11:47:43] what did it do, hang? [11:49:05] do we know what... silly question maybe, but... an strace shows? is it hung on a system call, or in some tight loop, or...? [11:49:28] (03PS3) 10ArielGlenn: move production of lists of last good dumps from snapshot to web server [puppet] - 10https://gerrit.wikimedia.org/r/395977 (https://phabricator.wikimedia.org/T182303) [11:50:48] (03CR) 10ArielGlenn: [C: 032] move production of lists of last good dumps from snapshot to web server [puppet] - 10https://gerrit.wikimedia.org/r/395977 (https://phabricator.wikimedia.org/T182303) (owner: 10ArielGlenn) [11:53:38] It should use ptrace to figure out where a process/thread is, but maybe that one was hanging as well for some reason [11:55:15] apergos: everything seems stable, going afk now, will re-check later.. thanks ! [11:55:45] yep, thanks for looking [11:57:59] thanks both :) [12:00:12] 10Operations, 10Wikimedia-General-or-Unknown, 10WorkType-NewFunctionality: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#3825045 (10Aklapper) a:05csteipp>03None [12:06:25] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 41268 MB (3% inode=99%) [12:17:01] (03PS1) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) [12:17:33] (03CR) 10jerkins-bot: [V: 04-1] move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [12:22:12] (03CR) 10Framawiki: "And where can I find doc about it ? :)" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [12:23:53] (03PS2) 10EddieGP: Delete mowiki and mowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio) [12:25:16] (03PS2) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) [12:26:31] (03CR) 10EddieGP: "PS2 is a rebase and re-adding a newline that PS1 wanted to remove at EOF of deleted.dblists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio) [12:27:21] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3825060 (10EddieGP) @MarcoAurelio You're right about 1 at least. As far as I see we won't need any other Apache change (the virt... [12:31:33] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3398651 (10Legoktm) Can we swap the order on these changes? First, redirect the domains to the proper wikis (leaving the wikis in... [12:31:34] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 41447 MB (3% inode=99%) [12:31:39] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3825068 (10MarcoAurelio) Note that 1 is abandoned as I was not able to run the ruby script to convert the dat to conf or vice vers... [12:32:23] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3825069 (10MarcoAurelio) @Legoktm Looks good to me. [12:33:04] (03PS3) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) [12:38:45] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3825080 (10EddieGP) >>! In T169450#3825066, @Legoktm wrote: > Can we swap the order on these changes? First, redirect the domains... [12:40:44] (03PS1) 10MarcoAurelio: wm2018: sysops to add and remove 'translationadmin' from their accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) [12:41:40] (03CR) 10jerkins-bot: [V: 04-1] wm2018: sysops to add and remove 'translationadmin' from their accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) (owner: 10MarcoAurelio) [12:51:59] (03PS1) 10MarcoAurelio: [do not merge yet] wikimania2017: closing the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396581 (https://phabricator.wikimedia.org/T182493) [12:56:04] PROBLEM - nova-compute process on labvirt1011 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [12:56:46] (03CR) 10MarcoAurelio: [C: 04-1] "Missing [ things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) (owner: 10MarcoAurelio) [12:58:54] (03PS2) 10MarcoAurelio: wm2018: sysops to add and remove 'translationadmin' from their accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) [13:00:04] RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [13:30:14] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [13:30:15] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [13:48:30] (03PS4) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) [13:49:10] (03CR) 10jerkins-bot: [V: 04-1] move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [13:53:40] (03PS5) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) [13:54:37] (03Restored) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [13:54:49] (03PS5) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) [14:01:17] $ ruby compile_redirects.rb [14:01:17] compile_redirects.rb:1: syntax error, unexpected .. [14:01:17] ../../../../lib/puppet/parser/fu [14:01:17] ^ [14:01:17] compile_redirects.rb:1: unknown regexp options - lb [14:04:09] Reedy: any idea ^^ [14:05:42] * Hauskatze tries to execute the script in another location [14:10:25] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:26] PROBLEM - Nginx local proxy to apache on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.008 second response time [14:12:54] (03CR) 10EddieGP: "About the difference between funnel and rewrite:" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [14:13:24] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [14:13:25] RECOVERY - Nginx local proxy to apache on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time [14:14:28] (03CR) 10EddieGP: "> but instead to als.wiktionary.org/wiki/Wort:Houptsyte" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [14:22:32] (03PS1) 10ArielGlenn: remove some dead vars from content translation dump manifests [puppet] - 10https://gerrit.wikimedia.org/r/396583 [14:28:28] (03CR) 10ArielGlenn: [C: 032] remove some dead vars from content translation dump manifests [puppet] - 10https://gerrit.wikimedia.org/r/396583 (owner: 10ArielGlenn) [14:29:06] apergos: I think ../../../../lib/puppet/parser/functions/compile_redirects.rb routing is wrong [14:29:10] on puppet [14:29:31] modules/mediawiki/files/apache/sites/redirects/ [14:35:22] (03PS1) 10ArielGlenn: remove dead vars from misc dump cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/396584 [14:37:34] (03CR) 10ArielGlenn: [C: 032] remove dead vars from misc dump cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/396584 (owner: 10ArielGlenn) [14:40:26] (03PS6) 10MarcoAurelio: apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) [14:40:54] * apergos peeks in [14:41:39] (03PS7) 10MarcoAurelio: apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) [14:45:18] Hauskatze: you cd into the directory puppet/modules/mediawiki/files/apache/sites [14:45:38] you edit redirects/redirects.dat [14:45:44] (03CR) 10MarcoAurelio: "I finnaly managed to run that %&!!#$ script and the .conf file is now here." [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [14:46:22] you run ruby redirects/compile_redirects.rb redirects/redirects.dat > redirects.conf.saveme (say) [14:46:47] apergos, managed to run it from the lib folder and specifying manually the whole path [14:46:54] from redirects it is impossible [14:46:56] look at the diff of redirects.conf and whatever you just saved, presumably it's ok [14:47:01] says that ../ is not expected [14:47:09] you remove those, says they'reneeded [14:47:10] then put the new file into redirects.conf [14:47:30] try running it from puppet/modules/mediawiki/files/apache/sites [14:47:32] as [14:47:45] ruby redirects/compile_redirects.rb redirects/redirects.dat > something [14:47:50] https://gerrit.wikimedia.org/r/#/c/393289/ <-- apergos -- but not to be merged yet [14:48:13] most likely I will not be your merger [14:48:22] but if you got the script working, great [14:48:39] I did on lib: ruby compile_redirects.rb > redirects.conf.ma [14:48:52] checked it was okay, cut the file and pasted it with correct name [14:48:54] wrong folder [14:49:01] that's why you had to give the full path [14:49:21] but you got the output, so that's the important thing [14:49:44] always doing things more complicated that they have to be.. damn me :) [14:49:58] heh [14:50:34] ευχαριστώ apergos [14:50:44] τπτ [14:50:48] :) [14:52:52] (03CR) 10ArielGlenn: "Hoo, I added you as a fyi, you're not obligated to review (though of course if you see something glaring, please feel free)" [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [15:00:29] (03PS8) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) [15:02:44] (03PS9) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) [15:10:04] PROBLEM - Disk space on scb1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=78%) [15:21:00] (03PS1) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) [15:21:35] (03CR) 10jerkins-bot: [V: 04-1] clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) (owner: 10ArielGlenn) [15:26:16] space on scb1001 is mostly the 1gb of ores celery worker logs in daemon.log, taking nearly 1g today, 3.5 million entries compared to 9k entries yesterday [15:27:14] apergos: FYI we turned the verbosity up recently… may need a new logrotate config [15:27:31] awight: please do so, I was just looking at the existig one [15:27:48] you should think about rotating when size exceeds X (300M?) [15:29:29] because this is the current log, I can't fix it unless the celery workers are restarted on that box [15:29:38] I can manually move the log elsewhere, do service uwsgi-ores reload [15:29:59] but let's get a new logrotate.conf in there in the next half hour, ok? awight [15:31:50] note this must cover daemon.log [15:31:53] apergos: I’m not sure how this will work—rotating alone won’t solve the problem, unless the files are flying off to an archive somewhere? [15:31:54] because that's where the issues are [15:34:43] 10Operations, 10ORES, 10Scoring-platform-team: Update logrotate config for scb* boxes, to deal with ORES verbose logging - https://phabricator.wikimedia.org/T182497#3825269 (10awight) p:05Triage>03High [15:44:04] RECOVERY - Disk space on scb1001 is OK: DISK OK [15:45:02] !log on scb1001 moved daemon.log out of the way, did "service rsyslog rotate", saved the last 5000 entries for use by ores team, removed the log [15:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:08] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Update logrotate config for scb* boxes, to deal with ORES verbose logging - https://phabricator.wikimedia.org/T182497#3825284 (10ArielGlenn) Needs to happen: logging to ores logs instead of daemon.log, make sure logrot conf file for ores l... [15:48:35] !log Making an emergency deployment to ORES logging config to reduce verbosity. [15:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:04] PROBLEM - Disk space on scb1004 is CRITICAL: DISK CRITICAL - free space: / 329 MB (3% inode=83%) [15:51:12] I'll get there [15:51:17] I'm doing scb1002 now [15:53:04] RECOVERY - Disk space on scb1004 is OK: DISK OK [15:53:16] !log did same on scb1002,3,4 [15:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:28] !log awight@tin Started deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity [15:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:59] !log awight@tin Finished deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (duration: 00m 31s) [15:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:24] !log awight@tin Started deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity [15:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:42] !log awight@tin Finished deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (duration: 00m 17s) [15:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:13] !log awight@tin Started deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity [15:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:06] the ores200* boxes have much smaller logs, so they must get a lot less/minimal traffic, also their / partitions are much larger so I'm ignoring them [16:02:11] !log awight@tin Finished deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (duration: 05m 58s) [16:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:31] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3825309 (10awight) This is affecting me in production, now: ``` Timeout, server scb2004.codfw.wmnet not responding. 16:01:39 conn... [16:07:50] !log awight@tin Started deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (take 4\!) [16:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:51] !log awight@tin Finished deploy [ores/deploy@1c0ede0]: Reducing ORES Celery log verbosity (take 4\!) (duration: 03m 01s) [16:11:01] the scb200* hosts actually do have 1G log files, but 80G root partitions so no worries there [16:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:03] (03PS2) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) [16:16:38] (03CR) 10jerkins-bot: [V: 04-1] clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) (owner: 10ArielGlenn) [16:17:41] (03PS3) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) [16:29:06] (03PS4) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) [16:38:57] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Update log config for scb* boxes, to deal with ORES verbose logging - https://phabricator.wikimedia.org/T182497#3825360 (10awight) [16:39:47] (03PS5) 10ArielGlenn: clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) [16:42:23] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10HTTPS: Check all wikis for inclusions of http resources on https - https://phabricator.wikimedia.org/T36670#3825367 (10Aklapper) a:05csteipp>03None [16:48:06] (03CR) 10ArielGlenn: [C: 032] clean up temp files created during xml/sql and other dumps generation [puppet] - 10https://gerrit.wikimedia.org/r/396585 (https://phabricator.wikimedia.org/T180102) (owner: 10ArielGlenn) [16:48:35] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:55] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:04] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:59] hhvm hang again [16:50:08] elukey: in case you're around, wanna try anything there? [16:57:57] guess I'll go ahead and restart it [16:59:54] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.063 second response time [17:00:05] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 71830 bytes in 0.284 second response time [17:00:12] !log restarted hhvm on mw1276, the same old hang with the same old symptoms [17:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:35] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.039 second response time [17:02:32] all right, I'm checked out for the day [17:02:40] any emergencies... call someone else :-P [17:17:51] Lol [17:21:41] (03PS6) 10ArielGlenn: move wikidata weekly dumps to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/396574 (https://phabricator.wikimedia.org/T179942) [18:32:48] 10Operations, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841#3825446 (10Aklapper) >>! In T146841#3729163, @Dzahn wrote: > @Seb35 @Peachey88 @Herron since T168467 is resolved meanwhile, d... [18:37:22] 10Operations, 10RT-Migration, 10Wikimedia Phabricator RfC, 10WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#3825778 (10Ebe123) [20:07:34] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:44] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:54] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:34] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71826 bytes in 6.099 second response time [20:08:44] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.032 second response time [20:09:24] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.080 second response time [20:15:25] Wikimedia\Rdbms\DBQueryError when trying to create item on Wikidata... [20:15:33] (WixEQApAIC4AAIKfjewAAAAM) [20:15:34] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:44] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:54] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:45] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.923 second response time [20:17:20] sjoerddebruin: looking... [20:17:24] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time [20:17:34] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71824 bytes in 0.208 second response time [20:17:56] sjoerddebruin: Lock wait timeout exceeded; try restarting transaction [20:18:06] Yeah, it's working again... [20:20:54] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:35] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:44] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:35] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 40974 MB (3% inode=99%) [20:43:20] (03PS2) 10Zoranzoki21: Redirect techblog.wikimedia.org to blog.wikimedia.org/c/technology [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [20:48:57] (03PS2) 10Zoranzoki21: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [20:49:08] (03CR) 10Zoranzoki21: [C: 031] "Now is ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [20:55:44] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.038 second response time [20:55:54] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71836 bytes in 0.096 second response time [20:56:04] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.054 second response time [21:02:14] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:02:54] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:03:04] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:03:44] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.050 second response time [21:03:54] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71792 bytes in 0.135 second response time [21:04:04] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.034 second response time [21:06:54] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:07:04] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:07:14] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:16:54] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 3.863 second response time [21:17:04] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71792 bytes in 0.116 second response time [21:17:05] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time [21:49:45] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 40381 MB (3% inode=99%) [22:16:04] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:14] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:24] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:18:04] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 5.594 second response time [22:18:07] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71792 bytes in 0.137 second response time [22:18:14] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time [22:19:35] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Update log config for scb* boxes, to deal with ORES verbose logging - https://phabricator.wikimedia.org/T182497#3825924 (10awight) p:05High>03Normal Urgent fix is deployed, lowering the priority. [22:19:55] PROBLEM - HHVM jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:25] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:21:24] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 8.623 second response time [22:22:45] RECOVERY - HHVM jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [22:25:45] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 41502 MB (3% inode=99%) [23:19:24] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:19:24] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:14] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71832 bytes in 0.138 second response time [23:20:15] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.028 second response time [23:41:05] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:41:24] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:41:25] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:24] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.271 second response time [23:42:24] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 71833 bytes in 5.889 second response time [23:43:04] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.035 second response time