[00:15:35] (03CR) 10Krinkle: [C: 04-1] Use a textarea for content differences (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370160 (https://phabricator.wikimedia.org/T172362) (owner: 10Giuseppe Lavagetto) [00:31:41] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3502819 (10Harej) >>! In T172417#3498059, @Harej wrote: > Note that this is pending final c-level approval. Update: Approval has b... [00:32:28] PROBLEM - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:33:11] mutante is that ^^ phd? [00:40:13] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3497968 (10Krinkle) > We want to use a short link that people can access that is easy to remember and type on any browser, on any d... [00:42:57] (03CR) 10Krinkle: [C: 031] phpcs for refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 (owner: 10Reedy) [01:29:29] ACKNOWLEDGEMENT - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn inactive server should not have phd [01:32:08] paladox: rsync initial run finished. i think it was pretty busy with that. also without phd running how can it "feel slow" :) [01:32:35] which protocol [01:44:11] 10Operations, 10MediaWiki-extensions-Score: crackling at start of OGG renditions of MIDI files (fixed in TiMidity++ 2.14.0) - https://phabricator.wikimedia.org/T50029#3502881 (10Reedy) [02:16:13] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3502957 (10Dzahn) [02:17:43] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3502961 (10Reedy) [02:33:40] Reedy: Do you know of any recent code or config changes to AbuseFilter? [02:34:08] I ran an XHProf profile on mwdebug1001 on mw.org when making an edit and finding some 750ms spent in AbuseFilter [02:34:16] 950ms * [02:34:23] Context: T172447 [02:34:24] T172447: Investigate 2017-08-02 Save Timing regression (+40-60%) - https://phabricator.wikimedia.org/T172447 [03:10:19] (03CR) 10Andrew Bogott: [C: 032] toolforge: Add qstat-full to bastions [puppet] - 10https://gerrit.wikimedia.org/r/370298 (owner: 10BryanDavis) [03:26:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 607.94 seconds [03:59:18] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 284.75 seconds [04:21:04] Question: is a fatal in production considered an UBN even if its trigger is rather obscure? [04:56:04] harej: If it's a regression (it didn't use to fatal) then yes, I'd mark it as UBN. [05:03:21] Krinkle: I don't know if this specific situation didn't use to fatal, but I think in principle a fatal in production is not good? [05:13:04] harej: not good no, what's the issue? It also depends on severity (how many people effected, how frequently used) [05:13:11] sadly our software is not currently bug free ;) [05:13:34] https://phabricator.wikimedia.org/T172588 [05:23:16] harej: is that the only page you've found that fatals? [05:24:24] That I've seen, yes [05:24:48] that's good at least [05:24:57] I wonder who proposed "How to volunteer editing server config and get a code change into Ops repos." on https://wikimania2017.wikimedia.org/wiki/Hackathon/Program [05:24:59] And under that exact arrangement. The same page doesn't fatal if you restore it back to its non cursed version [05:27:44] ah, mutante did :) (the session on getting things deployed) [05:27:56] mutante: does that mean you'll be there? [05:27:58] * greg-g goes [05:33:08] harej: I've dug up the full error message for you and added it to the task [05:33:14] Looks like it has to do with nested tags [05:33:33] so if you haven't found a way around the fatal yet, this might help. [05:33:51] Ah. Normally they're not allowed, but if you subst a thing that has Translate tags... [05:34:03] The ideal outcome of the task in this case would be to error in a "better" way (tell you what's wrong), but it will likely remain an error, however. [05:34:31] {{subst:translatable horror}} [05:34:45] Yeah, applies after everything else, so there's no good safesubt:-ish thing for it. [05:34:46] And yes the proper thing would be to have it error [05:36:33] Or we could scorch the earth the translate extension lies on, but we can all dream. :) [10:09:26] (03PS1) 10MarcoAurelio: Enabling OAuth on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370310 (https://phabricator.wikimedia.org/T170301) [10:17:31] (03PS2) 10MarcoAurelio: Allow bureaucrats on WMF wikis to grant and remove 'confirmed' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368939 (https://phabricator.wikimedia.org/T101983) [11:10:38] (03PS1) 10MarcoAurelio: Grant 'autopatrol' to 'editor' in en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370311 (https://phabricator.wikimedia.org/T172561) [11:16:47] (03PS1) 10MarcoAurelio: Translate sitename for nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370313 (https://phabricator.wikimedia.org/T172594) [11:54:28] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:54:28] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:54:29] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:54:38] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:54:38] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:54:48] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:55:18] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:55:19] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:59:50] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:02:22] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3503306 (10Urbanecm) a:03Reedy [12:04:19] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:04:28] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [12:04:28] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [12:04:29] RECOVERY - Disk space on stat1005 is OK: DISK OK [12:04:38] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [12:04:38] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [12:04:39] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:04:49] RECOVERY - DPKG on stat1005 is OK: All packages OK [12:29:48] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Sat 2017-08-05 12:29:42 UTC. [12:57:18] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:57:48] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:57:48] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:57:49] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:57:49] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:57:58] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:57:59] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:58:08] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:01:48] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:04:18] RECOVERY - DPKG on stat1005 is OK: All packages OK [13:04:48] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:04:49] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [13:04:49] RECOVERY - Disk space on stat1005 is OK: DISK OK [13:04:58] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [13:04:59] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [13:05:08] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [13:05:09] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [13:05:33] Database locked on Commons? [13:12:49] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 385.61 seconds [13:13:48] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [13:17:34] (03PS1) 10Ladsgroup: mediawiki: Another increase of batch size in dispatchChanges cronjob [puppet] - 10https://gerrit.wikimedia.org/r/370315 (https://phabricator.wikimedia.org/T171263) [13:21:02] (03CR) 10Sjoerddebruin: [C: 031] "Hope the dispatch will be "dope" with that batch size. ;)" [puppet] - 10https://gerrit.wikimedia.org/r/370315 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [13:22:43] (03CR) 10Luke081515: [C: 031] Enabling OAuth on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370310 (https://phabricator.wikimedia.org/T170301) (owner: 10MarcoAurelio) [13:31:48] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Sat 2017-08-05 13:31:42 UTC. [13:48:13] (03CR) 10Lucas Werkmeister (WMDE): "Load on Terbium may be fine, but if I understand correctly, we also need to watch out that we don’t overflow the change queues on the clie" [puppet] - 10https://gerrit.wikimedia.org/r/370315 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [14:13:01] !log reedy@tin Synchronized php-1.30.0-wmf.12/extensions/WikimediaMaintenance/createExtensionTables.php: add oauth (duration: 00m 48s) [14:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:20] Reedy: https://phabricator.wikimedia.org/diffusion/EWMA/browse/master/createExtensionTables.php still not updated? [14:24:46] replication lag? [14:24:55] i think phab only does it every few mins [14:26:22] ever 30 secs [14:26:25] if the repo is active [14:26:30] in the last few days [14:26:39] ever = every [14:27:02] https://phabricator.wikimedia.org/diffusion/EWMA/manage/status/ [14:27:06] shows 45 minutes [14:28:05] true that [14:28:52] they're are there now :) [14:32:07] yep :) [14:32:18] clicking the update now button should make the repo update [14:32:38] though only repo admins and owners of the repo and admins can do that [14:40:43] !log created oauth tables on foundationwiki T172591 [14:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:55] T172591: Create OAuth tables for foundationwiki - https://phabricator.wikimedia.org/T172591 [14:41:16] 10Operations, 10Wikimedia-Mailing-lists: lists.wikimedia.org (208.80.154.21) blocked by Trend Micro - https://phabricator.wikimedia.org/T172602#3503473 (10Platonides) [14:42:30] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: lists.wikimedia.org (208.80.154.21) blocked by Trend Micro - https://phabricator.wikimedia.org/T172602#3503485 (10Platonides) [14:45:08] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.16 seconds [14:45:18] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.56 seconds [14:45:19] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.69 seconds [14:45:28] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.76 seconds [14:45:38] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.18 seconds [14:47:19] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 44.23 seconds [14:47:28] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 16.97 seconds [14:47:38] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:48:09] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:48:18] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [14:54:18] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [15:01:18] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [15:05:08] 10Operations, 10Pybal, 10Traffic: lvs servers report 'Memory allocation problem' on bootup - https://phabricator.wikimedia.org/T82849#3503596 (10ema) A more general patch has been submitted by Julian Anastasov http://archive.linuxvirtualserver.org/html/lvs-devel/2017-08/msg00001.html \o/ [17:28:48] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [19:06:17] PROBLEM - Check systemd state on mw1285 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:17] PROBLEM - Disk space on mw1285 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:17] PROBLEM - HHVM rendering on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 9.453 second response time [19:06:18] PROBLEM - Apache HTTP on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 8.423 second response time [19:07:07] RECOVERY - Check systemd state on mw1285 is OK: OK - running: The system is fully operational [19:07:07] RECOVERY - Disk space on mw1285 is OK: DISK OK [19:07:08] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 73363 bytes in 0.173 second response time [19:07:08] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.074 second response time [20:01:38] 10Operations, 10MediaWiki-Database, 10NewPHP, 10Patch-For-Review, 10Technical-Debt: Remove old mysql extension support in favor of mysqli - https://phabricator.wikimedia.org/T120333#3503847 (10Reedy) We should probably make a move on this... However, if we look at WMF production where we're still using... [21:58:46] hey twentyafterfour i think the status in topic should updated, but if im incorrect feel free to disregard this msg. [22:08:57] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [22:37:05] (03PS1) 10Ebe123: Run Lilypond from Firejail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T171372) [22:45:08] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [23:41:29] (03PS1) 10Ebe123: Run Lilypond from Firejail [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T171372)