[02:46:52] 10Operations: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Anomie) [03:33:33] PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz] [03:35:25] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 916.30 seconds [03:50:41] 10Operations: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Huji) @Anomie Thanks, I couldn't have pinpointed this the way you did. Much appreciated. [04:02:15] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 252.30 seconds [04:04:47] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:46:34] (03PS3) 10Tulsi Bhagat: Enable 'extendedmover' user group at ur.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) [05:49:08] (03CR) 10Zoranzoki21: [C: 03+1] "Looks good, but see comments." (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) (owner: 10Tulsi Bhagat) [05:56:48] (03PS4) 10Tulsi Bhagat: Enable 'extendedmover' user group at ur.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) [05:59:20] (03PS5) 10Zoranzoki21: Enable 'extendedmover' user group at ur.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) (owner: 10Tulsi Bhagat) [06:28:31] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:51] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time [06:37:19] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.551 second response time [06:38:13] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:41:25] (03PS2) 10Tulsi Bhagat: Update be.wikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479713 (https://phabricator.wikimedia.org/T211795) [07:17:27] PROBLEM - puppet last run on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:17:35] PROBLEM - Check whether ferm is active by checking the default input chain on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:17:38] PROBLEM - mysqld processes on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:17:41] PROBLEM - Disk space on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:17:47] PROBLEM - DPKG on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:17:49] PROBLEM - Check size of conntrack table on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:17:55] PROBLEM - MD RAID on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:17:58] What's up? [07:18:03] That's tendril's host [07:18:05] just saw the page [07:18:07] PROBLEM - configured eth on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:18:13] PROBLEM - Check systemd state on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:18:15] From what I can see it is all fine [07:18:23] MySQL is up and with a high uptime [07:18:28] PROBLEM - MariaDB disk space on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:18:28] PROBLEM - dhclient process on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused [07:18:42] Could it be icinga itself? [07:18:58] Might be the network? [07:19:46] network between them seems ok, just pinged icinga1001 from db1115 [07:20:03] RECOVERY - Check whether ferm is active by checking the default input chain on db1115 is OK: OK ferm input default policy is set [07:20:08] RECOVERY - mysqld processes on db1115 is OK: PROCS OK: 1 process with command name mysqld [07:20:09] RECOVERY - Disk space on db1115 is OK: DISK OK [07:20:10] well [07:20:15] RECOVERY - DPKG on db1115 is OK: All packages OK [07:20:17] RECOVERY - Check size of conntrack table on db1115 is OK: OK: nf_conntrack is 1 % full [07:20:23] Very strange [07:20:23] RECOVERY - MD RAID on db1115 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:20:27] now it's recovered, going to be hard to tell innit [07:20:35] RECOVERY - configured eth on db1115 is OK: OK - interfaces up [07:20:41] RECOVERY - Check systemd state on db1115 is OK: OK - running: The system is fully operational [07:20:56] RECOVERY - MariaDB disk space on db1115 is OK: DISK OK [07:20:57] RECOVERY - dhclient process on db1115 is OK: PROCS OK: 0 processes with command name dhclient [07:21:04] Dec 16 07:20:00 db1115 systemd[1]: Reloading. [07:21:13] Dec 16 07:20:00 db1115 systemd[1]: Starting Nagios Remote Plugin Executor... [07:21:16] etc [07:21:25] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=18&fullscreen&orgId=1&var-server=db1115&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql [07:21:39] so that's what it was I guess, the plugin wasn't running there for a hot second on db1115 [07:21:53] (see last few lines of syslog) [07:22:41] RECOVERY - puppet last run on db1115 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:23:06] Why would that happen... [07:23:15] good question [07:23:16] <_joe_> heh just arrived sorry [07:23:26] <_joe_> I didn't hear the fist page I was preparing coffee [07:23:33] o/ [07:23:44] well we're recovered so drink your coffee as normal [07:23:46] <_joe_> so looks like excessive load? [07:23:57] Yeah [07:24:11] I am going to restart MySQL as tendril hasn't fully recovered [07:25:23] Dec 16 07:14:42 db1115 puppet-agent[17019]: Failed to apply catalog: Cannot allocate memory - /usr/bin/dpkg-query -W --showformat '${Status} ${Package} ${Version}\n' 2>&1 [07:25:29] <_joe_> why is memory reported in hertz [07:26:16] Dec 16 07:14:53 db1115 nrpe[30452]: fork() failed with error 12, bailing out... [07:26:16] Dec 16 07:14:53 db1115 systemd[1]: nagios-nrpe-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT [07:26:16] Dec 16 07:14:53 db1115 systemd[1]: nagios-nrpe-server.service: Unit entered failed state. [07:26:16] Dec 16 07:14:53 db1115 systemd[1]: nagios-nrpe-server.service: Failed with result 'exit-code'. [07:26:20] here's where it died [07:26:52] You guys ok if I fully reboot the host? [07:27:09] fine by me lemme get off first [07:27:16] done [07:27:26] _joe_? [07:28:29] maybe he went to actually drink that coffee :-) [07:28:32] haha [07:28:42] Going to stop mysql for now [07:28:59] !log Stop MySQL on db1115 so tendril can get back to work [07:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:50] I will upgrade mysql and kernel [07:30:39] !log Reboot db1115 after OOM [07:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:13] PROBLEM - HTTP-dbtree on dbmonitor1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 280 bytes in 0.003 second response time [07:32:51] PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:49] ^ that is expected as I am rebooting the tendril host [07:37:08] * apergos is watcing the recovery on icingia [07:37:12] s/ia/a/ [07:37:39] I am trying to get to the management console to see if the host is actually rebooting [07:37:43] Or I had to do a hard reset [07:38:10] I think I have to do it [07:38:32] I guess you do [07:39:40] Done [07:39:55] It is booting now [07:39:56] * apergos drums their fingers [07:41:28] host up! [07:42:07] Going to start mysql, run the upgrade and the event scheduler [07:42:28] okay [07:42:54] I think what causes the memory leak or OOM is the event scheduler, it is my theory [07:43:04] (Kinda not the first time we see it happening) [07:43:41] RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 78926 bytes in 0.909 second response time [07:43:46] orilly [07:44:01] tendril is back and reporting [07:44:20] icinga looks much better [07:44:29] RECOVERY - HTTP-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 78924 bytes in 0.422 second response time [07:45:37] apergos: See this: https://phabricator.wikimedia.org/T196726 [07:45:46] I will update this later [07:46:29] ugh [07:46:30] ok [07:46:51] well happy Sunday to us all and may the rest of it be quieter [07:47:08] I think we are good now [07:47:23] Thanks for showing up apergos and _joe_ - much appreciated :) [07:47:28] thanks for doing the work! [07:55:30] <_joe_> heh np [07:55:34] <_joe_> I did nothing :P [09:52:05] RECOVERY - Check systemd state on sulfur is OK: OK - running: The system is fully operational [09:52:21] !log mask + reset-failed kafkatee default instance on sulfur (kafkatee-webrequest works fine) [09:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:25] paravoid: --^ [09:53:31] 10Operations, 10netops, 10User-Elukey: migrate netinsights from rhenium to sulfur - https://phabricator.wikimedia.org/T212011 (10elukey) [10:43:19] (03PS4) 10Urbanecm: Upload custom minerva logo for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479379 (https://phabricator.wikimedia.org/T210979) [10:43:34] (03PS3) 10Urbanecm: Use new minerva logos for cswiki in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479963 (https://phabricator.wikimedia.org/T210979) [10:46:22] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) (owner: 10Tulsi Bhagat) [10:47:21] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479713 (https://phabricator.wikimedia.org/T211795) (owner: 10Tulsi Bhagat) [10:48:37] (03CR) 10Urbanecm: [C: 03+1] "Tulsi, this looks good, but it looks like you didn't upload a follow-up patch that will add those new 1.5x and 2x logos to wgLogoHD. Due t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479713 (https://phabricator.wikimedia.org/T211795) (owner: 10Tulsi Bhagat) [11:08:21] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [11:09:27] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [11:09:53] Hi, I have 2 1-GB+ files stucked in Upload stash :(( [11:10:09] how to get them published? [11:11:14] https://phabricator.wikimedia.org/T200820 [12:28:45] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [12:29:51] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [13:07:45] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [13:10:05] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:11:47] (03CR) 10Framawiki: [C: 03+1] Enable 'extendedmover' user group at ur.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) (owner: 10Tulsi Bhagat) [14:12:01] (03CR) 10Framawiki: [C: 03+1] Update be.wikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479713 (https://phabricator.wikimedia.org/T211795) (owner: 10Tulsi Bhagat) [14:21:57] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [14:23:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:38:20] Hi, can anyone check logstash for srwikinews? Is there some errors/warning from yesterday? [14:45:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [14:46:11] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:49:49] Zoranzoki21: from a quick look I see no errors and 6 warnings (4 Using cached lag value for $IP due to active transaction and 2 Expectation (readQueryTime <= 5) by MediaWiki::main not met) [14:50:39] volans: Can you give me output? [14:50:58] (on pastebin.com or simular) [14:51:56] sorry didn't saw the yesterday part... since 24h there a more [14:54:47] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [14:54:52] volans: Ok, what? [14:55:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:58:14] Zoranzoki21: a bunch, 174 warnings and 3 errors in the last 24h [14:58:26] volans: OMG [14:59:21] but some might be transient and mediawiki retrying I guess [15:01:16] volans: Ok, I asked because I have some problems with working via pywikibot on srwikinews related to API [15:06:27] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [15:38:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [15:39:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [16:04:36] hi [16:04:49] now I can't upload a file at all [16:04:50] https://phabricator.wikimedia.org/T200820#4826332 [16:05:01] https://phabricator.wikimedia.org/T38587#4826578 [16:05:17] I got an error whatever option is used [16:08:21] yannf, are you saying all file uploads are broken? [16:09:08] my file is 1.1 GB [16:09:28] I use https://commons.wikimedia.org/wiki/User:Rillke/bigChunkedUpload.js [16:09:52] which mentions that there is an issue with big PDF file [16:10:07] here https://commons.wikimedia.org/wiki/User_talk:Rillke/bigChunkedUpload.js#Troubleshooting [16:10:28] so I tried again without "stash and async" as recommended [16:10:38] but it also failed [16:10:56] that's the 3rd time today [16:11:46] so it's just certain files [16:11:53] or particularly large files [16:11:55] ? [16:12:01] yeah [16:12:18] but all my files are big [16:12:29] and all options failed [16:13:36] Rillke's script used to work very well [16:13:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [16:14:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [16:16:24] Krenair, a few days before I uploaded https://commons.wikimedia.org/wiki/File:L%27Illustration,_1916.pdf [16:16:40] 1.7 GB, no problem [16:17:13] and https://commons.wikimedia.org/wiki/File:L%27Illustration,_Jul-Dec_1922.pdf 1.19 GB [16:17:31] and https://commons.wikimedia.org/wiki/File:L%27Illustration,_Jan-Jun_1917.pdf 1.21 GB [16:17:36] mm sounds like something that can't really be properly handled on a sunday over IRC [16:17:56] I see :/ [16:17:57] your best option for now might be to leave a detailed comment about the problems on phab [16:18:40] ok, do you need more information I posted in the reports I wrote? [16:26:18] Krenair, ^ [16:34:21] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [16:35:27] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [16:38:48] Krenair, do you need more information than what I wrote in the reports? [16:44:00] yannf, I don't know [16:44:02] I won't be handling it [16:44:13] ok [16:44:21] thanks [16:44:41] PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:55] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:32:36] (03PS1) 10ArielGlenn: switch all invocations of scripts in dumps rep to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/479985 (https://phabricator.wikimedia.org/T210989) [17:33:21] (03CR) 10jerkins-bot: [V: 04-1] switch all invocations of scripts in dumps rep to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/479985 (https://phabricator.wikimedia.org/T210989) (owner: 10ArielGlenn) [17:34:37] (03Abandoned) 10ArielGlenn: start conversion to python3 [dumps] (python3) - 10https://gerrit.wikimedia.org/r/477227 (https://phabricator.wikimedia.org/T210989) (owner: 10ArielGlenn) [17:34:57] (03PS11) 10ArielGlenn: convert dump scripts to python3 [dumps] - 10https://gerrit.wikimedia.org/r/478702 (https://phabricator.wikimedia.org/T210989) [17:37:56] (03PS2) 10ArielGlenn: switch all invocations of scripts in dumps rep to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/479985 (https://phabricator.wikimedia.org/T210989) [18:53:41] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:56:37] (03PS2) 10ArielGlenn: use lbzip2 for recompression of wikidata weeky json dumps [puppet] - 10https://gerrit.wikimedia.org/r/474159 (https://phabricator.wikimedia.org/T206535) [18:59:08] (03PS3) 10ArielGlenn: use lbzip2 for recompression of wikidata weeky json dumps [puppet] - 10https://gerrit.wikimedia.org/r/474159 (https://phabricator.wikimedia.org/T206535) [19:05:20] (03CR) 10ArielGlenn: [C: 03+2] use lbzip2 for recompression of wikidata weeky json dumps [puppet] - 10https://gerrit.wikimedia.org/r/474159 (https://phabricator.wikimedia.org/T206535) (owner: 10ArielGlenn) [19:10:38] (03PS3) 10ArielGlenn: switch all invocations of scripts in dumps rep to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/479985 (https://phabricator.wikimedia.org/T210989) [19:24:55] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:40:07] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) >>! In T205899#4825625, @Volans wrote: >>>! In T205899#4824154, @crusnov wrote: >> Alri... [20:11:03] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [20:12:11] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [22:41:54] (03CR) 10Krinkle: [C: 03+1] "One nit about include vs require, but good to go either way, can follow-up later." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 (owner: 10Tim Starling) [22:42:01] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [22:43:01] (03CR) 10Krinkle: [C: 03+1] Put profiler hostnames in ProductionServices.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 (owner: 10Tim Starling) [22:43:04] (03CR) 10Krinkle: [C: 03+1] Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 (owner: 10Tim Starling) [22:43:09] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [23:28:50] (03PS5) 10Paladox: php: Add support for php 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/479144 [23:50:29] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [23:51:35] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy