[00:52:16] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3593368 (10mmodell) [00:54:29] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3593374 (10awight) [00:55:00] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3593377 (10awight) [01:05:38] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [01:12:54] !log ebernhardson@tin Synchronized php-1.30.0-wmf.17/extensions/WikimediaEvents/modules/ext.wikimediaEvents.humanSearchRelevance.js: T171740: Reduce annoyance of survey by enforcing minimum 2 days between showing survey to same browser (duration: 00m 46s) [01:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:08] T171740: [Epic] Search Relevance: graded by humans - https://phabricator.wikimedia.org/T171740 [03:28:59] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858.44 seconds [04:17:48] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 188.54 seconds [07:18:39] RECOVERY - Check systemd state on restbase1010 is OK: OK - running: The system is fully operational [10:02:08] (03PS15) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) [10:37:13] 10Operations, 10DNS, 10Traffic: arbcom-fi.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T175447#3593617 (10Stryn) [10:38:34] 10Operations, 10DNS, 10Traffic: arbcom-fi.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T175447#3593618 (10Peachey88) [11:02:58] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2026851 [11:35:08] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2021227 [11:58:58] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:02:59] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:05:08] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [12:12:08] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:42:59] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 37 [12:52:38] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:53:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [12:55:57] 10Operations, 10DNS, 10Traffic: arbcom-fi.wikipedia.org does not have a mobile version at arbcom-fi.m.wikipedia.org - https://phabricator.wikimedia.org/T175447#3593692 (10Aklapper) [13:00:33] (03Draft1) 10MarcoAurelio: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376886 [13:00:40] (03PS2) 10MarcoAurelio: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376886 [13:01:25] (03PS3) 10MarcoAurelio: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376886 [13:08:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:10:36] (03Draft1) 10MarcoAurelio: Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376897 (https://phabricator.wikimedia.org/T175356) [13:10:38] (03PS2) 10MarcoAurelio: Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376897 (https://phabricator.wikimedia.org/T175356) [13:10:58] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:19:19] (03CR) 10Jayprakash12345: [C: 031] Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376897 (https://phabricator.wikimedia.org/T175356) (owner: 10MarcoAurelio) [13:19:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:20:08] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:20:49] (03PS8) 10MarcoAurelio: Cloud VPS configuration for hi.wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/371096 (https://phabricator.wikimedia.org/T173013) [13:28:08] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:34:18] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:34:18] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:39:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:41:37] (03CR) 10Eranroz: "What exactly should be monitored?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz) [13:42:18] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:44:08] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:54:18] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:57:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:57:29] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:29:48] (03PS1) 10Reedy: Add mobile subdomains for all arbcom wikis [dns] - 10https://gerrit.wikimedia.org/r/376904 (https://phabricator.wikimedia.org/T175447) [15:34:37] 10Operations, 10DNS, 10Patch-For-Review: arbcom-fi.wikipedia.org does not have a mobile version at arbcom-fi.m.wikipedia.org - https://phabricator.wikimedia.org/T175447#3593816 (10Reedy) [16:00:29] I submitted a continous job using jstart (for a nodeJS application), but it did not start the application. [16:02:17] It is working when I just run the binary file. [17:08:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:19:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:52:49] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89972.41 seconds [19:05:33] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3177068 (10Gehel) Since we are decommissioning logstash100[1-3] fairly soon, it's probably not worth investing any time in them... [20:40:05] (03CR) 10Daniel Kinzler: "The most crucial aspect to monitor is indeed size/growth of the wbc_entity_usage table. We expect it to grow significantly, but hopefully " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz) [20:41:46] !log legoktm@tin Synchronized php-1.30.0-wmf.17/includes/filerepo/file/LocalFile.php: Fix issues related to comments and files in UI and API - T175443 T175444 (duration: 00m 46s) [20:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:01] T175443: [BUG] API:Imageinfo returns null for &iiprop=comment - https://phabricator.wikimedia.org/T175443 [20:42:01] T175444: File History: Comments are not displayed anymore for non-current versions - https://phabricator.wikimedia.org/T175444 [20:42:28] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2046587 [21:31:19] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (241954s 200000s) [21:32:08] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (241954s 200000s) [21:42:14] acagastyaQUASSEL: You can get help with Toolforge problems like your jsub question in the #wikimedia-cloud channel. [21:53:16] bd808: fyi the wt-static and wikitech in sync check maybe should be reworded so the check name (the beginning prior to where it says critical) isnt a question lol [21:56:11] I think its actually the right words, but missing punctuation [21:57:40] bd808: i guess that can be true too [21:58:36] (03PS1) 10BryanDavis: wmcs: tweak wikitech-static test label [puppet] - 10https://gerrit.wikimedia.org/r/377023 [21:58:58] I like how it applies on labtestweb2001 as well [21:59:57] anyway it looks like the last import to wikitech-static happened on the 7th [22:01:42] bd808, I take it you can log into wikitech-static? [22:01:59] Krenair: nope. I don't have the creds for it [22:02:05] wow [22:02:18] andrewbogott and mutante have been taking care of it [22:02:23] (03CR) 10Zppix: wmcs: tweak wikitech-static test label (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377023 (owner: 10BryanDavis) [22:06:24] (03CR) 10BryanDavis: wmcs: tweak wikitech-static test label (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377023 (owner: 10BryanDavis) [22:07:01] Krenair: I'm sure I could have the login if I asked, but I've got enough jobs already :) [22:07:04] (03CR) 10Zppix: [C: 031] wmcs: tweak wikitech-static test label [puppet] - 10https://gerrit.wikimedia.org/r/377023 (owner: 10BryanDavis) [22:07:25] bd808, aren't you in the ops team now? [22:07:34] no, I'm not a root [22:07:39] IIRC ops share the root password [22:07:47] oh [22:07:58] so you manage ops but aren't one yourself? [22:08:15] correct. [22:08:59] I've never had a manager who was a software developer. It doesn't seem to be a weird arrangement to me [22:43:59] !log Running service rabbitmq-server restart on labcontrol1001 [22:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:04] madhuvishy: restarting rabbit kk, is it just nodepool that is hosed up? [22:47:16] chasemp: no [22:47:35] my manual fullstack run is building forever and timing out too [22:47:39] yeah I see nova-fullstack reporting timeouts [22:47:41] k [22:48:01] madhuvishy: before iirc andrew had to stop nodepool and clean everything out [22:48:10] or it just kept drowning [22:48:14] chasemp: still seeing timeout errors, not sure rabbit restart has fixed anything right now [22:48:15] or last time anyhow [22:48:15] i see [22:48:29] I will stop nodepool [22:48:52] idk why this is happening more often [22:48:57] chasemp: join us on #releng? [22:49:27] !log labnodepool1001:~# service nodepool stop [22:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:38] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 376 [22:54:48] PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:55:08] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [22:59:58] !log restart nova-api and nova-network [23:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:31] !log freeswap && labcontrol1001:/home/rush# service rabbitmq-server restart [23:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:01] getting timeout errors that make it seems like rabbit is still not playing nice when I issue opensack commands madhuvishy [23:08:11] yeah [23:08:22] https://www.irccloud.com/pastebin/nwYTFF5F/ [23:08:28] PROBLEM - salt-minion processes on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:08:38] chasemp: ^ repeatedly on nova-network logs [23:08:40] madhuvishy: that could be from a restart [23:08:48] depending on if it's actively happening now [23:13:29] RECOVERY - salt-minion processes on labcontrol1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:21:48] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [23:29:58] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [23:40:49] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [23:41:29] RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational [23:42:30] !log labnodepool1001:~# service nodepool start [23:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:49] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 38.46% of data above the critical threshold [140.0]