[00:52:16] <wikibugs>	 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3593368 (10mmodell)
[00:54:29] <wikibugs>	 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3593374 (10awight)
[00:55:00] <wikibugs>	 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3593377 (10awight)
[01:05:38] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0
[01:12:54] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.30.0-wmf.17/extensions/WikimediaEvents/modules/ext.wikimediaEvents.humanSearchRelevance.js: T171740: Reduce annoyance of survey by enforcing minimum 2 days between showing survey to same browser (duration: 00m 46s)
[01:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:13:08] <stashbot>	 T171740: [Epic] Search Relevance: graded by humans - https://phabricator.wikimedia.org/T171740
[03:28:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858.44 seconds
[04:17:48] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 188.54 seconds
[07:18:39] <icinga-wm>	 RECOVERY - Check systemd state on restbase1010 is OK: OK - running: The system is fully operational
[10:02:08] <wikibugs>	 (03PS15) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013)
[10:37:13] <wikibugs>	 10Operations, 10DNS, 10Traffic: arbcom-fi.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T175447#3593617 (10Stryn)
[10:38:34] <wikibugs>	 10Operations, 10DNS, 10Traffic: arbcom-fi.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T175447#3593618 (10Peachey88)
[11:02:58] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2026851
[11:35:08] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2021227
[11:58:58] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[12:02:59] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[12:05:08] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0
[12:12:08] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:42:59] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 37
[12:52:38] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:53:19] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[12:55:57] <wikibugs>	 10Operations, 10DNS, 10Traffic: arbcom-fi.wikipedia.org does not have a mobile version at arbcom-fi.m.wikipedia.org - https://phabricator.wikimedia.org/T175447#3593692 (10Aklapper)
[13:00:33] <wikibugs>	 (03Draft1) 10MarcoAurelio: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376886
[13:00:40] <wikibugs>	 (03PS2) 10MarcoAurelio: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376886
[13:01:25] <wikibugs>	 (03PS3) 10MarcoAurelio: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376886
[13:08:38] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:10:36] <wikibugs>	 (03Draft1) 10MarcoAurelio: Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376897 (https://phabricator.wikimedia.org/T175356)
[13:10:38] <wikibugs>	 (03PS2) 10MarcoAurelio: Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376897 (https://phabricator.wikimedia.org/T175356)
[13:10:58] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:19:19] <wikibugs>	 (03CR) 10Jayprakash12345: [C: 031] Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376897 (https://phabricator.wikimedia.org/T175356) (owner: 10MarcoAurelio)
[13:19:59] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[13:20:08] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[13:20:49] <wikibugs>	 (03PS8) 10MarcoAurelio: Cloud VPS configuration for hi.wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/371096 (https://phabricator.wikimedia.org/T173013)
[13:28:08] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[13:34:18] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:34:18] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:39:59] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[13:41:37] <wikibugs>	 (03CR) 10Eranroz: "What exactly should be monitored?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz)
[13:42:18] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[13:44:08] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[13:54:18] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:57:18] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:57:29] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:29:48] <wikibugs>	 (03PS1) 10Reedy: Add mobile subdomains for all arbcom wikis [dns] - 10https://gerrit.wikimedia.org/r/376904 (https://phabricator.wikimedia.org/T175447)
[15:34:37] <wikibugs>	 10Operations, 10DNS, 10Patch-For-Review: arbcom-fi.wikipedia.org does not have a mobile version at arbcom-fi.m.wikipedia.org - https://phabricator.wikimedia.org/T175447#3593816 (10Reedy)
[16:00:29] <acagastyaQUASSEL>	 I submitted a continous job using jstart (for a nodeJS application), but it did not start the application.
[16:02:17] <acagastyaQUASSEL>	 It is working when I just run the binary file.
[17:08:08] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[17:19:09] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:52:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89972.41 seconds
[19:05:33] <wikibugs>	 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3177068 (10Gehel) Since we are decommissioning logstash100[1-3] fairly soon, it's probably not worth investing any time in them...
[20:40:05] <wikibugs>	 (03CR) 10Daniel Kinzler: "The most crucial aspect to monitor is indeed size/growth of the wbc_entity_usage table. We expect it to grow significantly, but hopefully " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz)
[20:41:46] <logmsgbot>	 !log legoktm@tin Synchronized php-1.30.0-wmf.17/includes/filerepo/file/LocalFile.php: Fix issues related to comments and files in UI and API - T175443 T175444 (duration: 00m 46s)
[20:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:01] <stashbot>	 T175443: [BUG] API:Imageinfo returns null for &iiprop=comment - https://phabricator.wikimedia.org/T175443
[20:42:01] <stashbot>	 T175444: File History: Comments are not displayed anymore for non-current versions - https://phabricator.wikimedia.org/T175444
[20:42:28] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2046587
[21:31:19] <icinga-wm>	 PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (241954s 200000s)
[21:32:08] <icinga-wm>	 PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (241954s 200000s)
[21:42:14] <bd808>	 acagastyaQUASSEL: You can get help with Toolforge problems like your jsub question in the #wikimedia-cloud channel.
[21:53:16] <Zppix>	 bd808:  fyi the wt-static and wikitech in sync check maybe should be reworded so the check name (the beginning prior to where it says critical) isnt a question lol
[21:56:11] <bd808>	 I think its actually the right words, but missing punctuation
[21:57:40] <Zppix>	 bd808: i guess that can be true too
[21:58:36] <wikibugs>	 (03PS1) 10BryanDavis: wmcs: tweak wikitech-static test label [puppet] - 10https://gerrit.wikimedia.org/r/377023
[21:58:58] <Krenair>	 I like how it applies on labtestweb2001 as well
[21:59:57] <Krenair>	 anyway it looks like the last import to wikitech-static happened on the 7th
[22:01:42] <Krenair>	 bd808, I take it you can log into wikitech-static?
[22:01:59] <bd808>	 Krenair: nope. I don't have the creds for it
[22:02:05] <Krenair>	 wow
[22:02:18] <bd808>	 andrewbogott and mutante have been taking care of it
[22:02:23] <wikibugs>	 (03CR) 10Zppix: wmcs: tweak wikitech-static test label (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377023 (owner: 10BryanDavis)
[22:06:24] <wikibugs>	 (03CR) 10BryanDavis: wmcs: tweak wikitech-static test label (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377023 (owner: 10BryanDavis)
[22:07:01] <bd808>	 Krenair: I'm sure I could have the login if I asked, but I've got enough jobs already :)
[22:07:04] <wikibugs>	 (03CR) 10Zppix: [C: 031] wmcs: tweak wikitech-static test label [puppet] - 10https://gerrit.wikimedia.org/r/377023 (owner: 10BryanDavis)
[22:07:25] <Krenair>	 bd808, aren't you in the ops team now?
[22:07:34] <bd808>	 no, I'm not a root
[22:07:39] <Krenair>	 IIRC ops share the root password
[22:07:47] <Krenair>	 oh
[22:07:58] <Krenair>	 so you manage ops but aren't one yourself?
[22:08:15] <bd808>	 correct.
[22:08:59] <bd808>	 I've never had a manager who was a software developer. It doesn't seem to be a weird arrangement to me
[22:43:59] <madhuvishy>	 !log Running service rabbitmq-server restart on labcontrol1001
[22:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:04] <chasemp>	 madhuvishy: restarting rabbit kk, is it just nodepool that is hosed up?
[22:47:16] <madhuvishy>	 chasemp: no
[22:47:35] <madhuvishy>	 my manual fullstack run is building forever and timing out too
[22:47:39] <chasemp>	 yeah I see nova-fullstack reporting timeouts
[22:47:41] <chasemp>	 k
[22:48:01] <chasemp>	 madhuvishy: before iirc andrew had to stop nodepool and clean everything out
[22:48:10] <chasemp>	 or it just kept drowning
[22:48:14] <madhuvishy>	 chasemp: still seeing timeout errors, not sure rabbit restart has fixed anything right now
[22:48:15] <chasemp>	 or last time anyhow
[22:48:15] <madhuvishy>	 i see
[22:48:29] <chasemp>	 I will stop nodepool
[22:48:52] <chasemp>	 idk why this is happening more often
[22:48:57] <madhuvishy>	 chasemp: join us on #releng?
[22:49:27] <chasemp>	 !log labnodepool1001:~# service nodepool stop
[22:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:38] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 376
[22:54:48] <icinga-wm>	 PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:55:08] <icinga-wm>	 PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[22:59:58] <chasemp>	 !log restart nova-api and nova-network
[23:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:31] <chasemp>	 !log freeswap && labcontrol1001:/home/rush# service rabbitmq-server restart
[23:06:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:01] <chasemp>	 getting timeout errors that make it seems like rabbit is still not playing nice when I issue opensack commands madhuvishy
[23:08:11] <madhuvishy>	 yeah
[23:08:22] <madhuvishy>	 https://www.irccloud.com/pastebin/nwYTFF5F/
[23:08:28] <icinga-wm>	 PROBLEM - salt-minion processes on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[23:08:38] <madhuvishy>	 chasemp: ^ repeatedly on nova-network logs
[23:08:40] <chasemp>	 madhuvishy: that could be from a restart
[23:08:48] <chasemp>	 depending on if it's actively happening now
[23:13:29] <icinga-wm>	 RECOVERY - salt-minion processes on labcontrol1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[23:21:48] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[23:29:58] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[23:40:49] <icinga-wm>	 RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[23:41:29] <icinga-wm>	 RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational
[23:42:30] <chasemp>	 !log labnodepool1001:~# service nodepool start
[23:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:49] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 38.46% of data above the critical threshold [140.0]