[00:12:24] <wikibugs>	 (03PS1) 10GTirloni: clush: Add sge group [puppet] - 10https://gerrit.wikimedia.org/r/487314
[00:16:11] <wikibugs>	 (03CR) 10GTirloni: [C: 03+2] clush: Add sge group [puppet] - 10https://gerrit.wikimedia.org/r/487314 (owner: 10GTirloni)
[00:16:29] <icinga-wm>	 PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:19:45] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling)
[00:31:33] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) Agreed. There is zero need for a subdomain: this is why CSS exists and has been supported since at least 1998.
[00:45:02] <wikibugs>	 10Puppet, 10MobileFrontend, 10Readers-Web-Backlog (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Krenair) See also {T214998}
[00:48:17] <icinga-wm>	 RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[01:18:35] <James_F>	 gerrit seems really slow?
[01:35:59] <MatmaRex>	 very slow for me as well. (everything works, it just takes several seconds)
[01:46:38] <paladox>	 Me too
[01:46:44] <paladox>	 Happening to others
[01:46:57] * paladox creates a task
[01:47:41] <wikibugs>	 10Operations, 10Gerrit: Gerrit losing slowly - https://phabricator.wikimedia.org/T215004 (10Paladox)
[01:47:47] <wikibugs>	 10Operations, 10Gerrit: Gerrit losing slowly - https://phabricator.wikimedia.org/T215004 (10Paladox) p:05Triage→03Unbreak!
[01:49:27] <paladox>	 https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=cobalt&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc
[01:49:37] <paladox>	 Cpu and load has gone up
[01:51:09] <wikibugs>	 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10AntiCompositeNumber)
[01:51:16] <wikibugs>	 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10TheDragonFire) I get intermittent `plugin failed to load` errors, as well as this in console:  ` Thu Jan 31 01:49:32 GMT+000 2019 Class$82 SEVERE: TypeError: a is null Class$S43: TypeError: a is null  at Unknown.TP(htt...
[01:53:39] <paladox>	 Anyone able to contact ops / releng?
[01:55:28] <TheDragonFire>	 ze servers r on fire!!!
[01:56:23] <AntiComposite>	 Ops Clinic Duty: <no one during all hands>
[01:56:28] <AntiComposite>	 Well that's discouraging.
[01:57:40] * gtirloni is taking a look
[01:57:57] <paladox>	 Thanks gtirloni 
[01:58:15] <paladox>	 I’m wondering what’s taking that much cpu
[01:59:35] <gtirloni>	 paladox: do you have the task #?
[01:59:45] <paladox>	 Yup
[01:59:48] <paladox>	 https://phabricator.wikimedia.org/T215004
[02:01:55] <gtirloni>	 !log T215004 restarted gerrit (using 1200% cpu, 71% mem)
[02:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:01:59] <stashbot>	 T215004: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004
[02:02:19] <paladox>	 Thanks!
[02:03:29] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts]
[02:04:36] <gtirloni>	 paladox: does it look okay now?
[02:04:37] <icinga-wm>	 PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[02:04:55] <paladox>	 gtirloni: yup
[02:04:59] <paladox>	 Much better thanks!
[02:05:04] <gtirloni>	 cool, I'll blame java :) 
[02:05:23] <paladox>	 Lol
[02:05:29] <gtirloni>	 I don't know what's our error budget for gerrit but I'll let someone else more knowledgeable about gerrit investigate the root cause... a restart seems fine for now 
[02:05:45] <paladox>	 Thanks :)
[02:05:53] <gtirloni>	 thanks for reporting ti
[02:05:55] <paladox>	 The bug could likely have been fixed already upstream
[02:06:07] <paladox>	 But they had such a big refactor hard to tell
[02:06:30] <gtirloni>	 James_F: thanks for reporting it too
[02:06:33] <gtirloni>	 paladox: got it
[02:06:44] <paladox>	 :)
[02:06:49] <icinga-wm>	 PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 7 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy]
[02:06:55] <icinga-wm>	 PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer]
[02:07:02] <paladox>	 That should be recovering soon
[02:07:07] <paladox>	 Always happens after a restart
[02:07:13] <gtirloni>	 ok
[02:07:25] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui]
[02:08:18] <wikibugs>	 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10GTirloni) It seems the issue is gone after a reboot. I didn't see anything special in /var/log/gerrit/error_log.  Thanks for reporting this.
[02:08:34] <wikibugs>	 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10GTirloni) 05Open→03Resolved p:05Unbreak!→03High a:03GTirloni
[02:08:43] <paladox>	 Thanks!
[02:29:59] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:31:05] <icinga-wm>	 RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[02:33:19] <icinga-wm>	 RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[02:33:23] <icinga-wm>	 RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[02:33:55] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[02:41:05] <wikibugs>	 10Operations, 10Cloud-VPS, 10Discovery-Search, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Mathew.onipe) a:05Gehel→03Mathew.onipe
[02:41:36] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work): rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Mathew.onipe)
[03:23:58] <wikibugs>	 (03CR) 10Mathew.onipe: cloudelastic: Add cloudelastic configs (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[03:25:13] <wikibugs>	 (03PS3) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[05:43:53] <wikibugs>	 (03CR) 10Sayant Mahato: "Any updates on this project? We do have namespaces with colon like Visarga(https://en.wikipedia.org/wiki/Visarga). Here are some examples-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad)
[06:13:04] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Legoktm) > The m. subdomain, as in en.m.wikipedia.org, is annoying. Links shared on social media are randomly mobile or non-mobile, and so des...
[06:46:53] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) Alternate proposal: create   - [langcode].braille.[project].org   - [langcode].embossed.[project].org   - [langcode].handheld.[project]...
[06:57:45] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[06:57:51] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[06:57:57] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:00:23] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[07:00:27] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:05:47] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:12:17] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Funnyjokes2019)  Commonly we attain confused when we finally hear the exact terms website development service in addition to internet develope...
[07:45:08] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) p:05Triage→03Unbreak!
[07:45:38] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) Escalated to unbreak now to bring attention to the vandal: https://phabricator.wikimedia.org/p/Funnyjokes2019/  I've never seen vandali...
[08:03:39] <icinga-wm>	 PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:04:47] <icinga-wm>	 RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 75630 bytes in 0.112 second response time
[08:09:32] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Ammarpad) p:05Unbreak!→03Triage The account has been disabled.
[08:12:33] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling) >>! In T214998#4919908, @Koavf wrote: > Alternate proposal: create >   - [langcode].braille.[project].org  I assume this sarcastic...
[08:24:10] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) >>! In T214998#4919947, @tstarling wrote: >>>! In T214998#4919908, @Koavf wrote: >> Alternate proposal: create >>   - [langcode].braill...
[08:58:59] <icinga-wm>	 PROBLEM - Backup of s7 in codfw on db1115 is CRITICAL: Backup for s7 at codfw taken more than 8 days ago: Most recent backup 2019-01-23 08:35:54
[08:59:59] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling) Yeah, very funny.  In case it needs to be said, your comment was an attempt to support my proposal by reductio ad absurdum, via a r...
[09:11:29] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:14:03] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:15:09] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:34:29] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling) My problem is that you are making fun of MobileFrontend, and by implication, the engineers who designed MobileFrontend, who may wel...
[09:36:38] <arturo>	 jynus: a cloud user is reporting an issue with EventStream
[09:40:59] <jynus>	 the api, the irc?
[09:41:35] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478)
[09:43:39] <jynus>	 I see high load there: https://grafana.wikimedia.org/d/000000336/eventstreams?refresh=1m&orgId=1&from=1546335808416&to=1548927808416&panelId=1&fullscreen&var-stream=All&var-topic=All&var-scb_host=All
[09:54:31] <jynus>	 !log restarting pdfrender on scb1001
[09:54:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:23] <jynus>	 !log restarting pdfrender on scb1002,3,4
[10:00:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:04] <arturo>	 jynus: I asked the person to open a phab ticket
[10:20:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) It is ok, will that be next week, e.g. Tuesday?
[10:22:43] <jynus>	 !log resetting to defaults innodb consistency options for db2048 T188327
[10:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:47] <stashbot>	 T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327
[10:31:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10aborrero) >>! In T209029#4919182, @Cmjohnson wrote: > @aborrero can this server be re-installed.....there is a risk that removing /dev/sda will kill the OS....
[10:34:58] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478)
[10:35:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478) (owner: 10Jcrespo)
[10:37:57] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478)
[10:42:29] <Krenair>	 fuzheado in -cloud has said eventstreams is broken
[10:42:31] <Krenair>	 the example at https://wikitech.wikimedia.org/wiki/EventStreams#Python leads to HTTP 502
[10:42:36] <Krenair>	 requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://stream.wikimedia.org/v2/stream/recentchange
[10:43:05] <Krenair>	 ah I see it was mentioned further up the chat log
[10:43:24] <Krenair>	 if it's 502ing why are no alarms going off?
[10:45:10] <Krenair>	 GET / gives HTTP 503
[10:53:19] <p858snake>	 depends if there is even monitoring setup to alert
[11:09:48] <jynus>	 so eventstreams is working
[11:09:59] <jynus>	 I can curl http://scb1003.eqiad.wmnet:8092/v2/stream/recentchange
[11:10:08] <jynus>	 internally
[11:10:35] <jynus>	 and lvs endpoints are up
[11:11:45] <jynus>	 https://stream.wikimedia.org/?spec works
[11:11:46] <icinga-wm>	 PROBLEM - Host cloudvirt1015 is DOWN: PING CRITICAL - Packet loss = 100%
[11:12:55] <arturo>	 ???
[11:15:04] <jynus>	 I can also curl https://stream.wikimedia.org/v2/stream/recentchange
[11:15:36] <jynus>	 but only from the inside
[11:17:33] <jynus>	 it could be hitting some limitation (MAX_CONCURRENT_STREAMS == 128)
[11:18:10] <icinga-wm>	 ACKNOWLEDGEMENT - Host cloudvirt1015 is DOWN: PING CRITICAL - Packet loss = 100% Arturo Borrero Gonzalez Investigating
[11:21:47] <jynus>	 I am going to restart one eventstreams service
[11:22:43] <jynus>	 !log restart eventstreams on scb1001
[11:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:42] <jynus>	 Krenair: it works now?
[11:24:29] <jynus>	 !log restart eventstreams on scb1002,3,4
[11:24:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:32] <arturo>	 !log T215012 reboot cloudvirt1015
[11:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:35] <stashbot>	 T215012: cloudvirt1015: server down - https://phabricator.wikimedia.org/T215012
[11:26:17] <jynus>	 did the reporter file a ticket?
[11:27:09] <arturo>	 jynus: I just asked the reporter to open a ticket
[11:27:18] <icinga-wm>	 RECOVERY - Host cloudvirt1015 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[11:29:03] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64
[11:30:29] <arturo>	 !log T215012 icinga downtime cloudvirt1015 for 4h while investigating issues
[11:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:33] <stashbot>	 T215012: cloudvirt1015: server down - https://phabricator.wikimedia.org/T215012
[11:32:27] <icinga-wm>	 RECOVERY - Backup of s7 in eqiad on db1115 is OK: Backup for s7 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-01-31 10:10:01 from db1116.eqiad.wmnet:3317 (101 GB)
[11:36:53] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1015 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64
[11:37:28] <wikibugs>	 (03CR) 10Raz Shuty: [C: 03+1] "wait for SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487076 (https://phabricator.wikimedia.org/T213975) (owner: 10Ladsgroup)
[11:54:03] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Grant m5 access to testreduce databases to scandium [puppet] - 10https://gerrit.wikimedia.org/r/487348 (https://phabricator.wikimedia.org/T214740)
[11:56:09] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Grant m5 access to testreduce databases to scandium [puppet] - 10https://gerrit.wikimedia.org/r/487348 (https://phabricator.wikimedia.org/T214740) (owner: 10Jcrespo)
[11:56:17] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Grant m5 access to testreduce databases to scandium [puppet] - 10https://gerrit.wikimedia.org/r/487348 (https://phabricator.wikimedia.org/T214740)
[12:08:14] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Style fixes for scandium grants [puppet] - 10https://gerrit.wikimedia.org/r/487349 (https://phabricator.wikimedia.org/T214740)
[12:09:01] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Style fixes for scandium grants [puppet] - 10https://gerrit.wikimedia.org/r/487349 (https://phabricator.wikimedia.org/T214740)
[12:09:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Style fixes for scandium grants [puppet] - 10https://gerrit.wikimedia.org/r/487349 (https://phabricator.wikimedia.org/T214740) (owner: 10Jcrespo)
[12:12:11] <jynus>	 !log apply new grants to m5-master with replication T214740
[12:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:15] <stashbot>	 T214740: Provide access to testreduce* databases on scandium + revoke from ruthenium - https://phabricator.wikimedia.org/T214740
[12:13:15] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[12:17:20] <jynus>	 arturo: cloudvirt1019 may be a WIP-to-setup host ?
[12:24:52] <arturo>	 jynus: yes, is in hardware limbo
[12:25:11] <jynus>	 can I ack the service alert?
[12:25:35] <jynus>	 PROCS CRITICAL: 0 processes with regex args 'qemu-system-x86_64' ?
[12:26:16] <jynus>	 or downtime it for some days/weeks ?
[12:26:53] <arturo>	 it has notification disabled, see T196507 for context
[12:26:54] <stashbot>	 T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507
[12:27:02] <arturo>	 feel free to ack anything
[12:27:09] <jynus>	 yes, but it shows on unacknoledged alerts
[12:27:13] <jynus>	 thanks
[12:28:12] <jynus>	 it is not a problem, I am just tring to triage existing alerts
[12:29:30] <arturo>	 cool
[12:54:54] <jynus>	 !log stop, upgrade and restart db2044
[12:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:49] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:18:25] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[13:18:54] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero upgrade staging -f zotero-values-staging.yaml stable/zotero [namespace: zotero, clusters: staging]
[13:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:14] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero upgrade staging -f zotero-values-staging.yaml --version=0.0.1 stable/zotero [namespace: zotero, clusters: staging]
[13:19:15] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero cluster staging completed
[13:19:15] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero finished
[13:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:52] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad]
[13:31:53] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero cluster eqiad completed
[13:31:53] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero finished
[13:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:23] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw]
[13:34:25] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero cluster codfw completed
[13:34:25] <logmsgbot>	 !log mvolz@deploy1001 scap-helm zotero finished
[13:34:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:39] <wikibugs>	 10Operations, 10Citoid, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Decreased internationalisation of automatic citations as a result of switch to new translation-server - https://phabricator.wikimedia.org/T210806 (10Mvolz)
[13:41:02] <wikibugs>	 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz)
[13:41:07] <wikibugs>	 10Operations, 10Citoid, 10Core Platform Team Backlog (Watching / External), 10Services (watching), 10VisualEditor (Current work): Decreased internationalisation of automatic citations as a result of switch to new translation-server - https://phabricator.wikimedia.org/T210806 (10Mvolz) 05Open→03Resolve...
[13:47:55] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10TheDragonFire) I don't think there is any significant animosity toward any engineers involved in building what was at the time a pragmatic sol...
[14:13:13] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478) (owner: 10Jcrespo)
[14:14:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478) (owner: 10Jcrespo)
[14:16:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] analytics-grants.sql: Remove unused grants [puppet] - 10https://gerrit.wikimedia.org/r/487059 (https://phabricator.wikimedia.org/T214469) (owner: 10Marostegui)
[14:16:50] <wikibugs>	 (03PS2) 10Marostegui: analytics-grants.sql: Remove unused grants [puppet] - 10https://gerrit.wikimedia.org/r/487059 (https://phabricator.wikimedia.org/T214469)
[14:20:51] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:27:21] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[14:39:16] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "All good, but I don't have the right to do a +2 in this codebase." [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491) (owner: 10MarcoAurelio)
[14:46:50] <wikibugs>	 (03CR) 10Addshore: [C: 03+1] Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491) (owner: 10MarcoAurelio)
[15:06:02] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Me, Niklas and Aaron had a long chat face...
[15:30:04] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:33:49] <wikibugs>	 (03PS1) 10Mathew.onipe: maps: migrate maps2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/487360 (https://phabricator.wikimedia.org/T198622)
[15:46:54] <icinga-wm>	 RECOVERY - Backup of s7 in codfw on db1115 is OK: Backup for s7 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-01-31 10:12:32 from dbstore2001.codfw.wmnet:3317 (101 GB)
[15:50:46] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10Paladox) This would have been usage for debugging T215004 , should we triage as high?
[15:51:19] <jynus>	 I am about to restart db1117- that will cause some proxies to detect it as down- I have downtimed then
[15:52:06] <jynus>	 they should recover when db1117 restarts fully
[15:54:20] <wikibugs>	 10Operations, 10Gerrit: Investigate why icinga did not report high cpu/load - https://phabricator.wikimedia.org/T215033 (10Paladox)
[15:54:23] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) >>! In T214998#4920019, @tstarling wrote: > My problem is that you are making fun of MobileFrontend, and by implication, the engineers...
[15:54:26] <wikibugs>	 10Operations, 10Gerrit: Investigate why icinga did not report high cpu/load - https://phabricator.wikimedia.org/T215033 (10Paladox) p:05Triage→03High
[15:54:53] <jynus>	 !log stop, upgrade and restart db1117
[15:54:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:34] <wikibugs>	 10Operations, 10Gerrit: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Paladox)
[15:57:00] <jynus>	 proxies complaining, as expected
[15:57:14] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[15:57:50] <jynus>	 monitoring just in case we have a double failure
[15:58:00] <jynus>	 as we are on reduced availability right now
[16:00:03] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvps: cold-migrate: exclude rotated console.log files [puppet] - 10https://gerrit.wikimedia.org/r/487362
[16:05:21] <jynus>	 and proxies are back to normal
[16:07:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Assuming that --exclude handles wildcards like that, this is a definite improvement." [puppet] - 10https://gerrit.wikimedia.org/r/487362 (owner: 10Arturo Borrero Gonzalez)
[16:11:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:14:12] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6461/IPv6: Active, AS6461/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:14:16] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): Stop NavPopups gadget conflict with PagePreviews on Wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01)
[16:14:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "> Assuming that --exclude handles wildcards like that, this is a" [puppet] - 10https://gerrit.wikimedia.org/r/487362 (owner: 10Arturo Borrero Gonzalez)
[16:29:42] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 22, down: 0, shutdown: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:32:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:59:22] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:02:43] <jynus>	 mutante subbu: should we add notifications_enabled:0 to scandium ^until it is setup ?
[17:26:36] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[17:35:44] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 22 Invalid argument from storage engine TokuDB on query. Default database: metawiki. [Query snipped]
[17:39:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: reallocate openstack wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/487367
[17:40:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: reallocate openstack wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/487367 (owner: 10Arturo Borrero Gonzalez)
[17:44:12] <jynus>	 !log running alter table on metawiki.revision_actor_temp, trying to fix TokuDB horrible bugs
[17:44:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:06] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: prefix all of our customs scripts with 'wmcs-' [puppet] - 10https://gerrit.wikimedia.org/r/487368
[17:49:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: prefix all of our customs scripts with 'wmcs-' [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez)
[17:50:14] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1043.04 seconds
[17:54:14] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64
[17:57:05] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1015: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/487370 (https://phabricator.wikimedia.org/T215012)
[17:57:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1015: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/487370 (https://phabricator.wikimedia.org/T215012) (owner: 10Arturo Borrero Gonzalez)
[18:00:48] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Arturo Borrero Gonzalez Server is depooled due to T215012
[18:01:48] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[18:29:02] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:29:20] <wikibugs>	 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10D3r1ck01)
[18:41:26] <wikibugs>	 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10D3r1ck01)
[18:45:41] <wikibugs>	 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10D3r1ck01)
[18:55:40] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[18:59:11] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10stjn) If you add some redirection, add a way to circumvent it please (via some specific query?). Right now testing for mobile version involves...
[19:04:18] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 295.66 seconds
[19:22:52] <wikibugs>	 (03CR) 10GTirloni: [C: 03+1] wmcs: prefix all of our customs scripts with 'wmcs-' [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez)
[19:28:05] <wikibugs>	 (03CR) 10D3r1ck01: "Thanks! :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01)
[19:29:25] <wikibugs>	 (03PS3) 10D3r1ck01: Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878)
[19:40:01] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01)
[19:46:03] <wikibugs>	 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10pmiazga) @Jhernandez I'll update documentation once all Tgr questions are answered.
[20:03:37] <wikibugs>	 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10greg) Thanks @GTirloni
[20:11:34] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling) >>! In T214998#4920810, @stjn wrote: > If you add some redirection, add a way to circumvent it please (via some specific query?). R...
[20:29:39] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10stjn) Wasn’t aware of this link, thanks for addressing my concern. (‘Mobile view’ link sets cookies that make mobile version permanent for a v...
[20:40:44] <wikibugs>	 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Peachey88)
[20:41:57] <wikibugs>	 (03PS1) 10CRusnov: management.py: trivial change to adapt for Netbox 2.5+ [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487380 (https://phabricator.wikimedia.org/T212524)
[21:21:39] <wikibugs>	 10Operations, 10Wikibase-Containers, 10Wikidata, 10serviceops, and 2 others: Create a wmf production ready nginx image - https://phabricator.wikimedia.org/T209292 (10Ladsgroup) I would be in favor of not using nginx and turning WDQS gui to a proper npm package so we can have real deployment pipeline (right...
[21:29:01] <wikibugs>	 10Operations: dbtree.wikimedia.org down - https://phabricator.wikimedia.org/T215040 (10Marostegui)
[21:34:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Although we the module isn't used anymore on core or misc I have run the PPC against a core host, misc, labs, sanitarium, dbstore...just i" [puppet] - 10https://gerrit.wikimedia.org/r/482693 (https://phabricator.wikimedia.org/T162070) (owner: 10Ottomata)
[21:45:16] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487380 (https://phabricator.wikimedia.org/T212524) (owner: 10CRusnov)
[21:59:28] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:02:30] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] management.py: trivial change to adapt for Netbox 2.5+ [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487380 (https://phabricator.wikimedia.org/T212524) (owner: 10CRusnov)
[22:07:30] <Trijnstel>	 I'm looking for a sysadmin to supervise a rename of an account with more than 100,000 edits
[22:07:36] <Trijnstel>	 anyone here who can help me?
[22:09:04] <Krenair>	 most of the foundation is in SF
[22:09:51] <Krenair>	 2PM there so they're probably doing stuff, maybe someone is watching IRC and can watch that, idk
[22:12:08] <Trijnstel>	 then I guess we could use more European sysadmins...
[22:12:25] <Krenair>	 you misunderstand
[22:12:27] <Platonides>	 it is night at Europe
[22:12:29] <Krenair>	 this week was all-hands
[22:12:53] <Krenair>	 the annual thing that has them all fly to SF
[22:15:31] <Trijnstel>	 what is all-hands?
[22:15:54] <Trijnstel>	 but anyway... no sysadmin around? :(
[22:16:50] <paladox>	 Nope, they are all busy doing all hands stuff.
[22:17:20] <Trijnstel>	 hmm, then I guess it shouldd wait
[22:17:22] <Trijnstel>	 -d
[22:17:44] <Trijnstel>	 is there an email address on which I can reach them?
[22:17:54] <Platonides>	 Trijnstel: you could open a phabricator task
[22:18:21] <Trijnstel>	 ah, and what or who could I tag?
[22:18:26] <MatmaRex>	 for emergencies, i'm sure folks would get to work at fixing them, but i hope this is not an emergency
[22:18:44] <MatmaRex>	 (i'm also at all hands, taking a break from the activities :) but not a sysadmin)
[22:18:51] <Trijnstel>	 it's not an emergency, but I prefer to have this done on irc - live with them
[22:19:33] <MatmaRex>	 Trijnstel: on monday everything should be back to normal
[22:19:41] <Trijnstel>	 hmm k
[22:19:53] <Trijnstel>	 wrong timing I guess :p
[22:26:58] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[22:46:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "We also need to remove the GRANTs from MySQL itself" [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff)
[22:52:40] <wikibugs>	 10Operations: dbtree.wikimedia.org down - https://phabricator.wikimedia.org/T215040 (10Marostegui) 05Open→03Resolved From what I can see it was failing on the call to:     google.setOnLoadCallback(drawChart);  It is now working, so it might have been a punctual issue.
[22:54:35] <wikibugs>	 10Operations, 10Analytics, 10Product-Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) Quick idea that could alleviate this issue: if we create a cgroup like https://wiki.archlinux.org/index.php/cgroups#Matlab on the stat/notebook...
[23:11:51] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10elukey) @Andrew if you have time can we chat about the labswiki steps stated above?
[23:32:51] <wikibugs>	 (03PS1) 10Elukey: Fix ports for wmcs/labs' Prometheus Memcached exporters [puppet] - 10https://gerrit.wikimedia.org/r/487453
[23:40:30] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Still WIP" [puppet] - 10https://gerrit.wikimedia.org/r/487453 (owner: 10Elukey)
[23:53:13] <wikibugs>	 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Aklapper) Please include context. Is this related to T215004?
[23:56:30] <wikibugs>	 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Paladox) Yup, it's related to that.
[23:59:58] <wikibugs>	 (03PS2) 10Elukey: Fix ports for wmcs/labs' Prometheus Memcached exporters [puppet] - 10https://gerrit.wikimedia.org/r/487453