[00:00:45] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [00:00:53] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [00:00:55] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [00:02:53] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures [00:04:22] 10Operations, 10Analytics, 10Jupyter-Hub: notebook1001 shown as DOWN in icinga, due to firewall rules - https://phabricator.wikimedia.org/T138685 (10Dzahn) 05Open→03Resolved a:03Dzahn no response since 2016 and meanwhile there is no more notebook1001. closing. [00:06:35] Nettrom: hi:) [00:06:41] Nettrom: did you actually see my wall message? heh [00:06:43] mutante: hey! [00:06:53] yes, so I was wondering what was going on [00:06:56] i am surprised it worked, was just a try :) [00:06:59] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [00:07:03] ^ this is going on [00:07:14] the server keeps running out of memory [00:07:15] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [00:07:15] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [00:07:19] and then that kills the nagios-nrpe-server [00:07:26] and that means all those monitoring alerts fire [00:07:57] it's about RAM [00:07:58] notebook1004 bash[4971]: # Native memory allocation (mmap) failed to map 44564480 bytes for committing reserved memory. [00:08:01] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [00:08:09] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [00:08:18] sorry to hear that! I stopped all my running notebook kernels just in case… looks like Nate is running a large R job there, currently eating up 40G of memory or so [00:09:39] Nettrom: i see, you are the only one logged in and i just did echo | wall in case [00:10:29] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [00:10:39] guess i should just make a ticket about it needing some kind of quota system [00:10:45] mutante: no problem, at least I can ping HaeB so he can ping Nate and let him know that there's a bit of a problem [00:11:15] Nettrom: ok, thank you [00:11:55] or even better, I'll msg Nate to let him know [00:12:06] cool [00:13:04] {{done}} [00:14:07] thanks Nettrom (nate = groceryheist on #wikimedia-analytics) [00:14:57] mutante: hello [00:15:09] heard you are looking for me :) [00:15:52] groceryheist: hello, it looks like your job is using all the memory that this notebook server has [00:16:01] and that kills other things like the service for monitoring [00:16:18] and then we get a bunch of alerts about notebook1004 [00:16:36] sorry about that [00:16:39] so i just tried "echo | wall" to tell users..and that worked [00:18:20] is there maybe a way to run the same thing but eh.. slower? [00:18:25] well [00:18:35] this isn't a recurring job -- i'm working interactively [00:18:36] i am creating a ticket that it needs some permanent solution [00:19:00] R's approach to managing memory is not so efficient [00:21:08] i'm fitting a handful of models and it copies the dataset for each model [00:21:13] and the dataset is pretty big [00:21:13] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [00:21:21] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [00:21:24] !log notebook1004 - started nagios-nrpe-server one more time [00:21:25] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [00:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:41] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [00:21:41] RECOVERY - DPKG on notebook1004 is OK: All packages OK [00:22:02] groceryheist: ok, just might want to let the other users of notebook know [00:22:09] i think they cant use it at the same time [00:22:25] ok [00:23:06] groceryheist: would you say this is an Analytics thing? [00:23:12] yes [00:23:28] i'm actually going to finish this task pretty shortly [00:23:29] analytics-ops maintain notebook.. ok. i am adding that to ticket [00:23:37] in maybe 20-30 min [00:23:59] 10Operations, 10Analytics: notebook server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10Dzahn) [00:24:39] you can try but letting you know it kills random processes [00:24:57] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1004 is OK: OK: synced at Thu 2019-01-03 00:24:56 UTC. [00:26:10] 10Operations, 10Analytics: notebook server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10Dzahn) ` Jan 2 22:33:15 notebook1004 kernel: [9646042.221155] R invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0-1, order=0, oom_score_adj=0 Jan 3 00:06:33 no... [00:28:49] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:32:12] (03CR) 10Dzahn: [C: 03+1] "They could all be in one place in hieradata/role/common/ and i would have certainly done that.. if they were using the same role. But sinc" [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [00:41:39] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:41:40] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Dzahn) @Joe Do we (still) need `systemd::sidekick` ? I have https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/456312/ to replace base:service_unit... [00:44:31] (03CR) 10Dzahn: "i'm afraid since this has been created there were a bunch of changes, so this would need manual rebasing at least" [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [00:45:26] (03PS3) 10Dzahn: refactoring bastion into profiles [puppet] - 10https://gerrit.wikimedia.org/r/386752 (owner: 10RobH) [00:45:56] (03CR) 10jerkins-bot: [V: 04-1] refactoring bastion into profiles [puppet] - 10https://gerrit.wikimedia.org/r/386752 (owner: 10RobH) [00:50:23] (03CR) 10Dzahn: [C: 04-2] "we need to keep at least one role to apply in site.pp and we still have different types of bastion, so probably more than one role unless " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386752 (owner: 10RobH) [00:51:43] (03Abandoned) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [01:01:42] (03PS7) 10Dzahn: prometheus::ops: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/400241 [01:02:30] (03CR) 10jerkins-bot: [V: 04-1] prometheus::ops: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/400241 (owner: 10Dzahn) [01:05:23] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/14132/aqs1006.eqiad.wmnet/change.aqs1006.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/478114 (owner: 10Dzahn) [01:06:45] (03PS8) 10Dzahn: prometheus::ops: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/400241 [01:07:30] (03CR) 10jerkins-bot: [V: 04-1] prometheus::ops: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/400241 (owner: 10Dzahn) [01:08:31] (03PS9) 10Dzahn: prometheus::ops: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/400241 [01:09:20] (03CR) 10jerkins-bot: [V: 04-1] prometheus::ops: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/400241 (owner: 10Dzahn) [01:12:32] (03PS10) 10Dzahn: prometheus::ops: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/400241 [01:21:14] (03CR) 10Dzahn: "@Filippo I believe it's fixed now. The differences in the compiler output are there because there is a "File{}" setting defaults that are " [puppet] - 10https://gerrit.wikimedia.org/r/400241 (owner: 10Dzahn) [01:49:54] (03PS1) 10BryanDavis: toolforge: Redirect tools-static to https [puppet] - 10https://gerrit.wikimedia.org/r/481979 (https://phabricator.wikimedia.org/T102367) [01:57:05] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) [02:01:12] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Krinkle) For this weeks' TechCom-RFC Inbox triage, I'm unsure whether to move to Under discussion or Backlog. We usually move t... [02:05:48] (03CR) 10Krinkle: [C: 03+1] "Woo!" [puppet] - 10https://gerrit.wikimedia.org/r/481979 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [02:41:03] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [02:43:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [02:47:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [02:48:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [02:51:37] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [02:51:37] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:52:43] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [02:53:11] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [02:53:57] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [02:54:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [03:02:35] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [03:02:53] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [03:03:45] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [03:03:47] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [03:04:59] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [03:06:27] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [03:25:35] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 20 seconds [03:26:29] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 0.002 second response time [03:33:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 870.50 seconds [04:24:57] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 281.07 seconds [05:30:05] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Joe) @Dzahn yup it's unused and useless as things stand, we should remove it. [05:54:36] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: convert to use class httpd [puppet] - 10https://gerrit.wikimedia.org/r/475770 [05:57:11] (03PS5) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: convert to use class httpd [puppet] - 10https://gerrit.wikimedia.org/r/475770 [06:23:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add test-commons.wikimedia.org to prod_sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/481796 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [06:24:16] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/14138/ it seems ok-ish as a diff, I'll apply with care." [puppet] - 10https://gerrit.wikimedia.org/r/475770 (owner: 10Giuseppe Lavagetto) [06:28:07] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:33] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time [06:29:37] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:37:41] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:38:07] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.551 second response time [06:39:13] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:39:29] <_joe_> I've no idea what happened with netmon [06:39:39] <_joe_> but I'm in a meeting, no time to look into it [06:57:09] 10Operations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review, and 2 others: Make scap and opcache work consistently together - https://phabricator.wikimedia.org/T211964 (10Joe) [07:00:25] 10Operations, 10serviceops, 10User-Joe: SRE FY2019 Q3 goal: Ramp-up serving traffic to PHP 7 - https://phabricator.wikimedia.org/T212828 (10Joe) [07:03:10] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10ArielGlenn) [07:38:18] PROBLEM - Host es2019 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:54] log rotation glitch is what that was [07:39:00] on the netmon host [07:39:02] <_joe_> uh what [07:39:06] <_joe_> es2019 [07:39:20] <_joe_> is that in rotation? [07:39:30] no idea [07:40:16] <_joe_> ok, while I try to get into console, would you call the DBAs? [07:40:20] <_joe_> wmf-config/db-codfw.php: '10.192.48.42' => 3, # es2019, D6 11TB 128GB [07:40:20] D6: Interactive deployment shell aka iscap - https://phabricator.wikimedia.org/D6 [07:40:25] <_joe_> it used [07:40:25] 'cluster 25' [07:40:43] ah we were both looking at it; sorry, when I'm typing I don't also see irc [07:42:07] <_joe_> I'm in console [07:42:19] <_joe_> it seems gloriously stuck [07:42:36] huh [07:43:01] 10Operations, 10CirrusSearch, 10Discovery-Search (Current work): Add chi, psi and omega selector to the elasticsearch dashboards in grafana - https://phabricator.wikimedia.org/T211956 (10Mathew.onipe) @dcausse Done! Some of metrics are dependent on this T210592 [07:43:19] <_joe_> we need to call the dbas [07:45:23] yeah i'm trying to do that [07:45:33] <_joe_> calling balasz I guess [07:46:28] grrrrr [07:46:41] not getting through [07:53:06] <_joe_> I am not rebooting that server though [07:53:59] nope [07:54:01] just leave it [07:54:40] good morning :) [07:54:46] there is a task opened for netmon [07:57:01] (took some time to find it :P) https://phabricator.wikimedia.org/T212697 [07:58:17] ah great, thanks for that [07:58:30] I thought I'd seen that behavior before too but didn't remember a ticket [08:02:32] hey [08:02:39] hello! :) [08:02:52] morning [08:04:05] this is a cluster25 host, I don't know how urgent it is that it come back up but anyways, if you could have a look? [08:05:06] I am checking es2019 [08:05:21] great [08:05:30] <_joe_> it stopped reporting metrics to prometheus around 7:40 UTC [08:05:35] <_joe_> so 25 minutes ago [08:05:58] RECOVERY - Disk space on notebook1004 is OK: DISK OK [08:06:01] that's about the time of the alert indeed [08:06:04] (this was me --^) [08:06:19] which command you used for the serial console? [08:07:37] banyek: depending on the type of the server: when connecting to the mgmt interface it's "vsp" for HPs and "console com2" for Dells [08:11:43] !log installing OpenSSL security updates [08:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:03] Well, I am not 100% sure, in what I am saying, but as eqiad is the primary dc now, and I've checked the pair of es2019 (in this case es2018) it seems this host is not in use, which means this outage doesn't affects the service. [08:12:22] I think I'd do a power cycle to see if the hosts is able to reboot [08:12:32] it's surely not serving data, true [08:12:37] and investigate after [08:12:50] the issue is if we had to fall back to codfw for some reason [08:13:15] I wouldn't depool the server, because if it comes back probably the recovery solves this, and I won't waste time on that [08:13:33] your call :-) [08:13:51] let's do this I say [08:14:01] in worst case we can reclone the host from es2018 [08:14:06] okey dokey [08:14:23] (03PS3) 10Elukey: role::analytics_cluster::hadoop::ui: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/481910 [08:14:25] (03PS1) 10Elukey: profile::hadoop::hdfs-balancer: fix bash script [puppet] - 10https://gerrit.wikimedia.org/r/481984 [08:15:01] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hadoop::ui: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/481910 (owner: 10Elukey) [08:16:10] (03CR) 10Elukey: [C: 03+2] profile::hadoop::hdfs-balancer: fix bash script [puppet] - 10https://gerrit.wikimedia.org/r/481984 (owner: 10Elukey) [08:18:10] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for certcentral [puppet] - 10https://gerrit.wikimedia.org/r/481985 (https://phabricator.wikimedia.org/T135991) [08:18:44] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for certcentral [puppet] - 10https://gerrit.wikimedia.org/r/481985 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:19:18] (03PS1) 10Gergő Tisza: Remove AICaptcha settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481987 (https://phabricator.wikimedia.org/T186244) [08:19:42] I can't reset the host [08:19:54] and I can't power down as well [08:19:58] I'll depool it [08:20:02] and open a ticket [08:20:11] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for certcentral [puppet] - 10https://gerrit.wikimedia.org/r/481985 (https://phabricator.wikimedia.org/T135991) [08:20:51] ok! [08:25:27] (03PS1) 10Banyek: mariadb: depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481989 (https://phabricator.wikimedia.org/T212833) [08:26:09] aspergos: can I have a quick CR on that? ^ [08:28:34] or moritzm? [08:30:24] looking [08:31:01] (03CR) 10ArielGlenn: [C: 03+1] mariadb: depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481989 (https://phabricator.wikimedia.org/T212833) (owner: 10Banyek) [08:31:18] sorry, the ping doesn't work if there's a typo :-) [08:31:24] ah [08:31:31] tx :) [08:31:36] (03CR) 10Banyek: [C: 03+2] mariadb: depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481989 (https://phabricator.wikimedia.org/T212833) (owner: 10Banyek) [08:32:24] I downtimed the host in icinga [08:32:39] great [08:32:41] (03Merged) 10jenkins-bot: mariadb: depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481989 (https://phabricator.wikimedia.org/T212833) (owner: 10Banyek) [08:33:21] I sent mail to the dba team a little before you came online, would you m ind to send a followup so everyone knows they don't all have to come check the host too? :-) [08:33:35] or so they know what is left to do anyways [08:35:24] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: depool es2019, host is unsresponsible - T212833 (duration: 00m 49s) [08:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:26] T212833: es2019 is not responsive - https://phabricator.wikimedia.org/T212833 [08:35:49] !log depooled es2019 as host was unsresponsive - T212833 [08:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:11] yes, sure [08:37:21] sweet [08:42:45] (03CR) 10jenkins-bot: mariadb: depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481989 (https://phabricator.wikimedia.org/T212833) (owner: 10Banyek) [08:46:44] !log rolling restart of proton to pick up OpenSSL update [08:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:18] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.12; 2019-01-08), and 4 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Nikerabbit) Adding #MLEB tag a... [08:56:18] (03CR) 10Vgutierrez: [C: 03+1] "Thx Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/481902 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:08:02] RECOVERY - Host es2019 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [09:18:17] es2019 is back in action, the replication is resumed, and catched up [09:18:21] I'll repool the host [09:18:39] !log repooling es2019 - T212833 [09:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:42] T212833: es2019 is not responsive - https://phabricator.wikimedia.org/T212833 [09:20:40] (03PS1) 10Banyek: Revert "mariadb: depool es2019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481993 [09:20:56] (03CR) 10Mathew.onipe: "Few comments" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/481857 (owner: 10Volans) [09:22:10] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool es2019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481993 (owner: 10Banyek) [09:23:17] (03Merged) 10jenkins-bot: Revert "mariadb: depool es2019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481993 (owner: 10Banyek) [09:23:51] (03CR) 10Filippo Giunchedi: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/400241 (owner: 10Dzahn) [09:26:06] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: repool es2019 - T212833 (duration: 01m 33s) [09:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:09] T212833: es2019 is not responsive - https://phabricator.wikimedia.org/T212833 [09:26:13] (03CR) 10Mathew.onipe: [C: 03+1] puppet: fix subprocess call to check_output() [software/spicerack] - 10https://gerrit.wikimedia.org/r/481856 (owner: 10Volans) [09:29:31] (03CR) 10Mathew.onipe: [C: 03+1] dns: include NXDOMAIN in the DnsNotFound exception [software/spicerack] - 10https://gerrit.wikimedia.org/r/481855 (owner: 10Volans) [09:30:28] (03CR) 10Mathew.onipe: [C: 03+1] admin_reason: fix default value for task [software/spicerack] - 10https://gerrit.wikimedia.org/r/481854 (owner: 10Volans) [09:34:26] (03CR) 10jenkins-bot: Revert "mariadb: depool es2019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481993 (owner: 10Banyek) [09:39:46] !log installing nginx updates on puppetdb* [09:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:47] (03CR) 10Mathew.onipe: [C: 04-1] "Since this is a workaround, I think we should discard it since the issue has been fixed. We can find a better implementation to supress ou" [software/spicerack] - 10https://gerrit.wikimedia.org/r/481858 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [09:45:13] (03PS1) 10Elukey: Add -R 200 to memcached on mc1023 [puppet] - 10https://gerrit.wikimedia.org/r/481996 (https://phabricator.wikimedia.org/T208844) [09:45:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::jobrunner: convert to use class httpd [puppet] - 10https://gerrit.wikimedia.org/r/475770 (owner: 10Giuseppe Lavagetto) [09:45:49] (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: convert to use class httpd [puppet] - 10https://gerrit.wikimedia.org/r/475770 [09:45:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add -R 200 to memcached on mc1023 [puppet] - 10https://gerrit.wikimedia.org/r/481996 (https://phabricator.wikimedia.org/T208844) (owner: 10Elukey) [09:46:35] !log remove imagemagick remnants from ATS hosts (obsoleted by upstream packaging change which dropped the webp plugin) [09:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:44] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14140/" [puppet] - 10https://gerrit.wikimedia.org/r/481996 (https://phabricator.wikimedia.org/T208844) (owner: 10Elukey) [09:47:52] (03PS2) 10Elukey: Add -R 200 to memcached on mc1023 [puppet] - 10https://gerrit.wikimedia.org/r/481996 (https://phabricator.wikimedia.org/T208844) [09:51:19] !log restart memcached on mc1023 to apply -R 200 - T208844 [09:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:22] T208844: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 [09:52:02] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) [09:52:26] PROBLEM - DPKG on maps1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:56:58] 10Operations, 10Discovery-Search, 10monitoring, 10User-CDanis, 10User-fgiunchedi: Remove "prometheus" from elasticsearch grafana dashboard names - https://phabricator.wikimedia.org/T212839 (10fgiunchedi) [09:57:03] 10Operations, 10Discovery-Search, 10monitoring, 10User-CDanis, 10User-fgiunchedi: Remove "prometheus" from elasticsearch grafana dashboard names - https://phabricator.wikimedia.org/T212839 (10fgiunchedi) [09:58:47] maps1004 is me, fixed [09:59:01] moritzm: Thanks! [09:59:15] (03CR) 10Volans: [C: 03+2] admin_reason: fix default value for task [software/spicerack] - 10https://gerrit.wikimedia.org/r/481854 (owner: 10Volans) [09:59:42] RECOVERY - DPKG on maps1004 is OK: All packages OK [10:04:44] (03Merged) 10jenkins-bot: admin_reason: fix default value for task [software/spicerack] - 10https://gerrit.wikimedia.org/r/481854 (owner: 10Volans) [10:07:29] (03CR) 10jenkins-bot: admin_reason: fix default value for task [software/spicerack] - 10https://gerrit.wikimedia.org/r/481854 (owner: 10Volans) [10:08:01] (03CR) 10Volans: "replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/481857 (owner: 10Volans) [10:08:17] (03CR) 10Volans: [C: 03+2] dns: include NXDOMAIN in the DnsNotFound exception [software/spicerack] - 10https://gerrit.wikimedia.org/r/481855 (owner: 10Volans) [10:09:52] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-certcentral [puppet] - 10https://gerrit.wikimedia.org/r/481902 (https://phabricator.wikimedia.org/T135991) [10:12:34] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for uwsgi-certcentral [puppet] - 10https://gerrit.wikimedia.org/r/481902 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:13:14] how did the host come back up, banyek? [10:13:18] es2019 I mean [10:13:32] I did a hardreset on that [10:14:20] good one [10:14:25] first I tried `reset /system1` and that was the one which doesn't worked, I checked wikitech, and found the 'racadm' command which worked [10:14:38] excellent [10:14:56] yeah hard power off usually gets it [10:15:17] 'have you tried turn off and on again?' [10:15:17] the db is ok? full recovery? [10:15:22] *exactly* [10:15:34] the db recovered, and the replication continued [10:15:41] ok [10:15:56] I'm lways suspicious of mysql recoveries but luckily I'm not a dba :-D [10:17:30] InnoDB is sturdy in that form. I mean we can misconfigure it and sacrifice stability on performance, but if the logfiles synced properly it should be recovered [10:18:40] (03CR) 10jerkins-bot: [V: 04-1] dns: include NXDOMAIN in the DnsNotFound exception [software/spicerack] - 10https://gerrit.wikimedia.org/r/481855 (owner: 10Volans) [10:18:45] anyways I think it will worth of running a table check, to make sure there are no skeletons in the closet, but I am 99(.99)% sure there won't be any problems [10:19:20] cool [10:19:43] how does the table check work, if I can ask? [10:20:33] (03CR) 10Volans: [C: 03+2] dns: include NXDOMAIN in the DnsNotFound exception [software/spicerack] - 10https://gerrit.wikimedia.org/r/481855 (owner: 10Volans) [10:21:16] we have a tool for that: [10:21:17] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/wmfmariadbpy/+/refs/heads/master/wmfmariadbpy/compare.py [10:21:43] ah that's the one [10:21:51] I should bookmark it [10:22:23] basically it takes the table which needed to be checked on two hosts, split them to chuncks, and runs crc checksum on the chunks, and compares the checksums [10:22:51] !log installing ghostscript security updates on jessie [10:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:55] neat stuff [10:25:57] (03Merged) 10jenkins-bot: dns: include NXDOMAIN in the DnsNotFound exception [software/spicerack] - 10https://gerrit.wikimedia.org/r/481855 (owner: 10Volans) [10:26:13] (03CR) 10Volans: [C: 03+2] puppet: fix subprocess call to check_output() [software/spicerack] - 10https://gerrit.wikimedia.org/r/481856 (owner: 10Volans) [10:27:00] (03CR) 10jenkins-bot: dns: include NXDOMAIN in the DnsNotFound exception [software/spicerack] - 10https://gerrit.wikimedia.org/r/481855 (owner: 10Volans) [10:28:08] * apergos didn't know about the group_concat function [10:31:43] (03Merged) 10jenkins-bot: puppet: fix subprocess call to check_output() [software/spicerack] - 10https://gerrit.wikimedia.org/r/481856 (owner: 10Volans) [10:32:43] (03CR) 10jenkins-bot: puppet: fix subprocess call to check_output() [software/spicerack] - 10https://gerrit.wikimedia.org/r/481856 (owner: 10Volans) [10:46:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: check health-check.php from nagios [puppet] - 10https://gerrit.wikimedia.org/r/481864 (owner: 10Giuseppe Lavagetto) [10:47:34] (03PS2) 10Giuseppe Lavagetto: jobrunner: check health-check.php from nagios [puppet] - 10https://gerrit.wikimedia.org/r/481864 [10:49:11] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10fgiunchedi) I've tried experimenting how CI integration would look like for repositories that choose to enforce code formatting, e.g. with `tox` and for example https://pypi.or... [10:50:36] (03PS6) 10Jforrester: Add test-commons.wikimedia.org to prod_sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/481796 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [10:50:53] herron: Hey, as on-duty person do you know who I should speak to re. https://gerrit.wikimedia.org/r/c/operations/dns/+/481795 ? [10:55:27] !log installing apache updates on puppetmasters [10:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:36] PROBLEM - puppet last run on logstash1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:02:02] <_joe_> !log manually reloading icinga to pick up changes to commands.cfg [11:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:29] PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:37] RECOVERY - puppet last run on logstash1007 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:05:41] PROBLEM - IPMI Sensor Status on kafka1013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [11:07:13] PROBLEM - puppet last run on scb2005 is CRITICAL: CRITICAL: Puppet has 35 failures. Last run 2 minutes ago with 35 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh],File[/etc/profile.d/bash_autologout.sh],File[/etc/profile.d/field.sh],File[/usr/local/bin/gen_fingerprints] [11:07:13] PROBLEM - puppet last run on ms-be2050 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown) [11:07:15] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:29] RECOVERY - puppet last run on cloudservices1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:07:36] mmmh a bit tooo many failures for puppet, what's happening? [11:07:56] ah mori.tzm restarting apaches... sorry missed the !log [11:08:05] <_joe_> yep [11:08:13] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: Puppet has 65 failures. Last run 3 minutes ago with 65 failures. Failed resources (up to 3 shown): File[/home/crusnov],File[/home/cdanis],File[/home/fsero],File[/home/dr0ptp4kt] [11:08:59] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 30 failures. Last run 4 minutes ago with 30 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats],File[/etc/rsyslog.d],File[/etc/profile.d/mysql-ps1.sh],File[/etc/profile.d/bash_autologout.sh] [11:09:47] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Puppet has 24 failures. Last run 5 minutes ago with 24 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/],File[/usr/local/bin/prometheus-intel-microcode],File[/usr/local/bin/apt-upgrade-activity],File[/usr/lib/nagios/plugins/check_sysctl] [11:10:15] (03PS2) 10Giuseppe Lavagetto: jobrunner: enable php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/481865 [11:10:47] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:10:51] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 6 minutes ago with 5 failures. Failed resources (up to 3 shown): File[/home/wmde-leszek],File[/home/jhuneidi],File[/home/wmde-fisch],File[/home/phedenskog] [11:12:17] RECOVERY - puppet last run on ms-be2050 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [11:14:03] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:14:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14141/mw1300.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/481865 (owner: 10Giuseppe Lavagetto) [11:14:53] RECOVERY - puppet last run on es2016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:15:41] 10Operations, 10ops-eqiad, 10Analytics: kakfa1013 shows a failed PSU - https://phabricator.wikimedia.org/T212844 (10elukey) [11:15:55] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:17:25] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 2 minutes ago with 5 failures. Failed resources (up to 3 shown): File[/home/vgutierrez],File[/home/jiji],File[/home/cwhite],File[/home/banyek] [11:17:25] RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:17:29] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:17:47] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Puppet has 24 failures. Last run 2 minutes ago with 24 failures. Failed resources (up to 3 shown) [11:18:08] uh? [11:18:23] vgutierrez: apache updates [11:18:25] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:18:31] PROBLEM - puppet last run on mw1328 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:18:34] ack :) [11:18:59] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:47] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Puppet has 72 failures. Last run 4 minutes ago with 72 failures. Failed resources (up to 3 shown): File[/home/crusnov],File[/home/cdanis],File[/home/fsero],File[/home/dr0ptp4kt] [11:19:53] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Puppet has 15 failures. Last run 4 minutes ago with 15 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP],File[/etc/R/update-library.R],File[/etc/R/biocLite.R],File[/etc/modprobe.d/nf_conntrack.conf] [11:20:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:20:09] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:20:49] <_joe_> ok this is not good otoh [11:21:05] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:21:25] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:21:25] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:22:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:22:33] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:22:37] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:22:57] <_joe_> ema: I don't see many 5xx in the logs, yet avail numbers are bad? [11:22:59] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:23:45] RECOVERY - puppet last run on mw1328 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:24:11] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:24:57] RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:25:05] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:25:09] <_joe_> I mostly see pdf generation timeouts [11:26:37] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:26:37] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:30:25] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:34:13] PROBLEM - puppet last run on mw1326 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/00-nonexistent.conf] [11:34:33] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:35:37] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:39:23] RECOVERY - puppet last run on mw1326 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:39:45] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:42:50] (03CR) 10Jforrester: [C: 03+1] "Will deploy this next week once we're out of the freeze. Ping me if I forget." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481987 (https://phabricator.wikimedia.org/T186244) (owner: 10Gergő Tisza) [11:44:21] (03CR) 10Jforrester: "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) (owner: 10MacFan4000) [11:57:24] (03PS1) 10Arturo Borrero Gonzalez: openstack: linuxbridge_agent: typo in libosinfo package name [puppet] - 10https://gerrit.wikimedia.org/r/482009 (https://phabricator.wikimedia.org/T212302) [11:58:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: linuxbridge_agent: typo in libosinfo package name [puppet] - 10https://gerrit.wikimedia.org/r/482009 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [12:18:09] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:19:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:20:33] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:20:39] mhhh looks like a spike of 500s from a single client, trying to get a pdf via the rest api [12:20:50] already recovered [12:21:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:22:09] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: mitaka: stretch: install python-dogpile.core from jessie [puppet] - 10https://gerrit.wikimedia.org/r/482013 (https://phabricator.wikimedia.org/T212302) [12:22:20] _joe_: I guess you'd not want to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/481796 until after the DNS patch is live, for verification? [12:22:37] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova: mitaka: stretch: install python-dogpile.core from jessie [puppet] - 10https://gerrit.wikimedia.org/r/482013 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [12:25:42] (03PS2) 10Arturo Borrero Gonzalez: openstack: nova: mitaka: stretch: install python-dogpile.core from jessie [puppet] - 10https://gerrit.wikimedia.org/r/482013 (https://phabricator.wikimedia.org/T212302) [12:26:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova: mitaka: stretch: install python-dogpile.core from jessie [puppet] - 10https://gerrit.wikimedia.org/r/482013 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [12:26:58] (03CR) 10Mathew.onipe: [C: 03+1] icinga: fix command_file property (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/481857 (owner: 10Volans) [12:27:53] !log restarting tor on torrelay1001 to pick up OpenSSL security update [12:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:48] !log depooling db1094 for schema change - T85757 [12:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:51] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [12:28:52] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481832 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [12:30:00] (03Merged) 10jenkins-bot: mariadb: depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481832 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [12:33:03] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1094 for schema change - T85757 (duration: 00m 46s) [12:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:29] _joe_: strange that you didn't see many 503s, there's been a few spikes [12:33:37] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=1546513373724&to=1546518205520 [12:34:50] (03CR) 10Volans: [C: 03+2] icinga: fix command_file property [software/spicerack] - 10https://gerrit.wikimedia.org/r/481857 (owner: 10Volans) [12:35:08] affecting only eqiad, it seems [12:38:08] 10Operations, 10Discovery-Search, 10Elasticsearch: Create Icinga check for failed shard allocation - https://phabricator.wikimedia.org/T212850 (10Mathew.onipe) [12:39:02] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Create Icinga check for failed shard allocation - https://phabricator.wikimedia.org/T212850 (10Mathew.onipe) [12:39:19] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Create Icinga check for failed shard allocation - https://phabricator.wikimedia.org/T212850 (10Mathew.onipe) p:05Triage→03Normal [12:39:54] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): fix broken visualizations in Elasticsearch Node comparison dashboard - https://phabricator.wikimedia.org/T212831 (10Mathew.onipe) [12:40:26] (03Merged) 10jenkins-bot: icinga: fix command_file property [software/spicerack] - 10https://gerrit.wikimedia.org/r/481857 (owner: 10Volans) [12:40:59] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): fix broken visualizations in Elasticsearch Node comparison dashboard - https://phabricator.wikimedia.org/T212831 (10Mathew.onipe) a:03Mathew.onipe [12:41:19] (03CR) 10jenkins-bot: icinga: fix command_file property [software/spicerack] - 10https://gerrit.wikimedia.org/r/481857 (owner: 10Volans) [12:41:26] !log T212302 reimaging again cloudvirt1030 to test final puppet code [12:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:29] T212302: CloudVPS: upgrade: jessie -> stretch & mitaka -> newton - https://phabricator.wikimedia.org/T212302 [12:42:37] (03CR) 10jenkins-bot: mariadb: depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481832 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [12:44:07] mobrovac: any ideas? https://logstash.wikimedia.org/goto/31927bd3398b7eea29db8228720b4284 [12:44:55] <_joe_> James_F: yes, and I can help with both when I'm back later in the afternoon [12:46:39] _joe_: Awesome, thank you! [12:49:30] !log repooling db1094 after schema change - T85757 [12:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:33] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [12:50:03] (03PS1) 10Banyek: Revert "mariadb: depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482015 [12:52:07] (03PS2) 10ArielGlenn: Check for truncated file content in certain circumstances [dumps] - 10https://gerrit.wikimedia.org/r/481893 (https://phabricator.wikimedia.org/T212462) [12:52:15] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482015 (owner: 10Banyek) [12:53:19] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482015 (owner: 10Banyek) [12:54:33] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1094 after schema change - T85757 (duration: 00m 45s) [12:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:35] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [12:55:11] !log depooling db1098:3317 for schema change - T85757 [12:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:26] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481837 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [12:55:46] (03CR) 10jenkins-bot: Revert "mariadb: depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482015 (owner: 10Banyek) [12:56:27] (03Merged) 10jenkins-bot: mariadb: depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481837 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [12:56:41] (03CR) 10jenkins-bot: mariadb: depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481837 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [12:58:39] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1098:3317 for schema change - T85757 (duration: 00m 45s) [12:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] (03PS10) 10Mathew.onipe: cirrus: increase number of shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) [13:01:40] (03PS1) 10Elukey: Remove two Analytics Hadoop worker nodes for decom [puppet] - 10https://gerrit.wikimedia.org/r/482016 (https://phabricator.wikimedia.org/T209929) [13:12:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:14:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:17:26] (03CR) 10Jforrester: [C: 03+1] Make password policy code saner (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 (owner: 10Gergő Tisza) [13:19:44] Doing small deploy in cxserver.. [13:20:13] !log kartik@deploy1001 Started deploy [cxserver/deploy@3b2ede7]: Update cxserver to 2369a18 [13:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:42] !log kartik@deploy1001 Finished deploy [cxserver/deploy@3b2ede7]: Update cxserver to 2369a18 (duration: 04m 30s) [13:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:16] !log repooling db1098:3317 after schema change - T85757 [13:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:19] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [13:34:35] (03PS1) 10Banyek: Revert "mariadb: depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482017 [13:35:59] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482017 (owner: 10Banyek) [13:37:05] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482017 (owner: 10Banyek) [13:38:28] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1098:3317 after schema change - T85757 (duration: 00m 44s) [13:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:04] !log depooling db1101:3317 for schema change - T85757 [13:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:07] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [13:41:23] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481839 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:42:25] (03Merged) 10jenkins-bot: mariadb: depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481839 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:42:27] (03PS1) 10Volans: phabricator: add phabricator module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) [13:43:41] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1101:3317 for schema change - T85757 (duration: 00m 44s) [13:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:10] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add phabricator module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:48:00] (03CR) 10jenkins-bot: Revert "mariadb: depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482017 (owner: 10Banyek) [13:48:02] (03CR) 10jenkins-bot: mariadb: depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481839 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:49:07] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:50:50] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add phabricator module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:51:08] (03PS1) 10Jforrester: Initial configuration for test-commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) [13:51:55] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for test-commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) (owner: 10Jforrester) [13:51:55] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Trizek-WMF) It that task still not ready to announce to the users? [13:57:03] (03PS3) 10Bmansurov: Recommendation API: increase mysql connection limit for service [puppet] - 10https://gerrit.wikimedia.org/r/481871 (https://phabricator.wikimedia.org/T205294) [13:58:31] (03PS4) 10Bmansurov: Recommendation API: increase mysql connection limit for service [puppet] - 10https://gerrit.wikimedia.org/r/481871 (https://phabricator.wikimedia.org/T205294) [13:58:57] (03PS2) 10Volans: phabricator: add phabricator module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) [14:00:38] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add phabricator module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:01:03] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1029: introduce it to the openstack eqiad1 deployment [puppet] - 10https://gerrit.wikimedia.org/r/482022 (https://phabricator.wikimedia.org/T209616) [14:01:44] (03PS3) 10Volans: phabricator: add phabricator module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) [14:02:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1029: introduce it to the openstack eqiad1 deployment [puppet] - 10https://gerrit.wikimedia.org/r/482022 (https://phabricator.wikimedia.org/T209616) (owner: 10Arturo Borrero Gonzalez) [14:04:36] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1029.eqiad.wmnet'] ` The log can be f... [14:05:27] !log T209616 reimage cloudvirt1029 as debian stretch [14:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:30] T209616: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 [14:07:45] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) @Trizek-WMF service is ready, but it doesn't handle the production traffic yet. We're planning to replace ElectronPDF with chromium-rende... [14:07:48] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10aborrero) [14:09:06] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Trizek-WMF) So no need to announce anything for now, since it is not going to impact anyone? [14:10:29] (03CR) 10DCausse: [C: 03+1] cirrus: increase number of shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [14:14:11] !log rebooting kubernetes mastes in codfw to pick up SSBD-enabled qemu [14:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:46] (03PS1) 10Ema: WIP: cache: flag to use ATS as local backend [puppet] - 10https://gerrit.wikimedia.org/r/482024 [14:21:36] !log rebooting kubernetes masters in eqiad to pick up SSBD-enabled qemu [14:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:41] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) This should have no impact on anyone. We're experiencing stability issues with ElectronPDF, that's why we want to replace it with a new s... [14:26:59] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.12; 2019-01-08), and 4 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Very interesting: I ca... [14:27:05] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1029.eqiad.wmnet'] ` and were **ALL** successful. [14:29:08] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1029: introduce hiera overrides for new iface names [puppet] - 10https://gerrit.wikimedia.org/r/482025 (https://phabricator.wikimedia.org/T209616) [14:29:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1029: introduce hiera overrides for new iface names [puppet] - 10https://gerrit.wikimedia.org/r/482025 (https://phabricator.wikimedia.org/T209616) (owner: 10Arturo Borrero Gonzalez) [14:31:19] (03PS1) 10Muehlenhoff: Add library hint for libseccomp [puppet] - 10https://gerrit.wikimedia.org/r/482027 [14:31:54] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10aborrero) [14:32:38] !log repooling db1101:3317 after schema change - T85757 [14:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:41] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [14:32:56] (03PS1) 10Banyek: Revert "mariadb: depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482028 [14:35:06] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482028 (owner: 10Banyek) [14:36:11] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482028 (owner: 10Banyek) [14:37:43] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1101:3317 after schema change - T85757 (duration: 00m 44s) [14:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:46] !log rebooting kubernetes workers in codfw for kernel security update [14:37:50] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [14:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:31] (03CR) 10jenkins-bot: Revert "mariadb: depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482028 (owner: 10Banyek) [14:40:30] (03PS2) 10Ema: WIP: cache: flag to use ATS as local backend [puppet] - 10https://gerrit.wikimedia.org/r/482024 [14:45:36] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:46:16] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:48:26] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:50:04] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:53:30] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:59:04] (03CR) 10Effie Mouzeli: [C: 03+2] Add test-commons.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/481795 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [14:59:23] (03PS3) 10Effie Mouzeli: Add test-commons.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/481795 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [15:04:18] (03PS1) 10Andrew Bogott: Prepare new/empty cloudvirts for Stretch/Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/482033 (https://phabricator.wikimedia.org/T209616) [15:04:21] (03PS2) 10Jforrester: Initial configuration for test-commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) [15:04:41] (03CR) 10Jforrester: Initial configuration for test-commons.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) (owner: 10Jforrester) [15:06:18] PROBLEM - IPMI Sensor Status on an-worker1078 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [15:06:20] PROBLEM - IPMI Sensor Status on an-worker1079 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [15:06:32] nice [15:06:38] two of the new worker nodes [15:06:55] checking console [15:07:18] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [15:09:38] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [15:13:34] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10Eevans) Is there any status update, or ETA on this? [15:14:20] 10Operations, 10ops-eqiad, 10Analytics: PSU broken on two Analytics Hadoop Workers - https://phabricator.wikimedia.org/T212861 (10elukey) p:05Triage→03High [15:17:20] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 603.35 seconds [15:17:33] 10Operations, 10ops-eqiad, 10Analytics: PSU broken on two Analytics Hadoop Workers - https://phabricator.wikimedia.org/T212861 (10fgiunchedi) Judging by icinga there's a few other hosts with PS alerts, all in A2. I suspect it has to do with one of the rack PDU themselves ` cloudelastic1001 db1082 db1107 ms-... [15:18:14] PROBLEM - IPMI Sensor Status on ms-be1045 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [15:20:10] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10elukey) [15:20:25] 10Operations, 10ops-eqiad, 10Analytics: kakfa1013 shows a failed PSU - https://phabricator.wikimedia.org/T212844 (10elukey) [15:20:27] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10elukey) [15:20:45] godog: thanks! --^ [15:20:56] this morning kafka1013 alarmed as well, in A2 too [15:21:16] I was checking icinga as well and the two ms-be alerts were indeed suspicious [15:22:17] elukey: np! seemed a little suspicious when I looked at the other hosts with ps failures [15:22:58] (03PS1) 10ArielGlenn: Fix a long-standing bug that allowed some incomplete output files to sneak in [dumps] - 10https://gerrit.wikimedia.org/r/482042 (https://phabricator.wikimedia.org/T212462) [15:28:54] PROBLEM - IPMI Sensor Status on ms-be1044 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [15:29:06] PROBLEM - IPMI Sensor Status on db1082 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] [15:30:22] PROBLEM - IPMI Sensor Status on db1107 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [15:31:01] !log Disabling puppet on mw servers to test 481796 - T197616 [15:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:04] T197616: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 [15:31:23] (03CR) 10Effie Mouzeli: [C: 03+2] Add test-commons.wikimedia.org to prod_sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/481796 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [15:31:37] (03PS7) 10Effie Mouzeli: Add test-commons.wikimedia.org to prod_sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/481796 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [15:34:42] (03PS1) 10Fsero: Revert "Narrow down ferm etcd allow_from" [puppet] - 10https://gerrit.wikimedia.org/r/482046 [15:35:01] (03PS2) 10Fsero: Revert "Narrow down ferm etcd allow_from" [puppet] - 10https://gerrit.wikimedia.org/r/482046 [15:35:32] (03CR) 10jerkins-bot: [V: 04-1] Revert "Narrow down ferm etcd allow_from" [puppet] - 10https://gerrit.wikimedia.org/r/482046 (owner: 10Fsero) [15:37:54] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 271.40 seconds [15:39:11] (03PS3) 10Fsero: Revert "Narrow down ferm etcd allow_from" [puppet] - 10https://gerrit.wikimedia.org/r/482046 [15:40:15] (03CR) 10Fsero: [C: 03+2] Revert "Narrow down ferm etcd allow_from" [puppet] - 10https://gerrit.wikimedia.org/r/482046 (owner: 10Fsero) [15:42:30] cmjohnson1: you about? T212861 would need a look just in case it could be more widespread than a single rack [15:42:30] T212861: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 [15:43:06] (03PS2) 10Arturo Borrero Gonzalez: Prepare new/empty cloudvirts for Stretch/Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/482033 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [15:43:09] (03PS1) 10Jforrester: Move testcommonswiki from group2 to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482047 [15:43:11] (03PS1) 10Jforrester: Enable WBMI on test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482048 [15:43:19] !log Enabled puppet on mw servers after merging 481796 - T197616 [15:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:21] T197616: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 [15:43:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Prepare new/empty cloudvirts for Stretch/Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/482033 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [15:44:54] (03PS5) 10Bmansurov: Recommendation API: increase mysql connection limit for service [puppet] - 10https://gerrit.wikimedia.org/r/481871 (https://phabricator.wikimedia.org/T205294) [15:48:43] !log restart parsoid on wtp1025 to pick up OpenSSL update for nodejs [15:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:49:08] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:51:09] (03PS1) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [15:51:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:51:56] (03CR) 10jerkins-bot: [V: 04-1] wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [15:53:27] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: add roles for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/482052 (https://phabricator.wikimedia.org/T209616) [15:54:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvps: add roles for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/482052 (https://phabricator.wikimedia.org/T209616) (owner: 10Arturo Borrero Gonzalez) [15:57:30] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet are marked down but pooled [15:58:44] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [16:00:53] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1026.eqiad.wmnet', 'cloudvirt1027.eqi... [16:01:47] (03PS2) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [16:02:39] !log reimaging cloudvirt1013 cloudvirt1026-1028 to stretch [16:02:39] (03CR) 10jerkins-bot: [V: 04-1] wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:01] 10Operations, 10Operations-Software-Development, 10Kubernetes: Create Spicerack cook book to drain/reboot/uncordon a Kubernetes worker - https://phabricator.wikimedia.org/T212866 (10MoritzMuehlenhoff) [16:04:04] (03PS3) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [16:04:56] (03CR) 10jerkins-bot: [V: 04-1] wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:09:02] (03PS4) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [16:09:58] (03CR) 10jerkins-bot: [V: 04-1] wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:14:48] (03PS2) 10Andrew Bogott: toolforge: Redirect tools-static to https [puppet] - 10https://gerrit.wikimedia.org/r/481979 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [16:15:41] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Redirect tools-static to https [puppet] - 10https://gerrit.wikimedia.org/r/481979 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [16:22:24] !log rebooting kubernetes workers in eqiad for kernel security update [16:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:31] (03PS1) 10BryanDavis: toolforge: add missing ; in tools-static nginx config [puppet] - 10https://gerrit.wikimedia.org/r/482069 [16:24:45] andrewbogott: ^ [16:25:06] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: add missing ; in tools-static nginx config [puppet] - 10https://gerrit.wikimedia.org/r/482069 (owner: 10BryanDavis) [16:27:20] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10aborrero) For the record, cloudvirt1027.eqiad.wmnet repots being a `Dell PowerEdge R640` (instead of R630) [16:32:56] !log remove old 10.64.22.0/24 IPs from cloud-instance-transport1-b-eqiad - T207663 [16:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:58] T207663: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 [16:35:17] (03PS5) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [16:36:12] (03CR) 10jerkins-bot: [V: 04-1] wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:36:36] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10aborrero) [16:39:22] (03PS1) 10Ayounsi: Remove old 10.64.22.0/24 IPs from cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/482071 (https://phabricator.wikimedia.org/T207663) [16:41:16] (03PS1) 10Andrew Bogott: cloudvirt1013: rename neutron nics [puppet] - 10https://gerrit.wikimedia.org/r/482072 [16:42:20] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1013: rename neutron nics [puppet] - 10https://gerrit.wikimedia.org/r/482072 (owner: 10Andrew Bogott) [16:42:30] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) I'll be attending the meeting that @Krinkle mentions. But in the meantime, I'd like to offer an explanation for what Ja... [16:43:43] (03PS2) 10Ayounsi: Remove old 10.64.22.0/24 IPs from cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/482071 (https://phabricator.wikimedia.org/T207663) [16:43:44] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) Thanks everyone for comments so far. This ticket in its current state is definitely not a ready RFC, you're right. We're goin... [16:45:04] (03CR) 10Ayounsi: [C: 03+2] Remove old 10.64.22.0/24 IPs from cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/482071 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [16:46:01] (03PS6) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [16:46:02] (03PS4) 10Cwhite: hiera: add puppetboard and puppetdb to puppet cluster [puppet] - 10https://gerrit.wikimedia.org/r/479772 (https://phabricator.wikimedia.org/T210486) [16:46:35] PROBLEM - HHVM rendering on mw2258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:50] (03CR) 10Cwhite: [C: 03+2] hiera: add puppetboard and puppetdb to puppet cluster [puppet] - 10https://gerrit.wikimedia.org/r/479772 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [16:46:52] (03CR) 10jerkins-bot: [V: 04-1] wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:47:27] RECOVERY - HHVM rendering on mw2258 is OK: HTTP OK: HTTP/1.1 200 OK - 75492 bytes in 0.298 second response time [16:48:17] andrewbogott: cloudvirt1013 puppet changes ready to merge? [16:48:39] shdubsh: yes, sorry, I had a terminal with a 'yes' waiting for me to hit return [16:48:41] which I just did [16:48:56] cool, thanks :) [16:51:23] (03PS1) 10Andrew Bogott: cloudvirt1013: remove stretch installer overrides [puppet] - 10https://gerrit.wikimedia.org/r/482075 [16:51:30] (03CR) 10BryanDavis: [C: 04-1] toolforge: Redirect GET & HEAD to https (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [16:52:18] (03CR) 10Vgutierrez: [C: 03+1] cloudvirt1013: remove stretch installer overrides [puppet] - 10https://gerrit.wikimedia.org/r/482075 (owner: 10Andrew Bogott) [16:52:38] (03PS1) 10Ayounsi: Revert "Remove old 10.64.22.0/24 IPs from cloud-instance-transport1-b-eqiad" [dns] - 10https://gerrit.wikimedia.org/r/482076 [16:52:48] (03PS2) 10Andrew Bogott: cloudvirt1013: remove stretch installer overrides [puppet] - 10https://gerrit.wikimedia.org/r/482075 [16:53:15] (03PS3) 10BryanDavis: toolforge: Redirect GET & HEAD to https [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) [16:53:30] (03CR) 10Ayounsi: [C: 03+2] Revert "Remove old 10.64.22.0/24 IPs from cloud-instance-transport1-b-eqiad" [dns] - 10https://gerrit.wikimedia.org/r/482076 (owner: 10Ayounsi) [16:53:36] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1013: remove stretch installer overrides [puppet] - 10https://gerrit.wikimedia.org/r/482075 (owner: 10Andrew Bogott) [16:53:59] (03PS4) 10Cwhite: hiera: add management cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/480664 (https://phabricator.wikimedia.org/T210486) [16:54:54] (03CR) 10Cwhite: [C: 03+2] hiera: add management cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/480664 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [16:56:01] (03PS7) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [16:58:09] (03PS1) 10Ayounsi: Remove old 10.64.22.0/24 IPs from cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/482077 (https://phabricator.wikimedia.org/T207663) [16:59:14] 10Operations, 10Wikimedia-Mailing-lists, 10Security: Mass unsubscribe of legitimate recipients from wiki-research-l mailing list - https://phabricator.wikimedia.org/T212234 (10sbassett) [17:01:15] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1027.eqiad.wmnet', 'cloudvirt1028.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvi... [17:02:52] (03PS8) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [17:03:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/482077 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [17:03:30] (03PS1) 10Andrew Bogott: Stretch cloudvirts: remove Stretch installer overrides [puppet] - 10https://gerrit.wikimedia.org/r/482079 (https://phabricator.wikimedia.org/T209616) [17:04:08] (03CR) 10Ayounsi: [C: 03+2] Remove old 10.64.22.0/24 IPs from cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/482077 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [17:04:36] (03CR) 10Andrew Bogott: [C: 03+2] Stretch cloudvirts: remove Stretch installer overrides [puppet] - 10https://gerrit.wikimedia.org/r/482079 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [17:05:11] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10ayounsi) [17:05:15] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) 05Open→03Resolved All cleaned up! [17:08:03] 10Operations, 10Discovery-Search, 10monitoring, 10User-CDanis, 10User-fgiunchedi: Remove "prometheus" from elasticsearch grafana dashboard names - https://phabricator.wikimedia.org/T212839 (10EBjune) Should be fine to rename, adding @Mathew.onipe who is currently working on dashboard issues to validate [17:09:06] (03PS9) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [17:15:28] (03PS10) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [17:17:35] 10Operations, 10CirrusSearch, 10Discovery-Search, 10serviceops: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10EBernhardson) >>! In T210717#4827144, @Joe wrote: > I see another problem here: > > say we do what makes sense and make MediaWiki con... [17:20:34] (03CR) 10CRusnov: [C: 03+1] "LGTM. Tests out locally also. Minor comment/q inline." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/481914 (owner: 10Volans) [17:20:54] 10Operations, 10CirrusSearch, 10Discovery-Search, 10serviceops: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10EBernhardson) I suppose more generally, the previous would be cleaner if we had some way to "tag" servers and check those server tags... [17:22:29] (03CR) 10GTirloni: "This only configures a standalone server like labstore1003 (so it's missing bdsync stuff) but I would appreciate your comments on the reor" [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [17:27:10] 10Operations, 10Wikimedia-Mailing-lists, 10Security: Mass unsubscribe of legitimate recipients from wiki-research-l mailing list - https://phabricator.wikimedia.org/T212234 (10Krenair) Should this ticket really be set as public given the big list of subscribed addresses? [17:28:14] 10Operations, 10Wikimedia-Mailing-lists, 10Security: Mass unsubscribe of legitimate recipients from wiki-research-l mailing list - https://phabricator.wikimedia.org/T212234 (10sbassett) Was debating that. I can make it private again for now. [17:28:47] (03PS18) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [17:28:49] (03PS18) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [17:28:51] (03PS20) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [17:28:53] (03PS1) 10DCausse: [cirrus] re-enable HHVM connection pooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482082 (https://phabricator.wikimedia.org/T212768) [17:31:13] 10Operations, 10cloud-services-team: cloudvirt1027 and 1028 won't PXE boot - https://phabricator.wikimedia.org/T212874 (10Andrew) p:05Triage→03High [17:32:25] 10Operations, 10cloud-services-team (Kanban): cloudvirt1027 and 1028 won't PXE boot - https://phabricator.wikimedia.org/T212874 (10aborrero) [17:45:15] (03PS2) 10Muehlenhoff: Add library hint for libseccomp [puppet] - 10https://gerrit.wikimedia.org/r/482027 [17:46:06] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libseccomp [puppet] - 10https://gerrit.wikimedia.org/r/482027 (owner: 10Muehlenhoff) [17:57:22] (03CR) 10ArielGlenn: [C: 03+2] Fix a long-standing bug that allowed some incomplete output files to sneak in [dumps] - 10https://gerrit.wikimedia.org/r/482042 (https://phabricator.wikimedia.org/T212462) (owner: 10ArielGlenn) [17:58:23] !log ariel@deploy1001 Started deploy [dumps/dumps@10dc8ad]: return properly if commands failed [17:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:32] !log ariel@deploy1001 Finished deploy [dumps/dumps@10dc8ad]: return properly if commands failed (duration: 00m 08s) [17:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:29] 10Operations, 10Analytics: notebook server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10Milimetric) p:05Triage→03High [18:05:39] 10Operations, 10Discovery-Search, 10monitoring, 10User-CDanis, 10User-fgiunchedi: Remove "prometheus" from elasticsearch grafana dashboard names - https://phabricator.wikimedia.org/T212839 (10EBjune) p:05Triage→03Normal [18:10:17] 10Operations, 10Analytics: notebook server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10Milimetric) p:05High→03Normal A proper fix is to manage resources through containerization (kubernetes), so marking low priority for now as other solutions we could think of are a little hacky. [18:11:10] (03PS1) 10Andrew Bogott: Enable notifications on some new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/482093 [18:11:12] (03PS1) 10Andrew Bogott: nova: adjust scheduling pool for new hardware [puppet] - 10https://gerrit.wikimedia.org/r/482094 [18:12:11] 10Operations, 10cloud-services-team (Kanban): cloudvirt1027 and 1028 won't PXE boot - https://phabricator.wikimedia.org/T212874 (10Dzahn) cloudvirt1027 has multiple NICs and multiple MACs per NIC, like so: The third NIC is being used and it has these: ` NIC.Integrated.1-3-1 Ethernet = D0:9... [18:15:22] (03PS1) 10Jforrester: Disable ZeroBanner and ZeroPortal on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482097 (https://phabricator.wikimedia.org/T212864) [18:15:24] (03PS1) 10Jforrester: Re-write mobilelanding.php to not break when we drop ZeroBanner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482098 (https://phabricator.wikimedia.org/T212865) [18:15:26] (03PS1) 10Jforrester: Drop the Wikipedia Zero debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482099 (https://phabricator.wikimedia.org/T212865) [18:15:28] (03PS1) 10Jforrester: robots.php: Drop the special treatment for Wikipedia Zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482100 (https://phabricator.wikimedia.org/T212865) [18:15:30] (03PS1) 10Jforrester: zerowiki: Stop whitelisting ZeroPortal to logged out users, no longer available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482101 (https://phabricator.wikimedia.org/T212865) [18:15:34] (03PS1) 10Jforrester: Drop ZeroBanner and ZeroPortal from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482102 (https://phabricator.wikimedia.org/T212865) [18:15:38] (03PS1) 10Jforrester: Stop configuring ZeroBanner and ZeroPortal, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482103 (https://phabricator.wikimedia.org/T212865) [18:15:41] (03PS1) 10Jforrester: Stop loading i18n for ZeroBanner and ZeroPortal, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482104 (https://phabricator.wikimedia.org/T212865) [18:15:58] 10Operations, 10Analytics: notebook server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) A couple of things that we discussed with the team: * this is the same problem that happens on stat machines, sometimes users are not conservative in their usage of those hosts consuming... [18:16:07] 10Operations, 10Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) [18:19:11] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:34] 10Operations, 10Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) I am pretty ignorant about it, but would cgroups fit in this use case? @MoritzMuehlenhoff ? [18:21:43] !log restart pdfrender on scb1003 [18:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:07] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [18:26:47] 10Operations, 10cloud-services-team (Kanban): cloudvirt1027 and 1028 won't PXE boot - https://phabricator.wikimedia.org/T212874 (10Dzahn) The fix should be: "D0:94:66:62:A8:D8" (FIP) -> D0:94:66:62:A8:D7 (Ethernet) for cloudvirt1027 and accordingly for the other servers. [18:27:01] (03PS1) 10Andrew Bogott: cloudvirt1026: enable alerting [puppet] - 10https://gerrit.wikimedia.org/r/482107 [18:27:46] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1026: enable alerting [puppet] - 10https://gerrit.wikimedia.org/r/482107 (owner: 10Andrew Bogott) [18:28:16] (03PS1) 10Cwhite: hiera: add certcentral cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/482108 (https://phabricator.wikimedia.org/T210486) [18:29:12] (03PS1) 10Andrew Bogott: nova: add cloudvirt1026 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/482109 [18:30:07] (03CR) 10Andrew Bogott: [C: 03+2] nova: add cloudvirt1026 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/482109 (owner: 10Andrew Bogott) [18:31:28] (03PS2) 10Andrew Bogott: Enable notifications on some new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/482093 [18:31:42] (03PS3) 10Dzahn: delete systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) [18:32:03] (03PS4) 10Dzahn: systemd::sidekick: replace base_service::unit comment with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) [18:32:40] (03PS5) 10Dzahn: delete systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) [18:33:01] (03PS6) 10Dzahn: delete systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) [18:33:13] (03PS3) 10Andrew Bogott: Enable notifications on some new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/482093 [18:33:15] (03PS2) 10Andrew Bogott: nova: adjust scheduling pool for new hardware [puppet] - 10https://gerrit.wikimedia.org/r/482094 [18:33:46] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@c470ed2]: Update mobileapps to f6ad0e5: Set timeout for backend /page/html requests [18:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:06] (03PS7) 10Dzahn: delete systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) [18:34:51] (03PS1) 10Herron: mail::smarthost: exim4: replace '-' and '.' in cert names with '_' [puppet] - 10https://gerrit.wikimedia.org/r/482113 [18:35:31] (03CR) 10Dzahn: [C: 03+2] "per https://phabricator.wikimedia.org/T194724#4850882" [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:35:44] (03PS8) 10Dzahn: delete systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) [18:37:58] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@c470ed2]: Update mobileapps to f6ad0e5: Set timeout for backend /page/html requests (duration: 04m 11s) [18:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:04] (03PS2) 10Herron: mail::smarthost: exim4: replace '-' and '.' in cert names with '_' [puppet] - 10https://gerrit.wikimedia.org/r/482113 (https://phabricator.wikimedia.org/T212736) [18:40:22] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Papaul) [18:41:41] (03CR) 10Alex Monk: [C: 03+1] mail::smarthost: exim4: replace '-' and '.' in cert names with '_' [puppet] - 10https://gerrit.wikimedia.org/r/482113 (https://phabricator.wikimedia.org/T212736) (owner: 10Herron) [18:41:44] (03PS3) 10Herron: mail::smarthost: exim4: replace '-' and '.' in cert names with '_' [puppet] - 10https://gerrit.wikimedia.org/r/482113 (https://phabricator.wikimedia.org/T212736) [18:43:03] (03CR) 10Herron: [C: 03+2] mail::smarthost: exim4: replace '-' and '.' in cert names with '_' [puppet] - 10https://gerrit.wikimedia.org/r/482113 (https://phabricator.wikimedia.org/T212736) (owner: 10Herron) [18:45:01] (03PS1) 10Dzahn: librenms: delete unused upstart initscript [puppet] - 10https://gerrit.wikimedia.org/r/482116 (https://phabricator.wikimedia.org/T194724) [18:46:22] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@1182b3b]: Update mobileapps to f6ad0e5: Set timeout for backend /page/html requests, part 2 [18:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:24] (03PS2) 10Dzahn: librenms: delete unused upstart initscript [puppet] - 10https://gerrit.wikimedia.org/r/482116 (https://phabricator.wikimedia.org/T194724) [18:48:05] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14151/" [puppet] - 10https://gerrit.wikimedia.org/r/482116 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:51:16] (03PS6) 10MarcoAurelio: Initial configuration for hyw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481943 (https://phabricator.wikimedia.org/T212597) [18:51:49] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@1182b3b]: Update mobileapps to f6ad0e5: Set timeout for backend /page/html requests, part 2 (duration: 05m 27s) [18:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:12] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 (10Papaul) [18:55:28] (03CR) 10Dzahn: [C: 04-1] "there is no more systemd::service_unit, it's systemd::unit" [puppet] - 10https://gerrit.wikimedia.org/r/456317 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:56:12] (03CR) 10Reedy: Initial configuration for test-commons.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) (owner: 10Jforrester) [19:00:16] (03PS3) 10Gergő Tisza: Make password policy code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 [19:00:35] (03CR) 10Gergő Tisza: Make password policy code saner (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 (owner: 10Gergő Tisza) [19:02:32] (03CR) 10Jforrester: [C: 03+1] Make password policy code saner (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 (owner: 10Gergő Tisza) [19:02:59] (03PS1) 10Dzahn: k8s::flannel: remove upstart, use systemd::service instead [puppet] - 10https://gerrit.wikimedia.org/r/482118 (https://phabricator.wikimedia.org/T194724) [19:10:41] Reedy: If we put it in s3 we'll have to add it as a manual exemption in one of the calculated dblists. Happy to do that if you are OK to deploy. :-) [19:11:02] I don't mind doing the actual creation [19:13:35] you can take, if you wish, the other pending wiki creations [19:13:38] I'm guessing s4 isn't overloaded... And it's probably more useful in terms of checking perf in comparison to commons [19:14:15] How many others are waiting? [19:14:25] Are they DNS/apached etc if necessary? [19:14:50] Reedy: 2 iirc [19:15:01] napwikisource and hywwiki [19:15:09] i added hyw to DNS yesterday, go ahead [19:15:10] Reedy: Changed to use s3. [19:15:29] (03PS3) 10Jforrester: Initial configuration for test-commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) [19:15:37] hywwiki is still missing in interwikisortorders mutante / Reedy [19:15:38] that's Western Armenian.. they went through the entire process to even get an ISO language code.. and got it [19:15:43] I'm not sure how to handle that [19:15:46] Hauskatze: That file is a mess [19:15:56] So don't worry about that too much [19:16:16] jouncebot: now [19:16:16] No deployments scheduled for the next 87 hour(s) and 13 minute(s) [19:16:18] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for test-commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) (owner: 10Jforrester) [19:16:19] jouncebot: next [19:16:19] In 87 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190107T1030) [19:16:37] It's a no-deploy week, so definitely clear. ;-) [19:16:38] adding new languages now just means merging langlist.tmpl , no more manual commands to run after that, thanks to Traffic team :) [19:16:51] hyw was the first one where i just had to merge and was done [19:17:02] authdns-update no longer needed? [19:17:58] note that I'm not sure if everything is ready for hywwiki, for napwikisource all pre-install should be done iirc [19:18:22] hywwiki request is very new and would benefit from more eyes [19:18:36] Maybe do the other new ones next week? [19:18:46] Doing the rest in one go isn't much work [19:18:54] Sure. [19:18:57] Biggest hurdle is getting addWiki working xD [19:19:05] It's you doing the hard work, so it's your call. :-) [19:19:15] Shall we just jfdi with test-commons on s3? [19:19:21] creating soon to be dead wikis can wait o.t.o.h [19:19:29] Reedy: yeah. [19:19:36] Hauskatze: that is always needed to merge any DNS change but what is not needed anymore is all the "authdns-gen-zones" on each DNS server https://phabricator.wikimedia.org/T97051#1994679 [19:20:00] mutante: less work for ops then :) [19:20:47] yea, and automated feels safer than "manually" re-generating all the zones [19:21:07] Reedy: Umm. I have no idea how what I did broke the tests. [19:21:35] (03PS11) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [19:21:44] Oh, right, I'm doing dblist fiddling but testcommonswiki isn't a dblist. [19:21:47] Meh. [19:21:55] heh [19:22:01] * James_F sighs. [19:22:07] Reedy: Shall we JFDI with s4? [19:22:13] If it's easier, yeah [19:22:16] WFM and all that [19:22:35] er why s4? [19:22:44] commons and commons [19:22:56] apergos: Some code makes "is s4" to be an alias for "is Commons". [19:23:05] that's... notgood [19:23:08] (03PS4) 10Jforrester: Initial configuration for test-commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) [19:23:08] no. [19:23:20] Notgood, also known as mw-config. [19:23:28] oh happy happy 2019 [19:23:51] anyways I am butting right back out beore I get nerdswiped into anything [19:23:54] "Properly" I should created filerepo.dblist or something, but testCommons is to be killed in a few months' time, so… [19:24:06] apergos: boring :P [19:24:12] Reedy: Let's rock? [19:24:17] yeah at 21:24 i am entitled to be boring [19:24:19] tah :-P [19:26:00] (03CR) 10Reedy: [C: 03+2] Initial configuration for test-commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) (owner: 10Jforrester) [19:26:46] * Hauskatze reviews napwikisource [19:27:14] (03Merged) 10jenkins-bot: Initial configuration for test-commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) (owner: 10Jforrester) [19:27:55] (03PS1) 10Papaul: DNS: Remove mgmt DNS entries for restbase200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/482121 (https://phabricator.wikimedia.org/T211070) [19:29:03] Initialising external storage cluster24... [19:29:03] [6f30182d25a2d432f75d027c] [no req] Wikimedia\Rdbms\DBReadOnlyError from line 1163 of /srv/mediawiki/php-1.33.0-wmf.9/includes/libs/rdbms/database/Database.php: Database is read-only: The database has been automatically locked until the replica database servers become available [19:29:03] rofl [19:29:47] Ha. [19:29:55] xD [19:30:00] Well, I guess the brand new tables are indeed not on replicas yet. [19:30:16] napwikisource is missing the restbase/parsoid stuff.- I guess it can be added later? [19:30:23] They can [19:30:31] I'll patch restbase [19:30:58] parsoid is easier when the wiki is created (sitematrix) [19:32:00] (03CR) 10jenkins-bot: Initial configuration for test-commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482019 (https://phabricator.wikimedia.org/T197616) (owner: 10Jforrester) [19:34:33] yep, the guide says parsoid only afterwards [19:34:42] restbase says ping mobrovac :) [19:35:15] They're pretty proactive when the patches are up [19:36:04] somebody could explain to me why we have to restbase repos, one with /deploy and the other without /deploy? [19:36:49] not me [19:36:51] Hauskatze: Production servers don't do out-bound contact to install npm. [19:37:19] Hauskatze: So we download all the npm stuff on a dev's machine, push it in a commit into to /deploy repo, and use it in a known-good place from there. [19:37:38] Hauskatze: But "real" development happens in the normal repo with the usual, floating npm modules. [19:38:13] James_F: the wikitech:Add_a_wiki page says to commit RESTBase stuff to /deploy [19:38:29] that's why I'm wondering [19:38:37] I've not checked if contents match though [19:38:39] Yeah, WMF-specific config also goes in /deploy [19:38:45] aha [19:40:34] Reedy: Anything I can do to help? [19:45:06] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 972.17 seconds [19:45:07] James_F: Hacked some stuff... [19:45:19] Uh [19:45:23] I wonder if that's what was up [19:45:31] Oops. [19:45:36] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1003.85 seconds [19:45:46] That’s almost certainly us, yes. [19:45:56] volans: ^ this one might be [19:45:58] S4 and X1 but no others? [19:47:00] (03PS4) 10Andrew Bogott: Enable notifications on some new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/482093 [19:47:10] (03PS3) 10Andrew Bogott: nova: adjust scheduling pool for new hardware [puppet] - 10https://gerrit.wikimedia.org/r/482094 [19:47:28] Reedy: looking [19:47:47] (03CR) 10Andrew Bogott: [C: 03+2] Enable notifications on some new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/482093 (owner: 10Andrew Bogott) [19:48:45] Hauskatze: you can feel free to assign those directly to me (add new wikis to stats).. why stalled btw? [19:48:49] (03CR) 10Andrew Bogott: [C: 03+2] nova: adjust scheduling pool for new hardware [puppet] - 10https://gerrit.wikimedia.org/r/482094 (owner: 10Andrew Bogott) [19:49:16] mutante: wiki not yet created [19:49:24] (03PS1) 10Andrew Bogott: Fix MAC addresses for cloudvirt1027 and 1028 [puppet] - 10https://gerrit.wikimedia.org/r/482125 (https://phabricator.wikimedia.org/T212874) [19:49:37] Hauskatze: ok! took it [19:50:08] (03CR) 10Andrew Bogott: [C: 03+2] Fix MAC addresses for cloudvirt1027 and 1028 [puppet] - 10https://gerrit.wikimedia.org/r/482125 (https://phabricator.wikimedia.org/T212874) (owner: 10Andrew Bogott) [19:50:08] :) [19:50:58] 10Operations, 10Security-Team: jalexander should be removed from security@ as his emails are bouncing - https://phabricator.wikimedia.org/T212621 (10Dzahn) a:03Dzahn [19:51:04] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:41] 10Operations, 10Security-Team: jalexander should be removed from security@ as his emails are bouncing - https://phabricator.wikimedia.org/T212621 (10Dzahn) Is there an offboarding ticket for all the other things that need to be done? [19:53:20] Reedy: there is a create index on testcommonswiki on dbstore1002 [19:53:27] that seems related to T146585 [19:53:27] T146585: Add a primary key to user_newtalk - https://phabricator.wikimedia.org/T146585 [19:53:45] Uhhh, what? [19:53:50] The patch isn't merged [19:54:03] 10Operations, 10Security-Team: jalexander should be removed from security@ as his emails are bouncing - https://phabricator.wikimedia.org/T212621 (10Dzahn) 05Open→03Resolved Done. removed jalexander from security@ alias. I do wonder where the rest of the offboarding is handled though. [19:54:05] CREATE /* AddWiki::execute www-data@mwmain... */ INDEX user_ip ON `user_newtalk` (user_ip) [19:54:34] Ah, those are in mw core [19:54:40] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.143 second response time [19:54:41] have been for ages [19:54:46] Has it failed to create? [19:54:55] no it's just taking long time [19:55:03] on an empty table? :/ [19:55:20] Eurgh. [19:55:24] maybe is locked by something else, I'm scrolling the processlist [19:55:28] there is another create indecx [19:55:49] echo_notification_user_base_timestamp ON `echo_notification` [19:55:50] creating a new wiki will create many new indexes :P [19:55:55] from /srv/mediawiki/php-1.33.0-wmf.9/extensions/Echo/echo.sql [19:55:59] But should be mostly empty tables [19:56:22] Oh. X1 won’t be empty, right? Shared db. [19:56:33] RoanKattouw? [19:57:13] No, Echo doesn't use a shared DB [19:57:27] It uses X1, but on X1 there are separate DBs for each wiki, and that's where the Echo tables are [19:57:33] Ah. [19:57:36] I'm wondering if they are deadlocking between the two [19:57:57] * volans reloading DBA-stuff from old generation memory [19:57:59] 10Operations, 10Security-Team: jalexander should be removed from security@ as his emails are bouncing - https://phabricator.wikimedia.org/T212621 (10chasemp) >>! In T212621#4853210, @Dzahn wrote: > Done. removed jalexander from security@ alias. I do wonder where the rest of the offboarding is handled though.... [19:58:15] I wonder if with the MW Core changes... [19:58:22] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:39] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/commit/c0569fb89a449d86c5ad5aad5a178ac109de96df [19:58:46] (03PS4) 10Andrew Bogott: toolforge: Redirect GET & HEAD to https [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [19:58:57] against mw core [19:58:57] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/commit/c0569fb89a449d86c5ad5aad5a178ac109de96df [19:58:59] ffs [19:59:03] // Close connections and make future ones use the new database as the local domain [19:59:03] $lbFactory->redefineLocalDomain( $dbName ); [19:59:25] But we don't do that for Echo stuff [19:59:29] Oops. [19:59:35] I dunno if it's needed or not... [19:59:37] AaronSchulz: About? [19:59:57] Definitely looks like it might be a funky interaction here [20:00:04] 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: point wikilovesmonuments.org ns to wmf - https://phabricator.wikimedia.org/T118468 (10Dzahn) It seems this ticket should be closed as rejected based on the comments above where Jan as the task author said he has been convinced by Faidon to not do thi... [20:00:12] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Redirect GET & HEAD to https [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [20:00:56] 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: point wikilovesmonuments.org ns to wmf - https://phabricator.wikimedia.org/T118468 (10Dzahn) 05Open→03Declined Still adding @CRoslof for his information .. in case WMF wants to take over the WLM domain names. In that case we can always reopen here. [20:03:20] banyek|away: you around by any chance? [20:03:50] about 20 minutes [20:04:03] is there a problem? [20:04:26] volans: Which database does it say it's working on to create those indexes? [20:04:36] Want to confirm whether it is trying to do some stuff on the wrong db/tables [20:04:40] Reedy: both testcommonswiki [20:04:43] volans^ [20:04:45] Hm [20:05:18] banyek|away: dbstore1002 is lagging behind for s4 and x1, and from processlist there are 2 create index that might be related, not yet sure [20:05:34] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 5.030 second response time [20:05:39] Reedy: the one from AddWiki::execute www-data@mwmain is progressing (different table now) [20:06:02] while the on echo_notification is still there, but might be a red herring? [20:06:32] it's indeed an empty table... so maybe just locked by the maintenance script? [20:06:39] hm, I am not home yet, but almost, I'll check this, ok? [20:06:57] I'm nto sure about x1 tbh [20:07:00] (03PS1) 10Andrew Bogott: Revert "toolforge: Redirect GET & HEAD to https" [puppet] - 10https://gerrit.wikimedia.org/r/482129 [20:07:00] *not [20:07:10] PROBLEM - toolschecker: tools nginx proxy health on tools.wmflabs.org is CRITICAL: connect to address tools.wmflabs.org and port 80: Connection refused [20:07:38] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: connect to address tools.wmflabs.org and port 80: Connection refused [20:07:46] PROBLEM - HTTPS-wmflabs on tools.wmflabs.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:07:48] (03CR) 10Andrew Bogott: [C: 03+2] Revert "toolforge: Redirect GET & HEAD to https" [puppet] - 10https://gerrit.wikimedia.org/r/482129 (owner: 10Andrew Bogott) [20:08:02] banyek|away: thanks that would be helpful! :) [20:08:24] RECOVERY - toolschecker: tools nginx proxy health on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 785 bytes in 0.090 second response time [20:08:50] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1158 bytes in 0.061 second response time [20:09:00] RECOVERY - HTTPS-wmflabs on tools.wmflabs.org is OK: SSL OK - Certificate *.wmflabs.org valid until 2019-11-16 15:41:05 +0000 (expires in 316 days) [20:09:20] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:31] !log reedy@deploy1001 Synchronized dblists/: T197616 (duration: 00m 45s) [20:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:34] T197616: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 [20:09:42] 👍 [20:11:08] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: T197616 [20:11:09] Reedy: I’ve got to bail in about five minutes’ time. :-( [20:11:44] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.640 second response time [20:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:25] !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: T197616 (duration: 00m 44s) [20:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:34] Reedy: do you know what generates this one? [20:13:34] Wikimedia\Rdbms\Database::sourceFile( /srv/mediawiki/php-1.33.0-wmf.9/extensions/Echo/echo.sql ) www-data@mwmain... [20:13:37] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T197616 (duration: 00m 44s) [20:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:40] (03PS1) 10Andrew Bogott: site.pp: consolidate cloudvirt entries [puppet] - 10https://gerrit.wikimedia.org/r/482133 [20:13:49] volans: Running the script to create new wikis [20:13:57] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447 (10Dzahn) Was also wondering this. Is this ticket deprecated due to CertCentral work? [20:14:08] could it be that one came through the x1 replication andthe others through the s4 one? [20:14:16] dbstore1002 is mult-replica [20:14:31] both for testcommonswiki ofc [20:14:51] How are the databases setup for other wikis on that host? [20:15:13] PHP fatal error: entire web request took longer than 60 seconds and timed out [20:15:15] uhu.. [20:15:28] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:48] 10Operations, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10hardware-requests: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815 (10Dzahn) [20:16:03] 10Operations, 10ops-ulsfo, 10Traffic, 10decommission: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815 (10Dzahn) [20:16:54] (03PS1) 10Reedy: Remove excess wiki suffix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482134 (https://phabricator.wikimedia.org/T197616) [20:16:55] James_F: ^ lol [20:17:11] (03CR) 10Reedy: [C: 03+2] Remove excess wiki suffix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482134 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [20:18:15] (03Merged) 10jenkins-bot: Remove excess wiki suffix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482134 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [20:18:44] Oops. [20:18:51] heh [20:18:51] (Cannot access the database: No working replica DB server: Unknown error (10.64.48.35)) [20:19:09] Some silly edge case [20:19:30] OH [20:19:32] James_F: tut [20:19:51] ? [20:20:04] Reedy: dbstore1002 is multi-replication so it gets the data from all and it goes in the same db if it belongs to the same db [20:20:28] so in /srv/sqldata/testcommonswiki we have bith the echo tables and all the others from s4 [20:20:39] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [20:20:56] Ah. Yeah, that’s a bunch of data. [20:20:59] my current theory is that the create indexes from both replication channels are deadlocking themselves although progressing [20:21:06] (03PS1) 10Reedy: Add testcommonswiki to db-*.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482135 (https://phabricator.wikimedia.org/T197616) [20:21:09] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Andrew) [20:21:11] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1027 and 1028 won't PXE boot - https://phabricator.wikimedia.org/T212874 (10Andrew) 05Open→03Resolved a:05Andrew→03None [20:21:23] (03CR) 10Reedy: [C: 03+2] Add testcommonswiki to db-*.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482135 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [20:21:40] but my DBA-foo in interpreting the innodb status is fading away ;) [20:22:05] OK, now I’m going afk. Back in a couple of hours. :-( Thank you so much Reedy and volans and others. [20:22:14] It could just be worth killing those create indexes, that's for sure [20:22:25] (03Merged) 10jenkins-bot: Add testcommonswiki to db-*.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482135 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [20:22:43] bbl [20:22:48] Reedy: not that easy, are coming through replication ;) so better not [20:22:52] x1 has recovered [20:23:00] KILLL IT ALL [20:23:06] 10Operations, 10serviceops, 10User-Joe: SRE FY2019 Q3 goal: Ramp-up serving traffic to PHP 7 - https://phabricator.wikimedia.org/T212828 (10herron) p:05Triage→03Normal [20:23:18] we just have the AddWiki::execute now [20:23:28] 10Operations, 10Operations-Software-Development, 10Kubernetes: Create Spicerack cook book to drain/reboot/uncordon a Kubernetes worker - https://phabricator.wikimedia.org/T212866 (10herron) p:05Triage→03Normal [20:23:31] !log reedy@deploy1001 Synchronized wmf-config/db-eqiad.php: T197616 (duration: 00m 44s) [20:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:34] T197616: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 [20:23:35] so I guess the echo tables are done [20:23:42] and the rest should progress now (I hope) [20:24:12] (03CR) 10jenkins-bot: Remove excess wiki suffix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482134 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [20:24:14] (03CR) 10jenkins-bot: Add testcommonswiki to db-*.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482135 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [20:24:29] !log reedy@deploy1001 Synchronized wmf-config/db-codfw.php: T197616 (duration: 00m 44s) [20:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:33] I'll have to step out for dinner in few minutes (it's not user facing) [20:25:20] Cheers volans :) [20:25:46] James_F: Wiki is created and up. We need a test logo :P [20:25:57] usually those issues involves the lock on the metadata, I remember vaguely something similar we hit in the past [20:26:07] I wish my memory was better [20:26:20] computers suck [20:26:57] Reedy: let's take the commons logo and remove the colors, so it's greyscale [20:28:04] as I see s4 catched up [20:29:17] the actual logo with the text is not in https://commons.wikimedia.org/wiki/Category:SVG_Wikimedia_Commons_logos [20:29:38] Reedy: https://commons.wikimedia.org/wiki/File:Commons_logo_semitrans.svg [20:30:33] banyek: x1 catched, s4 still lagging, I've sent you the summary ;) [20:32:14] (03PS1) 10Reedy: Set $wgMultiContentRevisionSchemaMigrationStage = SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482139 (https://phabricator.wikimedia.org/T197616) [20:33:31] (03PS6) 10Dzahn: puppet:Reduce cronspam from modules/mediawiki/ [puppet] - 10https://gerrit.wikimedia.org/r/470877 (https://phabricator.wikimedia.org/T150375) (owner: 10Thifranc) [20:36:13] 10Operations, 10netops: Netbox switches consistency report - https://phabricator.wikimedia.org/T212878 (10Reedy) [20:37:54] (03PS1) 10BryanDavis: toolforge: Redirect GET & HEAD to https (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/482142 (https://phabricator.wikimedia.org/T102367) [20:38:37] (03PS1) 10Andrew Bogott: Turn on alerting and add cloudvirt1027 and 1028 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/482143 (https://phabricator.wikimedia.org/T209616) [20:40:54] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS entries for restbase200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/482121 (https://phabricator.wikimedia.org/T211070) (owner: 10Papaul) [20:41:54] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482144 [20:41:56] (03CR) 10Reedy: [C: 03+2] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482144 (owner: 10Reedy) [20:42:05] (03CR) 10Dzahn: [C: 03+2] "confirmed these are unracked" [dns] - 10https://gerrit.wikimedia.org/r/482121 (https://phabricator.wikimedia.org/T211070) (owner: 10Papaul) [20:42:18] (03CR) 10Andrew Bogott: [C: 03+2] Turn on alerting and add cloudvirt1027 and 1028 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/482143 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [20:43:06] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482144 (owner: 10Reedy) [20:43:33] (03Abandoned) 10Cwhite: hiera: add debmonitor cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/480666 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [20:43:56] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Updating interwiki cache (duration: 02m 05s) [20:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:08] (03PS2) 10Reedy: Move testcommonswiki from group2 to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482047 (owner: 10Jforrester) [20:44:08] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Andrew) [20:44:14] (03CR) 10Reedy: [C: 03+2] Move testcommonswiki from group2 to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482047 (owner: 10Jforrester) [20:44:19] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Andrew) 05Open→03Resolved thanks all! [20:45:21] (03Merged) 10jenkins-bot: Move testcommonswiki from group2 to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482047 (owner: 10Jforrester) [20:46:54] !log reedy@deploy1001 Synchronized dblists/group0.dblist: Add testcommonswiki to group0 (duration: 00m 43s) [20:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:27] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 (10Dzahn) @RobH I still see a bunch of production DNS records for these.. although most of the check boxes above are checke... [20:50:01] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482144 (owner: 10Reedy) [20:50:03] (03CR) 10jenkins-bot: Move testcommonswiki from group2 to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482047 (owner: 10Jforrester) [20:50:29] !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: Fix error for testcommons (duration: 00m 44s) [20:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:11] (03CR) 10BryanDavis: [C: 03+1] "Tested nginx config change on tools-proxy-01 to ensure that syntax is valid and works as expected" [puppet] - 10https://gerrit.wikimedia.org/r/482142 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [20:56:04] 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508 (10Dzahn) It seems this ticket is permanently stalled. It hasn't had updates since over 2 years now. Does anyone have new input? Did anything cha... [20:57:38] (03PS2) 10Andrew Bogott: toolforge: Redirect GET & HEAD to https (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/482142 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [20:59:09] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Redirect GET & HEAD to https (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/482142 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [21:00:38] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) 05Stalled→03Open unstalling because npm is now available for stretch via backports @robh @ssastry @arlolra This should be finally unblocked now . [21:01:16] James_F: Just created a few random extension tables that were missing too [21:01:27] 10Operations, 10Patch-For-Review: logrotate for ruthenium - https://phabricator.wikimedia.org/T161920 (10Dzahn) T201366 is about replacing ruthenium with scandium (and jessie -> stretch upgrade) [21:04:37] (03PS1) 10Cwhite: hiera: add wmcs cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/482149 (https://phabricator.wikimedia.org/T210486) [21:08:11] 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: point wikilovesmonuments.org ns to wmf - https://phabricator.wikimedia.org/T118468 (10Effeietsanders) Adding @Slaporte @LilyOfTheWest FYI [21:08:50] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) @ssastry i don't see any mention of the npm package in the puppet code, yet it is installed on ruthenium. was it installed manually? [21:12:27] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) Conditional redirect is live and announced at https://phabricator.wikimedia.org/phame/post/view/132/migrating_tools.wmflabs.org_t... [21:13:34] So test-commons is raising issues I see. [21:15:32] (03PS1) 10Dzahn: testreduce: if on stretch, use stretch-backports to get npm package [puppet] - 10https://gerrit.wikimedia.org/r/482150 (https://phabricator.wikimedia.org/T201366) [21:16:08] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6434.07 seconds [21:16:24] (03CR) 10jerkins-bot: [V: 04-1] testreduce: if on stretch, use stretch-backports to get npm package [puppet] - 10https://gerrit.wikimedia.org/r/482150 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:18:10] (03PS2) 10Dzahn: testreduce: if on stretch, use stretch-backports to get npm package [puppet] - 10https://gerrit.wikimedia.org/r/482150 (https://phabricator.wikimedia.org/T201366) [21:20:36] (03CR) 10Muehlenhoff: testreduce: if on stretch, use stretch-backports to get npm package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482150 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:23:20] (03CR) 10Dzahn: testreduce: if on stretch, use stretch-backports to get npm package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482150 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:24:59] (03CR) 10Dzahn: testreduce: if on stretch, use stretch-backports to get npm package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482150 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:25:47] moritzm: it actually fails to install npm for "unmet dependencies" not for not finding it as i thought [21:25:59] but: npm : Depends: node-abbrev (>= 1.1.1~) but 1.0.9-1 is to be installed etc [21:28:20] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) >>! In T201366#4842983, @MoritzMuehlenhoff wrote: > npm 5.8 is now finally available in stretch-backports: https://lists.debian.org/debian-backports-ch... [21:34:12] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 8.957 second response time [21:37:54] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:41:08] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) ` apt-get -t stretch-backports install npm Reading package lists... Done Building dependency tree Reading state information... Done Some package... [21:42:03] (03CR) 10Dzahn: testreduce: if on stretch, use stretch-backports to get npm package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482150 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:43:12] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1755 MB (3% inode=91%) [21:44:26] PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[chown /srv/deployment/proton for deploy-service],Exec[ip addr add 2620:0:861:103:10:64:32:61/64 dev ens5] [21:50:28] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1640 MB (3% inode=91%) [22:10:30] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:11:40] an-coord1001 looks like lots of logs from hdfs rebalancing. known issue? [22:14:12] !log stopping all slaves on labsdb1002 [22:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:34] !log stopping all slaves on dbstore1002 (NOT labsdb) [22:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:55] !log restarted all slaves on dbstore1002 (relayed from banyek) [22:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:38] thanks volans [22:37:40] yw :) [22:49:21] Reedy: You are wonderful. [22:49:41] James_F: TLDR for the dbstore issue might just be it's an unhappy host [22:50:56] ACKNOWLEDGEMENT - IPMI Sensor Status on db1082 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] Banyek T212909 [22:52:46] Fun. [22:53:14] And I owe you a crate of something when you get to the office. [22:53:16] ACKNOWLEDGEMENT - IPMI Sensor Status on db1107 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Banyek T212910 [22:53:49] I made https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/482139/1/wmf-config/InitialiseSettings.php too [22:53:56] Also, when can we remove Zero? :P [22:56:29] Zero is next week. :-) [22:57:47] pffft [22:57:57] (03PS1) 10BryanDavis: toolforge: Update regex for parsing nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/482236 [22:57:59] (03PS1) 10BryanDavis: toolforge: process dynamicproxy access logs [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001) [22:58:00] James_F: Do you want that wgMultiContentRevisionSchemaMigrationStage config? [22:58:29] Or was that an oold comment? [22:59:12] (03CR) 10jerkins-bot: [V: 04-1] toolforge: process dynamicproxy access logs [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis) [23:01:19] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342 (10Aklapper) 05Open→03Declined Declining per T187716#4852639 [23:01:53] dbstore1002 lag on s4 is 8800 and going down [23:01:56] should recover in few hours [23:02:58] (03PS1) 10BryanDavis: toolforge: profile::toolforge::toolviews::mysql_password [labs/private] - 10https://gerrit.wikimedia.org/r/482238 (https://phabricator.wikimedia.org/T87001) [23:05:13] (03CR) 10Volans: tests: test also with Python 3.7 (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/481914 (owner: 10Volans) [23:05:23] (03PS2) 10BryanDavis: toolforge: process dynamicproxy access logs [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001) [23:05:55] (03CR) 10jerkins-bot: [V: 04-1] toolforge: process dynamicproxy access logs [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis) [23:07:14] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.065 second response time [23:07:39] (03PS3) 10BryanDavis: toolforge: process dynamicproxy access logs [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001) [23:08:03] !log restarted pdfrender on scb1004 [23:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:46] 10Operations, 10Commons, 10Multimedia, 10Reading-Infrastructure-Team-Backlog, and 5 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400 (10MaxSem) 05Open→03Declined No more Zero. [23:17:11] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 10 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847 (10MaxSem) [23:30:27] 10Operations, 10Traffic, 10Zero, 10ZeroPortal: Move proxy IP lists to META for Varnish XFF decoding - https://phabricator.wikimedia.org/T89838 (10MaxSem) 05Open→03Declined Zero is getting dismantled. [23:31:47] 10Operations, 10Traffic, 10Zero: Security: Is it safe to enable Zero spoofing - https://phabricator.wikimedia.org/T120631 (10Dzahn) 05Open→03Declined Declined per T187716#4852639 since there is no more Wikipedia Zero [23:32:23] 10Operations, 10Traffic, 10Zero, 10ZeroPortal: Move proxy IP lists to META for Varnish XFF decoding - https://phabricator.wikimedia.org/T89838 (10MaxSem) 05Declined→03Open Err, not necessarily related to Zero, please feel free to reclose if I'm wrong. [23:34:50] (03CR) 10BryanDavis: [C: 03+1] "Related to I8f8fc23c9f0d47288f3c72490b28d48436f2a5c9" [labs/private] - 10https://gerrit.wikimedia.org/r/482238 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis) [23:42:48] 10Operations, 10Traffic, 10Zero, 10ZeroPortal: Move proxy IP lists to META for Varnish XFF decoding - https://phabricator.wikimedia.org/T89838 (10Reedy) 05Open→03Declined [23:43:18] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10MaxSem) [23:58:52] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10Dzahn) @Cmjohnson @fgiunchedi There are 2 new Icinga alerts saying that on ms-be1044 and ms-be1045 the power supplies are not redundant anymor... [23:59:26] ACKNOWLEDGEMENT - IPMI Sensor Status on ms-be1044 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] daniel_zahn https://phabricator.wikimedia.org/T209618 [23:59:26] ACKNOWLEDGEMENT - IPMI Sensor Status on ms-be1045 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] daniel_zahn https://phabricator.wikimedia.org/T209618