[00:01:27] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2727431 (10BBlack) Lack of namespace enforcement is an interesting issue (both in terms of collisions on popular names, and also in terms of organizing... [00:02:27] Gerrit is very slow, not quite down but basically unusable [00:03:49] ( greg-g ostriches ) [00:03:49] for me it's quite down (ERR_TIMED_OUT) [00:05:58] seems to have recovered [00:06:24] Working again for me now [00:06:32] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2727433 (10Eevans) [00:07:59] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2725296 (10Eevans) [00:10:59] (03CR) 10BBlack: [C: 04-1] "I think having the cluster name be "logstash" is more-appropriate here. It's ok that not all logstash-cluster hosts have the kibana servi" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315675 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [00:11:38] (03CR) 10BBlack: [C: 031] kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [00:11:55] (03CR) 10BBlack: [C: 031] kibana - configure varnish to use new LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/315677 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [00:12:24] (03CR) 10BBlack: [C: 04-1] kibana - activate icinga check on new LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [00:14:09] (03PS2) 10BBlack: cache_misc: pybal_config: use puppetmaster1001.eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/315531 (https://phabricator.wikimedia.org/T147847) [00:14:59] (03CR) 10BBlack: [C: 032] "Yeah I hear you, but right now the focus is on getting things standardized so we can kill complexity. We'll come back to x-dc for cache_m" [puppet] - 10https://gerrit.wikimedia.org/r/315531 (https://phabricator.wikimedia.org/T147847) (owner: 10BBlack) [00:15:13] (03CR) 10BBlack: [V: 032] cache_misc: pybal_config: use puppetmaster1001.eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/315531 (https://phabricator.wikimedia.org/T147847) (owner: 10BBlack) [00:16:12] 06Operations, 10Traffic: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2727450 (10BBlack) [00:16:14] 06Operations, 10Traffic, 13Patch-For-Review: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2727448 (10BBlack) 05Open>03Resolved a:03BBlack [00:22:16] (03PS1) 10Madhuvishy: notebook: Apply analytics client role to test spark on jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/316725 [00:23:15] !log a L v A r O    m O l I n A [00:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:27] (03CR) 10Madhuvishy: [C: 032] notebook: Apply analytics client role to test spark on jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/316725 (owner: 10Madhuvishy) [00:25:07] !log freenode staffer dax is idiot [00:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:21] !log Elreysintrono loves AlvadoMolina [00:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:31] blerg. need to fix that bot [00:25:47] !log Vito is trush [00:25:50] !ops [00:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:08] yawn [00:26:30] Vito.. stupid [00:26:41] robh: [00:27:43] Ugh... [00:27:58] Might want to revert the log page….(i’m not on wikitech, so) [00:28:05] I'm on it [00:28:22] nice ec [00:28:26] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:28:28] bd808: And T148119 - if you can see it. [00:28:39] Actually, Krenair ^ [00:29:39] Ah. Someone removed -r from this channel. [00:30:11] deleted from @wikimediatech twitter [00:30:16] bleh, i dunno what the login details are for the wikitech twitter [00:30:18] Thanks. [00:30:19] so cannot scrub it from there [00:30:23] Krenair: ahh, thx [00:30:47] that bot needs a whitelist for accepting log messages =P [00:31:07] robh: Already got a task in. It's marked security so it can't be found. Poke me and I'll add you as a subscriber. [00:31:08] it's the identica_password on tools' /data/project/morebots/confs/production-logbot.py [00:31:19] he can see security things [00:31:21] Krenair: can you pgp encrypt those and send it over so i have them? [00:31:25] ahh [00:31:31] yes [00:31:33] robh: https://phabricator.wikimedia.org/T148120 [00:31:35] you are sending the info like 1 second before i hit enter ;D [00:31:44] :D [00:32:07] still want me to email it? [00:32:37] only if you have my pgp key and dont mind, otherwise ill go poke into the repo later [00:32:51] im also cooking, i just walked to laptop when i heard myself pinged =] [00:33:26] i see someone beat me to the ban with a kline, even better [00:35:23] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2727463 (10Dzahn) p:05Unbreak!>03High [00:39:10] sent [00:40:18] Krenair: Thanks again for taking care of that Twitter. [00:40:42] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 13Patch-For-Review, 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2727468 (10Dzahn) 05Open>03Resolved a:03Dzahn The gerrit patch has been amended to use... [00:41:01] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 13Patch-For-Review, 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2727471 (10Dzahn) 05Resolved>03Open [00:41:21] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 13Patch-For-Review, 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2700737 (10Dzahn) The status change wasn't intended, sorry. [00:41:37] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [00:42:06] yurik: "tracking" bug [00:42:10] detected [00:45:35] * bd808 does not agree with the jihad against #tracking [00:52:45] "nothing in my comment shall be considered an opinion on the usage of tracking tickets and may or may not reflect the views of any organization i'm affiliated with" :) [00:52:46] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:23] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [00:57:05] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:58:05] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:01:00] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [01:10:13] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [01:22:54] !log AlvadoMolina loves Ajraddatz [01:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:26:32] !ops [02:06:22] 06Operations, 06WMF-Communications: Feasibility of hosting podcast setup on Wikimedia servers - https://phabricator.wikimedia.org/T148061#2727541 (10Dzahn) I think it's possible to do. Basically all it needs is the audio files, a webserver and some RSS/XML. Here is a description how to write the XML part. ht... [02:07:24] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 13Patch-For-Review, 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2727542 (10Dzahn) a:05Dzahn>03None [02:11:49] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Testing on Production, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2727543 (10AndyRussG) @aaron thanks for the explanation! Just to check, I see the same sort of optio... [02:22:56] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Testing on Production, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2727545 (10aaron) When lag is > 7 seconds, the TTL cap is TTL_LAGGED, which is 30 seconds. [02:38:56] 06Operations, 06WMF-Communications: Feasibility of hosting podcast setup on Wikimedia servers - https://phabricator.wikimedia.org/T148061#2713810 (10BBlack) I think it does probably make sense to host the podcasts themselves on Commons, and then just set up a simpler microsite to host indexing (RSS/XML and som... [03:03:43] so is kibana just broken? [03:04:29] AaronSchulz: works for me. are you having trouble searching or ? [03:04:38] nothing works at all [03:05:06] manually cookie clear seems to work [03:05:15] weird [03:05:24] the session reset link suggested by the error message did nothing useful [03:06:06] oh did you stumble on a query that made it flip out? I've had issues with that before [03:08:15] prior queries were OK, it just stopped working at some point, so even the main dashboard kept giving the same backtrace [03:08:37] queries shouldn't persist over cookies anyway afaik...I hope not [03:09:06] its all a giant weird pile of javascript [03:33:19] (03PS1) 10Aaron Schulz: Switch to LoadMonitorMySQL instead of the generic one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316732 [03:39:43] Dereckson: you pinged [03:59:24] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:13:50] bd808: Kibana4 sux. There, I said it. [04:16:13] no argument from me. kibana4 is worse than kibana3 imo and I'm worried that the next major version will be even worse [04:16:58] There are 5111 commits on their master since the latest 4.x tag [04:17:35] so 5.x will either be really really better or another mostly from scratch rewrite [04:21:18] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Testing on Production, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2727598 (10AndyRussG) Ah K thanks... Mmmm just to follow up, to check that I'm understanding (sorry... [04:25:19] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:53] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::webserver: silence output of hhvm-needs-restart [puppet] - 10https://gerrit.wikimedia.org/r/316738 [06:47:01] <_joe_> elukey: ^^ [06:47:22] thankssss [06:47:48] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::mediawiki::webserver: silence output of hhvm-needs-restart [puppet] - 10https://gerrit.wikimedia.org/r/316738 (owner: 10Giuseppe Lavagetto) [06:50:23] <_joe_> !log installing jemalloc with memory profiling enabled on mw1189 [06:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:59:14] PROBLEM - HHVM rendering on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.006 second response time [06:59:55] PROBLEM - Apache HTTP on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [07:03:19] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:03:19] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [07:20:17] (03PS1) 10Muehlenhoff: Convert ferm::rule deployment-ssh in nova to ferm service [puppet] - 10https://gerrit.wikimedia.org/r/316739 [07:31:31] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.091 second response time [07:31:36] <_joe_> !log disabled profiling on mw1189, hhvm keeps crashing [07:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:31:56] very nice - https://librenms.wikimedia.org/device/device=92/tab=port/port=8277/view=events/ [07:32:32] probably maintenance? Even if I don't see anything in mails/calendar (but possibly I missed it) [07:33:35] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 72598 bytes in 0.818 second response time [07:34:39] (03PS1) 10Giuseppe Lavagetto: base::standard_packages: install quickstack again on jessie [puppet] - 10https://gerrit.wikimedia.org/r/316740 [07:35:31] ahhh xe- are the management interfaces? [07:35:53] <_joe_> moritzm: ^^ care to review? [07:36:39] maybe the eqinix issue that I was reading yesterday from RT [07:39:57] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/316740 (owner: 10Giuseppe Lavagetto) [07:40:12] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:40:13] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:46:25] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [07:53:11] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 7.97 ms [07:54:52] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2727782 (10elukey) Found other occurrences of the same issue but with different URIs on other hosts: ``` - D... [07:58:15] (03CR) 10Gehel: kibana - activate icinga check on new LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [08:00:07] (03CR) 10Giuseppe Lavagetto: [C: 032] base::standard_packages: install quickstack again on jessie [puppet] - 10https://gerrit.wikimedia.org/r/316740 (owner: 10Giuseppe Lavagetto) [08:06:09] (03PS2) 10Gehel: kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) [08:06:31] (03CR) 10jenkins-bot: [V: 04-1] kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [08:10:22] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2727787 (10elukey) @dr0ptp4kt: this is my fault, I am going to give you a bit of background story on why we are asking these questions :) The stat hosts should, theoretical... [08:12:33] !log installing quagga security updates [08:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:55] !log installing tor security update on radium [08:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:36] !log Stopping db2062.codfw.wmnet to use it to clone another server - T146261 [08:15:37] T146261: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261 [08:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:21:54] (03PS2) 10Gehel: kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315675 (https://phabricator.wikimedia.org/T132458) [08:22:09] (03CR) 10jenkins-bot: [V: 04-1] kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315675 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [08:22:45] (03CR) 10Gehel: "Changed cluster name to "logstash"" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315675 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [08:27:29] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: move standard/firewall to role [puppet] - 10https://gerrit.wikimedia.org/r/315883 (owner: 10Dzahn) [08:27:33] (03PS2) 10Alexandros Kosiaris: url_downloader: move standard/firewall to role [puppet] - 10https://gerrit.wikimedia.org/r/315883 (owner: 10Dzahn) [08:27:35] (03CR) 10Alexandros Kosiaris: [V: 032] url_downloader: move standard/firewall to role [puppet] - 10https://gerrit.wikimedia.org/r/315883 (owner: 10Dzahn) [08:30:30] 06Operations, 10Monitoring: Icinga check for Tor - https://phabricator.wikimedia.org/T148614#2727820 (10MoritzMuehlenhoff) [08:31:58] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:00] (03CR) 10Alexandros Kosiaris: [C: 031] restbase: move standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/315880 (owner: 10Dzahn) [08:38:53] 06Operations, 10Traffic: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2727839 (10Joe) [08:38:55] 06Operations, 07HHVM, 13Patch-For-Review, 15User-Joe, 07discovery-system: Restart HHVM on API appservers every about 48 hours - https://phabricator.wikimedia.org/T147773#2727838 (10Joe) 05Open>03Resolved [08:40:35] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC gives the expected result. Merging" [puppet] - 10https://gerrit.wikimedia.org/r/315255 (owner: 10Alexandros Kosiaris) [08:40:40] (03PS5) 10Alexandros Kosiaris: ntp: Update neon specific ACLs to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/315255 [08:40:42] (03CR) 10Alexandros Kosiaris: [V: 032] ntp: Update neon specific ACLs to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/315255 (owner: 10Alexandros Kosiaris) [08:42:58] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2727848 (10Marostegui) @chasemp are we talking about this user from labsdb1008 that you want to get replicated to the other labs hosts? If so, I will get... [08:51:36] 06Operations, 10ChangeProp, 10Citoid, 10ContentTranslation-CXserver, and 7 others: Update Node on SCB to v4.6.0 - https://phabricator.wikimedia.org/T148615#2727861 (10mobrovac) [08:52:03] 06Operations, 10ChangeProp, 10Citoid, 10ContentTranslation-CXserver, and 7 others: Update Node on SCB to v4.6.0 - https://phabricator.wikimedia.org/T148615#2727878 (10mobrovac) [08:52:06] 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2727877 (10mobrovac) [08:52:30] 06Operations, 10ChangeProp, 10Citoid, 10ContentTranslation-CXserver, and 7 others: Update Node on SCB to v4.6.0 - https://phabricator.wikimedia.org/T148615#2727861 (10mobrovac) [08:52:33] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, and 2 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2727879 (10mobrovac) [08:53:15] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, and 2 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2721397 (10mobrovac) {T148615} will be done soon, so let's finish that up before merging ex-S... [08:56:12] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:57:50] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2727898 (10jcrespo) A couple of things- I created a new pasword for something similar to this, which is currently living on palladium secrets git store. I... [09:04:30] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2727921 (10Marostegui) @jcrespo thanks - I see it now. You suggest using the same password for this new user? or even using the same user? @chasemp do yo... [09:16:22] (03PS1) 10Filippo Giunchedi: site: add varnish_exporter to ulsfo/codfw maps/misc [puppet] - 10https://gerrit.wikimedia.org/r/316742 [09:20:11] !log Deploying schema change on db1069 S4 instance commonswiki revision table - T147305 [09:20:12] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [09:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:30] (03CR) 10Filippo Giunchedi: "Note that this will only install the exporter, Prometheus will start doing the actual polling once https://gerrit.wikimedia.org/r/#/c/3108" [puppet] - 10https://gerrit.wikimedia.org/r/316742 (owner: 10Filippo Giunchedi) [09:22:44] !log installing rsyslog bugfix updates [09:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:52] 06Operations, 10Monitoring, 05Prometheus-metrics-monitoring: Evaluate prometheus snmp_exporter for Torrus PDUs metrics use case - https://phabricator.wikimedia.org/T148541#2727995 (10fgiunchedi) I've tried a sample configuration with OIDs from Sentry3.mib (ftp://ftp.servertech.com/Pub/SNMP/sentry3) e.g. for... [09:44:24] RECOVERY - ores on scb1003 is OK: HTTP OK: HTTP/1.0 200 OK - 2822 bytes in 0.018 second response time [09:44:32] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [09:44:52] RECOVERY - ores uWSGI web app on scb1003 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app [09:45:21] RECOVERY - cxserver endpoints health on scb1003 is OK: All endpoints are healthy [09:45:22] RECOVERY - changeprop endpoints health on scb1003 is OK: All endpoints are healthy [09:46:33] (03PS1) 10Marostegui: mariadb: db1064 is finished with the ALTER table and it is ready to go back to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316743 [09:46:56] (03PS2) 10Marostegui: mariadb: db1064 is finished with the ALTER table and it is ready to go back to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316743 [09:47:37] (03CR) 10Marostegui: [C: 032] mariadb: db1064 is finished with the ALTER table and it is ready to go back to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316743 (owner: 10Marostegui) [09:48:07] (03Merged) 10jenkins-bot: mariadb: db1064 is finished with the ALTER table and it is ready to go back to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316743 (owner: 10Marostegui) [09:50:17] !log marostegui@mira Synchronized wmf-config/db-eqiad.php: Repool db1064 after finishing the ALTER table - T147305 (duration: 01m 08s) [09:50:18] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [09:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:55:31] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [09:55:52] (03PS3) 10Gehel: kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315675 (https://phabricator.wikimedia.org/T132458) [09:55:54] (03PS3) 10Gehel: kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) [09:55:56] (03PS2) 10Gehel: kibana - configure varnish to use new LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/315677 (https://phabricator.wikimedia.org/T132458) [09:59:52] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [10:00:53] (03PS4) 10Gehel: kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315675 (https://phabricator.wikimedia.org/T132458) [10:00:57] marostegui: please don't ack or create a task, I'm about to merge the auto creation, what a perfect test for it [10:01:48] volans: I will ack [10:01:49] thanks [10:01:59] no, don't [10:02:01] marostegui: [10:02:08] Ah sorry [10:02:10] I read: dont [10:02:13] I read: act [10:02:20] Anyways, I didn't do it :) [10:02:26] lol [10:02:28] thanks [10:02:33] XD [10:02:34] <_joe_> !log ran usermod -u 10002 l10nupdate on tin [10:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:06] (03PS4) 10Gehel: kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) [10:05:03] <_joe_> !log converting owner of files for l10nupdate usermod on tin [10:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:08:42] (03PS8) 10Volans: Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) [10:10:18] (03CR) 10Volans: [C: 032] Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [10:12:26] (03PS4) 10MarcoAurelio: Stop adding "Category:Uploaded with UploadWizard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) [10:13:16] (03PS5) 10Gehel: kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) [10:15:20] (03PS1) 10Alexandros Kosiaris: conftool: Add scbX00X boxes to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/316747 [10:16:11] (03PS1) 10Volans: icinga: cleaning Puppet dependency that was removed [puppet] - 10https://gerrit.wikimedia.org/r/316748 (https://phabricator.wikimedia.org/T142085) [10:16:20] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, and 2 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2728100 (10akosiaris) [10:16:54] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, and 2 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2721397 (10akosiaris) Actually the boxes are up and running and ready to be pooled, sporting... [10:17:11] (03CR) 10Volans: [C: 032] icinga: cleaning Puppet dependency that was removed [puppet] - 10https://gerrit.wikimedia.org/r/316748 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [10:17:25] (03CR) 10Alexandros Kosiaris: [C: 032] conftool: Add scbX00X boxes to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/316747 (owner: 10Alexandros Kosiaris) [10:17:28] (03PS2) 10Alexandros Kosiaris: conftool: Add scbX00X boxes to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/316747 [10:17:30] (03CR) 10Alexandros Kosiaris: [V: 032] conftool: Add scbX00X boxes to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/316747 (owner: 10Alexandros Kosiaris) [10:17:52] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:00] this is me [10:18:02] already fixed [10:18:09] ^^^ [10:18:12] ok [10:18:30] actually I had it disabled... strange [10:19:11] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:22] PROBLEM - puppet last run on restbase-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:24] PROBLEM - puppet last run on elastic2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:25] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:52] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:52] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:00] volans: gonna kill icinga-wm for a while [10:20:02] PROBLEM - puppet last run on aqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:04] akosiaris: damn, it's me can t=hanks [10:20:06] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:07] I was asking about it [10:20:08] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:11] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Cannot reassign variable service_description at /etc/puppet/modules/raid/manifests/ [10:20:46] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:53] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:54] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:54] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:22:28] (03PS1) 10Volans: icinga: fix variable re-assignement [puppet] - 10https://gerrit.wikimedia.org/r/316749 (https://phabricator.wikimedia.org/T142085) [10:22:30] why it didn't failed on puppet compiler AND on a random host [10:23:00] akosiaris: if you have a sec for a second eye: https://gerrit.wikimedia.org/r/316749 [10:24:01] ah.. you poor soul [10:24:06] yeah [10:24:07] though you were writing python ? :P [10:24:11] thought* [10:24:21] or well.. any decent language that's not puppet [10:24:23] I wrote this 2 months ago :D [10:24:37] and on db1046 is still not failing [10:24:39] what the hell [10:25:16] ahhh of course has a megaraid raid, is the first one [10:25:23] (03CR) 10Alexandros Kosiaris: [C: 031] "Probably a hash and create_resources would be a better way, but let's unbreak this for now" [puppet] - 10https://gerrit.wikimedia.org/r/316749 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [10:25:39] (03CR) 10Volans: [C: 032] icinga: fix variable re-assignement [puppet] - 10https://gerrit.wikimedia.org/r/316749 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [10:25:52] so no reassigment on megaraid hosts [10:27:12] elukey: around? [10:27:17] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [10:27:18] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [10:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:29:50] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:30:54] (03PS1) 10Alexandros Kosiaris: ores: Add scb1003, scb1004 in client hosts [puppet] - 10https://gerrit.wikimedia.org/r/316751 [10:34:43] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add scb1003, scb1004 in client hosts [puppet] - 10https://gerrit.wikimedia.org/r/316751 (owner: 10Alexandros Kosiaris) [10:34:47] (03PS2) 10Alexandros Kosiaris: ores: Add scb1003, scb1004 in client hosts [puppet] - 10https://gerrit.wikimedia.org/r/316751 [10:34:49] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Add scb1003, scb1004 in client hosts [puppet] - 10https://gerrit.wikimedia.org/r/316751 (owner: 10Alexandros Kosiaris) [10:35:11] (03PS1) 10Gehel: elasticsearch: Fully qualify the lvs::realserver inclusion [puppet] - 10https://gerrit.wikimedia.org/r/316753 [10:36:10] (03CR) 10Alexandros Kosiaris: [C: 032] elasticsearch: Fully qualify the lvs::realserver inclusion [puppet] - 10https://gerrit.wikimedia.org/r/316753 (owner: 10Gehel) [10:36:15] (03CR) 10Alexandros Kosiaris: [V: 032] elasticsearch: Fully qualify the lvs::realserver inclusion [puppet] - 10https://gerrit.wikimedia.org/r/316753 (owner: 10Gehel) [10:36:20] (03PS2) 10Alexandros Kosiaris: elasticsearch: Fully qualify the lvs::realserver inclusion [puppet] - 10https://gerrit.wikimedia.org/r/316753 (owner: 10Gehel) [10:36:22] (03CR) 10Alexandros Kosiaris: [V: 032] elasticsearch: Fully qualify the lvs::realserver inclusion [puppet] - 10https://gerrit.wikimedia.org/r/316753 (owner: 10Gehel) [10:36:37] akosiaris: thanks [10:44:31] thanks a lot akosiaris for the support, and sorry everyone for the mess [10:58:54] 06Operations, 10ops-eqiad: Degraded RAID on 10.64.16.35 - https://phabricator.wikimedia.org/T148627#2728188 (10ops-monitoring-bot) [11:06:15] 06Operations, 10ORES, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#2728193 (10elukey) p:05Triage>03Normal [11:07:50] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1046.eqiad.wmnet. - https://phabricator.wikimedia.org/T148627#2728197 (10elukey) [11:07:59] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1046.eqiad.wmnet. - https://phabricator.wikimedia.org/T148627#2728188 (10elukey) p:05Triage>03High [11:10:44] 06Operations, 10hardware-requests: codfw/eqiad: 2x systems for prometheus - https://phabricator.wikimedia.org/T148513#2728201 (10elukey) p:05Triage>03Normal [11:11:15] 06Operations, 10Monitoring: Icinga check for Tor - https://phabricator.wikimedia.org/T148614#2728202 (10elukey) p:05Triage>03Normal [11:11:44] 06Operations, 10hardware-requests: eqiad/codfw: swift frontend hardware refresh - https://phabricator.wikimedia.org/T148510#2728203 (10elukey) p:05Triage>03Normal [11:12:27] 06Operations, 10OTRS: Intermittent 503 errors on OTRS ticket system when sending responses to tickets - https://phabricator.wikimedia.org/T148299#2728204 (10elukey) p:05Triage>03Low [11:13:10] 06Operations, 06WMF-Communications: Feasibility of hosting podcast setup on Wikimedia servers - https://phabricator.wikimedia.org/T148061#2728205 (10elukey) p:05Triage>03Normal [11:14:30] ACKNOWLEDGEMENT - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] Filippo Giunchedi new sca/scb cluster -- ores creating its metrics [11:14:45] (03PS1) 10Muehlenhoff: Add salt grain for sec-tools/zosma [puppet] - 10https://gerrit.wikimedia.org/r/316759 [11:16:05] (03CR) 10Muehlenhoff: [C: 032] Add salt grain for sec-tools/zosma [puppet] - 10https://gerrit.wikimedia.org/r/316759 (owner: 10Muehlenhoff) [11:17:26] 06Operations, 10Traffic: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2728212 (10Joe) 05Open>03Resolved [11:17:53] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2728228 (10mark) >>! In T136340#2699842, @Eevans wrote: > From https://phabricator.wikimedia.org/T139961#2541110 re: the avail... [11:18:03] (03PS1) 10Volans: Icinga: raid_handler - hostname instead of address [puppet] - 10https://gerrit.wikimedia.org/r/316760 (https://phabricator.wikimedia.org/T142085) [11:19:29] (03CR) 10Volans: [C: 032] Icinga: raid_handler - hostname instead of address [puppet] - 10https://gerrit.wikimedia.org/r/316760 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [11:19:42] (03PS2) 10Volans: Icinga: raid_handler - hostname instead of address [puppet] - 10https://gerrit.wikimedia.org/r/316760 (https://phabricator.wikimedia.org/T142085) [11:21:28] moritzm: can I merge on puppet? [11:21:57] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2728237 (10elukey) >>! In T136340#2728228, @mark wrote: > > Assuming we're talking about aqs1001-1003, and Analytics is willi... [11:22:17] volans: my sectools change? that's already merged [11:22:24] yep, ok thanks [11:23:28] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2728239 (10mobrovac) [11:30:14] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [11:30:19] 06Operations, 10ops-eqiad: Degraded RAID on 10.64.16.35 - https://phabricator.wikimedia.org/T148629#2728246 (10ops-monitoring-bot) [11:30:40] !log upgrading nginx on cp2001 (codfw text canary) - T144523 [11:30:41] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [11:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:35] 06Operations, 10ops-eqiad: Degraded RAID on 10.64.16.35 - https://phabricator.wikimedia.org/T148629#2728254 (10Volans) 05Open>03Invalid Automatically generated, was supposed to have the hostname instead of the address. [11:35:06] !log upgrading nginx on cp2002 (codfw upload canary) - T144523 [11:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:40] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2728262 (10mobrovac) [11:35:43] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2728261 (10mobrovac) [11:42:17] (03PS1) 10Giuseppe Lavagetto: role::aqs: convert to role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316765 [11:42:19] (03PS1) 10Giuseppe Lavagetto: eventbus: use role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316766 [11:42:21] (03PS1) 10Giuseppe Lavagetto: misc: fix relative inclusion of lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316767 [11:42:31] <_joe_> bblack, akosiaris ^^ [11:42:43] <_joe_> I need to step out for lunch [11:43:03] <_joe_> I'll merge those when I am back [11:43:35] 06Operations, 10ops-eqiad: Degraded RAID on 10.64.0.56 - https://phabricator.wikimedia.org/T148630#2728267 (10ops-monitoring-bot) [11:44:13] what? checking [11:46:47] lol [11:47:29] 06Operations, 10ops-eqiad: Degraded RAID on 10.64.0.56 - https://phabricator.wikimedia.org/T148630#2728279 (10Volans) 05Open>03Invalid The host was failing all checks on Icinga or was restarted and the scheduled downtime was done only on the host and not it's services [11:47:57] akosiaris: is possible that icinga doesn't get reloaded after a change from puppet? [11:48:08] yeah that's the icinga-downtime script fault I guess :) [11:48:15] (03PS1) 10Alexandros Kosiaris: servermon: Fix a syntax error in the report handler [puppet] - 10https://gerrit.wikimedia.org/r/316770 [11:48:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] servermon: Fix a syntax error in the report handler [puppet] - 10https://gerrit.wikimedia.org/r/316770 (owner: 10Alexandros Kosiaris) [11:49:02] I'm looking if I can get check in a reliable way the Status information [11:51:23] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [11:55:31] !log Deploying schema change db2055 - S1 enwiki.change_tag - T147166 [11:55:36] (03PS1) 10Gehel: kibana - allow access to both /status and /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316771 (https://phabricator.wikimedia.org/T132458) [11:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:55:38] (03PS1) 10Gehel: kibana - change probe URL to /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316772 (https://phabricator.wikimedia.org/T132458) [11:55:40] (03PS1) 10Gehel: kibana - only allow unauthenticated access to /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316773 (https://phabricator.wikimedia.org/T132458) [11:55:42] (03PS1) 10Gehel: kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/316774 (https://phabricator.wikimedia.org/T132458) [11:55:44] (03PS1) 10Gehel: kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/316775 (https://phabricator.wikimedia.org/T132458) [11:55:46] (03PS1) 10Gehel: kibana - configure varnish to use new LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/316776 (https://phabricator.wikimedia.org/T132458) [11:55:50] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [11:56:15] (03Abandoned) 10Gehel: kibana - configure varnish to use new LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/315677 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [11:56:18] (03Abandoned) 10Gehel: kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315676 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [11:56:21] (03Abandoned) 10Gehel: kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315675 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [12:00:27] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [12:00:31] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [12:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:00:59] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=ores']) [12:01:02] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=ores']) [12:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:06:36] PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:58] ostriches: I'm going to accept config patches to the next EU SWAT, as they don't depend of the current /srv/mediawiki-staging subfolders correctness, but defer any backport code patches. [12:13:55] (03PS1) 10Volans: Icinga: raid_handler improvements [puppet] - 10https://gerrit.wikimedia.org/r/316779 (https://phabricator.wikimedia.org/T142085) [12:17:12] !log update cr{1,2}-eqiad configuration to add tegmen+einsteinium [12:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:44] PROBLEM - DPKG on cp3013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:24:08] !log Stopping db2055 to clone another host - T146261 [12:24:09] T146261: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261 [12:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:17] akosiaris: when you have a sec, could you please check if I'm using correctly notify here? https://gerrit.wikimedia.org/r/#/c/316779/1/modules/icinga/manifests/event_handlers/raid.pp [12:24:19] ^fixing cp3013 [12:26:29] Dereckson: backports should be fine now too [12:27:00] can someone get morebots to rejoin the channel [12:27:20] (03PS2) 10Volans: Icinga: raid_handler improvements [puppet] - 10https://gerrit.wikimedia.org/r/316779 (https://phabricator.wikimedia.org/T142085) [12:28:28] (03CR) 10Alexandros Kosiaris: Icinga: raid_handler improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316779 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [12:29:27] RECOVERY - DPKG on cp3013 is OK: All packages OK [12:29:28] akosiaris: thanks! I've grepped around and found it was uses like this for icinga :) [12:30:56] (03CR) 10Volans: [C: 032] Icinga: raid_handler improvements [puppet] - 10https://gerrit.wikimedia.org/r/316779 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [12:31:20] volans: I am pretty sure ALL over the place [12:31:21] ostriches: okay, ack'ed [12:31:25] RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:31:32] the pattern sucks but for now we have to deal with it [12:31:34] (03PS1) 10Muehlenhoff: Add versioned dependency on kernel to ensure that latest version is pulled in [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/316781 [12:32:29] 06Operations, 10MediaWiki-Cache, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2728350 (10elukey) p:05Triage>03Normal [12:32:38] eheh ok [12:33:25] 06Operations, 10MediaWiki-Cache, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2636791 (10elukey) It is not super clear to me if we need to keep both Traffic and Operations tags (and if so what is requested from both), @hashar let me know :) [12:34:21] 06Operations, 07discovery-system: Replace etcd internal auth mechanism with a frontend proxy - https://phabricator.wikimedia.org/T146355#2728368 (10elukey) p:05Triage>03Normal [12:35:56] 06Operations, 10Monitoring, 13Patch-For-Review: Extract metrics from logs - https://phabricator.wikimedia.org/T147923#2728370 (10elukey) p:05Triage>03Normal [12:37:29] (03PS2) 10Giuseppe Lavagetto: role::aqs: convert to role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316765 [12:38:02] (03PS1) 10Muehlenhoff: role::mirrors: Move base::firewall include in the role [puppet] - 10https://gerrit.wikimedia.org/r/316782 [12:38:15] <_joe_> akosiaris: should we set those new machines to a higher load in ores? [12:38:47] 06Operations, 10Pybal: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425#2728373 (10elukey) p:05Triage>03Normal @Gehel: did the the restart fail or just proceeded anyway? [12:39:37] 06Operations, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841#2728375 (10elukey) p:05Triage>03Low [12:40:18] 06Operations, 10hardware-requests: Site: (3) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#2728376 (10elukey) p:05Triage>03Normal [12:40:38] (03CR) 10Giuseppe Lavagetto: [C: 032] role::aqs: convert to role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316765 (owner: 10Giuseppe Lavagetto) [12:41:24] (03PS2) 10Giuseppe Lavagetto: eventbus: use role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316766 [12:42:27] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2728381 (10elukey) p:05Triage>03Normal @hashar: will T146286 also solve this one? [12:43:13] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2728383 (10Ottomata) Perhaps labs discovery of relevant hiera variables could be done via good class docs? ``` /** * * [*nonya_param1*] * ... c... [12:43:26] 06Operations, 07Upstream: Trusty: debug information found in "/usr/lib/debug//usr/lib/php5/20121212/mysql.so" does not match "/usr/lib/php5/20121212/mysql.so" (CRC mismatch). - https://phabricator.wikimedia.org/T145706#2728384 (10elukey) p:05Triage>03Low [12:43:46] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2728385 (10elukey) p:05Triage>03Normal [12:45:46] jouncebot: next [12:45:46] In 0 hour(s) and 14 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161019T1300) [12:45:54] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: grain-ensure erroneous mismatch with (bool)True vs (str)true - https://phabricator.wikimedia.org/T146914#2728396 (10elukey) p:05Triage>03Low [12:46:26] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [12:46:28] ACKNOWLEDGEMENT - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T148633 [12:46:28] 06Operations, 06Operations-Software-Development, 07HHVM, 13Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2728397 (10elukey) [12:46:32] 06Operations, 10ops-eqiad: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T148633#2728398 (10ops-monitoring-bot) [12:46:35] yes! [12:46:52] nice!!!! [12:47:43] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1046.eqiad.wmnet. - https://phabricator.wikimedia.org/T148627#2728403 (10Volans) [12:47:45] 06Operations, 10ops-eqiad: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T148633#2728405 (10Volans) [12:49:21] it's rare to see people happy about a degraded RAID :-) [12:50:20] :) [12:50:31] lol [12:50:35] (03CR) 10Ottomata: [C: 032] Update R and C++-related stats puppet configs [puppet] - 10https://gerrit.wikimedia.org/r/315885 (https://phabricator.wikimedia.org/T147682) (owner: 10Bearloga) [12:50:40] (03PS3) 10Ottomata: Update R and C++-related stats puppet configs [puppet] - 10https://gerrit.wikimedia.org/r/315885 (https://phabricator.wikimedia.org/T147682) (owner: 10Bearloga) [12:50:41] haha [12:50:42] (03CR) 10Ottomata: [V: 032] Update R and C++-related stats puppet configs [puppet] - 10https://gerrit.wikimedia.org/r/315885 (https://phabricator.wikimedia.org/T147682) (owner: 10Bearloga) [12:52:04] !log upgrading nginx on codfw text+upload caches - T144523 [12:52:06] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [12:53:15] (03CR) 10Giuseppe Lavagetto: [C: 032] eventbus: use role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316766 (owner: 10Giuseppe Lavagetto) [12:53:21] (03PS3) 10Giuseppe Lavagetto: eventbus: use role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316766 [12:53:24] (03CR) 10Giuseppe Lavagetto: [V: 032] eventbus: use role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316766 (owner: 10Giuseppe Lavagetto) [12:54:49] (03PS2) 10Giuseppe Lavagetto: misc: fix relative inclusion of lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316767 [12:55:14] 06Operations, 10Monitoring, 15User-Joe, 07Wikimedia-Incident: Monitor redis memory/disk usage - https://phabricator.wikimedia.org/T110169#1570438 (10elukey) Adding a reference to the outage mentioned: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150825-Redis [12:56:26] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2728476 (10chasemp) >>! In T148560#2727848, @Marostegui wrote: > @chasemp are we talking about this user from labsdb1008 that you want to get replicated t... [12:57:04] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 13Patch-For-Review, 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2728477 (10Ottomata) ``` Notice: /Stage[main]/Statistics::Compute/Package[libgsl0-dev]/ensure... [12:57:05] (03CR) 10Giuseppe Lavagetto: [C: 032] misc: fix relative inclusion of lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/316767 (owner: 10Giuseppe Lavagetto) [12:57:24] _joe_: new machines ? they are actually a bit older IIRC [12:57:51] <_joe_> akosiaris: meaning "new to scb" [12:57:58] <_joe_> while other services are depooled [12:58:09] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2728485 (10elukey) [12:58:15] I think mobrovac plans to have them pooled today [12:58:33] yup yup [12:58:38] <_joe_> oh ok [12:58:38] a test deploy is in progress [12:58:58] <_joe_> mobrovac: btw our depool/repool script has an issue [12:59:07] <_joe_> I'm going to fix that [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161019T1300). [13:00:04] Urbanecm and mafk: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:06] I can SWAT. [13:00:08] kk _joe_ [13:00:12] Hi [13:00:22] hi mafk [13:01:52] 06Operations, 05Prometheus-metrics-monitoring: Port redis statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T148637#2728497 (10elukey) [13:02:19] 06Operations, 10Monitoring, 15User-Joe, 07Wikimedia-Incident: Monitor redis memory/disk usage - https://phabricator.wikimedia.org/T110169#2728513 (10elukey) I created https://phabricator.wikimedia.org/T148637 to add Redis metrics to Prometheus. [13:02:37] 06Operations, 05Prometheus-metrics-monitoring: Port redis statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T148637#2728497 (10elukey) Please also take https://phabricator.wikimedia.org/T110169 in consideration [13:03:22] (03PS4) 10Dereckson: Create a 'templateeditor' user group at en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315653 (https://phabricator.wikimedia.org/T148007) (owner: 10MarcoAurelio) [13:03:40] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T148633#2728534 (10elukey) [13:04:22] Maybe we can also do some group renaming if there's time left [13:04:31] but requires running a maintenance script [13:04:33] <_joe_> Dereckson: please proceed with caution, I am not sure what's the status with deployments [13:04:36] PROBLEM - Disk space on cp3012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%) [13:04:47] <_joe_> bblack: ^^ uh? [13:05:00] looking [13:05:24] probably vk spam? [13:05:27] _joe_: we wanted to resume it yesterday, but got a new issue after tin reimaging we lost /srv/mediawiki-staging, it has been restored yesterday [13:05:45] bblack: varnish.log [13:05:52] I'll carefully monitor logs. [13:05:59] <_joe_> Dereckson: I am aware, but there were some issues as of this eu morning [13:06:03] <_joe_> that I think I fixed [13:06:12] okay [13:06:57] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315653 (https://phabricator.wikimedia.org/T148007) (owner: 10MarcoAurelio) [13:07:13] ema: supposed to be rotated... [13:07:32] and empty, I thought [13:07:46] mafk: what's the priority of the group renaming ones? [13:07:59] (03Merged) 10jenkins-bot: Create a 'templateeditor' user group at en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315653 (https://phabricator.wikimedia.org/T148007) (owner: 10MarcoAurelio) [13:08:07] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2728548 (10Marostegui) >>! In T148560#2728476, @chasemp wrote: >>>! In T148560#2727848, @Marostegui wrote: >> @chasemp are we talking about this user from... [13:08:14] weird, a binary file? [13:08:26] bblack: varnishlog.service seems involved, it exited ~3minutes ago [13:08:30] Dereckson: none, just work I did which is pending SWAT [13:08:43] mafk: we'll do that later in this case [13:08:49] if there are problems with SWATs, servers, etc we can do it another time [13:08:51] elukey: yeah, /usr/bin/varnishlog -a -w /var/log/varnish/varnish.log [13:08:51] mafk: templateeditor live on mw1099 [13:08:51] sure [13:08:55] checking [13:08:58] ema: yeah it's supposed to be stopped/disabled in puppet, but it's not for v4 :) [13:09:08] 12:57:31 -!- Urbanecm [~chatzilla@posta.ssakhk.cz] has quit [Ping timeout: 256 seconds] [13:09:18] let's start with rm that file and stop the service manually on the v4 hosts for now to avoid more diskfull? [13:09:22] Urbanecm left just before the SWAT. [13:09:35] bblack: agreed, I'll do that [13:09:47] well, it's a new namespace, I'll take care of it and will test. [13:09:53] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316608 (https://phabricator.wikimedia.org/T148563) (owner: 10Urbanecm) [13:10:06] (03PS2) 10Dereckson: Create a new namespace "Príloha" for skwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316608 (https://phabricator.wikimedia.org/T148563) (owner: 10Urbanecm) [13:10:07] Dereckson: LGTM on mw1099 [13:10:44] I can see the rights on Special:ListGroupRights and on the protection dropdown there's a new option as well [13:10:55] PROBLEM - Disk space on cp3013 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%) [13:11:02] (03PS1) 10BBlack: varnish: stop varnishlog on v4 hosts, too [puppet] - 10https://gerrit.wikimedia.org/r/316786 [13:11:03] let me check if admins can add remove [13:11:11] !log stopping varnishlog service on v4 cp hosts and removing log file [13:11:16] yep, they can [13:11:20] all okay I think [13:11:24] oh wait [13:11:39] ema: ... these aren't real hosts [13:12:03] oh they're spares [13:12:39] RECOVERY - Disk space on cp3012 is OK: DISK OK [13:13:10] looking into that, I made upgraded those esams spares earlier since it'll take a bit until they're actually decommed [13:13:36] PROBLEM - puppet last run on es1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:52] <_joe_> spares but they both have varnish running [13:14:02] !log dereckson@mira Synchronized wmf-config/InitialiseSettings.php: Create a 'templateeditor' user group at en.wiktionary plus adittional configuration (T148007) (duration: 02m 33s) [13:14:03] T148007: Add template editor user group to en.wiktionary - https://phabricator.wikimedia.org/T148007 [13:14:12] fixed, I removed the vintage 3.16 kernels [13:14:28] I noticed a new error in the logs, filled as https://phabricator.wikimedia.org/T148639 [13:14:32] !log change-prop stopping instance on scb1004 so that scb1004 picks up more load [13:14:41] regexp related (not with this SWAT so), Compilation failed: nothing to repeat at offset 13 in /srv/mediawiki/php-1.28.0-wmf.22/extensions/SpamBlacklist/EmailBlacklist.php on line 61 [13:16:06] ema: https://gerrit.wikimedia.org/r/#/c/316786/ should be right, right? there's not some v4-specific reason we need it? [13:16:07] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2728584 (10Krenair) I think it needs to be able to at least read non-_p databases to be allowed to create views selecting them? [13:16:17] already reported by matanya (but wasn't in #wikimedia-log-errors) [13:16:43] matanya: when you see something in logs, add #wikimedia-log-errors to the projects on Phabricator, it's the board to track error messages [13:17:44] (03CR) 10Dereckson: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316608 (https://phabricator.wikimedia.org/T148563) (owner: 10Urbanecm) [13:17:57] (03CR) 10Ema: [C: 031] varnish: stop varnishlog on v4 hosts, too [puppet] - 10https://gerrit.wikimedia.org/r/316786 (owner: 10BBlack) [13:18:02] bblack: nope, lgtm [13:18:07] (03CR) 10BBlack: [C: 032] varnish: stop varnishlog on v4 hosts, too [puppet] - 10https://gerrit.wikimedia.org/r/316786 (owner: 10BBlack) [13:18:58] (03CR) 10Dereckson: [C: 032] "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316608 (https://phabricator.wikimedia.org/T148563) (owner: 10Urbanecm) [13:19:19] PROBLEM - Disk space on cp3014 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [13:19:24] (03Merged) 10jenkins-bot: Create a new namespace "Príloha" for skwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316608 (https://phabricator.wikimedia.org/T148563) (owner: 10Urbanecm) [13:19:41] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2728602 (10chasemp) At the moment the user has full permissions iirc. @Marostegui for my part I meant more like "Let's not use this user for anything els... [13:19:55] 316608 live on mw1099 [13:20:19] Works. [13:21:22] !log dereckson@mira Synchronized wmf-config/InitialiseSettings.php: Create a new namespace "Príloha" for skwikt (T148563) (duration: 00m 50s) [13:21:23] T148563: Create a new namespace for skwikt "Príloha" - https://phabricator.wikimedia.org/T148563 [13:21:28] !log change-prop stopping instances on scb100[12] so that scb1003 picks up more load [13:21:47] PROBLEM - changeprop endpoints health on scb1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.29, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:22:32] known ^ [13:23:28] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2728606 (10Eevans) These nodes have half the RAM of the proposed AMS nodes, and (I just learned), have H310 raid controllers, which are apparently notorious for "extremely poor performance". [13:24:41] ACKNOWLEDGEMENT - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Muehlenhoff stopped explicitly by Marko to drive traffic to scb1003 [13:24:41] ACKNOWLEDGEMENT - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Muehlenhoff stopped explicitly by Marko to drive traffic to scb1003 [13:24:42] ACKNOWLEDGEMENT - changeprop endpoints health on scb1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.29, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Muehlenhoff stopped explicitly by Marko to drive traffic to scb1003 [13:27:06] RECOVERY - Disk space on cp3013 is OK: DISK OK [13:27:28] RECOVERY - Disk space on cp3014 is OK: DISK OK [13:29:56] 06Operations, 10Analytics-EventLogging: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#2728649 (10elukey) [13:29:58] 06Operations, 10Analytics-EventLogging, 10Icinga: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#2728646 (10elukey) 05Open>03Resolved a:03elukey This is an old task, eventlog2001 is not running EL anymore (it is a spare host). [13:30:14] (03PS1) 10Giuseppe Lavagetto: restart-service: repool only previously pooled services [puppet] - 10https://gerrit.wikimedia.org/r/316788 [13:31:44] !log upgrading nginx on ulsfo text+upload caches - T144523 [13:31:45] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [13:33:17] HEY [13:33:18] 12:24:14 -!- mode/#wikimedia-operations [+b *!*@*internal-server-nat.wmflabs.org] by BanBot [13:33:21] 12:24:14 -!- morebots was kicked from #wikimedia-operations by BanBot [morebots] [13:33:34] We don't have any log for the last 50 minutes. [13:34:31] heh [13:34:46] Dereckson: what about https://gerrit.wikimedia.org/r/#/c/313601/ ? shouldn't be problematic? [13:34:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 724 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3063314 keys - replication_delay is 724 [13:35:04] mafk: is that urgent?. [13:35:08] nope [13:35:11] we can wait [13:35:35] uh? [13:35:38] I'm manually adding back the logs to SAL page. [13:35:44] why did we lose grrrit-wm [13:35:49] Urbanecm: I deployed sk.wikt namespace change [13:35:51] Dereckson: Sorry I have forgotten the SWAT window. [13:36:03] I noticed in mail. Thanks a lot! [13:36:10] !log dereckson@mira Synchronized wmf-config/InitialiseSettings.php: Create a 'templateeditor' user group at en.wiktionary plus adittional configuration (T148007) (duration: 02m 33s) [13:36:11] T148007: Add template editor user group to en.wiktionary - https://phabricator.wikimedia.org/T148007 [13:36:22] bot is not here [13:36:24] :( [13:36:42] I just unbanned it, no clue where it is running from though [13:36:58] why it got banned? [13:37:06] <_joe_> paravoid: toollabs [13:37:23] neon.wikimedia.org [13:37:32] logmsgbot <~logmsgbot@neon.wikimedia.org> “logmsgbot” [13:37:51] it's morebots we're missing [13:37:55] <_joe_> mafk: w;ere talking about grrrit-wm [13:38:02] nope, grrrit-wm is back [13:38:11] morebots is indeed what we're missing [13:38:16] RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:38:44] https://wikitech.wikimedia.org/wiki/Morebots [13:38:58] Log restored: https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&type=revision&diff=912301&oldid=912192 [13:41:29] (03PS1) 10Volans: Icinga: peroperly detect timeouts in raid_handler [puppet] - 10https://gerrit.wikimedia.org/r/316791 (https://phabricator.wikimedia.org/T142085) [13:41:47] https://wikitech.wikimedia.org/wiki/Special:Contributions/Lololololololololololl <-- don't expect anything good from this account I guess [13:42:41] (03PS1) 10Alexandros Kosiaris: icinga: Update labtest network::constants as well [puppet] - 10https://gerrit.wikimedia.org/r/316792 [13:44:47] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Update labtest network::constants as well [puppet] - 10https://gerrit.wikimedia.org/r/316792 (owner: 10Alexandros Kosiaris) [13:47:28] (03PS2) 10Volans: Icinga: peroperly detect timeouts in raid_handler [puppet] - 10https://gerrit.wikimedia.org/r/316791 (https://phabricator.wikimedia.org/T142085) [13:49:00] (03PS1) 10Alexandros Kosiaris: Add tendril role to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/316794 [13:49:12] (03CR) 10Volans: [C: 032] Icinga: peroperly detect timeouts in raid_handler [puppet] - 10https://gerrit.wikimedia.org/r/316791 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [13:50:02] akosiaris: I've just puppet-merged and on one host I got your previous commit too [13:50:05] did it failed? [13:50:06] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2728703 (10Marostegui) As @chasemp and myself agreed on IRC, we have decided to do some tweaking with a new user (to avoid touching maintain-views one) .... [13:50:25] (03PS3) 10BBlack: MW apache: remove bits.wm.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/305536 (https://phabricator.wikimedia.org/T107430) [13:56:16] volans: sha1 hash ? [13:56:28] WARNING: Revision range includes commits from multiple committers! [13:56:29] I see on all hosts 31e0ff02dabc45b366132daa7c6c047c329f9e4d. during the puppet-merge run [13:56:33] git merge --ff-only d190183734cf6f418cbe8fe7968debc48dc8e591 [13:56:36] modules/network/manifests/constants.pp | 5 ++++- [13:56:39] this file was from yours [13:56:42] depooled mw1239.eqiad.wmnet to allow hw investigation (T148421) [13:56:42] T148421: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421 [13:56:47] akosiaris: only on one host [13:56:58] one of the puppet-merge blocks [13:57:39] akosiaris: that has in the middle: Connection to puppetmaster1002.eqiad.wmnet closed. [13:57:45] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2728724 (10elukey) ``` elukey@puppetmaster1001:~$ sudo -i confctl --quiet --find --action set/pooled=inactive mw1239.eqiad.wmnet mw1239.eqiad.wmnet: pooled changed yes => inactive elukey@puppetmaster1001:~$ s... [13:57:57] OK: puppet-merge on puppetmaster1002.eqiad.wmnet succeded [13:58:11] hmm it maybe a PEBKAC on my side [13:58:16] ^CConnection to puppetmaster1002.eqiad.wmnet closed. [13:58:22] maybe a Ctrl+C ? [13:58:32] 06Operations, 10Pybal: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425#2728728 (10Gehel) The restart went just fine. There is no functional problem that I know related to this log. I was just surprised to see the error in the log... [13:58:39] but then OK should not be returned... weird [13:58:49] akosiaris: https://phabricator.wikimedia.org/P4263 [13:59:35] Maybe morebots needs a boot (restart) in the ass to get going again? I dont know if you guys have access to it or not though (unsure if morebots is tools or not) [13:59:37] volans: yeah, must have been an erroneuous ctrl+c on my side [14:00:47] !log scb expanding and deploying services to scb[12]00[1234] change-prop citoid cxserver graphoid mathoid mobileapps [14:00:51] akosiaris: ^ [14:01:07] akosiaris: ok np, just checking [14:02:59] !log bdsync tools from labstore1001 to labstore2001 [14:03:51] chasemp: morebots got ripped up and eaten by BanBot logs i believe need done manually for the time being [14:05:00] (03CR) 10Gehel: [C: 04-1] Add es-tool also on jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314695 (owner: 10Muehlenhoff) [14:05:52] RECOVERY - changeprop endpoints health on scb1004 is OK: All endpoints are healthy [14:07:45] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3041381 keys - replication_delay is 0 [14:09:05] (03PS4) 10Muehlenhoff: elasticsearch: Use domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304483 [14:12:59] 06Operations, 10Icinga, 06Operations-Software-Development, 13Patch-For-Review: Automate creation of Phab task for failed disks - https://phabricator.wikimedia.org/T142085#2728771 (10Volans) 05Open>03Resolved All done and released to production, first usage in T148633. [14:14:22] !log [done] scb expanding and deploying services to scb[12]00[1234] change-prop citoid cxserver graphoid mathoid mobileapps [14:14:44] akosiaris: ^^ you can now pool the "new" nodes [14:14:50] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:14:55] yippi! [14:14:59] doing so now [14:15:47] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid']) [14:15:48] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid']) [14:15:50] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=citoid']) [14:15:51] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=citoid']) [14:15:53] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=graphoid']) [14:15:53] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=graphoid']) [14:15:55] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=cxserver']) [14:15:56] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=cxserver']) [14:15:58] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mobileapps']) [14:15:59] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mobileapps']) [14:16:06] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2728781 (10chasemp) With maintain-views2 error: > pymysql.err.InternalError: (1227, 'Access denied; you need (at least one of) the SUPER privilege(s) for... [14:16:24] (03CR) 10Muehlenhoff: [C: 032] elasticsearch: Use domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304483 (owner: 10Muehlenhoff) [14:16:43] 06Operations, 10ChangeProp, 10Citoid, 10ContentTranslation-CXserver, and 8 others: Update Node on SCB to v4.6.0 - https://phabricator.wikimedia.org/T148615#2728784 (10mobrovac) [14:16:46] 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2728782 (10mobrovac) 05Open>03Resolved The `htcp-purge` did the trick! Thank you @Pchelolo ! [14:16:57] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [14:16:58] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [14:17:00] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid']) [14:17:01] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid']) [14:17:03] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid']) [14:17:04] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid']) [14:17:05] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [14:17:06] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [14:17:08] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [14:17:09] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [14:17:24] mobrovac: done. Let's witness all hell break loose now [14:17:26] :) [14:17:43] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, and 2 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2728790 (10mobrovac) [14:17:49] akosiaris: \o/ [14:17:51] 06Operations, 10ChangeProp, 10Citoid, 10ContentTranslation-CXserver, and 8 others: Update Node on SCB to v4.6.0 - https://phabricator.wikimedia.org/T148615#2727861 (10mobrovac) 05Open>03Resolved All of the services are now operating on Node 4.6.0 on SCB. [14:17:55] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, and 2 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2728791 (10akosiaris) a:05akosiaris>03None [14:18:04] 06Operations, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841#2672637 (10faidon) This is most likely related to Yahoo's DMARC policy, cf. T66818. [14:18:40] akosiaris: shouldn't you resolve T148380 [14:18:41] T148380: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380 [14:18:42] ? [14:18:52] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, and 2 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2721397 (10akosiaris) That leaves only the DC labels stuff and switchport descriptions. @Cmjo... [14:19:18] ah i see [14:19:19] :P [14:20:18] 06Operations, 06Services (done), 15User-mobrovac: Expand SCB cluster - https://phabricator.wikimedia.org/T147903#2728800 (10mobrovac) 05Open>03Resolved a:03mobrovac This is done. [14:29:25] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 611 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3042292 keys - replication_delay is 611 [14:35:48] (03PS1) 10Volans: Icinga: disable event_handler on passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/316806 (https://phabricator.wikimedia.org/T142085) [14:35:52] akosiaris: ^^^ [14:37:38] (03PS2) 10Volans: Icinga: disable event_handler on passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/316806 (https://phabricator.wikimedia.org/T142085) [14:37:42] fixed spacing [14:38:32] 06Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647#2728868 (10fgiunchedi) [14:39:00] 06Operations, 10media-storage, 05Goal: expand swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2122471 (10fgiunchedi) a:03fgiunchedi [14:39:21] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:40:36] (03CR) 10Alexandros Kosiaris: [C: 032] Icinga: disable event_handler on passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/316806 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [14:42:43] grrrit-wm where are you going... [14:42:52] akosiaris: oh already merged, thanks man! [14:45:39] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2728891 (10Paladox) Your connection reset by peer looks very similar to https://issues.apache.org/jira/browse/DIRMINA-1006 which... [14:51:15] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2728901 (10Marostegui) So far I have only been able to create it with either SUPER or ALL PRIVILEGES ON: `GRANT SELECT ON *.* TO 'maintain-views2'@'10.64.... [14:54:38] (03PS1) 10Filippo Giunchedi: graphite: change Cassandra '.count' metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/316810 (https://phabricator.wikimedia.org/T121789) [14:54:53] 06Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647#2728906 (10Papaul) @fgiunchedi yes in codfw we do have physical space and ports available also on the 10 GB switches [14:55:43] (03CR) 10jenkins-bot: [V: 04-1] graphite: change Cassandra '.count' metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/316810 (https://phabricator.wikimedia.org/T121789) (owner: 10Filippo Giunchedi) [14:56:53] (03PS4) 10Giuseppe Lavagetto: kubernetes: introduce 1st-stage worker role [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) [14:56:53] nested arrows I love you [14:59:02] (03PS2) 10Filippo Giunchedi: graphite: change Cassandra '.count' metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/316810 (https://phabricator.wikimedia.org/T121789) [14:59:30] PROBLEM - mediawiki-installation DSH group on mw1239 is CRITICAL: Host mw1239 is not in mediawiki-installation dsh group [15:02:12] this one is me --^ [15:02:41] set it to inactive to allow hw investigation [15:04:48] <_joe_> elukey: then you probably need this https://gerrit.wikimedia.org/r/#/c/316788/ to be merged [15:04:56] <_joe_> (please, take a good look) [15:05:58] ah snap [15:06:40] I am trying to get the whole picture [15:06:57] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2728945 (10Paladox) I reported it here https://bugs.chromium.org/p/gerrit/issues/detail?id=4791 to see if this is a gerrit issue... [15:10:23] _joe_ so without this one if mw1239 gets selected it is repooled [15:10:45] I mean, if something calls restart service.. like your script :D [15:10:46] <_joe_> yes [15:11:09] all right so I'll re-pool it and wait for the CR to be merged [15:11:15] it is not super urgent [15:11:20] thanks for the heads up [15:11:52] (03CR) 10BBlack: [C: 031] "Better than it was before this commit! :)" [puppet] - 10https://gerrit.wikimedia.org/r/316788 (owner: 10Giuseppe Lavagetto) [15:12:17] (03CR) 10Mobrovac: [C: 031] "Very nice find!" [puppet] - 10https://gerrit.wikimedia.org/r/316788 (owner: 10Giuseppe Lavagetto) [15:13:42] aaaaand maybe I don't need to wait :D [15:14:18] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2728977 (10elukey) Please wait https://gerrit.wikimedia.org/r/316788 to be merged before proceeding :) [15:25:07] 06Operations, 06Release-Engineering-Team, 07HHVM: 2016-10-17 API cluster overload - https://phabricator.wikimedia.org/T148652#2729010 (10ori) [15:29:12] <_joe_> elukey: what happened on mw1239, btw? [15:29:21] <_joe_> oh I see the message [15:29:28] (03CR) 10Giuseppe Lavagetto: [C: 032] restart-service: repool only previously pooled services [puppet] - 10https://gerrit.wikimedia.org/r/316788 (owner: 10Giuseppe Lavagetto) [15:29:38] (03PS2) 10Giuseppe Lavagetto: restart-service: repool only previously pooled services [puppet] - 10https://gerrit.wikimedia.org/r/316788 [15:29:41] (03CR) 10Giuseppe Lavagetto: [V: 032] restart-service: repool only previously pooled services [puppet] - 10https://gerrit.wikimedia.org/r/316788 (owner: 10Giuseppe Lavagetto) [15:31:28] <_joe_> elukey: is mw1239 depooled now? [15:32:57] 06Operations, 06Release-Engineering-Team, 07HHVM: 2016-10-17 API cluster overload - https://phabricator.wikimedia.org/T148652#2729028 (10ori) Backtrace from mw1194: {F4626363} [15:34:00] wat? I set the ACL to ops only [15:35:39] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2729029 (10dr0ptp4kt) Thanks, @elukey. This is a bit of a splinter question now: do I infer correctly that lighter weight data crunching (e.g., a python script that post-pr... [15:35:41] changed it to a secure paste [15:36:21] doesn't appear to contain anything sensitive but just in case [15:39:50] (03PS1) 10Giuseppe Lavagetto: restart-service: use service, not title [puppet] - 10https://gerrit.wikimedia.org/r/316812 [15:40:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] restart-service: use service, not title [puppet] - 10https://gerrit.wikimedia.org/r/316812 (owner: 10Giuseppe Lavagetto) [15:40:16] (03PS2) 10Giuseppe Lavagetto: restart-service: use service, not title [puppet] - 10https://gerrit.wikimedia.org/r/316812 [15:40:19] (03CR) 10Giuseppe Lavagetto: [V: 032] restart-service: use service, not title [puppet] - 10https://gerrit.wikimedia.org/r/316812 (owner: 10Giuseppe Lavagetto) [15:48:13] (03CR) 10Matěj Suchánek: Show changes from last 14 days in watchlist in cswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) (owner: 10Urbanecm) [15:52:17] !log esams cache_misc: seamless restarts for nginx [15:55:50] (03PS1) 10Giuseppe Lavagetto: restart-service: consider the case the server is not in the pool [puppet] - 10https://gerrit.wikimedia.org/r/316813 [15:56:21] (03PS2) 10Giuseppe Lavagetto: restart-service: do not fail if the server is not found in conftool [puppet] - 10https://gerrit.wikimedia.org/r/316813 [15:57:39] (03CR) 10Giuseppe Lavagetto: [C: 032] restart-service: do not fail if the server is not found in conftool [puppet] - 10https://gerrit.wikimedia.org/r/316813 (owner: 10Giuseppe Lavagetto) [15:58:31] !log Issuing secure erase on cp3021 sdb [15:58:31] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2729061 (10Eevans) [16:01:53] Is email servers for dewiki having issues at the moment? [16:03:11] (03CR) 10Gehel: service::node: Adding minimal test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316560 (owner: 10Gehel) [16:05:37] (03PS9) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [16:05:39] (03PS9) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [16:05:41] (03PS9) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [16:05:44] (03PS5) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [16:05:46] (03PS6) 10Alexandros Kosiaris: icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 [16:05:48] (03PS6) 10Alexandros Kosiaris: naggen2: Kill hostextinfo support [puppet] - 10https://gerrit.wikimedia.org/r/315243 [16:05:50] (03PS6) 10Alexandros Kosiaris: Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 [16:05:52] (03PS1) 10Alexandros Kosiaris: icinga: Bring the web part up to 2.4 standards [puppet] - 10https://gerrit.wikimedia.org/r/316814 [16:05:54] (03PS1) 10Alexandros Kosiaris: icinga: Fix permissions of a few directories [puppet] - 10https://gerrit.wikimedia.org/r/316815 (https://phabricator.wikimedia.org/T110893) [16:09:39] 06Operations, 10Cassandra, 06Services (blocked): SSL handshake errors - https://phabricator.wikimedia.org/T148654#2729072 (10Eevans) [16:09:54] 06Operations, 10Cassandra, 06Services (blocked): SSL handshake errors - https://phabricator.wikimedia.org/T148654#2729085 (10Eevans) p:05Triage>03Normal [16:11:10] 06Operations, 10Cassandra, 06Services (blocked): SSL handshake errors - https://phabricator.wikimedia.org/T148654#2729072 (10Eevans) [16:11:23] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:33] (03PS1) 10Madhuvishy: maps nfs: Remove mount at /project sourced from labstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/316816 (https://phabricator.wikimedia.org/T147657) [16:17:37] (03CR) 10Madhuvishy: [C: 032] maps nfs: Remove mount at /project sourced from labstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/316816 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy) [16:22:13] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:23:36] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Bring the web part up to 2.4 standards [puppet] - 10https://gerrit.wikimedia.org/r/316814 (owner: 10Alexandros Kosiaris) [16:23:40] (03PS2) 10Alexandros Kosiaris: icinga: Bring the web part up to 2.4 standards [puppet] - 10https://gerrit.wikimedia.org/r/316814 [16:23:42] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Bring the web part up to 2.4 standards [puppet] - 10https://gerrit.wikimedia.org/r/316814 (owner: 10Alexandros Kosiaris) [16:23:53] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Fix permissions of a few directories [puppet] - 10https://gerrit.wikimedia.org/r/316815 (https://phabricator.wikimedia.org/T110893) (owner: 10Alexandros Kosiaris) [16:23:57] (03PS2) 10Alexandros Kosiaris: icinga: Fix permissions of a few directories [puppet] - 10https://gerrit.wikimedia.org/r/316815 (https://phabricator.wikimedia.org/T110893) [16:24:00] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Fix permissions of a few directories [puppet] - 10https://gerrit.wikimedia.org/r/316815 (https://phabricator.wikimedia.org/T110893) (owner: 10Alexandros Kosiaris) [16:30:40] !log depooling cp3009 (esams cache_misc), possible HW issues [16:30:59] still no morebots? [16:32:03] Nope [16:33:22] I'll restart it [16:34:26] (03PS1) 10Alexandros Kosiaris: icinga: Remove the access_compat removal [puppet] - 10https://gerrit.wikimedia.org/r/316818 [16:34:39] (03PS1) 10Madhuvishy: maps nfs: Change resource naming for absenting mounts to hack around duplicate resource decl error [puppet] - 10https://gerrit.wikimedia.org/r/316819 [16:34:46] (03PS2) 10Alexandros Kosiaris: icinga: Remove the access_compat removal [puppet] - 10https://gerrit.wikimedia.org/r/316818 [16:34:52] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Remove the access_compat removal [puppet] - 10https://gerrit.wikimedia.org/r/316818 (owner: 10Alexandros Kosiaris) [16:35:24] (03PS2) 10Madhuvishy: maps nfs: Change resource naming for absenting mounts to hack around duplicate resource decl error [puppet] - 10https://gerrit.wikimedia.org/r/316819 [16:36:08] morebots: hello? [16:36:09] I am a logbot running on tools-exec-1410. [16:36:09] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:36:09] To log a message, type !log . [16:36:25] !log depooling cp3009 (esams cache_misc), possible HW issues [16:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:57] (03CR) 10Madhuvishy: [C: 032] maps nfs: Change resource naming for absenting mounts to hack around duplicate resource decl error [puppet] - 10https://gerrit.wikimedia.org/r/316819 (owner: 10Madhuvishy) [16:39:21] Did we log all msgs that werent logged while morebots wasnt in here? [16:39:28] If not im willing to help [16:39:36] (03PS5) 10Giuseppe Lavagetto: kubernetes: introduce 1st-stage worker role [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) [16:39:38] (03PS2) 10Giuseppe Lavagetto: kubernetes: install kubernetes1001-4 as worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/315923 (https://phabricator.wikimedia.org/T147933) [16:42:42] Zppix|mobile: you should be able to get the data from https://tools.wmflabs.org/sal/production [16:44:00] did the bot get flood banned for all those conftool actions? [16:44:19] Banned by BanBot [16:44:26] oh fuck banbot [16:44:59] morebots was gone from here for atleast 1-?2? Hours [16:45:59] More then 1 to 2 hours [16:46:11] like 3 - 4 hours [16:46:11] It missed things from 13:31 to 16:36 [16:47:53] Want me to log them if i can?? Or does someone else have too [16:48:00] (03CR) 10Yuvipanda: [C: 04-1] "I looked at the contents of the docker engine class itself, and it lgtm (assuming the lvm module works as intended)! It'll be a bit of wor" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [16:48:10] Zppix|mobile: go for it! and thanks [16:48:21] Nm [16:48:22] Np [16:48:24] * [16:48:25] subbu: would it be possible to reconstruct the API request that resulted in this log entry in parsoid's main.log? https://dpaste.de/y2Fi/raw [16:49:38] Platonides: can you get banbot out of here or at least teach it that we like logmsgbot? [16:49:48] and morebots [16:50:40] Someone i think is gonna whitelist morebots serv [16:51:00] What timezone is that you gave the timeframes in utc? [16:51:27] oh, I think it's to https://www.mediawiki.org/wiki/Extension:ParsoidBatchAPI#preprocess [16:51:38] Zppix|mobile: yes, UTC. You can get the messages that need to be logged and their timestamps from https://tools.wmflabs.org/sal/production [16:52:23] ^ so all thoose need logged correct. [16:53:40] Zppix|mobile: all of them that aren't on https://wikitech.wikimedia.org/wiki/Server_Admin_Log already. [16:53:50] Ok [16:54:54] !log 15:58 Issuing secure erase on cp3021 sdb [16:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:16] it failed, anyway :) [16:56:11] Zppix|mobile: oh. just edit the wiki [16:56:22] Oh [17:00:09] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2729176 (10Joe) >>! In T147718#2725420, @Andrew wrote: > Just to confirm my understanding... the proposal is basically to rename everything currently ca... [17:01:21] Ill finish up in a bit when i get to my laptop so i can do it all in one go it should be within the next 25 mins [17:04:18] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3040714 keys - replication_delay is 0 [17:05:28] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2729187 (10Joe) >>! In T147718#2725812, @BBlack wrote: [CUT] > # How does labs work with the above? I imagine we don't use labs-specific roles, but we... [17:09:29] 06Operations, 10ops-codfw, 10DBA: db2037: Disk in predictive failure - https://phabricator.wikimedia.org/T148373#2729189 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [17:09:31] (03CR) 10Paladox: "Oh, what should I do, i doint know what i should put in ensure?" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [17:12:30] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access request: #mediawiki_security and fluorine for Petr - https://phabricator.wikimedia.org/T148473#2729214 (10elukey) [17:12:32] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access request: #mediawiki_security for Petr. - https://phabricator.wikimedia.org/T148476#2729212 (10elukey) 05Open>03Resolved Pchelolo added, confirmed on IRC that he is able to access. Please note: the chan is +r so even if the user has no clo... [17:13:39] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2729234 (10BBlack) >>! In T147718#2727431, @BBlack wrote: > What are good examples of data that naturally belongs in a profile:: namespace? Answering m... [17:15:52] !log depooled mw1239.eqiad.wmnet to allow hw investigation (T148421) (was done today but didn't logged properly) [17:15:53] T148421: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421 [17:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:17:14] elukey: when i get to a pc within next 5-10 mins imma log manually what morebots missed just so your aware [17:18:25] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2729247 (10elukey) Got some requests logged with the new query string but no trace of backend tags. Maybe this i... [17:19:40] (03CR) 10Mobrovac: [C: 031] service::node: Adding minimal test [puppet] - 10https://gerrit.wikimedia.org/r/316560 (owner: 10Gehel) [17:22:11] Zppix|mobile: thanks! This case was a PEBCAK because I didn't put the log command in front of the message :D [17:23:46] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2729261 (10mobrovac) Given that Petr is an essential member of the Services team and is involved a lot with the EventBus system which has its own extension, I think we should add... [17:24:01] Ah [17:24:12] (03PS2) 10Ori.livneh: Enable AbuseFilterCachingParser by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316364 [17:25:10] I'm an adopter of the 'wrong channel' pebcak, myself... do it often enough [17:25:57] (03CR) 10Ori.livneh: [C: 032] Enable AbuseFilterCachingParser by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316364 (owner: 10Ori.livneh) [17:26:25] (03Merged) 10jenkins-bot: Enable AbuseFilterCachingParser by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316364 (owner: 10Ori.livneh) [17:30:37] !log ori@mira Synchronized wmf-config/InitialiseSettings.php: Ieb8cdab9: Enable AbuseFilterCachingParser by default (duration: 01m 01s) [17:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:09] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2729288 (10Eevans) [17:33:43] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2725296 (10Eevans) p:05High>03Normal [17:33:49] !log cp1008 / pinkunicorn reboot [17:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:16] !log rebooting xenon/cerium/praseodymium to new kernels [17:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:36:37] ACKNOWLEDGEMENT - MD RAID on cp1008 is CRITICAL: Connection refused or timed out nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T148659 [17:36:40] 06Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T148659#2729303 (10ops-monitoring-bot) [17:38:17] (03PS1) 10Ori.livneh: Disable AbuseFilterCachingParser on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316825 (https://phabricator.wikimedia.org/T148660) [17:39:04] 06Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T148659#2729326 (10Volans) 05Open>03Invalid Icinga raid_handler didn't detect this as a timeout. Sending a fix. [17:39:21] (03CR) 10Ori.livneh: [C: 032] Disable AbuseFilterCachingParser on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316825 (https://phabricator.wikimedia.org/T148660) (owner: 10Ori.livneh) [17:39:47] 06Operations, 06Discovery, 06Maps, 06Services (watching), 15User-mobrovac: Update Node on Maps to v4.6.0 - https://phabricator.wikimedia.org/T148661#2729329 (10mobrovac) [17:39:49] (03Merged) 10jenkins-bot: Disable AbuseFilterCachingParser on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316825 (https://phabricator.wikimedia.org/T148660) (owner: 10Ori.livneh) [17:41:03] !log ori@mira Synchronized wmf-config/InitialiseSettings.php: I6d28e534: Disable AbuseFilterCachingParser on bgwiki (T148660) (duration: 00m 50s) [17:41:04] T148660: Stack overflow in AbuseFilter when using AbuseFilterCachingParser - https://phabricator.wikimedia.org/T148660 [17:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:18] mutante it dosent seem https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network that is updating [17:41:29] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network [17:41:56] it shows 17:40, where as shoulden it show 18:41 [17:42:30] Oh never mind [17:42:38] paladox: it's 17:42 UTC [17:42:41] it's in utc, and dosent use my current time [17:42:44] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2724106 (10AlexMonk-WMF) Note deployment requires operations meeting approval (it has sudo privileges) whereas mw-log-readers is just the 3 day wait [17:42:48] yes, that [17:42:48] yep [17:42:52] yep [17:43:15] that little white gap on the left is normal [17:43:19] eh, on the right [17:45:37] (03PS1) 10Madhuvishy: maps nfs: Symlink project and home to mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/316826 [17:45:45] yep [17:52:40] (03PS1) 10Volans: Icinga: improve raid_handler false alarm detection [puppet] - 10https://gerrit.wikimedia.org/r/316827 (https://phabricator.wikimedia.org/T142085) [18:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161019T1800). [18:00:32] Morning swat?? [18:01:14] it's always morning somewhere [18:01:22] (03PS2) 10Dzahn: Phab: Use correct location for phd's homedir [puppet] - 10https://gerrit.wikimedia.org/r/316474 (owner: 10Chad) [18:02:42] !log T133395: RESTBase: Altering keyspace local_group_wikipedia_T_parsoid_html.data to enable time-window compaction [18:02:43] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [18:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:11] (03PS1) 10Yuvipanda: Add git to the toollabs base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316828 [18:06:21] (03CR) 10Dzahn: [C: 032] Phab: Use correct location for phd's homedir [puppet] - 10https://gerrit.wikimedia.org/r/316474 (owner: 10Chad) [18:06:21] valhallasw`cloud: ^ [18:06:52] * mutante watches puppet run on phab server to fix that home dir [18:07:13] and .. it fails. sigh [18:07:57] (03CR) 10Merlijn van Deen: [C: 032] Add git to the toollabs base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316828 (owner: 10Yuvipanda) [18:10:34] (03PS1) 10Dzahn: Revert "Phab: Use correct location for phd's homedir" [puppet] - 10https://gerrit.wikimedia.org/r/316829 [18:10:59] (03PS1) 10Yuvipanda: pythons: Add -dev packages for libxml2 into image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316830 (https://phabricator.wikimedia.org/T140117) [18:11:07] Zppix|mobile: relavant timezone for "morning" and "evening" SWAT is the San Francisco one. [18:11:24] Its not morn in san fran [18:11:26] (03CR) 10Dzahn: [C: 032] "need to switch this during a maintenance window (not during the releng offsite)" [puppet] - 10https://gerrit.wikimedia.org/r/316829 (owner: 10Dzahn) [18:11:32] And before the EU SWAT window, this morning SWAT windows was at 8h SF time [18:11:57] er yes, it is 11am [18:12:15] If you're in Europe, -6 is only for New York, for SF it's -9 [18:12:18] what do you call that time of day before noon? [18:12:49] (03PS2) 10Yuvipanda: pythons: Add -dev packages for libxml2 into image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316830 (https://phabricator.wikimedia.org/T140117) [18:13:03] mutante: brunch time [18:13:11] (03CR) 10Merlijn van Deen: [C: 032] pythons: Add -dev packages for libxml2 into image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316830 (https://phabricator.wikimedia.org/T140117) (owner: 10Yuvipanda) [18:13:26] Oh ya forgot sf was 2 hrs behind (damn dst) [18:13:31] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): User[phd] [18:13:34] So no change for morning SWAT. [18:13:45] SWAT is done so. [18:14:36] valhallasw`cloud: I think there's a bunch of dependent patches to it that also need to be +2'd :D [18:14:42] arseny92: ping? [18:15:36] (03CR) 10Dereckson: "This change should be rescheduled in a new SWAT window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) (owner: 10Huji) [18:16:03] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:16:06] (03CR) 10Merlijn van Deen: [C: 032] Add legacy trusty container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315706 (https://phabricator.wikimedia.org/T148054) (owner: 10Yuvipanda) [18:16:20] (03CR) 10Merlijn van Deen: [C: 032] Switch jessie continer locale to en_US too [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315708 (owner: 10Yuvipanda) [18:16:25] (03Merged) 10jenkins-bot: Add legacy trusty container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315706 (https://phabricator.wikimedia.org/T148054) (owner: 10Yuvipanda) [18:16:37] (03Merged) 10jenkins-bot: Switch jessie continer locale to en_US too [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/315708 (owner: 10Yuvipanda) [18:16:43] (03Merged) 10jenkins-bot: Add git to the toollabs base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316828 (owner: 10Yuvipanda) [18:16:46] (03PS6) 10Paladox: Enable JVM heap log to debug gerrit slowing down [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) [18:16:58] (03PS1) 10Yuvipanda: python2: Add packages to built MySQLdb-python [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316832 (https://phabricator.wikimedia.org/T140112) [18:17:02] valhallasw`cloud: ^ one more [18:17:10] valhallasw`cloud: and then I'll kick off a build [18:17:18] (03CR) 10Dzahn: "requires stopping phd completely, then running puppet, then starting it again, maybe more. let's do it during a maintenance window / phab " [puppet] - 10https://gerrit.wikimedia.org/r/316474 (owner: 10Chad) [18:18:18] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:18:58] (03CR) 10Chad: "Why does it need a restart? phd shouldn't care all that much where its homedir is, this is mainly for tidying up and making user SSH confi" [puppet] - 10https://gerrit.wikimedia.org/r/316474 (owner: 10Chad) [18:19:46] yuvipanda: no --no-dependencies thing there? [18:19:57] (the one you added for git) [18:20:04] --no-suggested, probably [18:20:14] valhallasw`cloud: no, since for -dev we want them to bring 'em in [18:20:19] since I think that'll bring in the library itself [18:20:25] ohhh [18:20:38] (03CR) 10Merlijn van Deen: [C: 032] python2: Add packages to built MySQLdb-python [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316832 (https://phabricator.wikimedia.org/T140112) (owner: 10Yuvipanda) [18:20:54] (03Merged) 10jenkins-bot: pythons: Add -dev packages for libxml2 into image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316830 (https://phabricator.wikimedia.org/T140117) (owner: 10Yuvipanda) [18:21:22] (03Merged) 10jenkins-bot: python2: Add packages to built MySQLdb-python [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316832 (https://phabricator.wikimedia.org/T140112) (owner: 10Yuvipanda) [18:21:43] valhallasw`cloud: I kicked off a build [18:22:20] (03PS1) 10Yuvipanda: Fix typo in build.py [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316833 [18:22:24] valhallasw`cloud: ^ too :) [18:24:00] valhallasw`cloud: it's runnning now, you'll have git in it in about 20mins when it's all done building (and you've to restart your containers) [18:24:48] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2729494 (10Eevans) >>! In T147926#2728606, @Eevans wrote: > These nodes have half the RAM of the proposed AMS nodes, and (I just learned), have H310 raid controllers, which are apparently... [18:26:12] (03CR) 10Merlijn van Deen: [C: 032] Fix typo in build.py [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316833 (owner: 10Yuvipanda) [18:26:18] yuvipanda: ok! [18:26:35] (03Merged) 10jenkins-bot: Fix typo in build.py [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/316833 (owner: 10Yuvipanda) [18:27:43] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2708643 (10mobrovac) Indeed, @Eevans . Getting the nodes just for the sake of it doesn't seem prudent, especially since it would not allow us to synthesise production load no more than we... [18:31:03] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:33:20] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2708643 (10Dzahn) reading "have role spare" and then "return to spare pool", doesn't that mean it's already in the spare pool? [18:33:50] !log restarting stuck Jenkins [18:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:35:24] (03PS3) 10Addshore: Enable simple-json-datasource on prod Grafana [puppet] - 10https://gerrit.wikimedia.org/r/314029 (https://phabricator.wikimedia.org/T147329) [18:37:04] 06Operations, 06Maps, 03Interactive-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2729570 (10Gehel) a:03Gehel [18:38:31] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2729573 (10Gehel) [18:38:33] 06Operations, 06Maps, 03Interactive-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2384405 (10Gehel) [18:40:23] 06Operations, 06Discovery, 06Maps, 10Maps-data, 03Interactive-Sprint: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2729576 (10Gehel) [18:42:44] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:43:46] bblack jenkins wont shut down unles you cancel the jobs [18:43:47] at [18:43:48] https://integration.wikimedia.org/ci/ [18:43:51] Which are stuck [18:43:52] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 06Services (done): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#2729579 (10Eevans) >>! In T94329#2709602, @GWicke wrote: > @Eevans, is there anything actionable left to do here? @GWicke, the only remaining blocker is {T92471},... [18:44:13] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 06Services (done): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#2729583 (10Eevans) [18:44:37] https://integration.wikimedia.org/ci/job/composer-php55-trusty/ [18:44:53] https://integration.wikimedia.org/ci/job/npm-node-4/10836/console [18:44:59] https://integration.wikimedia.org/ci/job/composer-php55-trusty/424/console [18:45:03] they need canceling [18:45:10] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1111859 (10Eevans) [18:48:10] You might want to cancel https://integration.wikimedia.org/ci/job/performance-webpagetest-wmf/2851/console [18:48:43] paladox: we got it, being discussed in another channel [18:48:50] ok [18:49:05] thanks [18:55:04] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:10:56] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:12:47] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2729681 (10GWicke) >>! In T148475#2729261, @mobrovac wrote: > Given that Petr is an essential member of the Services team and is involved a lot with the EventBus system which has... [19:15:57] 07Puppet, 06Release-Engineering-Team: Preload TestingAccessWrapper in production mwrepl - https://phabricator.wikimedia.org/T143607#2729719 (10greg) This needs a home not in a team project. [19:16:27] 07Puppet, 06Release-Engineering-Team: Preload TestingAccessWrapper in production mwrepl - https://phabricator.wikimedia.org/T143607#2729723 (10greg) (and not just in #puppet either as that's just a tag. Traditionally ops puppet means adding #operations). [19:18:31] !log installing new kernel packages on cp* [19:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:03] 06Operations, 06Release-Engineering-Team, 07HHVM, 07Wikimedia-Incident: 2016-10-17 API cluster overload - https://phabricator.wikimedia.org/T148652#2729732 (10greg) [19:22:46] (03PS2) 10Dzahn: Convert ferm::rule deployment-ssh in nova to ferm service [puppet] - 10https://gerrit.wikimedia.org/r/316739 (owner: 10Muehlenhoff) [19:24:20] (03CR) 10Dzahn: [C: 032] "yep, this should be 100% the same as before. hosts affected: silver, labtestweb2001 will double check" [puppet] - 10https://gerrit.wikimedia.org/r/316739 (owner: 10Muehlenhoff) [19:26:13] !log installing new kernel packages on authdns [19:26:18] !log installing new kernel packages on lvs:secondary [19:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:37] (03CR) 10Dzahn: "yup, the ferm config changes (as expected) but the iptables result is the same as before, no diff" [puppet] - 10https://gerrit.wikimedia.org/r/316739 (owner: 10Muehlenhoff) [19:30:29] !log upgrading nginx+openssl on remaining cache nodes (eqiad+esams/text+upload) - T144523 [19:30:30] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [19:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:34:08] (03PS2) 10Dzahn: role::mirrors: Move base::firewall include in the role [puppet] - 10https://gerrit.wikimedia.org/r/316782 (owner: 10Muehlenhoff) [19:35:14] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:36:53] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Testing on Production, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2729791 (10AndyRussG) Quoting @aaron's response on IRC about the [[ https://phabricator.wikimedia.or... [19:37:06] (03CR) 10Dzahn: [C: 032] role::mirrors: Move base::firewall include in the role [puppet] - 10https://gerrit.wikimedia.org/r/316782 (owner: 10Muehlenhoff) [19:39:28] (03CR) 10Dzahn: "If you want to install it, use "present"." [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [19:40:06] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [19:40:19] (03PS4) 10Paladox: Install postgresql on ci in php.pp [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [19:40:22] (03PS5) 10Paladox: Install postgresql on ci in php.pp [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [19:40:28] (03CR) 10Paladox: "> If you want to install it, use "present"." [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [19:42:52] (03CR) 10Paladox: "@Hashar would you be able to review this when ever your free please?" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [19:43:32] (03CR) 10Paladox: "+1 will do please?" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [19:47:31] (03CR) 10BBlack: [C: 031] kibana - allow access to both /status and /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316771 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [19:47:46] (03CR) 10BBlack: [C: 031] kibana - change probe URL to /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316772 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [19:47:56] (03CR) 10BBlack: [C: 031] kibana - only allow unauthenticated access to /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316773 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [19:48:36] (03CR) 10BBlack: [C: 031] kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/316774 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [19:49:05] (03CR) 10BBlack: [C: 031] kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/316775 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [19:49:23] (03CR) 10BBlack: [C: 031] kibana - configure varnish to use new LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/316776 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [19:50:00] bblack: thanks for the reviews! A bit late to deploy tonight, but I'll do that tomorrow... [19:51:22] gehel: awesome work :) [19:51:42] bblack: awesome if probably a bit too strong ... but thanks! [19:53:16] (03PS1) 10Chad: Add Tyler's GPG key cuz he's gonna do releasin' stuffz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316839 [19:56:50] !log installing new kernel packages on lvs:primary [19:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:20] Who is joinning with movie names and tv names [19:58:26] ? [19:58:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3051665 keys - replication_delay is 626 [19:59:02] I'm more curious about what rules banbot is using to select the as bad so quickly [19:59:17] IPs from previous abuse? [19:59:32] bblack there was a MaggieSimpson in #mediawiki [19:59:35] with threats [20:00:05] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161019T2000). Please do the needful. [20:00:21] yeah what I mean is, what is the criterita banbot is using to autoban these ones popping up here now? [20:00:32] I think it remeber users ips [20:00:36] ok [20:00:43] So the ones on its ban list it blocks [20:00:55] but im really guessing and Platonides should know more [20:01:15] It should not affect legitimate users in here [20:01:34] Nope it shoulden, but does which i doint know why [20:01:40] paladox: but where does the ban list come from? [20:01:44] are those IPs banned on wiki? [20:02:21] mutante i doint know where the list comes from, i think Platonides adds the ips to the list but i am guessing [20:04:49] maybe BartSimpson is a morebots sockpuppet and the bot caught them sharing an IP! :) [20:06:22] Well ^^ i think that one comes from a movie [20:06:26] and it has Bart [20:06:34] yes maybe [20:07:12] mutante ^^ [20:07:14] LOL [20:09:01] (03CR) 10Chad: [C: 032] Add Tyler's GPG key cuz he's gonna do releasin' stuffz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316839 (owner: 10Chad) [20:09:28] (03Merged) 10jenkins-bot: Add Tyler's GPG key cuz he's gonna do releasin' stuffz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316839 (owner: 10Chad) [20:11:22] !log demon@mira Synchronized docroot/mediawiki/keys/: adding tylers key (duration: 01m 09s) [20:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:38] 06Operations, 10Ops-Access-Requests: Add Tyler Cipriani to mediawiki-releases - https://phabricator.wikimedia.org/T148681#2729869 (10demon) [20:12:48] 06Operations, 10Ops-Access-Requests, 06Release-Engineering-Team: Add Tyler Cipriani to mediawiki-releases - https://phabricator.wikimedia.org/T148681#2729882 (10demon) [20:12:56] that's legit, we're in person here, verifying things :) ^^^ [20:13:03] (the adding to keys.html) [20:13:26] 06Operations, 10Ops-Access-Requests, 06Release-Engineering-Team: Add Tyler Cipriani to mediawiki-releases - https://phabricator.wikimedia.org/T148681#2729884 (10greg) Approved, yes please :) [20:15:07] 06Operations, 10Ops-Access-Requests, 06Release-Engineering-Team: Add Tyler Cipriani to releasers-mediawiki - https://phabricator.wikimedia.org/T148681#2729887 (10demon) [20:15:42] (03PS1) 10Chad: WIP: Adding Tyler to releasers-mediawiki group [puppet] - 10https://gerrit.wikimedia.org/r/316843 (https://phabricator.wikimedia.org/T148681) [20:15:52] (03CR) 10Chad: [C: 04-1] WIP: Adding Tyler to releasers-mediawiki group [puppet] - 10https://gerrit.wikimedia.org/r/316843 (https://phabricator.wikimedia.org/T148681) (owner: 10Chad) [20:18:10] grrrit-wm: poor thcipriani [20:18:16] damn it [20:18:19] greg-g: ^^ [20:18:34] Reedy ? [20:24:23] Reedy: :) yup [20:34:43] (03PS2) 10Kaldari: Switching 10 wikis to numeric category collation per T146675 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316486 (https://phabricator.wikimedia.org/T146675) [20:36:20] RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical [20:36:44] (03PS2) 10Dzahn: restbase: move standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/315880 [20:40:14] (03CR) 10Dzahn: [C: 032] restbase: move standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/315880 (owner: 10Dzahn) [20:48:24] !log JOIN #wikimedia-ayuda [20:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:35] <|L> Krenair: can you take a look at this: ^ [20:50:02] LOL [20:50:17] |L, look at what? [20:50:20] (03PS2) 10Dzahn: add mapped IPv6 address for krypton [puppet] - 10https://gerrit.wikimedia.org/r/316041 [20:50:27] Krenair: Remove from SAL and twitter I guess [20:50:41] Ive removed it from SAL [20:50:50] <|L> see PM [20:51:48] done [20:52:09] <|L> thx [20:55:48] Why is there so many trolls today [20:56:06] <|L> not many [20:56:10] <|L> just many attemps of one [20:58:38] They've been around a while [20:58:39] on and off [20:59:32] !log starting mobileapps deploy [20:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:20] !log deployed mobileapps 2551db4 [21:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:06] 06Operations, 10Traffic: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523#2730003 (10BBlack) 05Open>03Resolved a:03BBlack Done for now, assuming we don't find a reason to revert! [21:10:36] 06Operations, 10Traffic: Extend check_sslxnn to check OCSP Stapling - https://phabricator.wikimedia.org/T148490#2730007 (10BBlack) a:05BBlack>03None [21:14:10] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2730013 (10BBlack) [21:14:12] 06Operations, 10Traffic: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples - https://phabricator.wikimedia.org/T148132#2730015 (10BBlack) [21:14:49] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1149975 (10BBlack) Seems like the nginx-internal + ssl_stapling_proxy path is the way to go, so folding the other subtask back in (alre... [21:15:31] 06Operations, 10Traffic: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples - https://phabricator.wikimedia.org/T148132#2715770 (10BBlack) [21:15:33] 06Operations, 10Traffic: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134#2730018 (10BBlack) [21:27:31] (03PS1) 10BBlack: ssl_ciphersuite: re-order AES vs ECDSA priorities [puppet] - 10https://gerrit.wikimedia.org/r/316889 (https://phabricator.wikimedia.org/T144626) [21:27:33] (03PS1) 10BBlack: ssl_ciphersuite: commentary update re: chapoly [puppet] - 10https://gerrit.wikimedia.org/r/316890 (https://phabricator.wikimedia.org/T144626) [21:27:35] (03PS1) 10BBlack: ssl_ciphersuite: switch AES bits order for GCM [puppet] - 10https://gerrit.wikimedia.org/r/316891 (https://phabricator.wikimedia.org/T144626) [21:29:36] yuvipanda: hi! [21:29:45] (03CR) 10BBlack: [C: 032] ssl_ciphersuite: re-order AES vs ECDSA priorities [puppet] - 10https://gerrit.wikimedia.org/r/316889 (https://phabricator.wikimedia.org/T144626) (owner: 10BBlack) [21:29:49] hello awight [21:30:28] yuvipanda: I was hoping you could remind me what the conclusion was to the discussion around an extension on production wikis relying on ORES as a labs service... [21:30:49] awight: we deployed ORES in production and mediawiki extensions now hit that [21:30:52] Did you have to put ORES on production? Or is it okay to not do that cos ORES is a beta feature? [21:31:00] yup, we had to put ORES In production [21:31:02] ok cool. [21:31:22] Wondering because I'm trying to help WM Taiwan with this fun ideographic description sequence plugin [21:31:48] pretty much the same deal as latex rendering [21:32:25] But the rendering backend is written in Java, so it would be a long negotiation to get it onto production hardware AFAIK [21:33:58] awight: production services aren't allowed to depend on non-production services, afaik [21:34:10] I've been daydreaming about how the generated images will be cached somewhere productiony, so the impact of labs downtime would only be that new sequences couldn't be rendered until the service is back up [21:34:13] aka, labs is a no-go due to level of service that is provided [21:34:27] awight: there is some sort of virtualization in the prod network now that you might be able to get onto, to avoid the hardware problem (Assuming you don't need a ton of resources) [21:34:36] p858snake|L2: That seems quite reasonable, but perhaps there are workarounds? [21:34:44] awight: we have had java in production before iirc, and there is currently a mailing list thread about that somewhere [21:34:53] ebernhardson: Yes that's a great start. This would be a very low traffic service AIUI [21:34:56] awight: yeap, move it into production [21:35:53] we certainly hava java services, but as akosiaris mentioned in an email ops doesn't actually run any of those (gerrit, wdqs, elasticsearch, maybe others) [21:35:57] p858snake|L2: I started one a few days ago, but unfortunately the historical precedent I've noticed is that maxsem was unable to get his Hierator deployed [21:36:05] so they kinda need a long-term owner [21:37:08] ebernhardson: Agreed, it would be a problem if some stuff embedded in wikitext was suddenly unsupported. [21:37:18] ganeti is the name of the virtualisation system for production [21:37:28] Krenair: ty, that's helpful [21:37:50] awight: it's not a no, it's a "we need someone that can look after it" [21:38:15] PROBLEM - puppet last run on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:38:18] or a "rewrite it in something different that current ops can handle" type situation [21:38:46] p858snake|L2: Like, an op? Or you think a mere developer with some basic prod keys and puppet meddling will satisfy that? [21:39:08] mere developers can't do puppet meddling without the help of ops [21:39:18] p858snake|L2: That's the path I'd prefer, but this Java app is a bit of a beast and relies on a lot of prior art also writen in Java [21:53:08] robh: any word on those misc nodes for kafka brokers? [21:53:28] lemme pull it up and see whre im at on it [21:53:41] (03PS2) 10Volans: Icinga: improve raid_handler false alarm detection [puppet] - 10https://gerrit.wikimedia.org/r/316827 (https://phabricator.wikimedia.org/T142085) [21:55:23] ottomata: so the dell quote was in, but its expired due to being over 30 days. im emailing them now to refresh it [21:56:28] same with the quote from dasher, so checkign with them both [21:56:45] if they are good, we have in the options, will summarize once i get a reply, sorry that you had to followup =] [21:56:57] i kind of fell into the juniper audit for support contract until today =P [21:56:58] (03CR) 10Volans: [C: 032] Icinga: improve raid_handler false alarm detection [puppet] - 10https://gerrit.wikimedia.org/r/316827 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [21:57:09] robh, salright [21:57:14] there's a free misc machien in eqiad, right? [21:57:19] can we go ahead and set that one up? [21:57:30] the codfw doesn't need to happen at the same time [21:58:54] ottomata: technically we need mark to sign off on the allocation, lemme check out he spare levels [21:59:00] since this sat awhile things may have changed [21:59:31] ok [21:59:33] thanks [22:00:02] looks like WMF4723 is still free. Let me spell out that we need that approval from him on the task and assign to him. once he approves (likely tomorrow), then I can get it spun up for ya [22:01:04] alright im on my pc imma fix what morebots missed earlier today [22:01:41] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2730117 (10RobH) a:05RobH>03mark >>! In T145082#2620384, @RobH wrote: > First off, that is easily one of the best damned requests ever (in terms of... [22:01:54] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:02:13] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:05:29] thanks robh! [22:05:50] quite welcome, hopefully tomorrow i'll be handing you a new server, heh [22:13:10] would jvm gc show as a user running on https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network [22:13:11] ? [22:13:17] or system? [22:16:20] bblack, paladox: it has a few rules into play [22:16:30] oh [22:16:37] but the main basis is that [22:16:46] remembering banned fields from past abuse [22:17:08] oh [22:17:34] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:17:47] (03PS1) 10Alexandros Kosiaris: check_ssl: Unbreak by not verifying server certs [puppet] - 10https://gerrit.wikimedia.org/r/316906 [22:20:25] (03CR) 10Alexandros Kosiaris: "so, this unbreak specifically the cassandra ssl checks on jessie. An alternative and definitely better way in the long term would be to po" [puppet] - 10https://gerrit.wikimedia.org/r/316906 (owner: 10Alexandros Kosiaris) [22:27:23] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:33:58] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2730181 (10awight) [22:40:32] 06Operations, 10Adminbot, 06Labs, 10Tool-Labs: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#2730198 (10Zppix) [22:41:03] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [22:55:02] !log JOIN #wikimedia-ayuda [22:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:55:43] thank you for K-lining that vandal :) [22:57:36] greg-g, around? [22:57:36] Krenair: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [22:57:54] lol [22:58:26] and of course that message resets idle time [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161019T2300). Please do the needful. [23:00:04] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:01] Krenair: I saw tweets from him mentioning wine, so I'd guess he's being social with the team [23:01:05] kaldari: ping? [23:01:07] . [23:01:11] here [23:01:16] Okay, I can SWAT. [23:01:17] bd808, oh, right, they have an offsite [23:01:30] arseny92: ah you added the votewiki change, good [23:01:32] I have something to add to the end of the SWAT window [23:01:32] jouncebot caches page or something? [23:01:50] arseny92: yeah. it does [23:01:52] indeed, so if you add in last minute a change, it won't announce it [23:02:00] stashbot: refresh [23:02:11] jouncebot, refresh [23:02:12] I refreshed my knowledge about deployments. [23:02:14] jouncebot, current [23:02:15] heh [23:02:20] was there a comment for that? hmm [23:02:21] jouncebot: now [23:02:21] For the next 0 hour(s) and 57 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161019T2300) [23:02:23] command* [23:02:41] it doesn't re-ping on now [23:03:10] i just woke up 30m ago, and so thought why not to act on that change :p [23:04:17] (03CR) 10Krinkle: "Sounds like addDBData() is running before the unittest tables are created. Is this the same version of PHPUnit?" [puppet] - 10https://gerrit.wikimedia.org/r/314563 (https://phabricator.wikimedia.org/T147778) (owner: 10Hashar) [23:04:24] (03CR) 10Krinkle: "https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-php7/8/console" [puppet] - 10https://gerrit.wikimedia.org/r/314563 (https://phabricator.wikimedia.org/T147778) (owner: 10Hashar) [23:05:39] Dereckson, I'll probably do my own patch in swat [23:05:56] okay [23:06:04] (03CR) 10Dereckson: [C: 032] Switching 10 wikis to numeric category collation per T146675 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316486 (https://phabricator.wikimedia.org/T146675) (owner: 10Kaldari) [23:06:16] it's a wikitech thing so runs on different servers [23:06:35] (03Merged) 10jenkins-bot: Switching 10 wikis to numeric category collation per T146675 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316486 (https://phabricator.wikimedia.org/T146675) (owner: 10Kaldari) [23:06:59] (03PS6) 10Arseny1992: Reverting votewiki back to en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) (owner: 10Huji) [23:07:00] yeah, I'm now a little used with that after Math and Flow [23:07:47] rebased [23:08:00] arseny92: by the way, there is a coming vote for en. or it's more a question to get a more universal language? [23:08:08] kaldari: live on mw1099 [23:08:20] ok, checking... [23:09:09] Dereckson: this will actually take a little while since I need to check all 10 wikis [23:09:51] err yes, to set it on universal lang [23:10:40] but if we're limited on time i guess that can be set to the lang of that vote: where is it? [23:12:15] if you were asking if there is a vote for en, then there isnt that I know of [23:13:07] (if you meant to be asking that, you mistyped it (there is -> is there) ) [23:14:46] !log JOIN #wikimedia-ayuda [23:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:24] arseny92: ok [23:16:03] reverted sal [23:16:09] (03PS3) 10Elukey: Remove HHVM X-Powered-By header from static Apache responses [puppet] - 10https://gerrit.wikimedia.org/r/314519 [23:16:15] (03CR) 10Krinkle: [C: 031] Remove HHVM X-Powered-By header from static Apache responses [puppet] - 10https://gerrit.wikimedia.org/r/314519 (owner: 10Elukey) [23:16:56] logbot need measures for who is allowed to log [23:17:12] we know [23:18:00] !log JOIN #wikimedia-ayuda [23:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:03] Dereckson: OK, all wikis look good. Feel free to sync. [23:19:19] kaldari: ack'ed [23:19:24] Dereckson: Then I'll run the updateCollation scripts for each [23:19:45] "ack'ed", I like that one :) [23:20:04] I stole it from th.cipriani [23:20:11] !log dereckson@mira Synchronized wmf-config/InitialiseSettings.php: Switching 10 more wikis to numeric category collation (T146675) (duration: 00m 59s) [23:20:12] T146675: Convert more wikis to numerical sorting - https://phabricator.wikimedia.org/T146675 [23:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:39] (03PS7) 10Dereckson: Reverting votewiki back to en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) (owner: 10Huji) [23:21:12] arseny92: no need to rebase, repo is fast forward, so a last rebase will often be needed [23:22:18] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) (owner: 10Huji) [23:22:45] (03Merged) 10jenkins-bot: Reverting votewiki back to en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) (owner: 10Huji) [23:23:17] arseny92: live on mw1099. Have you already used X-Wikimedia-Debug? [23:24:19] a header I have to set? [23:24:23] right [23:24:31] There is an extension to automate the process. [23:24:34] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [23:25:15] The header can ask what server should serve the request (here mw1099) or to do some specific extra logging/profiling. [23:31:01] arseny92: ping? [23:33:53] (03PS3) 10Dzahn: installserver: move standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/315882 [23:34:26] seem to be working [23:35:00] ok [23:35:52] !log dereckson@mira Synchronized wmf-config/InitialiseSettings.php: Reverting votewiki back to English (T148352) (duration: 00m 50s) [23:35:53] T148352: Default language of votewiki set to Persian (fa) for anonymous users. Change to English (en) - https://phabricator.wikimedia.org/T148352 [23:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:01] Krenair: I'm done, up to you. [23:36:18] hmmmm [23:36:43] we'll need some minute for cache refresh it seems. [23:37:13] with https://vote.wikimedia.org/wiki/Main_Page?debug=true I've it in English, without still in Persan [23:37:17] I'd ask ops to ban the whole wiki from cache, otherwise that stuff can take up to 30 days [23:37:26] although... [23:37:36] votewiki, mostly only going to be logged-in users, right? [23:37:50] so maybe varnish doesn't matter too much [23:38:00] i see in on en already [23:38:14] Krenair: there isn't any election soon [23:38:30] without the debug things as i've been doing it manually in fiddler composer [23:38:53] and im anonymoys for that wiki [23:41:02] !log Host mw1239 is not in mediawiki-installation dsh group [23:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:09] what does that mean now [23:41:20] mutante, hm? [23:41:25] what does the mediawiki-installation dsh group do? [23:41:59] Krenair: i was more wondering where it is [23:42:13] and if that "generate dsh groups from puppet" ticket got resovled recently or so [23:42:24] /etc/dsh/group/mediawiki-installation [23:42:32] on tin and mira [23:42:58] ok, so that's the generated result [23:43:01] Yep [23:43:06] and the source for that? [23:44:45] 'cause last time i added one it was still files in puppet repo [23:45:06] mutante, modules/scap/templates/dsh/dsh-group.erb [23:45:13] modules/scap/manifests/dsh/group.pp [23:45:29] Date: Mon Oct 19 [23:45:41] but 2015, eh [23:45:44] scap uses the dsh group files to determine which hosts to scap to [23:45:55] what's happening with https://gerrit.wikimedia.org/r/#/c/316909/? Is that blocked by progress on a different swat patch? [23:45:55] i know [23:46:10] it's about the generation of the dsh group files [23:46:26] andrewbogott, I got sidetracked [23:46:34] ok :) [23:47:01] Krenair: thanks, looking at dsh-group.erb [23:47:04] this " scap: use conftool data to populate dsh groups [23:47:15] so populated from conftool, that's what i meant [23:47:34] so mw1239 is not right in conftool somehow [23:48:01] thank you ori [23:50:13] ori for example of the 2 batch requests, one might be something like this: https://eu.wikipedia.org/w/api.php?action=expandtemplates&text={{Frantziako%20udalerri%20infotaula%20INSEE|%20izena=%20%20Saint-R%C3%A9my-en-Rollat|armarria=%20%20|%20irudia%20=%20%20|%20irudiaren%20testua%20=%20|%20nazionalitatea%20=%20Okzitania%20%20|%20kokapena%20=%20kokapenmapa|%20dn%20=%2046.184662|%20de%20=%203.3917|%20INSEE=%2003258|%20postakodea%20=%2003110 [23:50:14] |%20azalera%20=%2021|%20web%20=%20}}&prop=wikitext|categories|properties|modules|jsconfigvars&format=json [23:50:20] andrewbogott, it should be on labtestweb2001/labtestwikitech now [23:50:43] I got rid of the newlines there however. [23:50:43] ah, I guess we need to fix the sidebar by hand [23:51:37] ori, but those are 2 different expandtemplates requests to the batch api extension [23:51:44] you can construct the second one similarly [23:53:11] Krenair: other than the sidebar, labtestwikitech looks ok to me [23:53:43] * subbu signs off irc again [23:53:45] I can deal with the sidebar on wikitech [23:54:01] Though my labtestalex account on labtestwikitech has no admin permissions [23:54:36] It's just wiki/MediaWiki:Sidebar/GROUP:somethingsomething right? [23:54:49] Yeah look at Special:PrefixIndex/MediaWiki:Sidebar [23:55:13] https://labtestwikitech.wikimedia.org/wiki/MediaWiki:Sidebar/Group:projectadmin [23:55:33] just that one [23:56:07] ok, removed [23:56:51] and you've done proper wikitech too [23:56:52] ok [23:58:26] !log krenair@mira Synchronized php-1.28.0-wmf.22/extensions/OpenStackManager: https://gerrit.wikimedia.org/r/#/c/316909/ (duration: 01m 00s) [23:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:39] andrewbogott,