[00:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171110T0000). [00:00:04] AaronSchulz: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:04:23] hehe, never noticed the "sticker" bit [00:07:28] 10Operations, 10Cloud-VPS, 10Traffic, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749641 (10bd808) [00:31:53] wait, why do we deploy today if tomorrow is a holiday? [00:35:20] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3599592 (10tstarling) >>! In T175672#3635865, @aaron wrote: >>>! In T175672#3635140,... [00:40:25] MaxSem: "reasons" [00:40:47] mostly to not let the train get too far behind given we're getting close to the holiday season where it will be forced to be [00:41:10] and yeah, maybe I should have cancelled swats :/ my bad [00:42:41] 10Operations, 10Discovery-Search, 10Patch-For-Review: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#3749719 (10demon) Hmmm, nobody but wikipedia? I'm really wondering if we can drop the other backends from here. [00:43:50] ebernhardson: https://gerrit.wikimedia.org/r/#/c/390347/ will kill one of the sources of 500s (how many, I can't tell ya) [00:45:14] The 4xx ones probably don't matter [00:45:28] 301 is kinda ugly (is that some apache rewrite or http -> https redirect?) [00:45:40] no_justification: i would bet its http->https [00:45:46] My guess [00:45:47] but didn't look close enough. Easy to find out if important [00:45:47] too [00:46:08] The 503 can be discarded: just means the apache wasn't able to respond that time probably because it was down or something [00:46:30] The 500s are probably all due to error conditions passed to the script. Which is arguably a pretty shitty way to respond [00:46:34] (hence my fix to limit) [00:46:44] Same thing if we can drop the non-wikipedia search options [00:46:49] no_justification: tbh dieOut probably shouldn't return 5xx, which indicates a backend error, it should really be a 4xx client error. But it might be too late to change that [00:47:04] I don't see any requirement that we have to return a 500 [00:50:45] switching all the client validation errors to 4xx should get rid of many of the 5xx. Similarly if the mw api returns a 4xx we should forward that along (it's typically code: request_too_large because we don't autocomplete strings longer than titles can be) [00:56:25] Actually, could just pretend they didn't even give params there at all [00:56:26] :p [00:59:39] (03PS1) 10Chad: search.wikimedia.org: Don't bail on bad $site or $lang params [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390352 [01:02:25] (03PS1) 10Chad: search.wikimedia.org: Nicer handling of bad search parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390353 [01:02:37] ebernhardson: I have a couple up now :) [01:03:26] (03CR) 10jerkins-bot: [V: 04-1] search.wikimedia.org: Nicer handling of bad search parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390353 (owner: 10Chad) [01:03:40] (03CR) 10EBernhardson: [C: 031] search.wikimedia.org: simplify limit handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390347 (https://phabricator.wikimedia.org/T179266) (owner: 10Chad) [01:04:02] (03CR) 10EBernhardson: [C: 031] search.wikimedia.org: Don't bail on bad $site or $lang params [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390352 (owner: 10Chad) [01:04:04] (03PS1) 10Chad: search.wikimedia.org: Remove silly configuration switch for caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390354 [01:04:41] (03CR) 10EBernhardson: [C: 031] search.wikimedia.org: Nicer handling of bad search parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390353 (owner: 10Chad) [01:04:43] (03CR) 10Chad: [C: 032] search.wikimedia.org: simplify limit handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390347 (https://phabricator.wikimedia.org/T179266) (owner: 10Chad) [01:04:49] (03CR) 10Chad: [C: 032] search.wikimedia.org: Don't bail on bad $site or $lang params [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390352 (owner: 10Chad) [01:05:00] (03CR) 10jerkins-bot: [V: 04-1] search.wikimedia.org: Don't bail on bad $site or $lang params [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390352 (owner: 10Chad) [01:05:02] (03CR) 10EBernhardson: [C: 031] search.wikimedia.org: Remove silly configuration switch for caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390354 (owner: 10Chad) [01:05:08] (03CR) 10jerkins-bot: [V: 04-1] search.wikimedia.org: Remove silly configuration switch for caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390354 (owner: 10Chad) [01:05:46] heh, so obvious but easy to miss: $limitParam => 0 [01:06:01] (03Merged) 10jenkins-bot: search.wikimedia.org: simplify limit handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390347 (https://phabricator.wikimedia.org/T179266) (owner: 10Chad) [01:06:26] (03PS2) 10Chad: search.wikimedia.org: Don't bail on bad $site or $lang params [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390352 [01:06:45] ebernhardson: Yeah, I fixed that but never rebased my local patches before pushing :) [01:06:54] (03CR) 10jenkins-bot: search.wikimedia.org: simplify limit handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390347 (https://phabricator.wikimedia.org/T179266) (owner: 10Chad) [01:08:13] (03PS2) 10Chad: search.wikimedia.org: Nicer handling of bad search parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390353 [01:10:09] (03CR) 10jenkins-bot: search.wikimedia.org: Don't bail on bad $site or $lang params [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390352 (owner: 10Chad) [01:10:24] (03CR) 10Chad: [C: 032] search.wikimedia.org: Nicer handling of bad search parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390353 (owner: 10Chad) [01:11:31] (03Merged) 10jenkins-bot: search.wikimedia.org: Nicer handling of bad search parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390353 (owner: 10Chad) [01:11:46] (03PS2) 10Chad: search.wikimedia.org: Remove silly configuration switch for caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390354 [01:13:23] (03CR) 10jenkins-bot: search.wikimedia.org: Nicer handling of bad search parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390353 (owner: 10Chad) [01:13:25] (03CR) 10Chad: [C: 032] search.wikimedia.org: Remove silly configuration switch for caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390354 (owner: 10Chad) [01:14:39] (03Merged) 10jenkins-bot: search.wikimedia.org: Remove silly configuration switch for caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390354 (owner: 10Chad) [01:17:11] (03CR) 10jenkins-bot: search.wikimedia.org: Remove silly configuration switch for caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390354 (owner: 10Chad) [01:18:39] !log demon@tin Synchronized docroot/search.wikimedia.org/index.php: minor cleanups, less 500s (duration: 00m 47s) [01:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:15] ebernhardson: Wikimedia search service bad request. Request must include a 'search' parameter [01:19:16] Wheeee [01:19:19] That's "better" [01:19:19] no_justification: I'm going to finally sync 390346 after playing with mwdebug [01:19:20] haha :) [01:19:29] already staged a bit ago [01:20:56] (03PS1) 1020after4: Update scap to 3.7.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/390355 [01:21:11] !log aaron@tin Synchronized php-1.31.0-wmf.7/includes/db: Use the main stash for LBFactory "memStash" parameter (duration: 00m 47s) [01:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:39] (03PS1) 10Chad: search.wikimedia.org: Clean up result returning logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390357 [01:24:41] (03PS1) 10Chad: WIP: search.wikimedia.org: Stop supporting non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390358 [01:25:38] Ok ok, I'm done now [01:27:41] no_justification: as am I ;) [01:33:16] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3749801 (10tstarling) >>! In T175672#3655525, @aaron wrote: > I forget to mention, u... [02:10:40] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3696477 (10Krinkle) Random fly-by comment ahead. Apologies for any useless information that no longer applies. As I understand... [02:46:15] (03CR) 10Krinkle: [C: 031] search.wikimedia.org: Clean up result returning logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390357 (owner: 10Chad) [03:34:23] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 658.37 seconds [03:56:33] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 121.08 seconds [04:00:22] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [06:34:12] elukey: I will force a BBU relearn on db1046 [06:35:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390366 (https://phabricator.wikimedia.org/T178359) [06:37:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390366 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:39:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390366 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:39:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390366 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:39:54] !log Force a BBU relearn on db1046 - T166141 [06:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:01] T166141: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141 [06:40:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 - T178359 (duration: 00m 49s) [06:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:30] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:41:00] !log Stop MySQL on db1055 to copy its content to db1105 - T178359 [06:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:21] !log Deploy alter table on s5 eqiad master (db1063) - T172207 [06:50:23] RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [06:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:28] T172207: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207 [06:51:07] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3750027 (10Marostegui) After the BBU re-learn: ``` ˜/icinga-wm 7:50> RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` [07:01:14] (03PS1) 10Marostegui: wiki-replicas.sql: Add quarry user with 48 conn [puppet] - 10https://gerrit.wikimedia.org/r/390368 (https://phabricator.wikimedia.org/T180141) [07:01:54] (03CR) 10Marostegui: [C: 032] wiki-replicas.sql: Add quarry user with 48 conn [puppet] - 10https://gerrit.wikimedia.org/r/390368 (https://phabricator.wikimedia.org/T180141) (owner: 10Marostegui) [07:17:33] !log Deploy alter table on s3.codfw master (db2018) with replication, this will generate lag on codfw - T174569 [07:18:53] PROBLEM - https://phabricator.wikimedia.org on phab1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:19:02] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:19:41] ^ just paged me [07:21:57] looks down yes [07:24:10] <_joe_> I didn't get paged [07:26:46] _joe_: I'm on 24 hour paging, may be that's why I got it? [07:27:08] <_joe_> !log restarting apache on phab1001 [07:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:48] seems up right now [07:27:53] RECOVERY - https://phabricator.wikimedia.org on phab1001 is OK: HTTP OK: HTTP/1.1 200 OK - 34523 bytes in 0.231 second response time [07:28:12] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 34524 bytes in 0.285 second response time [07:29:45] thanks _joe_ [07:30:36] <_joe_> greg-g: I didn't look for the root cause, tbh [07:30:57] :/ [07:30:58] <_joe_> I just saw apache children blocked in a futex, so decided it was surely something that would be cured by a restart [07:31:16] <_joe_> it's relatively early and I'm very sleepy today [07:31:40] <_joe_> greg-g: if this is a recurring issue, we will surely have time to find out. Let's hope it wasn't [07:31:40] hopefully mukunda (or mutante) is around next time [07:31:47] and very late for me, and greg-g :) [07:31:51] yeah, g'night! [07:32:05] <_joe_> I'm not sure mukunda or daniel would've extracted more information [07:32:12] good night greg-g :) [07:32:18] <_joe_> this needs strace(1) to be tracked down [07:36:47] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: allow wdqs-admins to pool / depool wdqs servers - https://phabricator.wikimedia.org/T172798#3750054 (10Smalyshev) 05Resolved>03Open Doesn't look like it is working: ``` Depooling wdqs2001.codfw.wmnet from all services... WARNI... [07:38:18] 10Operations, 10ops-codfw, 10DBA: db2059 storage crash - https://phabricator.wikimedia.org/T180196#3750056 (10Marostegui) [07:38:29] 10Operations, 10ops-codfw, 10DBA: db2059 storage crash - https://phabricator.wikimedia.org/T180196#3750068 (10Marostegui) 05Open>03Resolved a:03Marostegui [07:49:50] !log smalyshev@tin Started deploy [wdqs/wdqs@213f864]: (no justification provided) [07:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:23] !log smalyshev@tin Finished deploy [wdqs/wdqs@213f864]: (no justification provided) (duration: 00m 33s) [07:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:32] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [07:51:23] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [07:56:03] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: allow wdqs-admins to pool / depool wdqs servers - https://phabricator.wikimedia.org/T172798#3509813 (10Dzahn) try with "sudo -i" [07:58:20] (03CR) 10Hashar: "Indeed :-) Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/389432 (owner: 10Dzahn) [08:20:46] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: allow wdqs-admins to pool / depool wdqs servers - https://phabricator.wikimedia.org/T172798#3750093 (10Smalyshev) @Dzahn it asks me for password then. [08:25:03] (03PS1) 10Muehlenhoff: Record extended MOU for west1 [puppet] - 10https://gerrit.wikimedia.org/r/390376 [08:28:12] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o [08:28:12] e was received [08:29:52] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [08:30:53] (03CR) 10Muehlenhoff: [C: 032] Record extended MOU for west1 [puppet] - 10https://gerrit.wikimedia.org/r/390376 (owner: 10Muehlenhoff) [08:31:43] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [08:32:03] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [08:34:46] (03CR) 10Filippo Giunchedi: [C: 031] Log every retry warning [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389742 (owner: 10Eevans) [08:35:04] (03CR) 10Filippo Giunchedi: [C: 031] Use a more realistic defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389743 (owner: 10Eevans) [08:35:31] Thanks _joe_, not sure why I didn't get paged either. No code has been deployed recently so I'm not sure what would suddenly cause apache to get locked. But I guess we'll find out something if it happens again. [08:36:33] thanks marostegui!!! [08:36:35] (03CR) 10Muehlenhoff: [C: 031] Use a more realistic defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389743 (owner: 10Eevans) [08:37:39] (03PS3) 10Muehlenhoff: cumin: use new syntax in aliases [puppet] - 10https://gerrit.wikimedia.org/r/389983 (owner: 10Volans) [08:38:40] marostegui: db1046 knows that we are about to decom it and it is trying to get attention :P [08:38:48] hahaha [08:38:58] ah, btw, yes, let's migrate its data next week :) [08:39:38] whenever you have time, I'll send an email before hand explaining the maintenance to alert people that data on the replicas might not be up to date for a bit [08:40:32] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [08:40:49] elukey: ^ XDDDDD [08:41:10] let me poke the BBU again [08:41:20] ahahhaha [08:43:07] !log rebooting mw2200-mw2223 to 4.9.51 (and to pick up OpenSSL updates) [08:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:39] (03PS2) 10Lokal Profil: [WIP]Support prefixed dump types [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390312 (https://phabricator.wikimedia.org/T163328) [08:51:23] 10Operations, 10Prod-Kubernetes, 10monitoring, 10Kubernetes, and 3 others: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395#3657156 (10fgiunchedi) I've built a k8s-enabled deb from Debian package and imported the repo in `operations/debs/prometheus`. I'll test and u... [09:02:39] (03PS1) 10Gehel: base: purge apt sources on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/390377 [09:04:13] (03CR) 10Gehel: [C: 04-1] "Make sure all stray apt sources have been cleaned up before merging this: https://phabricator.wikimedia.org/P6286" [puppet] - 10https://gerrit.wikimedia.org/r/390377 (owner: 10Gehel) [09:12:56] (03PS1) 10Marostegui: db-eqiad.php: Repool db1055 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390378 (https://phabricator.wikimedia.org/T178359) [09:13:22] !log powercycling mw2213, stuck after reboot [09:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:05] (03CR) 10Hashar: "From what I understand, on deployment-prep we have some custom ones:" [puppet] - 10https://gerrit.wikimedia.org/r/390377 (owner: 10Gehel) [09:15:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1055 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390378 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:15:36] (03CR) 10Muehlenhoff: "wikimedia-experimental.list (we also use that one in production, where the change has been made already) and project-aptly.list are manage" [puppet] - 10https://gerrit.wikimedia.org/r/390377 (owner: 10Gehel) [09:16:30] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1055 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390378 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:16:39] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1055 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390378 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:17:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 with low weight - T178359 (duration: 00m 47s) [09:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:47] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:21:10] marostegui: working right now in T173647 to give a final review [09:21:10] T173647: Prepare and check storage layer for hif.wiktionary - https://phabricator.wikimedia.org/T173647 [09:21:21] arturo: cool! thanks [09:23:01] !log rebooting mw2097-mw2117 to 4.9.51 (and to pick up OpenSSL updates) [09:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:24] 10Operations, 10Cloud-VPS, 10Traffic, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749582 (10akosiaris) Just as a note, if it proves it's not possible to do this in our openstack, vagrant is also a valid... [09:30:06] !log Upgrading operations-puppet-tests-docker jenkins job to stop passing docker --tty and thus have signals forwarded from 'docker run' - T176747 [09:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:12] T176747: When jenkins kills a build due to max execution time the docker containers stay running - https://phabricator.wikimedia.org/T176747 [09:31:50] (03CR) 10Alexandros Kosiaris: [C: 032] puppetdb: Fix support for postgresql 9.6 [puppet] - 10https://gerrit.wikimedia.org/r/390332 (owner: 10Paladox) [09:31:55] (03PS4) 10Alexandros Kosiaris: puppetdb: Fix support for postgresql 9.6 [puppet] - 10https://gerrit.wikimedia.org/r/390332 (owner: 10Paladox) [09:31:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetdb: Fix support for postgresql 9.6 [puppet] - 10https://gerrit.wikimedia.org/r/390332 (owner: 10Paladox) [09:38:42] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:39:28] (03CR) 10Elukey: "Thanks Filippo!" (036 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [09:39:32] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [09:39:56] (03PS9) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [09:42:28] (03CR) 10Hashar: [C: 04-1] "contint::packages::apt is no more relevant. It is missing bits and not properly catching up packages." [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) (owner: 10Arturo Borrero Gonzalez) [09:43:39] pff [09:44:01] those friday migrations are boring [09:45:19] !log powercycling mw2108, stuck after reboot [09:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:34] (03PS2) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [09:47:42] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [09:48:32] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [09:53:23] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390380 [09:54:56] !log Compress enwiki on db1105.s1 - T178359 [09:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:03] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:55:40] (03PS3) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [09:55:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390380 (owner: 10Marostegui) [09:56:31] marostegui: https://phabricator.wikimedia.org/T173647#3750201 [09:57:03] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390380 (owner: 10Marostegui) [09:57:04] 10Operations, 10ops-codfw: Broken memory on mw2108 - https://phabricator.wikimedia.org/T180200#3750202 (10MoritzMuehlenhoff) [09:57:13] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390380 (owner: 10Marostegui) [09:57:36] ACKNOWLEDGEMENT - Host mw2108 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T180200 [09:58:01] mw2108 lost the battle :D [10:00:03] yeah :-) [10:00:22] arturo: i will check later [10:01:05] arturo: i guess something intermediate is missing, as you can do sql --cluster web kbpwiki_p and that works (and I assume you have never accessed that view before), so the grants are there for everything but for that new view, so something intermediate might be missing from that procedure :) [10:01:20] arturo: but I don't know the magic the clouds team do to generate the views+grants [10:01:41] I know nothing about the grants yet [10:02:14] I know about the maintain-views + maintain-meta_p stuff, not sure if that creates gratns or not [10:02:41] (03PS4) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [10:03:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1055 weight - T178359 (duration: 00m 58s) [10:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:16] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:04:00] arturo: check the last comment from Reedy :-) [10:04:08] yeah [10:04:20] I guess I can simply give +2? [10:04:34] (03PS4) 10Marostegui: Add hifwiktionary too labsdb.yaml [puppet] - 10https://gerrit.wikimedia.org/r/389555 (https://phabricator.wikimedia.org/T173643) (owner: 10Reedy) [10:04:45] do you have rights/know how to merge puppet? [10:04:49] if not, I can do it for you [10:05:34] I think so, lets try? [10:05:43] sure [10:05:49] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Add hifwiktionary too labsdb.yaml [puppet] - 10https://gerrit.wikimedia.org/r/389555 (https://phabricator.wikimedia.org/T173643) (owner: 10Reedy) [10:08:07] (03PS5) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [10:09:47] marostegui: done [10:11:28] !log rebooting Parsoid servers in codfw to 4.9.51 (and to pick up OpenSSL updates) [10:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:17] arturo: cool, I don't know if you need to re-run views stuff to pick up Reedy's patch, as I said, I don't know clouds's magic :) [10:12:38] marostegui: ok, will re-check everyting [10:12:57] I think it's DNS mostly... [10:12:59] Maybe [10:13:19] yeah, could be, because I do see the grants for labsdbuser and the users have that role assigned [10:16:04] marostegui: if I run the maintain-views script I see lot of warnings like this [10:16:07] DEBUG Skipping full view wb_terms on database hifwiktionary as the table does not seem to exist. [10:16:21] this may indicate that we are creating views for tables which doesn't exist yet? [10:17:40] that is fine, that table only exists on wikidatawiki [10:18:51] (03PS6) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [10:21:12] (03PS1) 10Alexandros Kosiaris: Use kubelet user for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/390384 [10:24:24] !log Deploy schema change on db2089 - T179106 [10:24:31] (03PS7) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [10:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:33] T179106: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106 [10:28:08] (03CR) 10Alexandros Kosiaris: [C: 032] Use kubelet user for kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/390384 (owner: 10Alexandros Kosiaris) [10:30:32] RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [10:30:56] elukey: ^ [10:31:57] (03PS4) 10Arturo Borrero Gonzalez: apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) [10:32:33] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) (owner: 10Arturo Borrero Gonzalez) [10:32:48] (03PS1) 10Addshore: Add AdvancedSearch to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390385 (https://phabricator.wikimedia.org/T180147) [10:32:50] (03PS1) 10Addshore: Enable AdvancedSearch on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390386 (https://phabricator.wikimedia.org/T180147) [10:33:00] (03PS2) 10Addshore: Enable AdvancedSearch on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390386 (https://phabricator.wikimedia.org/T180147) [10:34:01] (03PS1) 10Addshore: Enable AdvancedSearch on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) [10:37:32] marostegui: <3 [10:37:32] (03PS8) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [10:39:21] (03PS5) 10Arturo Borrero Gonzalez: apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) [10:47:00] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390388 [10:48:49] (03PS9) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [10:49:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390388 (owner: 10Marostegui) [10:49:12] !log powercycling wtp2017, stuck after reboot [10:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:18] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390388 (owner: 10Marostegui) [10:50:27] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390388 (owner: 10Marostegui) [10:50:32] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response [10:50:33] main}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received [10:51:11] <_joe_> something is going on on scb1001, lemme check [10:51:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1055 weight - T178359 (duration: 00m 47s) [10:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:22] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:52:23] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [10:52:32] <_joe_> !log restarting ores on scb1001, causing memory exhaustion [10:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:22] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [10:53:23] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [10:53:42] <_joe_> loadavg was 500+ [10:54:38] <_joe_> I'm depooling that server for now [10:54:47] <_joe_> load is still too high [10:55:09] marostegui: not sure how to follow up with T173647 [10:55:10] T173647: Prepare and check storage layer for hif.wiktionary - https://phabricator.wikimedia.org/T173647 [10:55:16] <_joe_> !log depooling scb1001 from all services while it becomes healthy again [10:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:53] ACKNOWLEDGEMENT - MD RAID on wtp2017 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T180211 [10:55:56] 10Operations, 10ops-codfw: Degraded RAID on wtp2017 - https://phabricator.wikimedia.org/T180211#3750452 (10ops-monitoring-bot) [10:57:07] (03PS10) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [10:57:57] arturo: As I said, I don't know how the views creation works behind the scenes, you might want to ask madhu if she was helping you with that ticket [10:58:20] great, will wait then a few hours, thanks1 [10:58:54] arturo: you'll need to wait till Monday, as far as I remember there is an US holiday today [10:58:56] (03PS11) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [10:59:08] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: wtp2017.codfw.wmnet [10:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:26] 10Operations, 10ops-codfw: Degraded RAID on wtp2017 - https://phabricator.wikimedia.org/T180211#3750473 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Papaul [10:59:29] (03Abandoned) 10Hashar: contint: unattended upgrade from distro [puppet] - 10https://gerrit.wikimedia.org/r/315084 (https://phabricator.wikimedia.org/T159254) (owner: 10Hashar) [10:59:32] (03Abandoned) 10Hashar: contint: update unattended-upgrade setting [puppet] - 10https://gerrit.wikimedia.org/r/315079 (owner: 10Hashar) [11:00:34] marostegui: ACK [11:02:32] (03PS1) 10Marostegui: db-eqiad.php: Restore db1055 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390389 (https://phabricator.wikimedia.org/T178359) [11:04:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1055 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390389 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:05:51] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1055 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390389 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:06:27] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1055 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390389 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:06:48] (03CR) 10Hashar: [C: 031] "On the CI puppetmaster, I have removed my two other patches https://gerrit.wikimedia.org/r/#/c/315084/ https://gerrit.wikimedia.org/r/#/c" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) (owner: 10Arturo Borrero Gonzalez) [11:06:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1055 original weight - T178359 (duration: 00m 46s) [11:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:00] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [11:09:15] (03PS1) 10Muehlenhoff: Add library hints for openssl11 and openssl1.0 [puppet] - 10https://gerrit.wikimedia.org/r/390390 [11:10:31] (03PS1) 10Marostegui: db-eqiad.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390391 [11:10:58] (03PS2) 10Muehlenhoff: Add library hints for openssl11 and openssl1.0 [puppet] - 10https://gerrit.wikimedia.org/r/390390 [11:12:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390391 (owner: 10Marostegui) [11:14:21] (03Merged) 10jenkins-bot: db-eqiad.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390391 (owner: 10Marostegui) [11:15:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove old comment about db1080 (duration: 00m 46s) [11:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:19] (03CR) 10jenkins-bot: db-eqiad.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390391 (owner: 10Marostegui) [11:16:33] (03PS1) 10Addshore: Disable AdvancedSearch on deployment.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390392 (https://phabricator.wikimedia.org/T180201) [11:17:30] Im gong to +2 that one on mediawiki-config now (beta only) [11:17:35] (03CR) 10WMDE-Fisch: [C: 031] Disable AdvancedSearch on deployment.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390392 (https://phabricator.wikimedia.org/T180201) (owner: 10Addshore) [11:17:48] (03CR) 10Addshore: [C: 032] Disable AdvancedSearch on deployment.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390392 (https://phabricator.wikimedia.org/T180201) (owner: 10Addshore) [11:18:14] (03CR) 10Muehlenhoff: [C: 032] Add library hints for openssl11 and openssl1.0 [puppet] - 10https://gerrit.wikimedia.org/r/390390 (owner: 10Muehlenhoff) [11:18:56] (03Merged) 10jenkins-bot: Disable AdvancedSearch on deployment.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390392 (https://phabricator.wikimedia.org/T180201) (owner: 10Addshore) [11:19:05] (03CR) 10jenkins-bot: Disable AdvancedSearch on deployment.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390392 (https://phabricator.wikimedia.org/T180201) (owner: 10Addshore) [11:21:41] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:390392|Disable AdvancedSearch on deployment.beta]] BETA ONLY T180201 (duration: 00m 46s) [11:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:47] T180201: Regular search box doesn't appear alongside AdvancedSearch on deployment.beta - https://phabricator.wikimedia.org/T180201 [11:22:20] <_joe_> !log stopping changeprop, celery-ores, cpjobqueue on scb1001 [11:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:05] (03PS1) 10Elukey: druid: remove com.metamx.metrics.JvmMonitor from default monitors [puppet] - 10https://gerrit.wikimedia.org/r/390393 (https://phabricator.wikimedia.org/T177459) [11:24:42] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:24:53] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/8723/" [puppet] - 10https://gerrit.wikimedia.org/r/390393 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [11:27:16] !log rebooting wtp1025 to 4.9.51 [11:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:52] (03CR) 10Elukey: [C: 032] druid: remove com.metamx.metrics.JvmMonitor from default monitors [puppet] - 10https://gerrit.wikimedia.org/r/390393 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [11:32:05] <_joe_> !log stopping mobileapps as well on scb1001 [11:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:29] !log rebooting mw2163-2199 to 4.9.51 (and to pick up OpenSSL updates) [11:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:52] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [11:46:03] <_joe_> !log restarted all services and repooled scb1001 [11:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:02] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2066994 [11:59:41] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (watching), and 3 others: Update trending-edits' node-rdkafka to v1.x - https://phabricator.wikimedia.org/T179786#3750605 (10Joe) [12:11:46] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:16:15] (03PS10) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [12:26:25] (03PS11) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [12:48:55] !log rebooting video scalers in codfw to 4.9.51 (and to pick up OpenSSL updates) [12:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:18] !log cp4021: restart varnish-be due to mbox lag [13:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:14] !log powercycling mw2118, stuck after reboot [13:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:50] !log truncate /var/log/nginx/error.log.1 on install1002 as it is filling up [13:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:14:46] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3750820 (10akosiaris) `debian/changelog` in that package is wrongly formatted and hence package is currently unbuildable. See D875 [13:15:08] RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 0 [13:17:18] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational [13:19:05] (03PS1) 10Alexandros Kosiaris: check_eth: Ignore calico interfaces [puppet] - 10https://gerrit.wikimedia.org/r/390400 [13:19:58] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:27:12] (03PS1) 10Gehel: elasticsearch: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) [13:27:49] (03CR) 10Gehel: [C: 04-1] "Don't merge before we are ready to upgrade to elastic 5.5.x" [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [13:29:40] (03CR) 10Muehlenhoff: elasticsearch: dedicated components in our APT repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [13:29:46] (03CR) 10Alexandros Kosiaris: [C: 032] check_eth: Ignore calico interfaces [puppet] - 10https://gerrit.wikimedia.org/r/390400 (owner: 10Alexandros Kosiaris) [13:30:26] (03PS2) 10Gehel: elasticsearch: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) [13:30:54] (03CR) 10Gehel: [C: 04-1] "Don't merge before we are ready to upgrade to elastic 5.5.x" [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [13:31:00] (03CR) 10Gehel: [C: 04-1] elasticsearch: dedicated components in our APT repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [13:31:02] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [13:31:29] (03PS3) 10Gehel: elasticsearch: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) [13:31:51] (03CR) 10Gehel: [C: 04-1] "Don't merge before we are ready to upgrade to elastic 5.5.x" [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [13:38:29] (03PS1) 10Gehel: elasticsearch: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390402 (https://phabricator.wikimedia.org/T179964) [13:39:07] (03PS1) 10Alexandros Kosiaris: profile::etcd: Move hiera lookups to parameters [puppet] - 10https://gerrit.wikimedia.org/r/390403 [13:39:09] (03PS1) 10Alexandros Kosiaris: profile::etcd::auth: Rename hiera key [puppet] - 10https://gerrit.wikimedia.org/r/390404 [13:39:46] (03PS4) 10Gehel: elasticsearch: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) [13:40:08] (03PS2) 10Gehel: logstash: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390402 (https://phabricator.wikimedia.org/T179964) [13:43:31] (03CR) 10Mobrovac: [C: 031] Log every retry warning [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389742 (owner: 10Eevans) [13:43:55] (03CR) 10Mobrovac: [C: 031] Use a more realistic defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389743 (owner: 10Eevans) [13:46:18] (03PS2) 10Alexandros Kosiaris: profile::etcd: Move hiera lookups to parameters [puppet] - 10https://gerrit.wikimedia.org/r/390403 [13:46:20] (03PS2) 10Alexandros Kosiaris: profile::etcd::auth: Rename hiera key [puppet] - 10https://gerrit.wikimedia.org/r/390404 [13:49:12] (03CR) 10Alexandros Kosiaris: [C: 031] "The only role using this is role::etcd::kubernetes and we don't set that hiera key in production nor in tools-k8s-etcd prefix in toollabs" [puppet] - 10https://gerrit.wikimedia.org/r/390404 (owner: 10Alexandros Kosiaris) [13:49:20] (03PS1) 10Muehlenhoff: Add component/icu57 [puppet] - 10https://gerrit.wikimedia.org/r/390406 [13:59:52] (03PS2) 10Gehel: archiva: generate git-fat sha1 for .tar.gz and .whl [puppet] - 10https://gerrit.wikimedia.org/r/389932 [14:06:00] (03CR) 10Giuseppe Lavagetto: [C: 031] archiva: generate git-fat sha1 for .tar.gz and .whl [puppet] - 10https://gerrit.wikimedia.org/r/389932 (owner: 10Gehel) [14:08:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] profile::etcd: Move hiera lookups to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390403 (owner: 10Alexandros Kosiaris) [14:09:12] (03CR) 10Giuseppe Lavagetto: [C: 031] "this was a leftover of when that setting was used in multiple places in the past." [puppet] - 10https://gerrit.wikimedia.org/r/390404 (owner: 10Alexandros Kosiaris) [14:12:48] (03CR) 10Giuseppe Lavagetto: [C: 031] profile: allow Prometheus to access k8s kubelet [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [14:16:08] (03CR) 10Giuseppe Lavagetto: [C: 032] Improve the checking procedure and emit better messages; v0.1.4 [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) (owner: 10Mobrovac) [14:17:42] !log rebooting image scalers in codfw to 4.9.51 (and to pick up OpenSSL updates) [14:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:42] (03CR) 10Muehlenhoff: [C: 031] elasticsearch: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [14:39:53] (03PS3) 10Alexandros Kosiaris: profile::etcd: Move hiera lookups to parameters [puppet] - 10https://gerrit.wikimedia.org/r/390403 [14:39:55] (03PS3) 10Alexandros Kosiaris: profile::etcd::auth: Rename hiera key [puppet] - 10https://gerrit.wikimedia.org/r/390404 [14:39:57] (03PS1) 10Alexandros Kosiaris: Add system::role to role::etcd::kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/390413 [14:43:06] (03CR) 10Alexandros Kosiaris: profile::etcd: Move hiera lookups to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390403 (owner: 10Alexandros Kosiaris) [14:43:36] (03PS5) 10Alexandros Kosiaris: prometheus: Force using read-only kubelet API [puppet] - 10https://gerrit.wikimedia.org/r/390264 (https://phabricator.wikimedia.org/T177395) [14:43:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] prometheus: Force using read-only kubelet API [puppet] - 10https://gerrit.wikimedia.org/r/390264 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [14:51:27] (03PS4) 10Alexandros Kosiaris: Prometheus: add kubernetes node cadvisor job [puppet] - 10https://gerrit.wikimedia.org/r/390267 [14:53:46] (03CR) 10Alexandros Kosiaris: Prometheus: add kubernetes node cadvisor job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390267 (owner: 10Alexandros Kosiaris) [14:54:35] (03CR) 10Alexandros Kosiaris: [C: 032] Prometheus: add kubernetes node cadvisor job [puppet] - 10https://gerrit.wikimedia.org/r/390267 (owner: 10Alexandros Kosiaris) [14:55:22] (03CR) 10Alexandros Kosiaris: [C: 032] profile::etcd::auth: Rename hiera key [puppet] - 10https://gerrit.wikimedia.org/r/390404 (owner: 10Alexandros Kosiaris) [14:55:28] (03PS4) 10Alexandros Kosiaris: profile::etcd::auth: Rename hiera key [puppet] - 10https://gerrit.wikimedia.org/r/390404 [14:55:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] profile::etcd::auth: Rename hiera key [puppet] - 10https://gerrit.wikimedia.org/r/390404 (owner: 10Alexandros Kosiaris) [14:56:01] (03PS2) 10Alexandros Kosiaris: Add system::role to role::etcd::kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/390413 [14:56:06] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add system::role to role::etcd::kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/390413 (owner: 10Alexandros Kosiaris) [14:56:39] (03CR) 10Alexandros Kosiaris: [C: 032] profile: allow Prometheus to access k8s kubelet [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [14:56:43] (03PS5) 10Alexandros Kosiaris: profile: allow Prometheus to access k8s kubelet [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [14:56:45] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] profile: allow Prometheus to access k8s kubelet [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [14:57:36] !log Decommissioning Cassandra, restbase2006-a.codfw.wmnet (T179422) [14:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:43] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [15:03:06] !log rebooting remaing API servers in codfw to 4.9.51 (and to pick up OpenSSL updates) [15:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:22] (03PS12) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [15:05:00] (03CR) 10Filippo Giunchedi: [C: 031] Add component/icu57 [puppet] - 10https://gerrit.wikimedia.org/r/390406 (owner: 10Muehlenhoff) [15:07:04] (03PS13) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [15:12:28] (03PS14) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [15:16:47] (03PS15) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [15:17:57] (03PS2) 10WMDE-leszek: Wikidata dispatcher: Choose a better value for --randomness [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [15:20:33] (03PS16) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [15:21:01] (03CR) 10jerkins-bot: [V: 04-1] [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 (owner: 10Gehel) [15:22:08] (03CR) 10Zoranzoki21: [C: 031] Add AdvancedSearch to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390385 (https://phabricator.wikimedia.org/T180147) (owner: 10Addshore) [15:22:55] (03CR) 10Zoranzoki21: [C: 031] Enable AdvancedSearch on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390386 (https://phabricator.wikimedia.org/T180147) (owner: 10Addshore) [15:23:16] (03CR) 10Zoranzoki21: [C: 031] Enable AdvancedSearch on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) (owner: 10Addshore) [15:24:51] (03PS17) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [15:29:34] (03PS1) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [15:30:01] (03CR) 10jerkins-bot: [V: 04-1] profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [15:30:32] (03PS2) 10Muehlenhoff: Add component/icu57 [puppet] - 10https://gerrit.wikimedia.org/r/390406 [15:32:38] (03PS2) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [15:33:26] (03PS3) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [15:38:09] (03CR) 10Muehlenhoff: [C: 032] Add component/icu57 [puppet] - 10https://gerrit.wikimedia.org/r/390406 (owner: 10Muehlenhoff) [15:39:38] (03PS4) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [15:43:09] (03PS5) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [15:45:38] (03PS6) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [16:16:01] (03PS1) 10Ema: role::prometheus::ops: add banner message to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 [16:16:31] (03CR) 10jerkins-bot: [V: 04-1] role::prometheus::ops: add banner message to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 (owner: 10Ema) [16:17:26] (03PS2) 10Ema: role::prometheus::ops: add banner message to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 [16:17:59] (03CR) 10jerkins-bot: [V: 04-1] role::prometheus::ops: add banner message to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 (owner: 10Ema) [16:18:19] (03PS3) 10Ema: role::prometheus::ops: add banner message to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 [16:22:46] 10Operations, 10ops-codfw: Broken memory on mw2108 - https://phabricator.wikimedia.org/T180200#3751421 (10Papaul) Thank you will work on it on Monday. [16:23:18] 10Operations, 10ops-codfw: Broken memory on mw2108 - https://phabricator.wikimedia.org/T180200#3751423 (10Papaul) p:05Triage>03Normal [16:33:27] (03PS1) 10Arturo Borrero Gonzalez: apt: unattended upgrades -updates suites by default [puppet] - 10https://gerrit.wikimedia.org/r/390431 (https://phabricator.wikimedia.org/T180254) [16:34:02] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended upgrades -updates suites by default [puppet] - 10https://gerrit.wikimedia.org/r/390431 (https://phabricator.wikimedia.org/T180254) (owner: 10Arturo Borrero Gonzalez) [16:34:27] 10Operations, 10Traffic, 10Prometheus-metrics-monitoring: authdns prometheus metrics are not available anymore - https://phabricator.wikimedia.org/T180256#3751470 (10ema) [16:34:34] 10Operations, 10Traffic, 10Prometheus-metrics-monitoring: authdns prometheus metrics are not available anymore - https://phabricator.wikimedia.org/T180256#3751482 (10ema) p:05Triage>03Normal [16:36:20] Line 7: Use 'Suggested-By:' not 'Suggested-by:' <-- really? [16:36:46] 10Operations, 10Traffic, 10netops, 10Cloud-VPS (Quota-requests): Request increased quota for traffic Cloud VPS project - https://phabricator.wikimedia.org/T180178#3751487 (10ema) p:05Triage>03Normal [16:39:31] 10Operations, 10Traffic: Puppet / LVS: confusion in service vs IP name - https://phabricator.wikimedia.org/T180257#3751493 (10Gehel) [16:43:19] (03PS6) 10Arturo Borrero Gonzalez: apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) [16:43:22] (03PS2) 10Arturo Borrero Gonzalez: apt: unattended upgrades -updates suites by default [puppet] - 10https://gerrit.wikimedia.org/r/390431 (https://phabricator.wikimedia.org/T180254) [16:46:18] (03PS1) 10Gehel: logstash: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390433 (https://phabricator.wikimedia.org/T179964) [16:46:24] (03PS1) 10Ayounsi: Fix wrong hostname for dns4001/4002 [puppet] - 10https://gerrit.wikimedia.org/r/390434 [16:47:53] !log Deploy alter table on s6, dbstore1001, dbstore1002 abd db1039 - T174569 [16:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:59] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:50:26] (03CR) 10Hashar: [C: 031] "Sounds good to me :)" [puppet] - 10https://gerrit.wikimedia.org/r/390431 (https://phabricator.wikimedia.org/T180254) (owner: 10Arturo Borrero Gonzalez) [17:00:08] (03PS1) 10Ottomata: 2.1.2-2 release for Hadoop 2.6 [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/390435 (https://phabricator.wikimedia.org/T158334) [17:00:46] (03CR) 10BBlack: [C: 031] Fix wrong hostname for dns4001/4002 [puppet] - 10https://gerrit.wikimedia.org/r/390434 (owner: 10Ayounsi) [17:01:07] (03CR) 10Ottomata: [V: 032 C: 032] 2.1.2-2 release for Hadoop 2.6 [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/390435 (https://phabricator.wikimedia.org/T158334) (owner: 10Ottomata) [17:02:52] (03CR) 10Ayounsi: [C: 032] Fix wrong hostname for dns4001/4002 [puppet] - 10https://gerrit.wikimedia.org/r/390434 (owner: 10Ayounsi) [17:09:09] PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 2570 MB (3% inode=98%) [17:11:46] !log freed some disk space on install1002 [17:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:11] (03PS12) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [17:20:58] !log uploaded icu 57.1-6+wmf1 for jessie-wikimedia/component/icu57 (co-installable build for ICU migration) [17:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:02] (03CR) 10Ottomata: [C: 031] "Some nits, but +1 Feel free to merge after fixing nits, OR NOT if you don't like them :)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [17:45:28] (03PS4) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) [17:49:18] PROBLEM - DPKG on stat1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:49:38] PROBLEM - DPKG on analytics1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:50:34] this should be Andrew instaling spark --^ [17:51:01] ah! [17:51:02] yes [17:51:03] sorry [17:51:10] yeah my .deb is getting better. [17:53:32] (03PS13) 10Elukey: First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [17:59:18] RECOVERY - DPKG on stat1005 is OK: All packages OK [17:59:38] RECOVERY - DPKG on analytics1003 is OK: All packages OK [18:01:54] (03PS7) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [18:03:46] (03PS8) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [18:06:24] (03PS9) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [18:08:04] !log smalyshev@tin Started deploy [wdqs/wdqs@ccab8ce]: data reload/T176593 [18:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:12] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/8736/" [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [18:08:39] !log smalyshev@tin Finished deploy [wdqs/wdqs@ccab8ce]: data reload/T176593 (duration: 00m 34s) [18:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:49] (03PS1) 10Gehel: wdqs: allow port 9876 inside the wdqs clsuter for netcat file transfer [puppet] - 10https://gerrit.wikimedia.org/r/390440 (https://phabricator.wikimedia.org/T176593) [18:10:38] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/spark/conf.analytics-hadoop] [18:14:41] (03PS2) 10Gehel: wdqs: allow port 9876 inside the wdqs clsuter for netcat file transfer [puppet] - 10https://gerrit.wikimedia.org/r/390440 (https://phabricator.wikimedia.org/T176593) [18:16:54] (03CR) 10Smalyshev: [C: 031] wdqs: allow port 9876 inside the wdqs clsuter for netcat file transfer [puppet] - 10https://gerrit.wikimedia.org/r/390440 (https://phabricator.wikimedia.org/T176593) (owner: 10Gehel) [18:20:19] (03CR) 10Gehel: [C: 032] wdqs: allow port 9876 inside the wdqs clsuter for netcat file transfer [puppet] - 10https://gerrit.wikimedia.org/r/390440 (https://phabricator.wikimedia.org/T176593) (owner: 10Gehel) [20:17:39] hi copying this question over from the #wikimedia-tech channel, since this is more appropriate [20:17:48] it looks like wikimedia's servers recently updated to nginx 1.13.6 [20:17:54] there's a known issue with this version where it breaks a number of older popular Android HTTP libraries [20:18:02] https://trac.nginx.org/nginx/ticket/1397 [20:19:19] since this change is likely to break a number of Android applications that use wikimedia content, I was wondering if it made sense to roll back to a previous version of nginx, at least for long enough to give app developers time to upgrade [20:24:12] 10Operations, 10ops-codfw, 10ops-eqdfw, 10ops-eqiad: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269#3751867 (10GabrielF) [20:25:50] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10HTTPS: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269#3751881 (10Reedy) [21:46:04] (03Draft1) 10Paladox: puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 [21:46:07] (03PS2) 10Paladox: puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 [21:46:32] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 (owner: 10Paladox) [21:51:07] (03PS3) 10Paladox: puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 [21:51:30] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 (owner: 10Paladox) [21:51:44] (03PS1) 10Krinkle: Add warning and documentation comment to HHVMRequestInit.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390526 [21:52:55] (03PS4) 10Paladox: puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 [21:53:23] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 (owner: 10Paladox) [21:53:24] Hello! [21:53:46] (03PS2) 10Krinkle: Add warning and documentation comment to HHVMRequestInit.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390526 [21:54:11] May I ask a question with slow database which was fast two weeks ago? [21:54:27] You may [21:54:38] (03PS5) 10Paladox: puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 [21:55:04] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 (owner: 10Paladox) [21:55:12] hmm that should have fixed it [21:57:03] Well, I am connecting to the dewiki-database with host dewiki.labsdb. One big job which starts a 4 a.m. (UTC) ist usually ready at 10 or 11 a.m. But since two weeks it finishes on the next day. So instead of 6 hours it takes more than 24 hours [21:57:21] wrong channel then :) [21:57:27] #wikimedia-labs [21:57:31] Aha! [21:58:14] However, when trying dewikisource.analytics.db.svc.eqiad.wmflabs instead, it seems to be fast as it was before [21:58:33] But I will go to the othe channel [22:00:10] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:01:08] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [22:01:22] (03PS6) 10Paladox: puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 [22:01:40] Reedy wrong channel [22:01:46] #wikimedia-cloud now :) [22:01:53] It should redirect... [22:03:42] yep [22:11:53] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10HTTPS: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269#3751867 (10Reedy) Do we have any examples of this actually affecting any apps? Are these apps actively maintained?... [22:16:47] (03CR) 10Paladox: puppetdb: Allow customising username and password for active record (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390504 (owner: 10Paladox) [22:24:00] (03CR) 10Krinkle: [C: 032] Add warning and documentation comment to HHVMRequestInit.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390526 (owner: 10Krinkle) [22:26:20] (03Merged) 10jenkins-bot: Add warning and documentation comment to HHVMRequestInit.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390526 (owner: 10Krinkle) [22:26:30] (03CR) 10jenkins-bot: Add warning and documentation comment to HHVMRequestInit.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390526 (owner: 10Krinkle) [22:50:45] 10Operations, 10Trending-Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban), and 4 others: Update trending-edits' node-rdkafka to v1.x - https://phabricator.wikimedia.org/T179786#3735963 (10bearND) There's already a 2.2.0 out. Pushed an additional commit for that. [23:04:58] (03PS1) 10Krinkle: Improve StartProfiler.php file documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) [23:06:12] (03PS2) 10Krinkle: Improve StartProfiler.php file documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) [23:38:57] Hello all. I (OTRS) require some help. A company can't accesswikipedia.org from certain IP adresses from Norcross. Can someone help out? [23:41:09] accesswikipedia.org does not exist for me. [23:41:23] access wikipedia.org * [23:41:38] sorry, my spacebar is acting up [23:42:44] Creating Phab ticket [23:46:23] https://phabricator.wikimedia.org/T180277 [23:47:50] XioNoX: You around? [23:49:24] or perhaps paravoid [23:49:36] I'm on my phone [23:49:58] XioNoX: When you get a chance, would you mind looking at that phab task for me^^ ? [23:50:26] can you make it public? [23:50:57] I'm not authenticated to phab on my phone [23:51:24] no, but i can CC you to the task...it cointains perosnal details frm OTRS; so it needs to be "private" [23:52:03] ok, I'll look at it in about 1h when I get to my laptop [23:52:07] https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [23:52:15] https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [23:52:18] I was just gonna say... [23:52:30] mentioning that also in case there is not enough details [23:52:33] hehe [23:52:50] "being blacklisted by a security service used by the destination site." [23:52:54] Yeah, not gonna be the case here [23:53:11] ok, thanks! Yeah, I'm gonna respond back and ask them to perform some troubleshooting [23:53:11] Josve05a: They're gonna have to do the network diagnostic stuff [23:53:37] * Josve05a has never read tatpage before...gonna addit to a help page on OTRS wiki, thanks! [23:53:43] that page* [23:55:07] "not working" could be sooo many things