[02:14:49] PROBLEM - Check Varnish expiry mailbox lag on cp4025 is CRITICAL: CRITICAL: expiry mailbox lag is 2069495 [02:31:56] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.7) (duration: 08m 18s) [02:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:38] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Nov 13 02:38:38 UTC 2017 (duration 6m 42s) [02:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:58] RECOVERY - Check Varnish expiry mailbox lag on cp4025 is OK: OK: expiry mailbox lag is 0 [03:29:28] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 898.60 seconds [03:55:38] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 196.03 seconds [04:01:05] !log Decommissioning Cassandra, restbase1007-c.eqiad.wmnet (T179422) [04:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:13] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [04:12:09] (03PS3) 10Santhosh: Remove wgContentTranslationEnableSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378833 (owner: 10KartikMistry) [04:12:35] (03CR) 10Santhosh: [C: 031] Remove wgContentTranslationEnableSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378833 (owner: 10KartikMistry) [04:40:28] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:00:24] (03PS4) 10KartikMistry: Beta: Explicitly set cookieDomain for ContentTranslationSiteTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) [05:10:28] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:30:17] (03CR) 10Phedenskog: "This is great, it will be nice to watch the number for a while and then add an alert for it." [puppet] - 10https://gerrit.wikimedia.org/r/390061 (owner: 10Krinkle) [05:35:28] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2099967 [05:38:22] (03PS3) 10Phedenskog: webperf: Refactor tests to directly associate expected data with cases [puppet] - 10https://gerrit.wikimedia.org/r/390083 (owner: 10Krinkle) [05:39:19] (03CR) 10Phedenskog: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/390061 (owner: 10Krinkle) [05:46:36] (03CR) 10Phedenskog: "I tested it and really like that you know see in which section there's an problem (that really helps out)." [puppet] - 10https://gerrit.wikimedia.org/r/390083 (owner: 10Krinkle) [06:13:38] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [06:13:49] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:28] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [06:14:48] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 1.051 second response time [06:14:49] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [06:15:48] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [06:18:21] !log Deploy alter table db2086 - T179106 [06:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:29] T179106: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106 [06:23:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390943 (https://phabricator.wikimedia.org/T174569) [06:25:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390943 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:26:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390943 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:26:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390943 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:27:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1098 - T174569 (duration: 00m 49s) [06:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:25] !log Deploy alter table on db1098 - T174569 [06:27:25] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:48] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:32:18] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:38:18] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:39:48] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:40:27] (03PS1) 10Marostegui: db1103.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/390944 (https://phabricator.wikimedia.org/T178359) [06:41:10] (03CR) 10Marostegui: [C: 032] db1103.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/390944 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:44:49] !log Deploy alter table directly on codfw s5 master (db2023), this will generate lag on codfw - T179793 [06:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:56] T179793: Consider dropping the "wb_items_per_site.wb_ips_site_page" index - https://phabricator.wikimedia.org/T179793 [06:56:34] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) [07:30:31] PROBLEM - Nginx local proxy to apache on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:31:21] RECOVERY - Nginx local proxy to apache on mw2134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.201 second response time [07:33:30] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3753732 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:33:35] !log Optimize wb_terms table on db2052 - T179106 [07:33:41] 10Operations, 10Puppet: Puppet wmf-style-guide: array of classes not detected properly - https://phabricator.wikimedia.org/T179230#3753734 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:42] T179106: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106 [07:34:03] 10Operations, 10Goal, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197#3753736 (10MoritzMuehlenhoff) p:05Triage>03High [07:34:38] 10Operations, 10Puppet: Add require_package() variant with repository component to wmflib - https://phabricator.wikimedia.org/T178575#3753737 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:34:45] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port exim statistics to Prometheus - https://phabricator.wikimedia.org/T179565#3753738 (10MoritzMuehlenhoff) p:05Triage>03High [07:35:15] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3753740 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:36:42] 10Operations, 10monitoring: Export ipsec counters as Prometheus metrics - https://phabricator.wikimedia.org/T154619#3753744 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:38:36] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3753745 (10MoritzMuehlenhoff) I'm adding @RStallman-legalteam for preparing the NDAs for Martijn and Kirk. [07:38:53] !log Deploy alter table to db1104 - T179106 [07:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:00] T179106: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106 [07:39:53] log Deploy alter table to db1105 [07:42:57] !log Deploy alter table to db1105 [07:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:12] !log installing ruby2.3 security updates [07:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:31] (03PS1) 10Alexandros Kosiaris: Add AAAA and PTR records for kubernetes boxes [dns] - 10https://gerrit.wikimedia.org/r/390953 [08:11:48] (03PS5) 10Nikerabbit: Beta: Explicitly set cookieDomain for ContentTranslationSiteTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) (owner: 10KartikMistry) [08:12:15] (03CR) 10Nikerabbit: [C: 031] Beta: Explicitly set cookieDomain for ContentTranslationSiteTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) (owner: 10KartikMistry) [08:23:05] (03CR) 10Hashar: "That is really just a quick hack to unbreak ferm on deployment-prep. I have no idea how to fix it properly and take in account that AAAA " [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) (owner: 10Hashar) [08:25:19] (03PS1) 10Alexandros Kosiaris: Calico: Force IPv4/IPv6 assignment [puppet] - 10https://gerrit.wikimedia.org/r/390958 [08:31:47] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390962 [08:33:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390962 (owner: 10Marostegui) [08:33:13] !log rebooting mw1238-mw1258 for update to Linux 4.9.51 (and to pick up OpenSSL update) [08:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:29] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390962 (owner: 10Marostegui) [08:35:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1098 after alter table - T174569 (duration: 00m 47s) [08:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:42] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [08:36:19] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390965 (https://phabricator.wikimedia.org/T174569) [08:38:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390965 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:39:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390965 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:40:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1093 - T174569 (duration: 00m 46s) [08:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:05] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [08:41:38] !log Deploy alter table on db1093 - T174569 [08:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:58] (03CR) 10Alexandros Kosiaris: [C: 032] Calico: Force IPv4/IPv6 assignment [puppet] - 10https://gerrit.wikimedia.org/r/390958 (owner: 10Alexandros Kosiaris) [08:51:02] (03PS10) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [08:51:48] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390962 (owner: 10Marostegui) [08:51:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390965 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:52:04] (03PS1) 10Alexandros Kosiaris: Fix missing comma typo in cni.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/390966 [08:52:16] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix missing comma typo in cni.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/390966 (owner: 10Alexandros Kosiaris) [08:54:30] (03PS11) 10Elukey: profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) [08:58:07] (03CR) 10Elukey: [C: 032] profile::druid::broker: add prometheus jmx exporter config (jvm only) [puppet] - 10https://gerrit.wikimedia.org/r/390419 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [09:02:42] !log restart of druid brokers on druid100[1-6] to apply https://gerrit.wikimedia.org/r/390419 - https://gerrit.wikimedia.org/r/390419 [09:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:48] noooooo [09:02:56] uffff copy/paste error [09:03:47] (03CR) 10Filippo Giunchedi: role::prometheus::ops: add banner message to MOTD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390428 (owner: 10Ema) [09:23:04] (03PS4) 10Ema: role::prometheus: add banner messages to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 [09:23:32] (03CR) 10jerkins-bot: [V: 04-1] role::prometheus: add banner messages to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 (owner: 10Ema) [09:27:21] (03PS5) 10Ema: role::prometheus: add banner messages to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 [09:29:10] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) [09:29:20] !log rebooting mw1221-mw1235 for update to Linux 4.9.51 (and to pick up OpenSSL update) [09:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:06] (03PS3) 10Marostegui: db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) [09:35:58] (03PS1) 10Elukey: profile::druid::*: add prometheus jvm monitoring via jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/390968 (https://phabricator.wikimedia.org/T177459) [09:36:25] (03CR) 10jerkins-bot: [V: 04-1] profile::druid::*: add prometheus jvm monitoring via jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/390968 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [09:38:07] you a right jenkins, my bad to upset you on a Monday morning [09:39:03] elukey: its motto is "whippings will continue until morale improves" [09:39:30] (03PS2) 10Elukey: profile::druid::*: add prometheus jvm monitoring via jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/390968 (https://phabricator.wikimedia.org/T177459) [09:39:43] ahahahah [09:41:02] (03PS4) 10Marostegui: db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) [09:42:12] (03PS5) 10Marostegui: db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) [09:42:30] (03PS6) 10Marostegui: db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) [09:44:15] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:44:31] !log test upgrade of prometheus 1.8.1 with k8s on prometheus2003 - T177395 [09:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:38] T177395: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395 [09:49:13] (03PS3) 10Elukey: profile::druid::*: add prometheus jvm monitoring via jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/390968 (https://phabricator.wikimedia.org/T177459) [09:58:40] 10Operations, 10Continuous-Integration-Config, 10Traffic, 10Tracking: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754081 (10hashar) [09:59:40] 10Operations, 10Continuous-Integration-Config: Add CI to all operations/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180330#3754100 (10hashar) [10:00:45] 10Operations, 10Continuous-Integration-Config, 10Traffic: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754081 (10hashar) [10:06:53] (03PS4) 10Elukey: profile::druid::*: add prometheus jvm monitoring via jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/390968 (https://phabricator.wikimedia.org/T177459) [10:12:28] !log rebooting remaining Parsoid hosts in eqiad for update to Linux 4.9.51 (and to pick up OpenSSL update) [10:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:28] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3754174 (10Addshore) [10:16:33] (03CR) 10Addshore: [C: 04-1] "Also need to add arwiki to this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) (owner: 10Addshore) [10:18:48] PROBLEM - Host kubernetes1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:59] that's ^ me [10:19:08] RECOVERY - Host kubernetes1001 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [10:21:32] (03CR) 10Jcrespo: [C: 031] db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:21:34] (03CR) 10Alexandros Kosiaris: [C: 032] Add AAAA and PTR records for kubernetes boxes [dns] - 10https://gerrit.wikimedia.org/r/390953 (owner: 10Alexandros Kosiaris) [10:22:05] (03PS1) 10Ladsgroup: Whitelist jenkins in test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390975 (https://phabricator.wikimedia.org/T167432) [10:24:47] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:25:59] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:26:30] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1103 as rc for s2,s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390947 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:27:02] (03PS4) 10Alexandros Kosiaris: profile::etcd: Move hiera lookups to parameters [puppet] - 10https://gerrit.wikimedia.org/r/390403 [10:27:06] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] profile::etcd: Move hiera lookups to parameters [puppet] - 10https://gerrit.wikimedia.org/r/390403 (owner: 10Alexandros Kosiaris) [10:27:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1103 as multi-instance host for s2 and s4 - T178359 (duration: 00m 47s) [10:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:25] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:27:28] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8743/" [puppet] - 10https://gerrit.wikimedia.org/r/390968 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:27:33] (03PS5) 10Elukey: profile::druid::*: add prometheus jvm monitoring via jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/390968 (https://phabricator.wikimedia.org/T177459) [10:28:18] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db1103 as multi-instance host for s2 and s4 - T178359 (duration: 00m 46s) [10:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:44] (03PS5) 10Gehel: elasticsearch: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) [10:34:29] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/etc/druid/jvm_prometheus_coordinator_jmx_exporter.yaml] [10:34:29] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/druid/jvm_prometheus_coordinator_jmx_exporter.yaml] [10:34:39] (03PS1) 10Elukey: profile::druid::monitoring::coordinator: fix source for jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/390976 (https://phabricator.wikimedia.org/T177459) [10:34:41] (03CR) 10Gehel: [C: 032] elasticsearch: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390401 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [10:35:38] (03CR) 10Elukey: [C: 032] profile::druid::monitoring::coordinator: fix source for jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/390976 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:35:42] (03PS2) 10Elukey: profile::druid::monitoring::coordinator: fix source for jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/390976 (https://phabricator.wikimedia.org/T177459) [10:35:52] * elukey +2 snipered by gehel [10:36:02] :) [10:39:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390977 [10:39:58] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390977 [10:39:59] PROBLEM - puppet last run on druid1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/druid/jvm_prometheus_coordinator_jmx_exporter.yaml] [10:41:31] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390977 (owner: 10Marostegui) [10:42:52] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390977 (owner: 10Marostegui) [10:43:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390977 (owner: 10Marostegui) [10:44:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 - T174569 (duration: 00m 46s) [10:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:13] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [10:44:29] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:45:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390978 (https://phabricator.wikimedia.org/T174569) [10:46:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390978 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [10:48:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390978 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [10:48:10] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390978 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [10:49:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1088 - T174569 (duration: 00m 54s) [10:49:18] !log Deploy schema change on db1088 - T174569 [10:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:20] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [10:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:11] !log upgrade elasticsearch on cirrus / codfw - T178411 [10:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:20] T178411: Upgrade cirrus elasticsearch clusters to 5.5.x - https://phabricator.wikimedia.org/T178411 [11:04:29] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:09:59] RECOVERY - puppet last run on druid1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:18:21] !log restart of all the druid daemons on druid100[1-6] to apply the new prometheus jmx jvm exporters - T177459 [11:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:28] T177459: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459 [11:21:42] (03CR) 10Filippo Giunchedi: [C: 031] role::prometheus: add banner messages to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 (owner: 10Ema) [11:25:46] (03PS4) 10Filippo Giunchedi: mx: export metrics from exim4 mainlog [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) [11:27:30] (03CR) 10Filippo Giunchedi: [C: 032] mx: export metrics from exim4 mainlog [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [11:36:20] PROBLEM - Check systemd state on druid1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:36:32] checking --^ [11:36:40] PROBLEM - Druid coordinator on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator [11:40:29] RECOVERY - Check systemd state on druid1001 is OK: OK - running: The system is fully operational [11:40:39] RECOVERY - Druid coordinator on druid1001 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator [11:41:02] this makes 0 sense --^ [11:45:42] 10Operations, 10Prod-Kubernetes, 10monitoring, 10Kubernetes, and 3 others: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395#3754369 (10akosiaris) Just created our first grafana dashboard at https://grafana.wikimedia.org/dashboard/db/kubernetes-api?orgId=1 [11:48:46] !log installing irssi security updates [11:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:37] PROBLEM - Check systemd state on druid1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:52:37] PROBLEM - Druid coordinator on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator [11:53:08] still working on it --^ [11:56:10] !log installing imagemagick security updates [11:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:38] PROBLEM - Check systemd state on druid1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:59:38] PROBLEM - Druid coordinator on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator [12:00:03] (03PS6) 10Ema: role::prometheus: add banner messages to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 [12:00:11] (03CR) 10Ema: [V: 032 C: 032] role::prometheus: add banner messages to MOTD [puppet] - 10https://gerrit.wikimedia.org/r/390428 (owner: 10Ema) [12:01:37] RECOVERY - Check systemd state on druid1002 is OK: OK - running: The system is fully operational [12:01:38] RECOVERY - Druid coordinator on druid1002 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator [12:02:18] !log cp4021: restart varnish-be (mbox lag) [12:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:37] RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 0 [12:08:57] !log cp3008: upgrade varnish to 5.1.3-1wm2 [12:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:08] ah now I think I know what's happening to the druid coordinators [12:13:36] shouldn't be an availability issue though, coordinators are responsible to dictate new/old segments [12:13:55] pivot works fine [12:14:51] !log cache_misc: upgrade varnish to 5.1.3-1wm2 [12:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:23] (03PS1) 10ArielGlenn: change scap for dumps to use the dumpsgen user [dumps/scap] - 10https://gerrit.wikimedia.org/r/390981 [12:17:15] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /_info/version (retrieve service version) timed out before a response was received: /_info/name (retrieve service name) timed out before a [12:17:15] ved: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [12:17:15] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [12:17:15] out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [12:18:45] PROBLEM - Check systemd state on druid1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:18:45] PROBLEM - Druid coordinator on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator [12:19:15] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [12:19:15] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:19:45] RECOVERY - Check systemd state on druid1001 is OK: OK - running: The system is fully operational [12:19:46] RECOVERY - Druid coordinator on druid1001 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator [12:19:53] (03CR) 10ArielGlenn: [V: 032 C: 032] change scap for dumps to use the dumpsgen user [dumps/scap] - 10https://gerrit.wikimedia.org/r/390981 (owner: 10ArielGlenn) [12:20:29] seems good now, the db password to an1003 was old and if backfire only when i restarted the service (old cruft left when we split the druid cluster into two) [12:21:35] (03PS1) 10Filippo Giunchedi: mtail: allow changing running group [puppet] - 10https://gerrit.wikimedia.org/r/390982 (https://phabricator.wikimedia.org/T179565) [12:21:45] RECOVERY - Check systemd state on druid1002 is OK: OK - running: The system is fully operational [12:21:46] RECOVERY - Druid coordinator on druid1002 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator [12:24:28] (03PS3) 10Filippo Giunchedi: mtail: add test scaffolding [puppet] - 10https://gerrit.wikimedia.org/r/388478 (https://phabricator.wikimedia.org/T179565) [12:27:48] (03CR) 10Filippo Giunchedi: [C: 032] mtail: add test scaffolding [puppet] - 10https://gerrit.wikimedia.org/r/388478 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [12:28:01] (03PS2) 10Filippo Giunchedi: mtail: allow changing running group [puppet] - 10https://gerrit.wikimedia.org/r/390982 (https://phabricator.wikimedia.org/T179565) [12:33:09] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3754469 (10Chicocvenancio) The problem is solved. I eventually moved it to [a tool](https://tools.wmflabs.org/merge2pdf/)... [12:33:20] (03CR) 10Filippo Giunchedi: [C: 032] mtail: allow changing running group [puppet] - 10https://gerrit.wikimedia.org/r/390982 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [12:43:24] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390984 [12:43:25] PROBLEM - Nginx local proxy to apache on mw2207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:15] RECOVERY - Nginx local proxy to apache on mw2207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.202 second response time [12:55:37] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390984 (owner: 10Marostegui) [12:57:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390984 (owner: 10Marostegui) [12:57:31] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390984 (owner: 10Marostegui) [12:58:30] 10Operations, 10Cloud-VPS, 10Traffic, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3754514 (10ema) p:05Triage>03Normal [12:58:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1088 - T174569 (duration: 00m 47s) [12:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:38] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [12:59:45] 10Operations, 10Continuous-Integration-Config, 10Traffic: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754553 (10ema) p:05Triage>03Normal [13:00:06] (03PS1) 10Marostegui: db-eqiad.php: Remove old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390985 [13:01:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Remove old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390985 (owner: 10Marostegui) [13:03:27] (03Merged) 10jenkins-bot: db-eqiad.php: Remove old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390985 (owner: 10Marostegui) [13:03:40] (03CR) 10jenkins-bot: db-eqiad.php: Remove old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390985 (owner: 10Marostegui) [13:04:37] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390988 (https://phabricator.wikimedia.org/T174569) [13:06:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390988 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:06:46] PROBLEM - puppet last run on lvs4005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:08:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390988 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:08:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390988 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:08:43] 10Operations, 10Continuous-Integration-Config, 10Traffic: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754581 (10ema) [13:09:39] !log Deploy schema change on db1083 - T174569 [13:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:46] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [13:10:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 - T174569 (duration: 00m 46s) [13:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:01] ^ that was db1085 [13:13:03] 10Operations, 10Continuous-Integration-Config, 10Traffic: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754602 (10ema) I've updated the task description with comments about all repos. They're all debian packages with the except... [13:31:53] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390991 (https://phabricator.wikimedia.org/T178359) [13:33:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390991 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:35:06] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390991 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:36:21] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Increase weight for db1103 on s2 and s4 - T178359 (duration: 00m 46s) [13:36:22] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390991 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:28] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [13:36:46] RECOVERY - puppet last run on lvs4005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:37:15] (03PS1) 10Hashar: Add linters for i18n json files [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390994 (https://phabricator.wikimedia.org/T180328) [13:37:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1103 on s2 and s4 - T178359 (duration: 00m 46s) [13:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:28] (03CR) 10Hashar: "That is the boiler plate we are using to validate i18n files are valid." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390994 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [13:49:38] !log Stop replication on labsdb1010 to copy cebwiki.geo_tags table [13:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:38] (03PS1) 10Hashar: Add PHP linter [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390999 (https://phabricator.wikimedia.org/T180328) [13:53:24] (03CR) 10Hashar: "The bit to enable it in CI https://gerrit.wikimedia.org/r/#/c/390998/" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390999 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [13:54:59] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242#3754715 (10Gehel) A short tcpdump session indicates that the only log producers still using logstash100[123] are udp... [14:00:06] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171113T1400). [14:00:06] Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:38] I can SWAT today [14:02:26] o/ [14:02:32] Sorry for being late [14:02:59] Amir1: no problem :) want to deploy yourself, or should I? [14:04:06] 10Operations, 10Continuous-Integration-Config, 10Traffic: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754739 (10hashar) If `operations/software/varnish/libvmod-header` is obsolete and you are never going to change it later:... [14:04:51] zeljkof: I can do it [14:05:07] Amir1: great, I can run the job that will test the commit [14:05:48] (03CR) 10Ladsgroup: [C: 032] Whitelist jenkins in test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390975 (https://phabricator.wikimedia.org/T167432) (owner: 10Ladsgroup) [14:06:58] (03Merged) 10jenkins-bot: Whitelist jenkins in test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390975 (https://phabricator.wikimedia.org/T167432) (owner: 10Ladsgroup) [14:07:02] Amir1: this is the job https://integration.wikimedia.org/ci/job/selenium-Wikibase-chrome/ [14:07:17] feel free to run it, or I can, when the deploy is finished [14:07:47] it runs for 40 minutes, so we will not know immediatelly [14:07:48] zeljkof: I'm not sure if I have enough right [14:07:58] (03CR) 10jenkins-bot: Whitelist jenkins in test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390975 (https://phabricator.wikimedia.org/T167432) (owner: 10Ladsgroup) [14:08:01] I can run it, it's just a click [14:08:07] !log Deploy alter table on db1102.s6 (with replication - sanitarium master) - T174569 [14:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:14] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [14:10:24] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Whitelist jenkins in test wiki (T167432) (duration: 00m 47s) [14:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:30] T167432: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432 [14:11:22] zeljkof: It's done now :) [14:11:31] Amir1: ok, running the job [14:11:44] (03CR) 10Lokal Profil: [C: 031] Add linters for i18n json files [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390994 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [14:12:15] Thanks! [14:12:19] when the job finishes, in about 40 minutes, the graph on top right should have more green https://integration.wikimedia.org/ci/job/selenium-Wikibase-chrome/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=DebianJessie%20&&%20contintLabsSlave/ [14:13:30] fingers crossed [14:15:20] no more commits for EU SWAT [14:15:27] !log EU SWAT finished [14:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:01] (03CR) 10Herron: [C: 032] puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [14:29:09] (03PS7) 10Herron: puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) [14:34:27] (03PS1) 10Elukey: role::prometheus::analytics: add druid jmx exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/391007 (https://phabricator.wikimedia.org/T177459) [14:41:21] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3754824 (10jcrespo) > I think this is because the version of yaSSL that MySQL bundle... [14:44:05] (03PS1) 10Filippo Giunchedi: role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) [14:44:17] (03CR) 10jerkins-bot: [V: 04-1] role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [14:45:59] (03PS2) 10Elukey: role::prometheus::analytics: add druid jmx exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/391007 (https://phabricator.wikimedia.org/T177459) [14:49:19] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3754851 (10herron) [14:50:39] Is anyone around that can create gerrit repos? [14:50:58] (03PS1) 10Herron: puppet: change puppet_major_version to 4 on codfw puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/391014 (https://phabricator.wikimedia.org/T177254) [14:54:45] moritzm can you create gerrit repos? :D [14:54:50] I hate being blocked on such silly thingds [14:58:08] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8747/" [puppet] - 10https://gerrit.wikimedia.org/r/391014 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:58:10] (03CR) 10Herron: [C: 032] puppet: change puppet_major_version to 4 on codfw puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/391014 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:58:49] !log upgrading puppetmaster2002 to puppet 4 [14:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:22] (03PS2) 10Filippo Giunchedi: role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) [15:01:48] (03CR) 10jerkins-bot: [V: 04-1] role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [15:06:05] !log otto@tin Started deploy [eventlogging/analytics@5796c27]: T179625 [15:06:09] !log otto@tin Finished deploy [eventlogging/analytics@5796c27]: T179625 (duration: 00m 04s) [15:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:12] T179625: Resolve EventCapsule / MySQL / Hive schema discrepancies - https://phabricator.wikimedia.org/T179625 [15:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:52] (03PS3) 10Filippo Giunchedi: role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) [15:07:05] (03CR) 10jerkins-bot: [V: 04-1] role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [15:07:31] (03PS4) 10Filippo Giunchedi: role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) [15:07:57] (03CR) 10jerkins-bot: [V: 04-1] role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [15:08:16] !log Decommissioning Cassandra, restbase1012-a.eqiad.wmnet (T179422) [15:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:23] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [15:08:56] hasharAway: when you are back, I'd need to add a debian package as a dependency for a task run by tox, what manifest should I change to get the packagei nstalled? [15:09:48] !log otto@tin Started deploy [eventlogging/analytics@03285e4]: Reverting, got an error: userAgent is a . [15:09:49] godog: out of curiosity is something tox cannot get as dependency? [15:09:50] !log otto@tin Finished deploy [eventlogging/analytics@03285e4]: Reverting, got an error: userAgent is a . (duration: 00m 02s) [15:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:17] (03CR) 10Greg Sabino Mullane: "> Greg, and is there a corresponding RSS/Atom feed? I tried appending" [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [15:12:11] volans: nope, it is mtail [15:13:50] got it, no I don't know where to add it and how to add it only for your jobs [15:15:42] yeah it'd be nice if running given tests was selective based on what changed [15:22:47] !log otto@tin Started deploy [eventlogging/analytics@e024af3]: T179625 [15:22:49] !log otto@tin Finished deploy [eventlogging/analytics@e024af3]: T179625 (duration: 00m 02s) [15:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:53] T179625: Resolve EventCapsule / MySQL / Hive schema discrepancies - https://phabricator.wikimedia.org/T179625 [15:22:58] 10Operations, 10ops-codfw: check mw2176 power supply redundancy - https://phabricator.wikimedia.org/T177639#3755093 (10Papaul) 05Open>03Resolved Resolving this since we didn't have any more issue for a week. [15:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:45] 10Operations, 10ops-codfw: check mw2160 power supply redundancy - https://phabricator.wikimedia.org/T177638#3755098 (10Papaul) 05Open>03Resolved Resolving this since we didn't have any more issue for a week. [15:23:45] (03PS1) 10Elukey: Modify CNAME for analytics-slave from db1047 to db1108 [dns] - 10https://gerrit.wikimedia.org/r/391020 (https://phabricator.wikimedia.org/T156844) [15:25:44] (03PS7) 10Ottomata: [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [15:25:57] (03CR) 10Mforns: [C: 031] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/391020 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [15:26:15] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20891136 [15:27:15] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 [15:28:56] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3755116 (10herron) [15:29:09] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3715465 (10herron) [15:34:39] (03PS14) 10Elukey: First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [15:45:06] (03PS1) 10Ottomata: Temporarily disable EventLogging refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/391023 (https://phabricator.wikimedia.org/T179625) [15:45:33] (03CR) 10jerkins-bot: [V: 04-1] Temporarily disable EventLogging refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/391023 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [15:46:22] (03PS2) 10Ottomata: Temporarily disable EventLogging refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/391023 (https://phabricator.wikimedia.org/T179625) [15:46:45] (03CR) 10jerkins-bot: [V: 04-1] Temporarily disable EventLogging refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/391023 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [15:46:47] (03PS1) 10Filippo Giunchedi: prometheus: add redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/391024 (https://phabricator.wikimedia.org/T148637) [15:47:16] (03PS3) 10Ottomata: Temporarily disable EventLogging refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/391023 (https://phabricator.wikimedia.org/T179625) [15:50:59] (03CR) 10Lokal Profil: [C: 031] "Both patches look fine to me." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390994 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [15:51:52] (03PS1) 10BBlack: cache_upload: do not apply 256K hfp to CL-less requests [puppet] - 10https://gerrit.wikimedia.org/r/391025 [15:55:27] (03CR) 10Lokal Profil: "We don't want to include phpcs ?" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390999 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [15:58:34] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=wtp2017.codfw.wmnet [15:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:57] !log akosiaris@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=wtp2017.codfw.wmnet [15:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:09] (03CR) 10Lokal Profil: [C: 031] "Both meaning this and the one in integration/config" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390994 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [16:02:56] PROBLEM - Host wtp2017 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:16] (03PS1) 10Mholloway: Make all Reading Infrastructure engineers deployers for MCS, trendingedits [puppet] - 10https://gerrit.wikimedia.org/r/391026 (https://phabricator.wikimedia.org/T180366) [16:07:54] 10Operations, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3755301 (10Mholloway) [16:09:40] 10Operations, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3755390 (10Mholloway) [16:11:27] (03PS1) 10Ottomata: Install spark2 on Hadoop workers for use with Oozie [puppet] - 10https://gerrit.wikimedia.org/r/391028 (https://phabricator.wikimedia.org/T158334) [16:12:46] 10Operations, 10Traffic, 10monitoring, 10Prometheus-metrics-monitoring: authdns prometheus metrics are not available anymore - https://phabricator.wikimedia.org/T180256#3755414 (10fgiunchedi) [16:12:55] (03Abandoned) 10EBernhardson: Support .whl in archiva git-fat link [puppet] - 10https://gerrit.wikimedia.org/r/389889 (owner: 10EBernhardson) [16:13:06] PROBLEM - Host wtp2017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:13:08] (03PS8) 10EBernhardson: Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 [16:15:11] (03CR) 10EBernhardson: "I shouldn't need to be around for a merge. It won't immediately work as i need to push the new deployment out, but that wont break anythin" [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [16:18:16] RECOVERY - Host wtp2017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms [16:23:10] (03PS1) 10WMDE-leszek: Load DataTypes extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391029 (https://phabricator.wikimedia.org/T180062) [16:23:15] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391030 [16:23:18] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391030 [16:25:53] (03PS1) 10Herron: puppet: fix puppetmaster and puppetmaster-common package names [puppet] - 10https://gerrit.wikimedia.org/r/391031 (https://phabricator.wikimedia.org/T177254) [16:25:56] (03PS23) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [16:25:58] (03PS21) 10Paladox: Gerrit: Upgrading gerrit to 2.14.6-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [16:26:19] (03CR) 10jerkins-bot: [V: 04-1] puppet: fix puppetmaster and puppetmaster-common package names [puppet] - 10https://gerrit.wikimedia.org/r/391031 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:26:55] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391030 (owner: 10Marostegui) [16:27:38] (03PS2) 10Herron: puppet: fix puppetmaster and puppetmaster-common package names [puppet] - 10https://gerrit.wikimedia.org/r/391031 (https://phabricator.wikimedia.org/T177254) [16:27:40] (03CR) 10Ottomata: [C: 032] Install spark2 on Hadoop workers for use with Oozie [puppet] - 10https://gerrit.wikimedia.org/r/391028 (https://phabricator.wikimedia.org/T158334) (owner: 10Ottomata) [16:28:02] (03CR) 10jerkins-bot: [V: 04-1] puppet: fix puppetmaster and puppetmaster-common package names [puppet] - 10https://gerrit.wikimedia.org/r/391031 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:28:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391030 (owner: 10Marostegui) [16:29:03] (03PS3) 10Herron: puppet: fix puppetmaster and puppetmaster-common package names [puppet] - 10https://gerrit.wikimedia.org/r/391031 (https://phabricator.wikimedia.org/T177254) [16:30:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1085 - T174569 (duration: 00m 46s) [16:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:50] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:30:56] (03PS4) 10Herron: puppet: fix puppetmaster and puppetmaster-common package names [puppet] - 10https://gerrit.wikimedia.org/r/391031 (https://phabricator.wikimedia.org/T177254) [16:32:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391030 (owner: 10Marostegui) [16:36:22] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8749/" [puppet] - 10https://gerrit.wikimedia.org/r/391031 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:36:24] (03CR) 10Herron: [C: 032] puppet: fix puppetmaster and puppetmaster-common package names [puppet] - 10https://gerrit.wikimedia.org/r/391031 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:36:43] (03PS5) 10Herron: puppet: fix puppetmaster and puppetmaster-common package names [puppet] - 10https://gerrit.wikimedia.org/r/391031 (https://phabricator.wikimedia.org/T177254) [16:36:59] 10Operations, 10Goal, 10User-fgiunchedi: Port postgresql metrics to Prometheus - https://phabricator.wikimedia.org/T179306#3755503 (10akosiaris) p:05Triage>03High a:03akosiaris [16:40:25] 10Operations, 10Traffic, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3755514 (10BBlack) What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis. Am I missing some way to use the deletion l... [16:42:03] !log mobrovac@tin Started deploy [mathoid/deploy@63b2ddc]: Update to service-template-node v0.5.3 - T151396 [16:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:09] T151396: Update Mathoid to service-template-node v0.5.3 - https://phabricator.wikimedia.org/T151396 [16:43:05] (03PS2) 10Elukey: Remove any trace of db1047 from analytics CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/391020 (https://phabricator.wikimedia.org/T156844) [16:43:25] RECOVERY - Host wtp2017 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [16:43:31] (03CR) 10MusikAnimal: Enable per-filter profiling on enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390153 (https://phabricator.wikimedia.org/T179323) (owner: 10Dmaza) [16:44:46] (03PS1) 10Marostegui: db-eqiad.php: Increase db1103 traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391035 (https://phabricator.wikimedia.org/T178359) [16:44:58] (03CR) 10Addshore: "It looks like duplicate wfLoadExtension calls don't cause any issues (i just does it once), so shouldn't be an issue to add this asap." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391029 (https://phabricator.wikimedia.org/T180062) (owner: 10WMDE-leszek) [16:45:48] !log mobrovac@tin Finished deploy [mathoid/deploy@63b2ddc]: Update to service-template-node v0.5.3 - T151396 (duration: 03m 45s) [16:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:55] PROBLEM - Check whether ferm is active by checking the default input chain on wtp2017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:45:55] PROBLEM - Check systemd state on wtp2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:57] (03CR) 10Marostegui: [C: 031] Remove any trace of db1047 from analytics CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/391020 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [16:46:07] ACKNOWLEDGEMENT - MD RAID on wtp2017 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T180373 [16:46:10] 10Operations, 10ops-codfw: Degraded RAID on wtp2017 - https://phabricator.wikimedia.org/T180373#3755540 (10ops-monitoring-bot) [16:48:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1103 traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391035 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [16:48:55] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:02] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10HTTPS: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269#3755551 (10BBlack) p:05Triage>03Normal > Unfortunately, there is a known issue with this version of nginx where... [16:50:52] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1103 traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391035 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [16:51:06] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1103 traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391035 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [16:52:24] (03CR) 10Elukey: [C: 032] Remove any trace of db1047 from analytics CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/391020 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [16:53:23] (03CR) 10Ema: [C: 031] cache_upload: do not apply 256K hfp to CL-less requests [puppet] - 10https://gerrit.wikimedia.org/r/391025 (owner: 10BBlack) [16:54:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1103 on s2 and s4 - T178359 (duration: 00m 47s) [16:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:55] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [16:55:53] 10Operations, 10ops-codfw: Broken memory on mw2108 - https://phabricator.wikimedia.org/T180200#3755578 (10Papaul) 05Open>03Resolved Memory test came up with no errors. IF this happen again we will have to purchase a 8GB DDR3 because I have no 8GB memory on site and the system is out of warranty. System is... [16:58:55] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:01:08] (03CR) 10Addshore: [C: 04-1] "I think I would want to split this in 2 though and first deploy it for beta only, and then also in production!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391029 (https://phabricator.wikimedia.org/T180062) (owner: 10WMDE-leszek) [17:07:01] (03CR) 10Zoranzoki21: "This is not deployed. You can abandon patch. I am sorry." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [17:07:02] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3755623 (10EBjune) @MoritzMuehlenhoff @RStallman-legalteam they have already signed the NDAs [17:10:05] PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 3018 MB (3% inode=98%) [17:10:16] PROBLEM - mediawiki-installation DSH group on mw2108 is CRITICAL: Host mw2108 is not in mediawiki-installation dsh group [17:12:12] !log restbase depooling restbase1007, restbase1012, restbase1014, restbase2002, restbase2004, restbase2006 for T179422 [17:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:19] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [17:15:08] (03PS1) 10Gehel: wdqs: ensure blazegraph data file has correct ownership [puppet] - 10https://gerrit.wikimedia.org/r/391039 [17:28:08] (03PS1) 10Jdlrobson: Enable the download icon on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391041 (https://phabricator.wikimedia.org/T179914) [17:28:13] !log Ran "scap pull" on mwdebug1001 after tests re T177486 [17:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:20] T177486: [Tracking] Wikidata entity dumpers need to cope with the immense Wikidata growth recently - https://phabricator.wikimedia.org/T177486 [17:32:05] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391041 (https://phabricator.wikimedia.org/T179914) (owner: 10Jdlrobson) [17:32:31] (03PS8) 10Zoranzoki21: Enable the ArticlePlaceholder for Northern Sami (sewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [17:34:08] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3755723 (10RStallman-legalteam) @MoritzMuelenhoff @EBjune Martijn has signed, but Kirk has still yet to sign. I've... [17:38:18] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3755742 (10mobrovac) In order to be able to depl... [17:38:34] 10Operations, 10Traffic, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3755746 (10JEumerus) The user-side of deletion logs does not inherently have a search function, unless the specific actions are marked with a tag. [17:49:28] (03CR) 10Eevans: [C: 032] Log every retry warning [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389742 (owner: 10Eevans) [17:49:55] (03CR) 10Chad: [C: 032] search.wikimedia.org: Clean up result returning logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390357 (owner: 10Chad) [17:50:00] 10Operations, 10Traffic, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3755802 (10BBlack) Err, we should really move the sub-conversation back to T171881 . This ticket is more about general reliability problems and/or race-conditions, not about the WP0 abuse s... [17:51:07] (03Merged) 10jenkins-bot: search.wikimedia.org: Clean up result returning logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390357 (owner: 10Chad) [17:51:21] (03CR) 10jenkins-bot: search.wikimedia.org: Clean up result returning logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390357 (owner: 10Chad) [17:53:56] (03PS2) 10Eevans: Use a more realistic defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389743 [17:55:32] (03CR) 10Eevans: "I bumped this default up a bit more after having at least one restart failure at `-a 15`." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389743 (owner: 10Eevans) [17:55:37] (03CR) 10Eevans: [C: 032] Use a more realistic defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389743 (owner: 10Eevans) [17:55:57] ACKNOWLEDGEMENT - Disk space on flerovium is CRITICAL: DISK CRITICAL - free space: /mnt/1a 1309253 MB (3% inode=96%): /mnt/2a 1248881 MB (3% inode=96%): Volans expected, temporary [17:56:06] ACKNOWLEDGEMENT - Disk space on furud is CRITICAL: DISK CRITICAL - free space: /mnt/1a 1309240 MB (3% inode=96%): /mnt/2a 1248862 MB (3% inode=96%): Volans expected, temporary [17:57:25] PROBLEM - Nginx local proxy to apache on mw2131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:15] RECOVERY - Nginx local proxy to apache on mw2131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.201 second response time [18:00:04] gehel: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171113T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:47] !log demon@tin Synchronized docroot/search.wikimedia.org/index.php: minor cleanup (duration: 00m 47s) [18:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:45] !log Restarting Cassandra, restbase-dev1004.eqiad.wmnet (testing new `c-foreach-restart`) [18:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:15] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:03:37] (03PS1) 10Hoo man: Include php5 packages on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/391045 [18:04:05] (03CR) 10jerkins-bot: [V: 04-1] Include php5 packages on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/391045 (owner: 10Hoo man) [18:06:36] (03CR) 10Hoo man: [C: 04-1] "Will need to put this into a profile" [puppet] - 10https://gerrit.wikimedia.org/r/391045 (owner: 10Hoo man) [18:07:18] (03PS1) 10Jforrester: Switch submit button from 'save' to 'publish' on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391046 [18:08:39] (03PS2) 10Chad: WIP: search.wikimedia.org: Stop supporting non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390358 [18:09:05] !log Ran "scap pull" on mwdebug1001/snapshot1001 after (further) tests re T177486 [18:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:11] T177486: [Tracking] Wikidata entity dumpers need to cope with the immense Wikidata growth recently - https://phabricator.wikimedia.org/T177486 [18:12:29] (03PS3) 10Elukey: role::prometheus::analytics: add druid jmx exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/391007 (https://phabricator.wikimedia.org/T177459) [18:16:16] PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 3019 MB (3% inode=98%) [18:17:45] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:40] !log drain + shutdown analytics1029 as prep step to replace the BBU - T178742 [18:20:47] Cc: ottomata --^ [18:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:50] T178742: Possibly faulty BBU on analytics1029 - https://phabricator.wikimedia.org/T178742 [18:21:18] +1 [18:22:04] apergos: the dumps initial rsync is all done :) [18:22:14] nice! [18:22:55] wanna do a second pull? that will probably finish up by tomorrow and I can see about adding the box into the dumspdata rsync job after that [18:22:58] apergos: it says used: 38T though, and that confuses me [18:23:24] is that what we expected? I thought it was more like 17T [18:23:47] I didn't run a du on the public directory to check on ms1001 though [18:25:20] hm [18:25:27] dunno why stat1005 is being weird [18:25:48] it fails on an timeout for a hdfs check via puppet , but works fine manually [18:28:00] something heavy is hitting the disk - https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=stat1005&refresh=1m&orgId=1&from=now-24h&to=now [18:28:33] oh yeha wow [18:28:33] !log deploy latest blazegraph + GUI on wdqs200[23] to switch vocabulary - T176593 [18:28:34] ok [18:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:41] T176593: Reload WDQS dataset - https://phabricator.wikimedia.org/T176593 [18:32:25] RECOVERY - Disk space on install1002 is OK: DISK OK [18:37:45] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:41:06] PROBLEM - Host analytics1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:42:43] (03PS3) 10Gehel: archiva: generate git-fat sha1 for .tar.gz and .whl [puppet] - 10https://gerrit.wikimedia.org/r/389932 [18:43:16] (03CR) 10Gehel: [C: 032] archiva: generate git-fat sha1 for .tar.gz and .whl [puppet] - 10https://gerrit.wikimedia.org/r/389932 (owner: 10Gehel) [18:43:22] (03PS1) 10Hoo man: Use --no-cache for dumping Wikidata entities [puppet] - 10https://gerrit.wikimedia.org/r/391053 (https://phabricator.wikimedia.org/T180048) [18:45:34] (03PS9) 10Gehel: Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [18:46:16] RECOVERY - Host analytics1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [18:47:15] (03CR) 10Gehel: [C: 032] Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [18:47:45] (03PS2) 10Mholloway: Add tgr to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/391026 (https://phabricator.wikimedia.org/T180366) [18:48:28] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Use --no-cache for dumping Wikidata entities [puppet] - 10https://gerrit.wikimedia.org/r/391053 (https://phabricator.wikimedia.org/T180048) (owner: 10Hoo man) [18:51:36] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3756151 (10mobrovac) [18:55:10] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3756170 (10Mholloway) >>! In T180366#3755742, @m... [18:57:49] (03PS1) 10Herron: puppet: fix puppet package names in puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/391060 (https://phabricator.wikimedia.org/T177254) [19:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171113T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:55] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[search/mjolnir/deploy] [19:01:18] gehel: FYI ^^^ [19:01:33] mmm.. my patch didn't get picked up again [19:01:33] volans: thanks! [19:01:46] why would naos fail when tin was ok... I'll check [19:01:59] anyone interest in doing a quick swat ? [19:02:05] elukey: swapped new controller in an1029 [19:02:12] cmjohnson1: ack! checking [19:03:33] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3756237 (10Fjalapeno) @mobrovac that is not any type of assurance that would happen… sorry if that is confusing. We should not plan... [19:04:30] DMaza I don't think the swatter has been volunteered. :/ [19:04:45] :( [19:04:50] cmjohnson1: this one seems ok! [19:05:36] I'll keep an eye on it and close the task tomorrow [19:05:40] thanks a lot! [19:05:55] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:06:28] DMaza I guess we'll ask when someone volunteers to SWAT [19:06:41] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Possibly faulty BBU on analytics1029 - https://phabricator.wikimedia.org/T178742#3756252 (10elukey) Chris swapped the battery (two times) and it seems that the second one is ok! Will keep an eye on it and close the task tomorrow if everything is good. [19:06:59] DMaza: I can SWAT, sorry for the delay! [19:07:18] thcipriani: thank you, and no worries :) [19:07:27] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8751/" [puppet] - 10https://gerrit.wikimedia.org/r/391060 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [19:07:42] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390153 (https://phabricator.wikimedia.org/T179323) (owner: 10Dmaza) [19:07:57] (03CR) 10Herron: [C: 032] puppet: fix puppet package names in puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/391060 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [19:08:02] (03PS2) 10Herron: puppet: fix puppet package names in puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/391060 (https://phabricator.wikimedia.org/T177254) [19:10:19] (03Merged) 10jenkins-bot: Enable per-filter profiling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390153 (https://phabricator.wikimedia.org/T179323) (owner: 10Dmaza) [19:10:29] (03CR) 10jenkins-bot: Enable per-filter profiling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390153 (https://phabricator.wikimedia.org/T179323) (owner: 10Dmaza) [19:11:18] (03CR) 10Mobrovac: [C: 031] Add tgr to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/391026 (https://phabricator.wikimedia.org/T180366) (owner: 10Mholloway) [19:12:25] PROBLEM - Nginx local proxy to apache on mw2150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:15] RECOVERY - Nginx local proxy to apache on mw2150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.200 second response time [19:13:38] DMaza: you change is live on mwdebug1002, check there if possible please [19:14:06] thcipriani: checking [19:14:46] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[search/mjolnir/deploy] [19:15:01] !log Kick of second dumps rsync from ms1001 to labstore1006 [19:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:11] ^ stat1005 is me, revert coming up [19:15:13] s/of/off - meh [19:15:33] (03PS1) 10Gehel: Revert "Deploy MjoLniR with new deploy repository" [puppet] - 10https://gerrit.wikimedia.org/r/391064 [19:15:36] thcipriani: everything looks good [19:15:53] DMaza: cool, going live momentarily [19:16:13] (03CR) 10Gehel: [C: 032] Revert "Deploy MjoLniR with new deploy repository" [puppet] - 10https://gerrit.wikimedia.org/r/391064 (owner: 10Gehel) [19:16:18] !log thcipriani@tin Synchronized docroot/search.wikimedia.org/index.php: [[gerrit:390357|search.wikimedia.org: Clean up result returning logic]] (duration: 00m 47s) [19:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:07] thcipriani: thank you [19:17:41] (03PS2) 10RobH: Add shell user for phedenskog [puppet] - 10https://gerrit.wikimedia.org/r/390244 (https://phabricator.wikimedia.org/T179729) (owner: 10Muehlenhoff) [19:18:28] 10Operations, 10Ops-Access-Requests, 10Performance-Team, 10Patch-For-Review: Adding phedenskog to perf-team - https://phabricator.wikimedia.org/T179729#3734308 (10RobH) Please note this was reviewed and approved in our operations team meeting today. I'll go ahead and rebase/merge @MoritzMuehlenhoff's patc... [19:18:40] (03CR) 10RobH: [C: 032] Add shell user for phedenskog [puppet] - 10https://gerrit.wikimedia.org/r/390244 (https://phabricator.wikimedia.org/T179729) (owner: 10Muehlenhoff) [19:18:40] !log thcipriani@tin Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:390153|Enable per-filter profiling on enwiki]] T179323 (duration: 00m 45s) [19:18:44] ^ DMaza live now [19:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:48] T179323: Enable AbuseFilter per-filter profiling on English Wikipedia & monitor if there is a performance impact - https://phabricator.wikimedia.org/T179323 [19:19:13] thcipriani: thank you [19:19:19] (03PS2) 10RobH: Add phedenskog to perf-team group [puppet] - 10https://gerrit.wikimedia.org/r/390245 (https://phabricator.wikimedia.org/T179729) (owner: 10Muehlenhoff) [19:19:59] (03CR) 10RobH: [C: 032] Add phedenskog to perf-team group [puppet] - 10https://gerrit.wikimedia.org/r/390245 (https://phabricator.wikimedia.org/T179729) (owner: 10Muehlenhoff) [19:20:27] !log smalyshev@tin Started deploy [wdqs/wdqs@b44cf27]: data reload/T176593 [19:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:47] !log smalyshev@tin Finished deploy [wdqs/wdqs@b44cf27]: data reload/T176593 (duration: 00m 20s) [19:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:12] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Adding phedenskog to perf-team - https://phabricator.wikimedia.org/T179729#3756345 (10RobH) 05Open>03Resolved a:05MoritzMuehlenhoff>03None This is now live (after ops meeting approval). Since this affects a fair number of systems, manually kic... [19:24:08] (03PS2) 10BBlack: cache_upload: do not apply 256K hfp to CL-less requests [puppet] - 10https://gerrit.wikimedia.org/r/391025 [19:24:26] (03PS1) 10Herron: puppet: fix puppetmaster-passenger package in puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/391067 (https://phabricator.wikimedia.org/T177254) [19:24:44] (03CR) 10BBlack: [C: 032] cache_upload: do not apply 256K hfp to CL-less requests [puppet] - 10https://gerrit.wikimedia.org/r/391025 (owner: 10BBlack) [19:28:09] (03Abandoned) 10BBlack: normalize_path: stop on fragment marker [puppet] - 10https://gerrit.wikimedia.org/r/274086 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [19:33:23] (03CR) 10Gilles: [C: 031] xenon: pass --mindwidth to flamegraph.pl [puppet] - 10https://gerrit.wikimedia.org/r/390645 (owner: 10Ori.livneh) [19:34:35] (03PS1) 10RobH: Adding Ian Marlier to wmf ldap [puppet] - 10https://gerrit.wikimedia.org/r/391068 (https://phabricator.wikimedia.org/T180381) [19:35:29] (03CR) 10RobH: [C: 032] Adding Ian Marlier to wmf ldap [puppet] - 10https://gerrit.wikimedia.org/r/391068 (https://phabricator.wikimedia.org/T180381) (owner: 10RobH) [19:42:12] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3756422 (10RobH) 05stalled>03declined This is still pending after over a month without update. I'm going to go ahead and close this t... [19:44:45] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures [19:45:04] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3756435 (10thcipriani) >>! In T179984#3750820, @akosiaris wrote: > `debian/changelog` in that package is wrongly formatted and hence package is cur... [19:47:23] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler03/8754/" [puppet] - 10https://gerrit.wikimedia.org/r/391067 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [19:47:35] (03CR) 10Herron: [C: 032] puppet: fix puppetmaster-passenger package in puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/391067 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [19:50:33] (03CR) 10Dereckson: [C: 04-1] Adjust throttle.php for dewiki workshop (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [19:51:34] (03PS2) 10Herron: puppet: fix puppetmaster-passenger package in puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/391067 (https://phabricator.wikimedia.org/T177254) [19:57:10] (03CR) 10Hashar: "My primary goal is to configure that repository in CI. I went with simple changes that add just the minimum and do not touch the source co" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390999 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [19:58:41] (03CR) 10Zoranzoki21: "> This is not deployed. You can abandon patch. I am sorry." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [19:59:36] (03PS2) 10Gehel: wdqs: ensure blazegraph data file has correct ownership [puppet] - 10https://gerrit.wikimedia.org/r/391039 [20:00:04] (03CR) 10jerkins-bot: [V: 04-1] wdqs: ensure blazegraph data file has correct ownership [puppet] - 10https://gerrit.wikimedia.org/r/391039 (owner: 10Gehel) [20:00:39] (03PS3) 10Gehel: wdqs: ensure blazegraph data file has correct ownership [puppet] - 10https://gerrit.wikimedia.org/r/391039 [20:01:06] (03CR) 10jerkins-bot: [V: 04-1] wdqs: ensure blazegraph data file has correct ownership [puppet] - 10https://gerrit.wikimedia.org/r/391039 (owner: 10Gehel) [20:02:03] (03PS4) 10Gehel: wdqs: ensure blazegraph data file has correct ownership [puppet] - 10https://gerrit.wikimedia.org/r/391039 [20:05:21] (03PS2) 10Ori.livneh: xenon: pass --mindwidth to flamegraph.pl [puppet] - 10https://gerrit.wikimedia.org/r/390645 [20:05:44] (03PS5) 10TerraCodes: Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) [20:06:36] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [20:07:04] (03CR) 10TerraCodes: ">" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [20:07:06] (03CR) 10Ori.livneh: [C: 032] xenon: pass --mindwidth to flamegraph.pl [puppet] - 10https://gerrit.wikimedia.org/r/390645 (owner: 10Ori.livneh) [20:07:26] (03CR) 10Zoranzoki21: "> recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [20:10:37] (03CR) 10BearND: [C: 031] Add tgr to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/391026 (https://phabricator.wikimedia.org/T180366) (owner: 10Mholloway) [20:21:03] (03PS1) 10Herron: puppet: puppetmaster remove puppetmaster-common package ensure [puppet] - 10https://gerrit.wikimedia.org/r/391076 (https://phabricator.wikimedia.org/T177254) [20:21:30] (03CR) 10jerkins-bot: [V: 04-1] puppet: puppetmaster remove puppetmaster-common package ensure [puppet] - 10https://gerrit.wikimedia.org/r/391076 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [20:23:27] (03PS2) 10Herron: puppet: puppetmaster remove puppetmaster-common package ensure [puppet] - 10https://gerrit.wikimedia.org/r/391076 (https://phabricator.wikimedia.org/T177254) [20:23:49] (03PS14) 10Addshore: Add ::statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) [20:26:56] (03CR) 10Addshore: "11:05:19 NEW violations:" [puppet] - 10https://gerrit.wikimedia.org/r/387211 (owner: 10Addshore) [20:28:51] (03PS2) 10Catrope: Enable structured change filters by default on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382333 (https://phabricator.wikimedia.org/T177444) [20:32:16] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382333 (https://phabricator.wikimedia.org/T177444) (owner: 10Catrope) [20:33:27] (03CR) 10Ottomata: [C: 032] Add ::statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [20:33:37] (03CR) 10Ottomata: [C: 031] "+1 let me know when you are ready for merge." [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [20:35:01] (03CR) 10Addshore: "@ottomata, I think we are ready to get this part merged and out the door! :)" [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [20:35:04] (03CR) 10Addshore: [C: 031] Add ::statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [20:35:09] ottomata: ^^ [20:35:19] (03PS15) 10Ottomata: Add ::statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [20:35:26] (03CR) 10Ottomata: [V: 032 C: 032] Add ::statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [20:38:42] (03PS6) 10TerraCodes: Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) [20:39:10] (03PS13) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [20:39:41] (03PS2) 10TerraCodes: Enable local uploads for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) [20:40:33] (03CR) 10Zoranzoki21: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes) [20:41:05] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [20:41:06] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8755/" [puppet] - 10https://gerrit.wikimedia.org/r/391076 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [20:41:59] (03CR) 10Zoranzoki21: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [20:43:55] (03CR) 10Herron: [C: 032] puppet: puppetmaster remove puppetmaster-common package ensure [puppet] - 10https://gerrit.wikimedia.org/r/391076 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [20:44:49] (03PS3) 10Herron: puppet: puppetmaster remove puppetmaster-common package ensure [puppet] - 10https://gerrit.wikimedia.org/r/391076 (https://phabricator.wikimedia.org/T177254) [20:48:23] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#3756627 (10debt) [20:56:59] (03CR) 10Lokal Profil: [C: 031] "> My primary goal is to configure that repository in CI. I went with simple changes that add just the minimum and do not touch the source " [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390999 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [21:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear deployers, time to do the Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171113T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:26] Nothing for ORES today [21:09:27] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: reports.frdev.wm.o -- still in use? - https://phabricator.wikimedia.org/T170640#3756669 (10Jgreen) >>! In T170640#3449856, @cwdent wrote: > @Ejegg - there are functioning sites at /reports and /webfiledrop, but no real idea if they are still in use... [21:10:15] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 51 probes of 285 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:15:18] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 12 probes of 285 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:27:54] apergos: second rsync is done too fyi [21:28:18] no parsoid deploy today [21:30:23] madhuvishy: excellent! [21:30:39] XioNoX: ^^ you should be good to do the needful on the switch [21:31:16] apergos: also, I had this before: [21:31:16] madhuvishy> Madhumitha Viswanathan apergos: it says used: 38T though, and that confuses me [21:31:16] 10:23 AM is that what we expected? I thought it was more like 17T [21:31:16] 10:23 AM I didn't run a du on the public directory to check on ms1001 though [21:31:29] 38? that's no good [21:31:51] /usr/bin/rsync --archive --progress ms1001.wikimedia.org::data/xmldatadumps/public /srv/dumps/ [21:31:54] is what I ran [21:31:55] yeah [21:32:19] apergos: thx, will do in a few minutes [21:32:23] let me look and see what's going on [21:32:49] oh [21:32:53] just me not being able to read [21:32:57] :) [21:32:58] that's probably about right [21:33:19] it does say 40T used in ms1001 [21:33:28] Last 5 good dumps (most desired option): 15 TB for 5 most recent dumps, as of October 2017. This would be 3 sets of full dumps and 2 sets of partial dumps. [21:33:33] I thought public was much smaller [21:33:42] look I even documented it. recently. and that'ss only "last 5" and not archives and other/* [21:34:17] https://www.irccloud.com/pastebin/CMEDFVCx/ [21:34:27] ^ looks like more than 5 copies to me? [21:34:44] that's fine, it shoul dbe [21:34:47] we don't keep 5 [21:34:48] okay [21:35:00] we just ask out mirrors to pick up the last 5 as a good number [21:35:03] *our [21:35:12] I see [21:35:18] and we keep 10? [21:35:20] other/* is taking up a lot of those extra T [21:35:25] for small wikis we keep 9-10 [21:35:29] I see [21:35:32] for large ones, 7? 6-7? [21:35:43] apergos: alright, all done [21:35:48] enwiki has 7 + latest [21:35:53] XioNoX: ok, thanks! [21:36:00] madhuvishy: then it must be 7 [21:36:04] may be latest is just a symlink [21:36:05] right [21:36:14] yes, latest is always a set of links only [21:36:32] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: Find and clear oversized x-content-dimensions headers - https://phabricator.wikimedia.org/T179595#3756718 (10Gilles) [21:36:47] right, cool [21:37:22] apergos: yeah, other is 18T [21:37:26] yep [21:38:22] https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps [21:38:59] see I even updated it recently... sure doesn't mean I can *remember* it though 😕 [21:40:53] !log Decommissioning Cassandra, restbase1012-b.eqiad.wmnet (T179422) [21:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:03] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [21:41:54] apergos: :) nice! thanks for the doc [21:42:48] yw [21:43:18] apergos i wonder if there should be indepth version @ wikitech? [21:44:29] wikitech has piles and piles of dump docs [21:45:10] apergos i guess that is true [21:45:11] https://wikitech.wikimedia.org/wiki/Dumps it will be time to update them again once everything gets shifted around [21:50:46] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [21:51:46] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [21:57:06] !log mholloway-shell@tin Started deploy [mobileapps/deploy@9b10959]: Update mobileapps to c002862 [21:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:05] dapatrick, bawolff, and Reedy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171113T2200). [22:00:05] No GERRIT patches in the queue for this window AFAICS. [22:00:14] (03PS1) 10Ayounsi: reserve internal anycast range [dns] - 10https://gerrit.wikimedia.org/r/391131 [22:02:37] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@9b10959]: Update mobileapps to c002862 (duration: 05m 31s) [22:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:25] (03PS2) 10Ayounsi: Reserve internal anycast range [dns] - 10https://gerrit.wikimedia.org/r/391131 [22:22:59] 10Operations, 10Traffic: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3756812 (10Krinkle) [22:26:34] 10Operations, 10Traffic: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3756828 (10Krinkle) [22:42:44] !log deployed patch T124404 [22:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:25] PROBLEM - Apache HTTP on mw2209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:15] RECOVERY - Apache HTTP on mw2209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.496 second response time [22:49:20] !log deployed patch T119158 (Will affect language converter. -{}- no longer allowed in link urls) [22:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:00] (03PS1) 10Ottomata: [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) [22:51:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [22:53:12] 10Operations, 10Edit-Review-Improvements, 10Collaboration-Feature-Rollouts (Collaboration-WL-Graduated-Everywhere), 10Collaboration-Team-Triage (Collab-Team-This-Quarter), 10Performance: Systematically test load speeds of Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176445#3756905 (10... [23:09:21] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3756990 (10Jdlrobson) I understand, but I must say, this bums me out as I just migrated a bunch of side projects to this service an... [23:13:06] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [23:14:05] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [23:15:05] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3755911 (10Krenair) >>! In T180384#3756237, @Fjalapeno wrote: > Really the concept needs more testing for product viability. Unfort... [23:16:43] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3757008 (10RobH) Supposedly this was delivered to ulsfo today, but I didn't get any email from UL support. Dropped them an email and will update. If it is onsite, I'll plan to go to ulsfo tomorrow (Tuesda... [23:28:55] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 504 (expecting: 200) [23:29:55] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [23:40:56] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[cdh::hadoop::directory /user/spark/share/lib] [23:43:15] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is C [23:43:16] eve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [23:44:15] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [23:50:16] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 400 (expecting: 200) [23:52:25] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy