[00:29:12] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Papaul) Checked switch configuration is done. This task can be closed [00:30:07] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Papaul) [00:30:23] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Papaul) 05Open→03Resolved [00:35:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10Papaul) ` papaul@asw2-b-eqiad# show | compare [edit interfaces] - ge-3/0/18 { - description labnodepool1001; - } [00:35:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10Papaul) [00:50:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces] - ge-2/0/15 { - description labstore1001; - } [00:52:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Papaul) [01:01:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Papaul) [01:03:01] 10Operations, 10ops-eqiad, 10DC-Ops: audit/rebalance power in a5-eqiad - https://phabricator.wikimedia.org/T245655 (10wiki_willy) @ayounsi - from @Papaul 's comment above, it seems like an issue with the threshold being set too low. If the dotted line on this graph represents when ps1-a5-eqiad starts alerti... [01:03:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Papaul) @Jclark-ctr once you done with this task you can resolve. I checked switch part is done, mgmt DNS is done as well. Thanks [01:05:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10Papaul) [01:06:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10Papaul) 05Open→03Resolved This is done [01:09:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission oxygen.eqiad.wmnet - https://phabricator.wikimedia.org/T211826 (10Papaul) ` papaul@asw2-a-eqiad# show | compare [edit interfaces] - ge-4/0/8 { - description oxygen; - } [01:12:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission oxygen.eqiad.wmnet - https://phabricator.wikimedia.org/T211826 (10Papaul) [01:12:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission oxygen.eqiad.wmnet - https://phabricator.wikimedia.org/T211826 (10Papaul) 05Open→03Resolved This is done [01:12:51] 10Operations, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Papaul) [01:23:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces interface-range disabled] member ge-5/0/39 { ... } + member ge-7/0/0; [edit interfaces] - ge-7/0/... [01:24:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10Papaul) [02:01:55] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:03:19] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10wiki_willy) Hi @jijiki - chatted with John on this a bit earlier today. He'll prioritize getting these racked, along with a couple other installs, in earl... [04:36:55] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:36:57] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:39:09] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:39:09] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:52:27] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:52:27] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:16] ACKNOWLEDGEMENT - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP CDanis Cyrusone maintenance 1572395 - The acknowledgement expires at: 2020-03-01 11:03:12. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:04:16] ACKNOWLEDGEMENT - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Cyrusone maintenance 1572395 - The acknowledgement expires at: 2020-03-01 11:03:12. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:10:09] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:12:21] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:10:47] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:11:57] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:36] ACKNOWLEDGEMENT - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis Cyrusone maintenance 1572395 - The acknowledgement expires at: 2020-03-01 11:19:17. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:19:36] ACKNOWLEDGEMENT - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Cyrusone maintenance 1572395 - The acknowledgement expires at: 2020-03-01 11:19:17. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:32:45] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:33:55] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:56:37] (03PS19) 10Effie Mouzeli: mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [09:01:28] (03PS20) 10Effie Mouzeli: mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [09:18:54] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) [11:11:35] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::refine: fix el blacklist [puppet] - 10https://gerrit.wikimedia.org/r/575573 (owner: 10Elukey) [11:22:38] (03PS1) 10Elukey: Fix Hive zookeeper references in Analytics Hadoop client/daemons [puppet] - 10https://gerrit.wikimedia.org/r/575682 [11:24:15] (03PS2) 10Elukey: Fix Hive zookeeper references in Analytics Hadoop client/daemons [puppet] - 10https://gerrit.wikimedia.org/r/575682 [11:30:36] 10Operations, 10DNS, 10Technical blog, 10Traffic, 10cloud-services-team (Kanban): Setup DNS to direct techblog.wikimedia.org to new Wordpress VIP hosting - https://phabricator.wikimedia.org/T246507 (10Reedy) [11:31:06] (03PS3) 10Elukey: Remove Hive zookeeper settings from Analytics nodes [puppet] - 10https://gerrit.wikimedia.org/r/575682 [11:33:50] (03CR) 10Elukey: [C: 03+2] Remove Hive zookeeper settings from Analytics nodes [puppet] - 10https://gerrit.wikimedia.org/r/575682 (owner: 10Elukey) [11:36:35] (03PS1) 10Reedy: Add viwiki to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575684 (https://phabricator.wikimedia.org/T246511) [11:39:06] (03PS2) 10Reedy: Add viwiki to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575684 (https://phabricator.wikimedia.org/T246511) [12:28:56] (03PS3) 10Reedy: Add viwiki to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575684 (https://phabricator.wikimedia.org/T246511) [12:30:04] (03CR) 10jerkins-bot: [V: 04-1] Add viwiki to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575684 (https://phabricator.wikimedia.org/T246511) (owner: 10Reedy) [12:30:46] (03CR) 10Reedy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575684 (https://phabricator.wikimedia.org/T246511) (owner: 10Reedy) [12:32:05] (03CR) 10Reedy: [C: 03+2] "ship it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575684 (https://phabricator.wikimedia.org/T246511) (owner: 10Reedy) [12:33:04] (03Merged) 10jenkins-bot: Add viwiki to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575684 (https://phabricator.wikimedia.org/T246511) (owner: 10Reedy) [12:34:17] !log reedy@deploy1001 Synchronized dblists/all-labs.dblist: T246511 (duration: 00m 57s) [12:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:23] T246511: Create beta viwiki - https://phabricator.wikimedia.org/T246511 [12:35:23] !log reedy@deploy1001 Synchronized wikiversions-labs.json: T246511 (duration: 00m 56s) [12:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:14] !log reedy@deploy1001 Synchronized wmf-config/config/viwiki.yaml: T246511 (duration: 00m 56s) [12:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:37] still 'no wiki found' [12:38:11] beta scap is still running [12:38:23] https://integration.wikimedia.org/ci/job/beta-scap-eqiad/289950/console [12:38:26] 12:34:32 Started by upstream project "beta-mediawiki-config-update-eqiad" build number 16798 [12:38:36] works now [12:49:35] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 108.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [13:21:41] 10Operations, 10Research, 10Traffic, 10Patch-For-Review: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10bmansurov) @BBlack is the switchover complete? Can the previous content host stop hosting? [13:59:59] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [14:04:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:08:53] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:45:28] 10Operations, 10Research, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10Reedy) >>! In T242374#5929016, @bmansurov wrote: > @BBlack is the switchover complete? Can the previous content host stop hosting? ` Address: 91.198.174.192 ` It's bei... [14:51:17] 10Operations, 10DNS, 10Technical blog, 10Traffic, 10cloud-services-team (Kanban): Setup DNS to direct techblog.wikimedia.org to new Wordpress VIP hosting - https://phabricator.wikimedia.org/T246507 (10Aklapper) [15:13:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:16:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:20:13] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [15:20:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:24:31] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:27:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:49:57] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:51:17] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:52:07] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:52:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) is WARNING: Test Ensure Zotero is working responds with unexpected value at path [0]/itemType = webpage https://wikitech.wikimedia.org/wiki/Citoid [15:54:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:55:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:04:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:08:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:16:14] (03PS1) 10Urbanecm: Set cswiki to use custom minerva logo again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575692 (https://phabricator.wikimedia.org/T246535) [16:22:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:24:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:29:04] (03PS1) 10Andrew Bogott: nfs: remove scratch, dumps, project mounts from utrs project [puppet] - 10https://gerrit.wikimedia.org/r/575697 (https://phabricator.wikimedia.org/T208414) [16:30:54] (03CR) 10Andrew Bogott: [C: 03+2] nfs: remove scratch, dumps, project mounts from utrs project [puppet] - 10https://gerrit.wikimedia.org/r/575697 (https://phabricator.wikimedia.org/T208414) (owner: 10Andrew Bogott) [16:46:21] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:55:07] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:56:16] (03CR) 10Krinkle: "Based on I17a6261009444a, I re-reviewed the config diff from Jenkins and found one other regression:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575363 (owner: 10Jforrester) [16:56:38] (03CR) 10Krinkle: [C: 03+1] "See my comment at https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/575363/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575692 (https://phabricator.wikimedia.org/T246535) (owner: 10Urbanecm) [16:57:08] (03CR) 10Urbanecm: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575692 (https://phabricator.wikimedia.org/T246535) (owner: 10Urbanecm) [16:58:11] (03PS2) 10Urbanecm: Set cswiki and cywiki to use custom minerva logo again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575692 (https://phabricator.wikimedia.org/T246535) [17:02:51] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:05:01] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:15:47] (03PS1) 10Elukey: role::prometheus::analytics: add jmx job for Hive in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/575703 [17:18:41] (03CR) 10Elukey: [C: 03+2] role::prometheus::analytics: add jmx job for Hive in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/575703 (owner: 10Elukey) [17:28:40] (03PS1) 10Elukey: profile::hive::server: add prometheus exporter by default [puppet] - 10https://gerrit.wikimedia.org/r/575705 [17:30:19] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:32:15] (03CR) 10Elukey: [C: 03+2] profile::hive::server: add prometheus exporter by default [puppet] - 10https://gerrit.wikimedia.org/r/575705 (owner: 10Elukey) [17:32:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:36:43] (03PS1) 10Elukey: profile::hive::metastore: add prometheus monitoring by default [puppet] - 10https://gerrit.wikimedia.org/r/575706 [17:37:03] (03CR) 10Elukey: [C: 03+2] profile::hive::metastore: add prometheus monitoring by default [puppet] - 10https://gerrit.wikimedia.org/r/575706 (owner: 10Elukey) [18:07:19] PROBLEM - PHP opcache health on mw1251 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:13:28] (03PS1) 10Krinkle: Enable LCStoreStaticArray on test.wikipedia.org and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575709 (https://phabricator.wikimedia.org/T99740) [18:14:47] (03PS2) 10Krinkle: Enable LCStoreStaticArray on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575709 (https://phabricator.wikimedia.org/T99740) [18:44:43] (03CR) 10ArielGlenn: [C: 03+2] cleanup of page content dumps run() [dumps] - 10https://gerrit.wikimedia.org/r/575581 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [18:47:29] (03CR) 10ArielGlenn: [C: 03+2] fix up file list methods [dumps] - 10https://gerrit.wikimedia.org/r/575584 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [18:52:25] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::ui: add BigTop/Hive settings [puppet] - 10https://gerrit.wikimedia.org/r/575710 [18:54:13] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::ui: add BigTop/Hive settings [puppet] - 10https://gerrit.wikimedia.org/r/575710 (owner: 10Elukey) [19:46:38] (03PS1) 10Krinkle: Enable LCStoreStaticArray on commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575711 (https://phabricator.wikimedia.org/T99740) [19:46:40] (03PS1) 10Krinkle: Enable LCStoreStaticArray on hewiki and cawiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575712 [19:46:42] (03PS1) 10Krinkle: Enable LCStoreStaticArray on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575713 (https://phabricator.wikimedia.org/T99740) [19:47:53] (03PS2) 10Krinkle: Enable LCStoreStaticArray on wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575711 (https://phabricator.wikimedia.org/T99740) [19:47:55] (03PS2) 10Krinkle: Enable LCStoreStaticArray on hewiki and cawiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575712 [19:47:57] (03PS2) 10Krinkle: Enable LCStoreStaticArray on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575713 (https://phabricator.wikimedia.org/T99740) [19:47:59] (03PS1) 10Krinkle: Enable LCStoreStaticArray on commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575714 (https://phabricator.wikimedia.org/T99740) [19:48:52] (03PS3) 10Krinkle: Enable LCStoreStaticArray on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575709 (https://phabricator.wikimedia.org/T99740) [19:48:54] (03PS3) 10Krinkle: Enable LCStoreStaticArray on wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575711 (https://phabricator.wikimedia.org/T99740) [19:48:56] (03PS2) 10Krinkle: Enable LCStoreStaticArray on commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575714 (https://phabricator.wikimedia.org/T99740) [19:48:58] (03PS3) 10Krinkle: Enable LCStoreStaticArray on hewiki and cawiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575712 [19:49:00] (03PS3) 10Krinkle: Enable LCStoreStaticArray on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575713 (https://phabricator.wikimedia.org/T99740) [19:49:19] (03PS4) 10Krinkle: Enable LCStoreStaticArray on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575709 (https://phabricator.wikimedia.org/T99740) [19:49:21] (03PS4) 10Krinkle: Enable LCStoreStaticArray on wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575711 (https://phabricator.wikimedia.org/T99740) [19:49:23] (03PS3) 10Krinkle: Enable LCStoreStaticArray on commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575714 (https://phabricator.wikimedia.org/T99740) [19:49:25] (03PS4) 10Krinkle: Enable LCStoreStaticArray on hewiki and cawiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575712 (https://phabricator.wikimedia.org/T99740) [19:49:27] (03PS4) 10Krinkle: Enable LCStoreStaticArray on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575713 (https://phabricator.wikimedia.org/T99740) [19:59:49] (03CR) 10ArielGlenn: [C: 03+2] convert all file list methods to use common args [dumps] - 10https://gerrit.wikimedia.org/r/575585 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [20:02:26] (03CR) 10ArielGlenn: [C: 03+2] move StubProvider out to its own module [dumps] - 10https://gerrit.wikimedia.org/r/575586 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [20:07:39] (03CR) 10ArielGlenn: [C: 03+2] move some dfname/pagerange munging methods to their own class [dumps] - 10https://gerrit.wikimedia.org/r/575587 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [20:18:49] (03CR) 10ArielGlenn: [C: 03+2] move some output file listing methods to their own module [dumps] - 10https://gerrit.wikimedia.org/r/575588 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [20:23:39] (03CR) 10ArielGlenn: [C: 03+2] use only jobFileLister instance methods in other modules [dumps] - 10https://gerrit.wikimedia.org/r/575589 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [20:27:28] (03PS1) 10Alex Monk: Fix some incorrect uses of the lookup function [puppet] - 10https://gerrit.wikimedia.org/r/575718 [20:27:30] (03PS1) 10Alex Monk: Move more eqiad1.yaml hieradata to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/575719 (https://phabricator.wikimedia.org/T242607) [20:29:19] (03PS3) 10ArielGlenn: add some unit tests for prefetch arg generation [dumps] - 10https://gerrit.wikimedia.org/r/575591 (https://phabricator.wikimedia.org/T246465) [20:53:41] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:55:33] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:55:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:02:07] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:06:31] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [21:08:39] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:09:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:11:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:20:57] (03CR) 10ArielGlenn: [C: 03+2] add some unit tests for prefetch arg generation [dumps] - 10https://gerrit.wikimedia.org/r/575591 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [21:29:07] (03PS1) 10Alex Monk: codfw1: Register our bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/575737 [21:29:53] (03PS2) 10Alex Monk: codfw1dev: Register our bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/575737 [21:41:13] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:42:35] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [21:51:47] (03CR) 10Alex Monk: "Cherry-picked on cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud and puppetmaster-01.cloudinfra-codfw1d" [puppet] - 10https://gerrit.wikimedia.org/r/575737 (owner: 10Alex Monk) [21:58:16] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10bd808) [22:04:58] (03CR) 10Masumrezarock100: [C: 03+1] Set cswiki and cywiki to use custom minerva logo again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575692 (https://phabricator.wikimedia.org/T246535) (owner: 10Urbanecm) [22:09:01] (03PS1) 10Alex Monk: profile::mariadb::cloudinfra: Allow overriding of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/575744 (https://phabricator.wikimedia.org/T242607) [22:12:22] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10bd808) [22:13:54] (03PS2) 10Alex Monk: profile::mariadb::cloudinfra: Allow overriding of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/575744 (https://phabricator.wikimedia.org/T242607) [22:15:43] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [22:16:13] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) Also did Toolforge in particular in {T245365} [22:46:59] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) Worth noting there's a small army of random puppetmasters laying around running puppet 4: * jeh-puppetmaster.testlabs.eqiad.wmflabs * integration-puppetmaster01.integra... [23:01:09] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [23:02:27] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [23:34:39] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 124.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1